The Unicode Standard

Abbreviation: Unicode [not official, only for reference in this website]

Scope: Standard for character encoding of text documents

Standard body: Other

Keywords: international character, character encoding, character set, character, UCS, UTF-8

Use in CLARIN: fully recommended

Description:

Unicode is an international standard developed by the Unicode Consortium, that defines nearly every character used in all written languages of the world. The first version of the standard was published in 1991 and covered over 7000 characters. Since the number of characters has increased significantly, currently Unicode contains 112956 different characters of the modern world's language (alphabetic scripts of Europe, the Middle East, Asia and Africa), ancient language (such as Latin, Sanskrit, classical Greek) and many other archaic and historic scripts. Furthermore, the standard encodes many important symbol sets, punctuation marks, mathematical symbols, technical symbols, geometric shapes, dingbats, and emoji.

Unicode may be seen as a character superset. It combines the character sets represented in many international and national standards of ISO, ANSI/NISO and so on. It also includes character sets from Adobe, Apple, Fujitsu, IBM, Lotus, Microsoft and much more. Therefore, the Unicode Standard offers the most complete and one of the largest character set in the world. Nearly all characters are encoded in Unicode, unambiguously defined and represented independent of any computer system or application used.

Unicode deﬁnes a name and a numerical value for each character, in three encoding forms: a 32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). These various forms make it easy to transform data in a byte, word or double word format.

The character codes of the Unicode Standard and the standard ISO/IEC 10646 (Universal Character Set) are identical and fully compatible with each other.

Related Standard(s):

SpecEAFEAF
The textual content of annotations in EAF is encoded in Unicode.
SpecIPAIPA
The Unicode defines codes for symbols used in the IPA
SpecLMFLMF
LMF uses the Unicode.
SpecUCSUCS
The Unicode character set is codepoint-for-codepoint identical to ISO/IEC 10646 (UCS).

Home
Centres
Format Recommendations
	Data Deposition Formats
	Functional Domains
	File Extensions
	Media Types
	Statistics
		Popular Formats
		Relevant KPIs
	Sanity Check
		Keywords
Standards and Specifications
	Standard Bodies
	Topics
	Search
API
About / F.A.Q.

Legend:
	isSimilarTo
	uses
	hasPart
	isUsedBy
	isVersionOf