Show all info regardless research infrastructures. Switch to CLARIN environment and show only relevant info to CLARIN, e.g. format recommendations by CLARIN centres. Switch to Text+ environment and show only relevant info to Text+, e.g. format recommendations by Text+ centres. Switch to DARIAH environment and show only relevant info to DARIAH, e.g. format recommendations by DARIAH centres.
The Unicode Standard
Abbreviation: Unicode [not official, only for reference in this website]
Scope: Standard for character encoding of text documents
Topic: Character encoding
Standard body: Other
Keywords: international character, character encoding, character set, character, UCS, UTF-8
Use in CLARIN: fully recommended

Unicode is an international standard developed by the Unicode Consortium, that defines nearly every character used in all written languages of the world. The first version of the standard was published in 1991 and covered over 7000 characters. Since the number of characters has increased significantly, currently Unicode contains 112956 different characters of the modern world's language (alphabetic scripts of Europe, the Middle East, Asia and Africa), ancient language (such as Latin, Sanskrit, classical Greek) and many other archaic and historic scripts. Furthermore, the standard encodes many important symbol sets, punctuation marks, mathematical symbols, technical symbols, geometric shapes, dingbats, and emoji.

Unicode may be seen as a character superset. It combines the character sets represented in many international and national standards of ISO, ANSI/NISO and so on. It also includes character sets from Adobe, Apple, Fujitsu, IBM, Lotus, Microsoft and much more. Therefore, the Unicode Standard offers the most complete and one of the largest character set in the world. Nearly all characters are encoded in Unicode, unambiguously defined and represented independent of any computer system or application used.

Unicode defines a name and a numerical value for each character, in three encoding forms: a 32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). These various forms make it easy to transform data in a byte, word or double word format.

The character codes of the Unicode Standard and the standard ISO/IEC 10646 (Universal Character Set) are identical and fully compatible with each other.

Related Standard(s):
  • EAF

    The textual content of annotations in EAF is encoded in Unicode.

  • IPA

    The Unicode defines codes for symbols used in the IPA

  • LMF

    LMF uses the Unicode.

  • UCS

    The Unicode character set is codepoint-for-codepoint identical to ISO/IEC 10646 (UCS).

Other standards in the same topic(s):
Recommended Reading:

Version Title: The Unicode Standard: Version 7.0 – Core Specification
Abbreviation: Unicode 7.0
Version Number: 7.0
Status: final
Release Date: 2014-10-08
  1. The Unicode Consortium