Show all info regardless research infrastructures. Switch to CLARIN environment and show only relevant info to CLARIN, e.g. format recommendations by CLARIN centres. Switch to Text+ environment and show only relevant info to Text+, e.g. format recommendations by Text+ centres. Switch to DARIAH environment and show only relevant info to DARIAH, e.g. format recommendations by DARIAH centres.
Generic Corpus Annotation

Annotated corpus is an important resource in linguistic research and is used in diverse ways, for example in word sense disambiguation, dictionary creation, lexicographic research, information extraction, and so on. Corpus annotation is understood as “the practice of adding interpretative (especially linguistic) information to an existing corpus of spoken and/or written language” (Leech (1997): Corpus Annotation Schemes). This information can include data of different levels of linguistics: pragmatics, syntax, semantics, grammatics etc. In this case, a distinction is made between primary data and its annotation data.

They are two techniques for data annotation: in-line and stand-off annotations. In the inline annotation, the primary and annotation data are a unit and are saved in one data file. The disadvantages of this technique are:

  • The original data is manipulated and complicated by further processings
  • The annotation structures can be overlapped
  • The multiple annotations are arranged hardly

Recently, the stand-off annotation has been widely used. The concept of the annotation is that one or more annotations can be saved separately from the primary data and linked to it. The annotation data can be stored in the same file with the primary data or separately in another file. The advantages are:

  • The separation of primary data and its annotations makes it possible that the primary data is unchanged and can be used for further processing. With indexing on sign-level and word-level, the annotations reference the primary data directly. For this reason, it is necessary that the primary data stay unchanged, otherwise the indexing could be messed up. Therefore, it is reasonable to define a read-only permission for the primary data file.
  • Multiple annotation
  • It is relatively easy to amend the primary data with further annotations. For this purpose new files will be created.
  • Each annotation can be modified separately.

EXtensible Markup Language (XML) is used for stand-off annotation in the majority of cases. The XPointer and XLink expressions are used to link and reference annotation data to the original text.

Other types of corpus annotation are:

  • Lemmatization
  • Part-of-speech annotation
  • Syntactical annotation
  • Semantic annotation
  • Discourse annotation
  • Pragmatic annotation
  • Phonetic annotation
  • Lexical annotation
Standards dealing with this topic:
  1. Language resource management — Linguistic annotation framework
  2. Corpus Encoding Standard
  3. Language resource management — Word segmentation of written texts
  4. Language Resources Management — Multilingual Information Framework
  5. Guidelines for Electronic Text Encoding and Interchange
  6. Journal Article Tag Suite
  7. Darwin Information Typing Architecture
  8. NLM Journal Archiving and Interchange Tag Suite