The areas of standard application vary based on the standard purpose, for example whether a standard is used for linguistic annotation of a corpus, or for annotation of metadata within a corpus or other linguistic resources. These areas of standard application are called topics, and we categorized standards into the following topics:
- Annotation of Multilingual Data
- Character encoding
- Constraint Language
- Controlled Vocabulary
- Guidelines for Multilingual Thesauri
- DIN Language codes
- Structured vocabularies for information retrieval
- Country Codes
- Codes for the representation of names of countries and their subdivisions — Part 1: Country codes
- Codes for the representation of names of countries and their subdivisions — Part 2: Country subdivision code
- Codes for the representation of names of countries and their subdivisions — Part 3: Code for formerly used names of countries
- Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies
- Presentation/representation of entries in dictionaries — Requirements, recommendations and information
- Information and documentation — Thesauri and interoperability with other vocabularies
- Codes for the representation of names of languages
- Codes for the representation of names of languages — Part 1: Alpha-2 code
- Codes for the representation of names of languages — Part 2: Alpha-3 code
- Codes for the representation of names of languages — Part 3: Alpha-3 code for comprehensive coverage of languages
- Codes for the representation of names of languages — Part 4: General principles of coding of the representation of names of languages and related entities, and application guidelines
- Codes for the representation of names of languages — Part 5: Alpha-3 code for language families and groups
- Data Categorization
- Feature Structure
- File Formats
- Codes for the Human Analysis of Transcripts
- Analyzed Layout and Text Object
- Translation Memory eXchange
- Document management — Electronic document file format for long-term preservation
- Document management — Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4
- Document management — Electronic document file format for long-term preservation — Part 2: Use of ISO 32000-1
- Document management — Electronic document file format for long-term preservation — Part 3: Use of ISO 32000-1 with support for embedded files
- Document management — Electronic document file format for long-term preservation — Part 4: Use of ISO 32000-2 (PDF/A-4)
- ELAN Annotation Format
- Rich Text Format
- TermBase eXchange
- Portable Document Format
- Formatting
- Generic Corpus Annotation
- Language resource management — Linguistic annotation framework
- Corpus Encoding Standard
- Language resource management — Word segmentation of written texts
- Language resource management — Word segmentation of written texts — Part 1: Basic concepts and general principles
- Language resource management — Word segmentation of written texts — Part 2: Word segmentation for Chinese, Japanese and Korean
- Language resource management — Word segmentation of written texts — Part 3: Thai, Hindi, Vietnamese, and other related languages
- Language Resources Management — Multilingual Information Framework
- Guidelines for Electronic Text Encoding and Interchange
- Journal Article Tag Suite
- Darwin Information Typing Architecture
- NLM Journal Archiving and Interchange Tag Suite
- Knowledge Representation
- Resource Description Framework
- Simple Knowledge Organization System
- Structured vocabularies for information retrieval
- Distributed Ontology Language
- Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies
- Ontology Integration and Interoperability
- Web Ontology Language
- Information Technology — Topic Maps
- Information technology — Topic Maps — Part 2: Data model
- Information technology — Topic Maps — Part 3: XML syntax
- Information technology — Topic Maps — Part 4: Canonicalization
- Information technology — Topic Maps — Part 5: Reference model
- Information technology — Topic Maps — Part 6: Compact syntax
- Information technology — Topic Maps — Part 7: Graphical notation
- Lexical Knowledge
- Markup Language
- Rule Markup Language
- Translation Memory eXchange
- Semantic role markup language
- XML Path Language
- XQuery: an XML Query Language
- REWERSE I1 Rule Markup Language
- Guidelines for Electronic Text Encoding and Interchange
- TermBase eXchange
- Dialogue Act Markup Language
- Markup Language for events and temporal expressions in natural language
- Information technology — Hypermedia/Time-based Structuring Language
- Meta Language
- Metadata
- Metadata Encoding and Transmission Standard
- NISO Metadata for Images in XML Schema
- Resource Description Framework
- Open Language Archive Metadata
- Dublin Core Metadata Element Set
- Component Metadata Infrastructure
- ISLE Metadata Initiative
- Data Dictionary - Technical Metadata for Digital Still Images
- International Standard Bibliographic Description
- DCMI Abstract Model
- Technical Metadata for Text
- Morpho-syntactic Annotation
- Ontology
- Suggested Upper Merged Ontology
- OpenCyc
- Basic Formal Ontology
- Upper Mapping and Binding Exchange Layer
- General Ontology for Linguistic Description
- Distributed Ontology Language
- SIMPLE Core Ontology
- DARPA Agent Markup Language + Ontology Integration Language
- Ontologies of Linguistic Annotation
- Flora-2
- Descriptive Ontology for Linguistic and Cognitive Engineering
- Ontology Integration and Interoperability
- Web Ontology Language
- Semantic Web Rule Language Combining OWL and RuleML
- Query
- Schema
- Segmentation
- Language resource management — Word segmentation of written texts
- Language resource management — Word segmentation of written texts — Part 1: Basic concepts and general principles
- Language resource management — Word segmentation of written texts — Part 2: Word segmentation for Chinese, Japanese and Korean
- Language resource management — Word segmentation of written texts — Part 3: Thai, Hindi, Vietnamese, and other related languages
- Segmentation Rules eXchange
- Semantic Annotation
- Semantic role markup language
- Language resource management — Semantic annotation framework
- Language resource management — Semantic annotation framework (SemAF) — Part 1: Time and events
- Language resource management — Semantic annotation framework (SemAF) — Part 2: Dialogue acts
- Language resource management — Semantic annotation framework (SemAF) — Part 3: Named entities
- Language resource management — Semantic annotation framework (SemAF) — Part 4: Semantic roles
- Language resource management — Semantic annotation framework (SemAF) — Part 5: Discourse structure
- Language resource management — Semantic annotation framework (SemAF) — Part 6: Principles of semantic annotation
- Language resource management — Semantic annotation framework — Part 7: Spatial information
- Language resource management — Semantic annotation framework — Part 8: Semantic relations in discourse
- Dialogue Act Markup Language
- Markup Language for events and temporal expressions in natural language
- Serialization
- Syntactic Annotation
- Terminology
- General Ontology for Linguistic Description
- Language resource management — Persistent identification and sustainable access
- Systems to manage terminology, knowledge and content — Design, implementation and maintenance of terminology management systems
- Presentation/representation of entries in dictionaries — Requirements, recommendations and information
- Ontologies of Linguistic Annotation
- TermBase eXchange
- Language resource management — Simplified natural language — Part 1: Basic concepts and general principles
- Thesaurus
- Guidelines for Multilingual Thesauri
- Simple Knowledge Organization System
- Structured vocabularies for information retrieval
- Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies
- Information and documentation — Thesauri and interoperability with other vocabularies
- Transcription
- Transformation
Multilingual data are identical or similar data in two or more different languages. Multilingual data with different levels of annotation have been used in all kinds of language-specific or cross-linguistic research and in various natural language processing tasks. They can be applied for a huge range of applications such as machine translation, speech recognition, information retrieval etc. Some examples of multilingual data are multilingual corpora (e.g. Europarl), international vocabularies (e.g. Agrovoc) or multilingual datasets (e.g. DBpedia, YAGO), multilingual encyclopaedia and ontology (e.g. BabelNet), and multilingual dictionary (e.g. OmegaWiki). More...
Specification categorized in Annotation of Multilingual Data:
Specification categorized in Character encoding:
A constraint language provides a formal model and a syntax for defining specific constraints for validating instances of a given markup language. Constraint languages (CL) can be subdivided into grammar-based CL and rule-based CL. More...
Specification categorized in Constraint Language:
A controlled vocabulary is a "list of terms that have been enumerated explicitly. This list is controlled by and is available from a controlled vocabulary registration authority. All terms in a controlled vocabulary must have an unambiguous, non-redundant definition." More...
Specification categorized in Controlled Vocabulary:
Data Categorization is a mechanism to create a central global (or local) registry for both data categories (i.e. the element and attribute names, or more generic: concepts) and values used in an annotation process. More...
Specification categorized in Data Categorization:
Specification categorized in Feature Structure:
Specification categorized in File Formats:
Formatting describes the representation of source data for output to on-line reading, printing, speech programs, etc. In contrast to transformation, which transforms one data structure and content into another specific structure or another format, formatting focuses on how the source data should appear, be visualised or represented. More...
Specification categorized in Formatting:
Annotated corpus is an important resource in linguistic research and is used in diverse ways, for example in word sense disambiguation, dictionary creation, lexicographic research, information extraction, and so on. Corpus annotation is understood as “the practice of adding interpretative (especially linguistic) information to an existing corpus of spoken and/or written language” (Leech (1997): Corpus Annotation Schemes). This information can include data of different levels of linguistics: pragmatics, syntax, semantics, grammatics etc. In this case, a distinction is made between primary data and its annotation data. More...
Specification categorized in Generic Corpus Annotation:
Specification categorized in Knowledge Representation:
Lexical knowledge describes knowledge about words. The goal of the lexical knowledge is to describe all information about words and their relationships. The lexical knowledge includes several information aspects of linguistic theory such as morpho-syntactical, semantical, pragmatical and lexical aspect. More...
Specification categorized in Lexical Knowledge:
Markup languages are languages that provide an extra information about the text in order to facilitate automated processing of it, including editing and formatting for displaying or printing. More...
Specification categorized in Markup Language:
A meta language is the foundation of a markup language. It provides at least the syntax and the formal model. Sometimes, a meta language also provides one or more document grammar formalism(s) defining a specific markup language. More...
Specification categorized in Meta Language:
Metadata contained the information about other data. This information should allow or/and facilitate to discover, retrieve, use and manage relevant resources. To sum up, metadata help to organize those resources. More...
Specification categorized in Metadata:
During morpho-syntactic annotation each lexical tokens in the text corpus becomes a tag of morpho-syntactic labels such as part of speech and other morphological characteristics. More...
Specification categorized in Morpho-syntactic Annotation:
Ontology is a formal structured presentation of any number of concepts and their respective relationships to each other. The ontology describes the knowledge with help of standardized terminology, meaning of individual concepts and semantic relations between the terms. All information in ontology forms a kind of network for representing knowledge. This knowledge can be used in different ways, for example in informational retrieval, artificial intelligence, the Semantic Web, software engineering, biomedical informatics, library science, and so on. More...
Specification categorized in Ontology:
Query language is a kind of computer language, which is used for requesting information from database or other information systems. More...
Specification categorized in Query:
Specification categorized in Schema:
According to Jurafsky and Martin (1999: 178), the segmentation is the process of taking undifferentiated sequence of symbols and segmenting it into meaningful linguistic units. As units can be defined the sentences as well as the word or the topic. It distinguishes for example between the sentence and word segmentation. The task of the sentence segmentation is to find the sentence boundaries in the text. Similarly the task of word segmentation is to split the text into word boundaries. More...
Specification categorized in Segmentation:
In contrast to human a computer cannot understand the meaning of the words or the sentences in a text. However in some working areas, such as informational retrieval, named entity recognition, event extraction or sentiment analysis, it is necessary to find, extract, manipulate or manage some specific knowledge contained in heterogenous information documents in short time. To achieve this, there are many statistics, or machine learning methods to extract and identify automatically the information, which are based on the semantic annotated corpora. More...
Specification categorized in Semantic Annotation:
Specification categorized in Serialization:
Syntactic annotation includes across-word analysis and its goal is to describe structure of sentences. It shows how the words build phrases and sentences, and their relationship with each other in a sentence. More...
Specification categorized in Syntactic Annotation:
Specification categorized in Terminology:
The Thesaurus is the alphabetically or topically structured vocabulary. The terms in a thesaurus are grouped together according to similarity of meaning or domain of knowledge, and are interconnected by semantic relations such as synonym, antonym, hypernym (generalization), hyponym (specification), etc. More...
Specification categorized in Thesaurus:
Specification categorized in Transcription:
The goal of transformation is to convert the source data of a certain structure and content, to another specific structure or another format, while the source data remains unchanged. There are different tasks like changing, reformatting of the data structure, merging of data or modification of data representation. More...
Specification categorized in Transformation: