Show all info regardless research infrastructures. Switch to CLARIN environment and show only relevant info to CLARIN, e.g. format recommendations by CLARIN centres. Switch to Text+ environment and show only relevant info to Text+, e.g. format recommendations by Text+ centres. Switch to DARIAH environment and show only relevant info to DARIAH, e.g. format recommendations by DARIAH centres.
Corpus Encoding Standard
Abbreviation: CES/XCES
Scope: corpus annotation
Topic: Generic Corpus Annotation
Standard body: EAGLES
Keywords: corpus, CES, corpus encoding, CES, XCES

Other standards in the same topic(s):

Version Title: Corpus Encoding Standard
Abbreviation: CES
Version Number: 4.1
Status: final
Release Date: 1996-10-14
  1. Nancy Ide
  2. Greg Priest-Dorman

CES is an encoding standard for corpus annotation. It was developed within the framework of the EAGLES (Expert Advisory Group on Language Engineering Standards) project. The aim of the CES is to provide a unitary coding standard for linguistic corpus annotation. The CES can be used to encode corpora as resources for natural language processing.

SGML (ISO 8879:1986 Standard Generalized Markup Language) was the foundation of the CES. Beside the SGML, TEI (Text Encoding Initiative) Guidelines were taken into account for the development of the CES. Like the TEI, the CES standardizes the document structure (e.g. title, caption, break) or document info (metadata). In addition to that, the CES standardizes the linguistic annotation of a text (e.g. morpho-syntactic tagging, parallel text alignment, prosody, phonetic transcription, etc.). TEI P3 and the CES are compatible to each other, so they can be used side by side. As an XML-based version of the CES, XCES (Corpus Encoding Standard for XML) has also been developed .

The CES can be applied in monolingual, multi-lingual and parallel corpora.

  • metaLanguage: SGML
  • constraintLanguage: DTD
  • grammarClass: LTG
  • formalModel: Tree
  • notation: Standoff
  • multipleHierarchies: standoff annotation
Related Standard(s):
  • SGML

    CES is based on the SGML.

  • TEI Guidelines-1994

    CES is an application of the SGML-based TEI P3 using the TEI modification layer.

  • XCES

    CES is the SGML ancestor of the XML-based XCES.

Version Title: Corpus Encoding Standard in XML
Abbreviation: XCES
Version Number: 1.0.4
Status: final
Release Date: 2008-06-20
  1. Nancy Ide
  2. Patrice Bonhomme

XCES is the XML version of the Corpus Encoding Standard (CES). It was developed by the Department of Computer Science, Vassar College, and Equipe Langue et Dialogue, LORIA/CNRS, for the simple reason that XML is the standard for data representation and exchange on the World Wide Web. Some of the aims of this conversion were to offer a state-of-the-art representation of the corpus data and to be accessible for the language engineering community as well.

XCES offers DTDs and XML schemas for encoding basic document structure and linguistic annotation. The implementation of CES in XML allows not only the morpho-syntactic annotation but also the syntactic annotation. With the aid of XLink und XPointer, XCES gives more complex and superior method to refer to the standoff annotated corpus data, in contrast to the SGML based CES.

Furthermore XCES currently includes XML Schemas for validation and some XSLT scripts to transform into HTML document.

XCES is developed continually and planned be compliant with TEI P5. Currently the stages of development between the TEI Guidelines and XCES are so large that the TEI Guidelines P5 cannot be used in XCES.

  • metaLanguage: XML
  • constraintLanguage: XSD
  • grammarClass: LTG
  • formalModel: Graph
  • notation: Standoff
  • multipleHierarchies: standoff annotation
Related Standard(s):
  • CES

    XCES is the XML instantiation of CES.

  • LAF-2012
  • TEI Guidelines

    The XCES specification is based on the TEI P3 Standard.

  • XML

    XCES is an application of the Extensible Markup Language (XML), for instance it uses the XML syntax.

  • XSD

    XCES uses XML Schema 1.0 as a constraint language.

Used in CLARIN centre(s):