Show all info regardless research infrastructures. Switch to CLARIN environment and show only relevant info to CLARIN, e.g. format recommendations by CLARIN centres. Switch to Text+ environment and show only relevant info to Text+, e.g. format recommendations by Text+ centres. Switch to DARIAH environment and show only relevant info to DARIAH, e.g. format recommendations by DARIAH centres.
eXtensible Markup Language
suggest a fix or extension
Abbreviation: XML
Type Id
SIS ID fXML Copy ID to clipboardSIS ID copied
LOCLibrary of Congress fdd000075
PRONOMUK National Archives fmt/101
Wikidata Q2115
Media type(s):
File extension(s): .xml
Format family: Markup.Full
Functional domains:
  • Audiovisual Annotation
  • Catalogue Metadata
  • Contextual Information
  • Documentation
  • Geodata
  • Image Annotation
  • Language Description
  • Lexical Resource
  • Metadata
  • Text Annotation
  • Textual Source Language Data
  • Tool Support
Centre Domain Level Comments
PORTULAN-CLARIN Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) acceptable
PORTULAN-CLARIN MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
IDS MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
CLARINO_Bergen Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable Well known and defined standards of XML-formats are preferred. When depositing non-standard, less known formats consider depositing also schema documents,(ODD, XSD, DTD or RelaxNG), guidelines and documentation to improve usability.
Sprakbanken MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
SAW Catalogue MetadataBasic structured information for discoverability and general description, to be openly provided for harvesting. acceptable
SAW Contextual InformationStructured information on the communicative event or text and its creators (i.e. participants or authors) relevant for analysis. acceptable
SAW MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
LAC MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
FIN-CLARIN Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable
OTA MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
CLARIN-CH DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. acceptable
CLARIN-CH MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
CLARIN.SI DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
CLARIN.SI Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
Sprakbanken DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
SAW Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. recommended
SAW Image AnnotationAnnotations of image sources. recommended
SAW Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
SAW Language DescriptionStructured or unstructured descriptions of linguistic varieties or phenomena, typological databases etc. recommended
MPI-PL DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
MPI-PL Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
ACDH-ARCHE DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
ACDH-ARCHE Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
MI DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
MI Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
CLARIN-DK-UCPH DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
CLARIN-DK-UCPH Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
EKUT DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
EKUT Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
OTA DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
ZIM DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
ZIM Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
BBAW DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
BBAW Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
DANS Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. recommended See more info from DANS
DANS Catalogue MetadataBasic structured information for discoverability and general description, to be openly provided for harvesting. recommended See more info from DANS
DANS Contextual InformationStructured information on the communicative event or text and its creators (i.e. participants or authors) relevant for analysis. recommended See more info from DANS
DANS DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended See more info from DANS
DANS GeodataInformation on geographic locations. recommended See more info from DANS
DANS Image AnnotationAnnotations of image sources. recommended See more info from DANS
DANS Language DescriptionStructured or unstructured descriptions of linguistic varieties or phenomena, typological databases etc. recommended See more info from DANS
DANS Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) recommended See more info from DANS
DANS MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. recommended See more info from DANS
DANS Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended See more info from DANS
DANS Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. recommended See more info from DANS
DANS Tool SupportTool-related formats required for specific functionality of the tool or reliable reuse of resources (e.g. tagsets, annotation schemes, vocabularies, language models, parameter files, and other specifications or settings) recommended See more info from DANS
ILC4CLARIN DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
ILC4CLARIN Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
CLARIN-CH Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. recommended
CLARIN-CH Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
CLARIN-CH Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. recommended
CLARIN-CH Language DescriptionStructured or unstructured descriptions of linguistic varieties or phenomena, typological databases etc. recommended
CLARIN-CH Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) recommended

In the context of format recommendations, "XML" is too general a pointer to provide a meaningful recommendation. In fringe cases, it might happen that a centre receives deposition of a prose text, encoded with a series of <paragraph> elements within a <text> element, but in regular cases, and especially if the text has internal structure and is accompanied by annotations and a header containing metadata, it is best to adhere to one of the established formats that are more or less standardly recognized by tools at the disposal of CLARIN centres.

The following is an emphatically non-exhaustive list of XML-based formats recognized by CLARIN:

  • TEI-based formats
  • XCES
  • TigerXML
  • TCF
  • FOLKER/OrthoNormal
  • ALTO
  • FoLiA
  • SVG (for graphics)
  • XSD, RNG, Schematron (for document grammars)
  • ...
  • many others: the SIS is going to feature a format-family browsing facility in the "near future".
Keywords: data format, annotation format, format family
Related Standard(s):
