Show all info regardless research infrastructures. Switch to CLARIN environment and show only relevant info to CLARIN, e.g. format recommendations by CLARIN centres. Switch to Text+ environment and show only relevant info to Text+, e.g. format recommendations by Text+ centres. Switch to DARIAH environment and show only relevant info to DARIAH, e.g. format recommendations by DARIAH centres.
Functional domains

In order to arrive at adequate recommendations concerning individual file formats and standards, or to decide on their suitability in particular kinds of research, the purpose for which they are intended has to be taken into account. For example, while PDF/A has been developed for unproblematic long-term archiving and is an excellent format choice for (unstructured) documentation, i.e. documents containing a corpus manual, corpus guidelines or the annotation guidelines applied for the project, it is undoubtedly not suitable as a format for annotated corpus data. That demonstrates that recommending PDF/A or any other format to researchers and data depositors without information on the intended purpose of that format is bound to create issues rather than solve them. Therefore, the CLARIN Standards Committee has, by reviewing the policies and deposited data of CLARIN centres, suggested a set of functional domains representing purposes specifically relevant to the field of digital language resources.

It has to be borne in mind that the set of functional domains described below has been mainly designed to be useful in the practical work of the Standards Committee in gathering information on standards and data formats currently in use within CLARIN. It does not claim or aim to be a complete and detailed taxonomy and does not reflect all possible distinctions between different resource (sub)types. The Standards Committee acknowledges that in order to make an individual recommendation on suitable formats to be used within a specific research project, more subtle differences usually become relevant. As an example, the most suitable data format for a corpus will not only depend on whether it is based on audiovisual or textual source data, but also on the complexity of the source data and the annotation schemes and possibly interoperability with relevant existing resources within the same research area.

For pragmatic reasons, some domains are very broad, e.g. the Tool Support domain comprises highly different information types, which however share the purpose of enabling the use of tools or services. Conversely, a single format might serve several purposes, e.g. when CMDI is being used for all types of metadata and contextual information, or when TEI is used to model both text annotations and contextual information (please note, however, that we want to avoid using plain "TEI" or plain "XML", without specifying the concrete schema, because such broad, unqualified format names are rather meaningless in the intended context). The focus on formats that are likely to be a part of digital language resources created and deposited by researchers in the humanities and social sciences further implies a conscious reduction of the scope of the taxonomy.

Below, we list the functional domains, with a brief characterization of each. They have been gathered into several groups (under the headings "Annotation", "Data/Resource Description", "Databases", "Source Data", and... "Uncategorized"), in order to make it easier to perceive their commonalities. Rather than forcing each domain into a group, we set the "Tool Support" and "Other" domains aside: the former because of its special and internally mixed character, and the latter because it is the "elsewhere" domain, which, by design, is going to be used when everything else fails.

Please note that, depending on the feedback that the Committee receives from users and centre representatives, it is possible to also supply more documentation, with examples, etc., or to adjust the entire system: this is partially a bottom-up initiative, after all.

  • Annotation

    • Audiovisual AnnotationCopy name to clipboardcopied

      Annotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.

    • Image AnnotationCopy name to clipboardcopied

      Annotations of image sources.

    • Text AnnotationCopy name to clipboardcopied

      Annotations of textual sources/written text, with the original text included or as stand-off.

  • Data/Resource Description

    • Catalogue MetadataCopy name to clipboardcopied

      Basic structured information for discoverability and general description, to be openly provided for harvesting.

    • Contextual InformationCopy name to clipboardcopied

      Structured information on the communicative event or text and its creators (i.e. participants or authors) relevant for analysis.

    • DocumentationCopy name to clipboardcopied

      Unstructured documentation of the resource and its parts such as corpus or annotation guidelines.

    • MetadataCopy name to clipboardcopied

      Comprehensive structured information including descriptive, structural and administrative metadata. See the National Information Standards Organization primer on metadata for further hints.

  • Databases

    • GeodataCopy name to clipboardcopied

      Information on geographic locations.

    • Language DescriptionCopy name to clipboardcopied

      Structured or unstructured descriptions of linguistic varieties or phenomena, typological databases etc.

    • Lexical ResourceCopy name to clipboardcopied

      Structured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.)

    • Statistical DataCopy name to clipboardcopied

      Data from surveys and tests in numeric formats.

  • Source Data

    • Audiovisual Source Language DataCopy name to clipboardcopied

      Audio or video recordings providing spoken/multimodal or signed language data for research purposes.

    • Contextual DataCopy name to clipboardcopied

      Images (photos or drawings) or documents relevant to the communicative event or text but not part of the source language data.

    • Image Source Language DataCopy name to clipboardcopied

      Digitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions).

    • Textual Source Language DataCopy name to clipboardcopied

      Written unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes.

  • Uncategorized

    • OtherCopy name to clipboardcopied

      Any other function that cannot be included in an existing domain. The content of this domain will be periodically examined for potential patterns that may give rise to new domains.

    • Tool SupportCopy name to clipboardcopied

      Tool-related formats required for specific functionality of the tool or reliable reuse of resources (e.g. tagsets, annotation schemes, vocabularies, language models, parameter files, and other specifications or settings)