- CLARIN (B-centre)
- Jussi Piitulainen (Februar 15, 2024)
The following measures are taken to enhance the chance of future interpretability of the data.
The number of accepted file formats is small and well documented to make future conversions to other formats more feasible. Open (non-proprietary) file formats are strongly preferred. The Language Bank of Finland recommends formats listed in the CLARIN Standards Information System.
The Language Bank's participation in relevant networks like CLARIN enables steady information about recent developments in file formats and encodings. Plans to migrate or convert files will be developed if new standards arise.
For more information, see the Language Bank of Finland's Portal.
Data to be deposited might need to be converted to accepted or recommended formats for long-term preservation.
Plain text and XML files will normally only be accepted in Unicode character encoding, preferably UTF-8.
As a general guideline we believe that the file formats best suited for long-term sustainability and accessibility:
- Are frequently used
- Have open specifications
- Are independent of specific software, developers or vendors
- Audiovisual Annotation
- Audiovisual Source Language Data
- Catalogue Metadata
- Documentation
- Image Source Language Data
- Lexical Resource
- Text Annotation
- Textual Source Language Data
- Tool Support
- Metadata
Format | Domain | Level | Comments |
---|---|---|---|
ALTO | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | acceptable | |
CHAT | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged |
|
CHAT-XML | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged |
|
CMDI | Catalogue MetadataBasic structured information for discoverability and general description, to be openly provided for harvesting. | recommended | |
CoNLL-U | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended | |
CSV | Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) | recommended | |
CSV | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | acceptable | |
CWB-VRT | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended | |
DOCX | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged |
|
EAF | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | acceptable | |
FLAC | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
GZIP | Tool SupportTool-related formats required for specific functionality of the tool or reliable reuse of resources (e.g. tagsets, annotation schemes, vocabularies, language models, parameter files, and other specifications or settings) | acceptable | |
JPEG | Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). | recommended | |
JSON | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | acceptable |
|
LMF | Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) | recommended | |
Markdown | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
MP3 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | discouraged |
|
MP4 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
MPEG-4 AVC | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | recommended |
|
DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | acceptable | ||
PDF/A | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
plainText | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged | |
plainText | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | recommended |
|
plainText | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended |
|
PNG | Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). | recommended | |
Praat | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | acceptable | |
Praat | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | recommended | |
SVG | Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). | recommended | |
TAR | Tool SupportTool-related formats required for specific functionality of the tool or reliable reuse of resources (e.g. tagsets, annotation schemes, vocabularies, language models, parameter files, and other specifications or settings) | acceptable | |
TEI | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
TEI | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended | |
TEISpoken | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | recommended |
|
TSV | Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) | recommended | |
WAVE | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | recommended |
|
WAVE | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable |
|
XML | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | acceptable | |
ZIP | Tool SupportTool-related formats required for specific functionality of the tool or reliable reuse of resources (e.g. tagsets, annotation schemes, vocabularies, language models, parameter files, and other specifications or settings) | recommended |