- CLARIN (B-centre)
- Jussi Piitulainen (Februar 15, 2024)
The following measures are taken to enhance the chance of future interpretability of the data.
The number of accepted file formats is small and well documented to make future conversions to other formats more feasible. Open (non-proprietary) file formats are strongly preferred. The Language Bank of Finland recommends formats listed in the CLARIN Standards Information System.
The Language Bank's participation in relevant networks like CLARIN enables steady information about recent developments in file formats and encodings. Plans to migrate or convert files will be developed if new standards arise.
For more information, see the Language Bank of Finland's Portal.
Data to be deposited might need to be converted to accepted or recommended formats for long-term preservation.
Plain text and XML files will normally only be accepted in Unicode character encoding, preferably UTF-8.
As a general guideline we believe that the file formats best suited for long-term sustainability and accessibility:
- Are frequently used
- Have open specifications
- Are independent of specific software, developers or vendors
Format | Domain | Level | Comments |
---|---|---|---|
EAF | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | acceptable | |
WAVE | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable |
|
DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | acceptable | ||
Praat | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | acceptable | |
XML | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | acceptable | |
MP4 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
FLAC | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
GZIP | PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable. | acceptable | |
TAR | PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable. | acceptable | |
JSON | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | acceptable |
|
CSV | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | acceptable | |
ALTO | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | acceptable | |
CHAT | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged |
|
CHAT-XML | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged |
|
DOCX | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged |
|
plainText | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged | |
MP3 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | discouraged |
|
TEISpoken | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | recommended |
|
WAVE | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | recommended |
|
CMDI | Catalogue MetadataBasic structured information for discoverability and general description, to be openly provided for harvesting. | recommended | |
PDF/A | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
TEI | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
JPEG | Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). | recommended | |
PNG | Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). | recommended | |
SVG | Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). | recommended | |
CSV | Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) | recommended | |
TSV | Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) | recommended | |
LMF | Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) | recommended | |
CoNLL-U | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended | |
CWB-VRT | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended | |
TEI | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended | |
Praat | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | recommended | |
plainText | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | recommended |
|
MPEG-4 AVC | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | recommended |
|
ZIP | PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable. | recommended | |
Markdown | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
plainText | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended |
|