Show all info regardless research infrastructures. Switch to CLARIN environment and show only relevant info to CLARIN, e.g. format recommendations by CLARIN centres. Switch to Text+ environment and show only relevant info to Text+, e.g. format recommendations by Text+ centres. Switch to DARIAH environment and show only relevant info to DARIAH, e.g. format recommendations by DARIAH centres.
The Language Bank of Finland
Abbreviation: FIN-CLARIN
Registry: CLARIN: https://centres.clarin.eu/centre/17
Research infrastructure:
  • CLARIN (B-centre)
Curation:
Description:

The following measures are taken to enhance the chance of future interpretability of the data.

The number of accepted file formats is small and well documented to make future conversions to other formats more feasible. Open (non-proprietary) file formats are strongly preferred. The Language Bank of Finland recommends formats listed in the CLARIN Standards Information System.

The Language Bank's participation in relevant networks like CLARIN enables steady information about recent developments in file formats and encodings. Plans to migrate or convert files will be developed if new standards arise.

For more information, see the Language Bank of Finland's Portal.

Data to be deposited might need to be converted to accepted or recommended formats for long-term preservation.

Plain text and XML files will normally only be accepted in Unicode character encoding, preferably UTF-8.

As a general guideline we believe that the file formats best suited for long-term sustainability and accessibility:

  • Are frequently used
  • Have open specifications
  • Are independent of specific software, developers or vendors
Data functions covered by the recommendations: ...
Format recommendations:
Format Domain Level Comments
EAF Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. acceptable
WAVE Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable PCM-WAV above 22 kHz/16 bit
PDF DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. acceptable
Praat Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. acceptable
XML Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable
MP4 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
FLAC Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
GZIP PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable. acceptable
TAR PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable. acceptable
JSON MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable regular and structured; consider using JSONLD with a schema
CSV MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
ALTO Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable
CHAT Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged Consider using TEISpoken instead.
CHAT-XML Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged Consider using TEISpoken instead.
DOCX Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged Consider using PDFA instead.
plainText Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged
MP3 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. discouraged lossy formats should be avoided if possible
TEISpoken Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. recommended See format description.
WAVE Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. recommended PCM-WAV, 48 kHz, 16 bit
CMDI Catalogue MetadataBasic structured information for discoverability and general description, to be openly provided for harvesting. recommended
PDF/A DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
TEI DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
JPEG Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). recommended
PNG Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). recommended
SVG Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). recommended
CSV Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) recommended
TSV Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) recommended
LMF Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) recommended
CoNLL-U Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
CWB-VRT Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
TEI Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
Praat Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. recommended
plainText Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. recommended UTF-8 encoded
MPEG-4 AVC Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. recommended 25 fps, 1920×1080, constant bit rate
ZIP PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable. recommended
Markdown DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
plainText DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended e.g. as README.txt
Last update commit-id: f1792b82
Suggest a fix or extension