Show all info regardless research infrastructures. Switch to CLARIN environment and show only relevant info to CLARIN, e.g. format recommendations by CLARIN centres. Switch to Text+ environment and show only relevant info to Text+, e.g. format recommendations by Text+ centres. Switch to DARIAH environment and show only relevant info to DARIAH, e.g. format recommendations by DARIAH centres.
The Language Bank of Finland
Abbreviation: FIN-CLARIN
Link: https://centres.clarin.eu/centre/17
Research infrastructure:
  • CLARIN (B-centre)
Curation:
Description:

The following measures are taken to enhance the chance of future interpretability of the data.

The number of accepted file formats is small and well documented to make future conversions to other formats more feasible. Open (non-proprietary) file formats are strongly preferred. The Language Bank of Finland recommends formats listed in the CLARIN Standards Information System.

The Language Bank's participation in relevant networks like CLARIN enables steady information about recent developments in file formats and encodings. Plans to migrate or convert files will be developed if new standards arise.

For more information, see the Language Bank of Finland's Portal.

Data to be deposited might need to be converted to accepted or recommended formats for long-term preservation.

Plain text and XML files will normally only be accepted in Unicode character encoding, preferably UTF-8.

As a general guideline we believe that the file formats best suited for long-term sustainability and accessibility:

  • Are frequently used
  • Have open specifications
  • Are independent of specific software, developers or vendors
Data functions covered by the recommendations
Recommendations provided by this centre concern the following functions of data:
  • Audiovisual Annotation
  • Audiovisual Source Language Data
  • Catalogue Metadata
  • Documentation
  • Image Source Language Data
  • Lexical Resource
  • Text Annotation
  • Textual Source Language Data
  • Packaging
  • Metadata
Format recommendations:
Format Domain Level Comments
ALTO Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable
CHAT Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged Consider using TEISpoken instead.
CHAT-XML Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged Consider using TEISpoken instead.
CMDI Catalogue MetadataBasic structured information for discoverability and general description, to be openly provided for harvesting. recommended
CoNLL-U Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
CSV Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) recommended
CSV MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
CWB-VRT Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
DOCX Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged Consider using PDFA instead.
EAF Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. acceptable
FLAC Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
GZIP PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable. acceptable
JPEG Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). recommended
JSON MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable regular and structured; consider using JSONLD with a schema
LMF Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) recommended
Markdown DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
MP3 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. discouraged lossy formats should be avoided if possible
MP4 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
MPEG-4 AVC Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. recommended 25 fps, 1920×1080, constant bit rate
PDF DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. acceptable
PDF/A DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
plainText Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged
plainText Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. recommended UTF-8 encoded
plainText DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended e.g. as README.txt
PNG Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). recommended
Praat Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. acceptable
Praat Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. recommended
SVG Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). recommended
TAR PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable. acceptable
TEI DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
TEI Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
TEISpoken Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. recommended See format description.
TSV Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.) recommended
WAVE Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. recommended PCM-WAV, 48 kHz, 16 bit
WAVE Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable PCM-WAV above 22 kHz/16 bit
XML Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable
ZIP PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable. recommended
Last update commit-id: f1792b82
Suggest a fix or extension