Centre: FIN-CLARIN

The Language Bank of Finland

Abbreviation: FIN-CLARIN

Registry: CLARIN: https://centres.clarin.eu/centre/17

Research infrastructure:

CLARIN (B-centre)

Curation:

Jussi Piitulainen (Februar 15, 2024)

Description:

The following measures are taken to enhance the chance of future interpretability of the data.

The number of accepted file formats is small and well documented to make future conversions to other formats more feasible. Open (non-proprietary) file formats are strongly preferred. The Language Bank of Finland recommends formats listed in the CLARIN Standards Information System.

The Language Bank's participation in relevant networks like CLARIN enables steady information about recent developments in file formats and encodings. Plans to migrate or convert files will be developed if new standards arise.

For more information, see the Language Bank of Finland's Portal.

Data to be deposited might need to be converted to accepted or recommended formats for long-term preservation.

Plain text and XML files will normally only be accepted in Unicode character encoding, preferably UTF-8.

As a general guideline we believe that the file formats best suited for long-term sustainability and accessibility:

Are frequently used
Have open specifications
Are independent of specific software, developers or vendors

Data functions covered by the recommendations: ...

Format recommendations:

Format	Domain	Level	Comments
EAF	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	acceptable
WAVE	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	acceptable	PCM-WAV above 22 kHz/16 bit
PDF	DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines.	acceptable
Praat	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	acceptable
XML	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	acceptable
MP4	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	acceptable
FLAC	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	acceptable
GZIP	PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable.	acceptable
TAR	PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable.	acceptable
JSON	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	acceptable	regular and structured; consider using JSONLD with a schema
CSV	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	acceptable
ALTO	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	acceptable
CHAT	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	discouraged	Consider using TEISpoken instead.
CHAT-XML	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	discouraged	Consider using TEISpoken instead.
DOCX	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	discouraged	Consider using PDFA instead.
plainText	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	discouraged
MP3	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	discouraged	lossy formats should be avoided if possible
TEISpoken	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	recommended	See format description.
WAVE	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	recommended	PCM-WAV, 48 kHz, 16 bit
CMDI	Catalogue MetadataBasic structured information for discoverability and general description, to be openly provided for harvesting.	recommended
PDF/A	DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines.	recommended
TEI	DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines.	recommended
JPEG	Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions).	recommended
PNG	Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions).	recommended
SVG	Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions).	recommended
CSV	Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.)	recommended
TSV	Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.)	recommended
LMF	Lexical ResourceStructured (item-based) resources for lexical and/or conceptual information on units of language (e.g. wordlists, lexicons, WordNets etc.)	recommended
CoNLL-U	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	recommended
CWB-VRT	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	recommended
TEI	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	recommended
Praat	Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes.	recommended
plainText	Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes.	recommended	UTF-8 encoded
MPEG-4 AVC	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	recommended	25 fps, 1920×1080, constant bit rate
ZIP	PackagingPackaging formats of various nature (archiving, compression, library) if no more specific domain is suitable.	recommended
Markdown	DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines.	recommended
plainText	DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines.	recommended	e.g. as README.txt

Last update commit-id: f1792b82

Suggest a fix or extension

Home
Centres
Format Recommendations
	Data Deposition Formats
	Functional Domains
	File Extensions
	Media Types
	Statistics
		Popular Formats
		Centre Statistics
		Relevant KPIs
	Sanity Check
		Keywords
		Media Types
Standards and Specifications
	Standard Bodies
	Topics
	Search
API
About / F.A.Q.