- CLARIN (C-centre)
- Leif-Jöran Olsson (Oktober 17, 2023)
The following measures are taken to enhance the chance of future interpretability of the data.
The number of accepted file formats is small and well documented to make future conversions to other formats more feasible. As much as possible, open (non-proprietary) file formats are used. For textual resources, XML formats are used whenever possible, to ensure future interpretation of the files even if the tool that was used to create them no longer exists. The Språkbanken Text repository recommends to use formats listed in the CLARIN Standards Information System.
The Språkbanken Text participation in relevant networks like e.g. CLARIN enables steady information about recent developments in file formats and encodings. Plans to migrate or convert files will be developed if new standards arise; all relevant features of the old formats will be preserved employing reliable procedures.
For more information, see the full Språkbanken Text repository description.
Språkbanken Text Clarin repository gives further information on policies.
Data will not necessarily be stored in the delivered formats. They might be converted to more appropriate formats for long-term preservation.
Data that are not delivered in the accepted formats usually require higher curation measures. By a consultation you can receive an estimation on when the capacity for depositing the not appropriate formats.
Plain text and XML files will normally only be accepted in Unicode character encoding, including UTF-8.
As a general guideline, Språkbanken Text believes that the file formats best suited for long-term sustainability and accessibility:
- Are frequently used
- Have open specifications
- Are independent of specific software, developers or vendors
Format | Domain | Level | Comments |
---|---|---|---|
AIFF | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
ALTO | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | acceptable |
|
CHAT | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged |
|
CHAT-XML | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged |
|
CMDI | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | recommended | |
CoNLL-U | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended | |
CoNLL-U | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | recommended | |
CSV | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | acceptable | |
DC XML | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | recommended | |
DOCX | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged | |
DOCX | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | discouraged | |
EAF | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | acceptable | |
FLAC | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
GeoJSON | GeodataInformation on geographic locations. | recommended | |
GeoTIFF | GeodataInformation on geographic locations. | recommended | |
GML | GeodataInformation on geographic locations. | recommended | |
HTML | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | acceptable |
|
JSON | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | acceptable |
|
JSON | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | acceptable |
|
JSON | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | acceptable |
|
KML | GeodataInformation on geographic locations. | acceptable | |
M2JClick to add or suggest missing format information | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
Markdown | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | acceptable | |
MP3 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | discouraged |
|
MP4 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
MPEG-1 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
MPEG-2 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
MPEG-4 AVC | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | recommended |
|
ODTClick to add or suggest missing format information | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | acceptable | ||
PDF/A | Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). | recommended | |
PDF/A-1 | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
PDF/A-2 | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
PDF/A-3 | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | acceptable | |
plainText | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged | |
plainText | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
plainText | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | discouraged | |
plainText | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | acceptable |
|
Praat | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | acceptable | |
QGIS.qgs | GeodataInformation on geographic locations. | acceptable | |
Shapefile | GeodataInformation on geographic locations. | acceptable | |
SVG | Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). | recommended | |
TEI | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
TEI | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended |
|
TEIHeader | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | recommended | |
TEISpoken | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | recommended |
|
TIFF | Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). | recommended | |
WAVE | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | recommended |
|
WAVE | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable |
|
XML | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
XML | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | acceptable | |
XSDClick to add or suggest missing format information | Tool SupportTool-related formats required for specific functionality of the tool or reliable reuse of resources (e.g. tagsets, annotation schemes, vocabularies, language models, parameter files, and other specifications or settings) | recommended |