Show all info regardless research infrastructures. Switch to CLARIN environment and show only relevant info to CLARIN, e.g. format recommendations by CLARIN centres. Switch to Text+ environment and show only relevant info to Text+, e.g. format recommendations by Text+ centres. Switch to DARIAH environment and show only relevant info to DARIAH, e.g. format recommendations by DARIAH centres.
Språkbanken
Suggest a fix or extension
Abbreviation: Sprakbanken
Link: https://centres.clarin.eu/centre/37
Research infrastructure:
  • CLARIN (C-centre)
Curation:
Description:

The following measures are taken to enhance the chance of future interpretability of the data.

The number of accepted file formats is small and well documented to make future conversions to other formats more feasible. As much as possible, open (non-proprietary) file formats are used. For textual resources, XML formats are used whenever possible, to ensure future interpretation of the files even if the tool that was used to create them no longer exists. The Språkbanken Text repository recommends to use formats listed in the CLARIN Standards Information System.

The Språkbanken Text participation in relevant networks like e.g. CLARIN enables steady information about recent developments in file formats and encodings. Plans to migrate or convert files will be developed if new standards arise; all relevant features of the old formats will be preserved employing reliable procedures.

For more information, see the full Språkbanken Text repository description.

Språkbanken Text Clarin repository gives further information on policies.

Data will not necessarily be stored in the delivered formats. They might be converted to more appropriate formats for long-term preservation.

Data that are not delivered in the accepted formats usually require higher curation measures. By a consultation you can receive an estimation on when the capacity for depositing the not appropriate formats.

Plain text and XML files will normally only be accepted in Unicode character encoding, including UTF-8.

As a general guideline, Språkbanken Text believes that the file formats best suited for long-term sustainability and accessibility:

  • Are frequently used
  • Have open specifications
  • Are independent of specific software, developers or vendors
Functional domains:
  • Audiovisual Annotation
  • Audiovisual Source Language Data
  • Documentation
  • Image Source Language Data
  • Metadata
  • Text Annotation
  • Textual Source Language Data
  • Geodata
  • Tool Support
Format recommendations:
Format Domain Level Comments
AIFF Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
ALTO Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable Conversion to a suitable TEI-based format is expected.
CHAT Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged Consider using TEISpoken instead.
CHAT-XML Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged Consider using TEISpoken instead.
CMDI MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. recommended
CoNLL-U Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
CoNLL-U Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. recommended
CSV MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
DC XML MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. recommended
DOCX Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged
DOCX MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. discouraged
EAF Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. acceptable
FLAC Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
GeoJSON GeodataInformation on geographic locations. recommended
GeoTIFF GeodataInformation on geographic locations. recommended
GML GeodataInformation on geographic locations. recommended
HTML Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. acceptable without js etc. and with generic markup
JSON MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable regular and structured; consider using JSONLD with a schema
JSON Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable regular and structured; consider using JSONLD with a schema
JSON Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. acceptable regular and structured; consider using JSONLD with a schema
KML GeodataInformation on geographic locations. acceptable
M2JClick to add or suggest missing format information Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
Markdown DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. acceptable
MP3 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. discouraged lossy formats should be avoided if possible
MP4 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
MPEG-1 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
MPEG-2 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
MPEG-4 AVC Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. recommended 25 fps, 1920×1080, constant bit rate
ODTClick to add or suggest missing format information DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
PDF DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. acceptable
PDF/A Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). recommended
PDF/A-1 DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
PDF/A-2 DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
PDF/A-3 DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. acceptable
plainText Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged
plainText DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
plainText Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. discouraged
plainText Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. acceptable without markup
Praat Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. acceptable
QGIS.qgs GeodataInformation on geographic locations. acceptable
Shapefile GeodataInformation on geographic locations. acceptable
SVG Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). recommended
TEI DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
TEI Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended with ODD or other schema
TEIHeader MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. recommended
TEISpoken Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. recommended See format description.
TIFF Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). recommended
WAVE Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. recommended PCM-WAV, 48 kHz, 16 bit
WAVE Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable PCM-WAV with non-recommended parameters (not 48 kHz, 16 bit)
XML DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
XML MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
XSDClick to add or suggest missing format information Tool SupportTool-related formats required for specific functionality of the tool or reliable reuse of resources (e.g. tagsets, annotation schemes, vocabularies, language models, parameter files, and other specifications or settings) recommended
Last update commit-id: 38a46ffa