Show all info regardless research infrastructures. Switch to CLARIN environment and show only relevant info to CLARIN, e.g. format recommendations by CLARIN centres. Switch to Text+ environment and show only relevant info to Text+, e.g. format recommendations by Text+ centres. Switch to DARIAH environment and show only relevant info to DARIAH, e.g. format recommendations by DARIAH centres.
Leibniz-Institut für Deutsche Sprache
Suggest a fix or extension
Abbreviation: IDS
Link: https://centres.clarin.eu/centre/11
Research infrastructure:
  • CLARIN (C-centre)
  • Text+ (Collections, Lexical Resources, Operations)
Curation:
Description:

Die Hauptanlaufstellen für Datengeber*innen sind bezüglich Schriftkorpora das Projekt Ausbau und Pflege der Korpora geschriebener Gegenwartssprache für das Deutsche Referenzkorpus (DeReKo) und bezüglich Korpora gesprochener Sprache das Archiv für Gesprochenes Deutsch mit der Datenbank für Gesprochenes Deutsch (AGD mit DGD), da dies die beiden großen Korpora bzw. Korpussammlungen des IDS sind. Das Langzeitarchiv und Repositorium des Leibniz-Instituts für Deutsche Sprache übernimmt mittelfristig alle dort abgelegten Daten und orientiert sich an deren Formatvorgaben für den Import.

Die Datenübernahmerichtlinien des Leibniz-Instituts für Deutsche Sprache geben weitere Informationen zur Datenübernahme.

Daten werden nicht zwingend in den gelieferten Formaten abgespeichert, sondern ggf. in Formate konvertiert, die für die Langzeitarchivierung geeignet sind.

Ein auf die Sprachwissenschaft ausgerichteter Überblick über die Landschaft der Formate, best practices und Standards findet sich in den Empfehlungen des DFG-Fachkollegiums 104 “Sprachwissenschaften”.

Daten, die nicht den empfohlen Formaten entsprechen, haben in der Regel einen höheren Kurationsaufwand. Sie können daher nur nach vorheriger Abstimmung und einer Abschätzung des Aufwands angenommen werden, sofern die entsprechenden Kapazitäten zur Verfügung stehen bzw. zur Verfügung gestellt werden.

Plain Text und XML werden grundsätzlich nur in Unicode-Kodierung akzeptiert, also UTF-8 (⊇ ASCII), notfalls UTF-16 oder UTF-32.

Functional domains:
  • Audiovisual Annotation
  • Audiovisual Source Language Data
  • Documentation
  • Image Source Language Data
  • Metadata
  • Text Annotation
  • Textual Source Language Data
Format recommendations:
Format Domain Level Comments
AIFF Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
ALTO Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable Conversion to a suitable TEI-based format is expected, per Empfehlungen des DFG-Fachkollegiums 104 “Sprachwissenschaften" (Oct. 2019)
ANVIL Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. acceptable
CHAT Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged Consider using TEISpoken instead.
CHAT-XML Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged Consider using TEISpoken instead.
CMDI MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. recommended
Coma MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. recommended For transcriptions of speech. Coma is the EXMARaLDA Corpus-Manager.
CoNLL-U Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
CoNLL-U Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. recommended
CSV MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
DC XML MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. recommended
DGD-XML MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. recommended
DOCX Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged
DOCX MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. discouraged
DTABf Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable
EAF Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. acceptable
EXB Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. recommended
EXS Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. recommended
F4 Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged
FLAC Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
FLN Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. recommended
HTML Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. acceptable without js etc. and with generic markup
I5 Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended See the format description.
I5 Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. recommended See the format description.
JSON MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable regular and structured; consider using JSONLD with a schema
JSON Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable regular and structured; consider using JSONLD with a schema
JSON Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. acceptable regular and structured; consider using JSONLD with a schema
KorAPXMLClick to add or suggest missing format information MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. recommended
KorAPXMLClick to add or suggest missing format information Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended
KorAPXMLClick to add or suggest missing format information Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. recommended
M2JClick to add or suggest missing format information Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
MP3 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. discouraged lossy formats should be avoided if possible
MP4 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
MPEG-1 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
MPEG-2 Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable
MPEG-4 AVC Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. recommended 25 fps, 1920×1080, constant bit rate
MPEG-4 AVC Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable 50 fps, 4096x2160, constant bit rate
PAULA Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. acceptable
PDF/A DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
PDF/A Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). recommended
plainText Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged
plainText DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. recommended
plainText Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. discouraged
plainText Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. acceptable ohne Mark-up
Praat Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. acceptable
TEI Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. recommended with ODD or other schema
TEIHeader MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. recommended
TEISpoken Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. recommended See format description.
Transana Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. discouraged
TRS Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. acceptable
WAVE Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. recommended PCM-WAV, 48 kHz, 16 bit
WAVE Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. acceptable PCM-WAV with non-recommended parameters (not 48 kHz, 16 bit)
XML MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. acceptable
Last update commit-id: 76e7d218