- CLARIN (C-centre)
- Text+ (Collections, Lexical Resources, Operations)
- Text formats: Harald Lüngen (Juli 24, 2023)
- Audio visual formats: Siegwalt Lindenfelser (Mai 8, 2024)
Die Hauptanlaufstellen für Datengeber*innen sind bezüglich Schriftkorpora das Projekt Ausbau und Pflege der Korpora geschriebener Gegenwartssprache für das Deutsche Referenzkorpus (DeReKo) und bezüglich Korpora gesprochener Sprache das Archiv für Gesprochenes Deutsch mit der Datenbank für Gesprochenes Deutsch (AGD mit DGD), da dies die beiden großen Korpora bzw. Korpussammlungen des IDS sind. Das Langzeitarchiv und Repositorium des Leibniz-Instituts für Deutsche Sprache übernimmt mittelfristig alle dort abgelegten Daten und orientiert sich an deren Formatvorgaben für den Import.
Die Datenübernahmerichtlinien des Leibniz-Instituts für Deutsche Sprache geben weitere Informationen zur Datenübernahme.
Daten werden nicht zwingend in den gelieferten Formaten abgespeichert, sondern ggf. in Formate konvertiert, die für die Langzeitarchivierung geeignet sind.
Ein auf die Sprachwissenschaft ausgerichteter Überblick über die Landschaft der Formate, best practices und Standards findet sich in den Empfehlungen des DFG-Fachkollegiums 104 “Sprachwissenschaften”.
Daten, die nicht den empfohlen Formaten entsprechen, haben in der Regel einen höheren Kurationsaufwand. Sie können daher nur nach vorheriger Abstimmung und einer Abschätzung des Aufwands angenommen werden, sofern die entsprechenden Kapazitäten zur Verfügung stehen bzw. zur Verfügung gestellt werden.
Plain Text und XML werden grundsätzlich nur in Unicode-Kodierung akzeptiert, also UTF-8 (⊇ ASCII), notfalls UTF-16 oder UTF-32.
- Audiovisual Annotation
- Audiovisual Source Language Data
- Documentation
- Image Source Language Data
- Metadata
- Text Annotation
- Textual Source Language Data
Format | Domain | Level | Comments |
---|---|---|---|
AIFF | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
ALTO | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | acceptable |
|
ANVIL | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | acceptable | |
CHAT | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged |
|
CHAT-XML | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged |
|
CMDI | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | recommended | |
Coma | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | recommended |
|
CoNLL-U | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended | |
CoNLL-U | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | recommended | |
CSV | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | acceptable | |
DC XML | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | recommended | |
DGD-XML | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | recommended | |
DOCX | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged | |
DOCX | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | discouraged | |
DTABf | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | acceptable | |
EAF | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | acceptable | |
EXB | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | recommended | |
EXS | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | recommended | |
F4 | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged | |
FLAC | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
FLN | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | recommended | |
HTML | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | acceptable |
|
I5 | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended |
|
I5 | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | recommended |
|
JSON | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | acceptable |
|
JSON | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | acceptable |
|
JSON | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | acceptable |
|
KorAPXMLClick to add or suggest missing format information | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | recommended | |
KorAPXMLClick to add or suggest missing format information | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended | |
KorAPXMLClick to add or suggest missing format information | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | recommended | |
M2JClick to add or suggest missing format information | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
MP3 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | discouraged |
|
MP4 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
MPEG-1 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
MPEG-2 | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable | |
MPEG-4 AVC | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | recommended |
|
MPEG-4 AVC | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable |
|
PAULA | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | acceptable | |
PDF/A | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
PDF/A | Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions). | recommended | |
plainText | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged | |
plainText | DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines. | recommended | |
plainText | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | discouraged | |
plainText | Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes. | acceptable |
|
Praat | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | acceptable | |
TEI | Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off. | recommended |
|
TEIHeader | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | recommended | |
TEISpoken | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | recommended |
|
Transana | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | discouraged | |
TRS | Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation. | acceptable | |
WAVE | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | recommended |
|
WAVE | Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes. | acceptable |
|
XML | MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints. | acceptable |