Centre: IDS

Leibniz-Institut für Deutsche Sprache

Suggest a fix or extension

Abbreviation: IDS

Link: https://centres.clarin.eu/centre/11

Research infrastructure:

CLARIN (C-centre)
Text+ (Collections, Lexical Resources, Operations)

Curation:

Text formats: Harald Lüngen (Juli 24, 2023)

Audio visual formats: Siegwalt Lindenfelser (Mai 8, 2024)

Description:

Die Hauptanlaufstellen für Datengeber*innen sind bezüglich Schriftkorpora das Projekt Ausbau und Pflege der Korpora geschriebener Gegenwartssprache für das Deutsche Referenzkorpus (DeReKo) und bezüglich Korpora gesprochener Sprache das Archiv für Gesprochenes Deutsch mit der Datenbank für Gesprochenes Deutsch (AGD mit DGD), da dies die beiden großen Korpora bzw. Korpussammlungen des IDS sind. Das Langzeitarchiv und Repositorium des Leibniz-Instituts für Deutsche Sprache übernimmt mittelfristig alle dort abgelegten Daten und orientiert sich an deren Formatvorgaben für den Import.

Die Datenübernahmerichtlinien des Leibniz-Instituts für Deutsche Sprache geben weitere Informationen zur Datenübernahme.

Daten werden nicht zwingend in den gelieferten Formaten abgespeichert, sondern ggf. in Formate konvertiert, die für die Langzeitarchivierung geeignet sind.

Ein auf die Sprachwissenschaft ausgerichteter Überblick über die Landschaft der Formate, best practices und Standards findet sich in den Empfehlungen des DFG-Fachkollegiums 104 “Sprachwissenschaften”.

Daten, die nicht den empfohlen Formaten entsprechen, haben in der Regel einen höheren Kurationsaufwand. Sie können daher nur nach vorheriger Abstimmung und einer Abschätzung des Aufwands angenommen werden, sofern die entsprechenden Kapazitäten zur Verfügung stehen bzw. zur Verfügung gestellt werden.

Plain Text und XML werden grundsätzlich nur in Unicode-Kodierung akzeptiert, also UTF-8 (⊇ ASCII), notfalls UTF-16 oder UTF-32.

Functional domains:

Audiovisual Annotation
Audiovisual Source Language Data
Documentation
Image Source Language Data
Metadata
Text Annotation
Textual Source Language Data

Format recommendations:

Format	Domain	Level	Comments
AIFF	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	acceptable
ALTO	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	acceptable	Conversion to a suitable TEI-based format is expected, per Empfehlungen des DFG-Fachkollegiums 104 “Sprachwissenschaften" (Oct. 2019)
ANVIL	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	acceptable
CHAT	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	discouraged	Consider using TEISpoken instead.
CHAT-XML	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	discouraged	Consider using TEISpoken instead.
CMDI	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	recommended
Coma	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	recommended	For transcriptions of speech. Coma is the EXMARaLDA Corpus-Manager.
CoNLL-U	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	recommended
CoNLL-U	Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes.	recommended
CSV	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	acceptable
DC XML	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	recommended
DGD-XML	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	recommended
DOCX	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	discouraged
DOCX	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	discouraged
DTABf	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	acceptable
EAF	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	acceptable
EXB	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	recommended
EXS	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	recommended
F4	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	discouraged
FLAC	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	acceptable
FLN	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	recommended
HTML	Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes.	acceptable	without js etc. and with generic markup
I5	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	recommended	See the format description.
I5	Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes.	recommended	See the format description.
JSON	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	acceptable	regular and structured; consider using JSONLD with a schema
JSON	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	acceptable	regular and structured; consider using JSONLD with a schema
JSON	Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes.	acceptable	regular and structured; consider using JSONLD with a schema
KorAPXMLClick to add or suggest missing format information	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	recommended
KorAPXMLClick to add or suggest missing format information	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	recommended
KorAPXMLClick to add or suggest missing format information	Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes.	recommended
M2JClick to add or suggest missing format information	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	acceptable
MP3	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	discouraged	lossy formats should be avoided if possible
MP4	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	acceptable
MPEG-1	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	acceptable
MPEG-2	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	acceptable
MPEG-4 AVC	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	recommended	25 fps, 1920×1080, constant bit rate
MPEG-4 AVC	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	acceptable	50 fps, 4096x2160, constant bit rate
PAULA	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	acceptable
PDF/A	DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines.	recommended
PDF/A	Image Source Language DataDigitized images of analogue sources of written language data for research purposes (e.g. facsimiles, scans of handwriting, photos of inscriptions).	recommended
plainText	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	discouraged
plainText	DocumentationUnstructured documentation of the resource and its parts such as corpus or annotation guidelines.	recommended
plainText	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	discouraged
plainText	Textual Source Language DataWritten unstructured/plain text or originally structured text (e.g. HTML) without linguistic or other mark-up added for research purposes.	acceptable	ohne Mark-up
Praat	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	acceptable
TEI	Text AnnotationAnnotations of textual sources/written text, with the original text included or as stand-off.	recommended	with ODD or other schema
TEIHeader	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	recommended
TEISpoken	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	recommended	See format description.
Transana	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	discouraged
TRS	Audiovisual AnnotationAnnotations of audiovisual sources, usually including a basic rendering of the spoken content (transcription) and sometimes further annotation.	acceptable
WAVE	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	recommended	PCM-WAV, 48 kHz, 16 bit
WAVE	Audiovisual Source Language DataAudio or video recordings providing spoken/multimodal or signed language data for research purposes.	acceptable	PCM-WAV with non-recommended parameters (not 48 kHz, 16 bit)
XML	MetadataComprehensive structured information including descriptive, structural and administrative metadata. See the for further hints.	acceptable

Last update commit-id: 76e7d218

Home
Centres
Format Recommendations
	Data Deposition Formats
	Functional Domains
	File Extensions
	Media Types
	Statistics
		Popular Formats
		Relevant KPIs
	Sanity Check
Standards and Specifications
	Standard Bodies
	Topics
	Search
API
About / F.A.Q.