Show all info regardless research infrastructures. Switch to CLARIN environment and show only relevant info to CLARIN, e.g. format recommendations by CLARIN centres. Switch to Text+ environment and show only relevant info to Text+, e.g. format recommendations by Text+ centres. Switch to DARIAH environment and show only relevant info to DARIAH, e.g. format recommendations by DARIAH centres.
Penn Treebank (Phrase Structure Treebank)
Abbreviation: Penn Treebank
Scope: Syntactic annotation format
Topic: Syntactic Annotation
Standard body: Other

Penn Treebank format is an annotation format for part-of-speech tagged and syntactically parsed corpora. It was developed at the University of Pennsylvania for the Penn Treebank. Syntactic dependencies of tree structures are realized by hierarchical “bracketing” of words and phrases. Due to that ordering principle, analysis is essentially context-free, and non-contiguous structures and dependencies are not possible.

Penn Treebank corpus files do not contain any metadata but only consist of a series of annotated sentences. There is no formal specification how tag sets etc. should be published. Bracketed structures can be arbitrarily complex, and one bracketed structure can nest within another. Brackets are labeled with their syntactical category; every bracket has exactly one label (except the bracket which surrounds the entire sentence, it has no label). Phrases can contain an unlimited number of elements; its head element is not marked explicitly. Penn Treebank uses symbols for different kinds of null elements.

Treebank I bracketing was used until 1992. In 1994, guidelines for a refined bracketing format were published. Wishes were to explicitly provide some form of predicate-argument structure, to provide richer annotation forms and to mark non-contiguous structures. If only the Penn Treebank file format is used and own annotation standards are created, the only difference between formats I and II lies in the number of constituent labels and the number of tags allowed.

Other standards in the same topic(s):

Abbreviation: Treebank I bracketing
Release Date: 1991
  1. Beatrice Santorini
Recommended Reading:
  • M. Marcus et al., "Building a Large Annotated Corpus of English: The Penn Treebank," Comput. Linguist., vol. 19, pp. 313-330, 1993.
Abbreviation: Treebank II bracketing
Release Date: 1995
  1. Linguistic Data Consortium
Recommended Reading:
  • M. Marcus et al., "The Penn Treebank: Annotating Predicate Argument Structure", in Proceedings of the Workshop on Human Language Technology, 1994, pp. 114-119, Association for Computational Linguistics.