Penn Treebank format is an annotation format for part-of-speech tagged and syntactically parsed corpora. It was developed at the University of Pennsylvania for the Penn Treebank. Syntactic dependencies of tree structures are realized by hierarchical “bracketing” of words and phrases. Due to that ordering principle, analysis is essentially context-free, and non-contiguous structures and dependencies are not possible.
Penn Treebank corpus files do not contain any metadata but only consist of a series of annotated sentences. There is no formal specification how tag sets etc. should be published. Bracketed structures can be arbitrarily complex, and one bracketed structure can nest within another. Brackets are labeled with their syntactical category; every bracket has exactly one label (except the bracket which surrounds the entire sentence, it has no label). Phrases can contain an unlimited number of elements; its head element is not marked explicitly. Penn Treebank uses symbols for different kinds of null elements.
Treebank I bracketing was used until 1992. In 1994, guidelines for a refined bracketing format were published. Wishes were to explicitly provide some form of predicate-argument structure, to provide richer annotation forms and to mark non-contiguous structures. If only the Penn Treebank file format is used and own annotation standards are created, the only difference between formats I and II lies in the number of constituent labels and the number of tags allowed.
- Beatrice Santorini
- M. Marcus et al., "Building a Large Annotated Corpus of English: The Penn Treebank," Comput. Linguist., vol. 19, pp. 313-330, 1993.
- Linguistic Data Consortium
- M. Marcus et al., "The Penn Treebank: Annotating Predicate Argument Structure", in Proceedings of the Workshop on Human Language Technology, 1994, pp. 114-119, Association for Computational Linguistics.