Penn Treebank (Phrase Structure Treebank)

Abbreviation: Penn Treebank

Scope: Syntactic annotation format

Standard body: Other

Description:

Penn Treebank format is an annotation format for part-of-speech tagged and syntactically parsed corpora. It was developed at the University of Pennsylvania for the Penn Treebank. Syntactic dependencies of tree structures are realized by hierarchical “bracketing” of words and phrases. Due to that ordering principle, analysis is essentially context-free, and non-contiguous structures and dependencies are not possible.

Penn Treebank corpus files do not contain any metadata but only consist of a series of annotated sentences. There is no formal specification how tag sets etc. should be published. Bracketed structures can be arbitrarily complex, and one bracketed structure can nest within another. Brackets are labeled with their syntactical category; every bracket has exactly one label (except the bracket which surrounds the entire sentence, it has no label). Phrases can contain an unlimited number of elements; its head element is not marked explicitly. Penn Treebank uses symbols for different kinds of null elements.

Treebank I bracketing was used until 1992. In 1994, guidelines for a refined bracketing format were published. Wishes were to explicitly provide some form of predicate-argument structure, to provide richer annotation forms and to mark non-contiguous structures. If only the Penn Treebank file format is used and own annotation standards are created, the only difference between formats I and II lies in the number of constituent labels and the number of tags allowed.

URL(s): http://www.cis.upenn.edu/~treebank/

Home
Centres
Format Recommendations
	Data Deposition Formats
	Functional Domains
	File Extensions
	Media Types
	Statistics
		Popular Formats
		Relevant KPIs
	Sanity Check
		Keywords
Standards and Specifications
	Standard Bodies
	Topics
	Search
API
About / F.A.Q.