The Prague Dependency Treebank (and WS02) Jan Hajič Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague, Czech Republic
This Talk: an Overview The Prague Dependency Treebank The project The 3 annotation layers: morphology surface syntax (also: the lab) deep syntax Use of the deep representation: Machine translation Challenges for NL generation 5/7/2002 PreWS02 Summer School 2
The Prague Dependency Treebank Project (Czech Language Treebank) 1996-2004 1998 PDT v. 0.5 released (JHU workshop) 400k words annotated, unchecked 2001 PDT 1.0 released (LDC): 1.3MW annotated, morphology & surface syntax 2004 PDT 2.0 release planned 1.0MW annotated, underlying (deep) syntax: the tectogrammatical layer 5/7/2002 PreWS02 Summer School 3
Annotation Layers Morphology Tag (full morphology, 13 categories), lemma Analytical layer (surface syntax) Dependency, analytical function Tectogrammatical layer (underlying syntax) Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order) 5/7/2002 PreWS02 Summer School 4
Morphological Annotation 13 categories: Category # of values Example(s) POS 10 N (noun), Z (punctuation) SUBPOS 75 P (personal pron.), U (possessive adj.) GENDER 8 I (masc. inanimate), X (any), - (N.A) NUMBER 4 P (plural), D (dual) CASE 9 1 (nominative), 6 (locative) POSSGENDER 4 M (masc. animate), F (feminine) POSSNUMBER 3 S (singular), P (plural) PERSON 5 1 (first),... TENSE 4 P (present), M (past) GRADE 5 3 (superlative) NEGATION 3 A (affirmative), N (negative) VOICE 3 A (active), P (passive) VAR 11 1 (1 st variant), 6 (colloq. style), 8 (abbrev.) 5/7/2002 PreWS02 Summer School 5
Layer 1: Morphology Tag: 13 categories Example: AAFP3----3N---- Adjective no poss. Gender negated Regular no poss. Number no voice Feminine no person reserve1 Plural no tense reserve2 Dative superlative base var. Lemma: unique identifier Ex.: (to) the most uninteresting Books/verb -> book-1, went -> go, to/prep. -> To-1 5/7/2002 PreWS02 Summer School 6
Layer 2: Analytical syntax Surface, dependency-based representation Every word gets a node, plus one (root) Interested in: dependency structure analytical function: Pred, Sb, Obj, Adv, Atr, Atv, Pnom; AuxV, AuxP, AuxC,...; Coord, Apos, parenthesis ExD 5/7/2002 PreWS02 Summer School 7
Layer 2: Analytical syntax Dependency + Analytical Function dependent governor The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated. 5/7/2002 PreWS02 Summer School 8
Comparison: parse trees vs. dependency Compare: Lexicalized parse tree S(walks) Dependency tree walks VBZ NP(John) VP(walks) NNP John walks VBZ John NNP 5/7/2002 PreWS02 Summer School 9
Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,... 5/7/2002 PreWS02 Summer School 10
Analytical vs. Tectogrammatical annotation (TR: sublayer 1 only shown) Underlying verb + tense Deep function Elided Actor in Another ellipsis... Prepositions out (TR: sublayer 1 only shown) 5/7/2002 PreWS02 Summer School 11
Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,... 5/7/2002 PreWS02 Summer School 12
Dependency structure Similar to the surface (Analytical) layer......but: certain nodes deleted auxiliaries, non-autosemantic words, punctuation some nodes added based on word (mostly verb, noun) valency some ellipsis resolution detailed dependency relation labels (functors) 5/7/2002 PreWS02 Summer School 13
Tectogrammatical Functors Actants : ACT, PAT, EFF, ADDR, ORIG cannot repeat in a clause, usually compulsory Free modifications (~ 50) can repeat; optional, sometimes compulsory Ex.: LOC, DIR1,...; TWHEN, TTILL,...; RESTR, DESC; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, Special Coordination, Rhematizers, Foreign phrases,... 5/7/2002 PreWS02 Summer School 14
Tectogrammatical Example Analytical verb form:» (he) allowed would-be to-be enrolled» směl by být zapsán Collapsed Additional attributes (grammatemes): conditional + allow 5/7/2002 PreWS02 Summer School 15
Tectogrammatical Example Predicate with copula (state)» (the) pool has-been already filled» bazén byl již napuštěný ý 5/7/2002 PreWS02 Summer School 16
Tectogrammatical Example Passive construction (action)» (The) book has-been translated [by Mr. X]» Kniha byla přeložena Disappeared Added 5/7/2002 PreWS02 Summer School 17
Tectogrammatical Example Object» (he) gave him a-book» dal mu knihu Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor s valency frame 5/7/2002 PreWS02 Summer School 18
Tectogrammatical Example Relative clause (embedded) (a) house, which is expensive, (we) (to-ourselves) will-notbuy dům, který je drahý, si nekoupíme 5/7/2002 PreWS02 Summer School 19
Tectogrammatical Example Incomplete phrases» Peter works well, but Paul badly» Petr pracuje dobře, ale Pavel špatně Added 5/7/2002 PreWS02 Summer School 20
Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,... 5/7/2002 PreWS02 Summer School 21
Deep word order, topic/focus Deep word order: from old information to the new one (left-to-right) at every level (head included) projectivity by definition i.e., partial level-based order -> total d.w.o. Topic/focus/contrastive topic attribute of every node restricted by d.w.o. and other constraints 5/7/2002 PreWS02 Summer School 22
Deep word order, topic/focus Example: Analytical dep. tree: Baker bakes rolls. vs. Baker IC bakes rolls. 5/7/2002 PreWS02 Summer School 23
Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,... 5/7/2002 PreWS02 Summer School 24
Coreference Grammatical (vs. textual) Ex.: Peter moved to Iowa after he finished his PhD. move PRED Peter Iowa ACT DIR1 finish TWHEN he ACT he APP PhD PAT NB: poster about Control, this morning 5/7/2002 PreWS02 Summer School 25
Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,... 5/7/2002 PreWS02 Summer School 26
Grammatemes Syntactic (= detailed functors) only for some functors: WHEN: before/after LOC: next-to, behind, in-front-of Lexical, underlying number (SG/PL), tense, modality, degree of comparison strictly only where necessary (agreement!) 5/7/2002 PreWS02 Summer School 27
The Valency Lexicon Valency frames each verb (+ some nouns, adjectives) has slots for functor/form pairs: give: ACT(Nom) PAT(Acc) ADDR(to+Dat) Basic set prepared in advance, annotators add entries on-the-go, checking and approval process follows (consistency) Compare: Levin s Classes, Proposition Bank 5/7/2002 PreWS02 Summer School 28
Tectogrammatical Annotation Manual annotation 4 groups of annotators ~ 4 sublayers Special graphical tool (TrEd) Customizable graphical tree editor Preprocessing Data from Analytical Layer, preprocessed Online dependency function preassignment 5/7/2002 PreWS02 Summer School 29
The [Manual] Annotation Tool Perl/PerlTk based, platform-independent Linux, Windows 95/98/2000, Solaris,... Perl as the macro language unlimited online processing capability Flexibility for interactive checking split screen, graphical diff function Customization, printing, plugins 5/7/2002 PreWS02 Summer School 30
The TrEd Tree Editor Graphical tool TrEd Main screen: Original sentence: [This year s flu season is still quiet in Europe.] Editing window customization Run a macro Multiwindow editing/compare 5/7/2002 PreWS02 Summer School 31
Valency Lexicon in TrEd to write sth (about sth) 5/7/2002 PreWS02 Summer School 32
What Is It Good For? Machine Translation TL representation is closer to an interlingua than surface (analytical) syntax => less work in the transfer phase more work in parsing and generation...but advantage in multilingual MT application Question answering same representation for questions and answers 5/7/2002 PreWS02 Summer School 33
Machine Translation Architecture Typical (structural) MT system: Transfer a parse Analysis (parsing) Generation (synthesis) source sentence target sentence 5/7/2002 PreWS02 Summer School 34
Machine Translation Architecture Tectogrammatical layer-based system: Transfer (tectogrammatical) parsing parsing morphology (tagging) tectogrammatical layer analytical layer morphological layer generation linearization morph. synthesis source sentence target sentence 5/7/2002 PreWS02 Summer School 35
Comparison: analytical layer 5/7/2002 PreWS02 Summer School 36
Comparison: tectogrammatical l. The [Homestead s] only remaining baker bakes the most famous roll s to the north of Long River. al-xabaaz al- axiir al-baaqii [fii Homestead] yaśmacu ashhar al-kruasaanaat ilaa shimaal min Long River. 5/7/2002 PreWS02 Summer School 37
The Three Crucial Steps Analytical (surface) Tectogrammatical additional parsing required Transfer minimal effort: only true transformations needed (like swimming ~ schwimmen gern) Generation back from Tectogrammatical representation to Analytical (surface syntax) 5/7/2002 PreWS02 Summer School 38
The Devil s In... The additional three steps: (tectogrammatical) parsing parsing morphology (tagging) Transfer tectogrammatical layer analytical layer morphological layer Generation linearization (trivial) morph. synthesis (easy) source sentence target sentence 5/7/2002 PreWS02 Summer School 39
The Devil s In... The additional three steps: Tectogrammatical parsing (Simple) transfer tectogrammatical layer source analytical layer target Generation: - Deletions - Insertions: prepositions, conjunctions,... - Word order - Morphology 5/7/2002 PreWS02 Summer School 40
Components:...the Generation Deletions of nodes [rare if going into English] Insertions of nodes prepositions, conjunctions, punctuation splitting phrases/idioms/named entities Tree reorganization (numeric expressions) Surface word order (analytical tree: defined w.o.) Morphology (agreement, cases based on subcat) 5/7/2002 PreWS02 Summer School 41
Generation Insertion of Prepositions střed center tectogrammatical layer přitažlivost APP.sg gravity APP.sg center středu přitažlivosti.nfs2 Atr analytical layer of AuxP gravity.nn Atr 5/7/2002 PreWS02 Summer School 42
Surface word order přijít.past Generation come.past tectogrammatical layer včera Petr yesterday Peter TWHEN ACT TWHEN ACT přijít.vb3sp come.vbd včera Petr Adv Sb analytical layer Peter yesterday Sb Adv 5/7/2002 PreWS02 Summer School 43
Generation: Complex Input English translation 5/7/2002 PreWS02 Summer School 44
Generation: How-To (1) Statistical, (perhaps) in two steps Analytical tree reconstruction everything except word order i.e., includes morphology (tag assignment) Word Order projective trees assumed here 1 thus, it is sufficient to determine level-by-level word order 1 Additional step required for non-projective constructions [can be avoided for English] 5/7/2002 PreWS02 Summer School 45
Generation: How-To (2) Reconstruction: two possible ways transformation-based learning ([fn]tbl) probabilistic, by a dependency tree model: based on triplets <word,tag,afun> and dependency relation (governor,dependent) ~ Collins bilexical model, Charniak parser model, Bangalore & Rambow afun instead of nonterminals 5/7/2002 PreWS02 Summer School 46
Generation: How-To (3) Word order language model for a single level in the tree: <word,tag,afun> triples; includes head (no afun) come.vbd Peter.NNP yesterday.adv Sb Adv non-projective constructions (and some more) by classic n-gram LM 5/7/2002 PreWS02 Summer School 47
Generation: How-To (4) Data trained on WSJ: converted to analytical dependency trees adapted Jason Eisner s head assignment rules added rules for heads of base NPs added rules for analytical functions rule-based parsing to tectogrammatical layer (for now; manual annotation will follow) i.e., TR AR data available (English) 5/7/2002 PreWS02 Summer School 48
Some pointers Current version of PDT: v1.0 morphology + analytical level 1.3M words (train/dev test/eval test) http://ufal.mff.cuni.cz Projects -> Treebank http://www.ldc.upenn.edu LDC2001T10 (PDT v1.0) http://www.clsp.jhu.edu: Workshop 2002 Using TL for MT Generation 5/7/2002 PreWS02 Summer School 49