ANNOTATING DISCOURSE IN PRAGUE DEPENDENCY TREEBANK PDTB WORKSHOP, PHILADEPHIA APRIL 30, 2012 Lucie Poláková (Mladová), Charles University in Prague
PRAGUE DEPENDENCY TREEBANK & DISCOURSE 2006: Prague Dependency Treebank (PDT) 2.0 Hajič et al. : multilayer annotations of 50,000 sentences of Czech journalistic texts: - morphology - surface syntax - underlying syntax + sentence semantics (tectogrammatical tree structures) + other phenomena included: information structure, grammatically bound and textual pronominal coreference 2009-2012: annotation of discourse-related phenomena - Prof. Eva Hajičová
DISCOURSE-LEVEL PHENOMENA IN PDT Manual annotations of the whole treebank: 49,431 sentences in 3165 documents of 1-231 sentences, with an average length of 15.6 sentences 1. Explicit discourse connectives + their arguments (scopes), sense tags (= PDTB II. annotation) 2. a) Extended textual coreference annotation b) Bridging relations ALL DISCOURSE PHENOMENA ANNOTATED DIRECTLY ON THE SYNTACTIC (tectogrammatical) TREES! Both projects completed 2011, currently under checking procedures
EXAMPLE OF ANNOTATION ON TREES 1. Searching for the possible discourse connectives in a plain text (unlike in Penn) V polovině července radní předložené rozhodnutí akceptovali. Dodnes však žádná smlouva podepsána nebyla. In middle July, the councilors accepted the proposed decision. However, no contract was signed so far. 2. Marking the discourse relation on the tectogrammatical trees:
In middle July, the councilors accepted the proposed decision. However, no contract was signed so far.
In middle July, the councilors accepted the proposed decision. However, no contract was signed so far.
In middle July, the councilors accepted the proposed decision. However, no contract was signed so far.
In middle July, the councilors accepted the proposed decision. However, no contract was signed so far. - linking of the arguments of the relation by an arrow (automatically connected with choice of semantic intepretation)
In middle July, the councilors accepted the proposed decision. However, no contract was signed so far. - linking of the arguments of the relation by an arrow (automatically connected with choice of semantic intepretation) - adding of the connective
In middle July, the councilors accepted the proposed decision. However, no contract was signed so far. - linking of the arguments of the relation by an arrow (automatically connected with choice of semantic intepretation) - adding of the connective - marking of the extent of the arguments
DISCOURSE REPRESENTATION COMPACT VIEW
SEMANTIC LABELS IN PRAGUE TEMPORAL CONTINGENCY COMPARISON (CONTRAST) EXPANSION asynchronous reason - result confrontation conjunction synchronous pragmatic reason result opposition instantiation purpose pragmatic contrast specification explication restrictive opposition equivalence condition concession generalization pragmatic condition correction (replacement) gradation conjunctive alternative disjunctive alternative
SEMANTIC LABELS IN PRAGUE TEMPORAL CONTINGENCY COMPARISON (CONTRAST) EXPANSION asynchronous reason - result confrontation conjunction synchronous pragmatic reason result opposition instantiation purpose pragmatic contrast specification explication restrictive opposition equivalence condition concession generalization pragmatic condition correction (replacement) gradation conjunctive alternative disjunctive alternative
ADDITIONAL INFORMATION ANNOTATED List structures quite a typical way of text composition Headings Alternative lexical expressions of the connectives, for example: Z toho vyplývá = tedy, takže Důvodem je = protože (This implies = so, therefore ) (The reason is that = because) Double-sense relations The so-called collections Not annotated so far: Implicit connectives, their arguments and senses (in comparison to PDTB 2.0) Attribution (partly present in some features of the syntactic trees)
TREEBANK STATISTICS Measured on train data = 9/10 of the treebank = 43,955 sentences Relation Intra-sentential Intersentential Intersentential Total conj 4706 1259 5965 opp 1179 1602 2781 reason 1507 900 2407 cond 1332 15 1347 conc 618 234 852 preced 586 205 791 confr 311 272 583 spec 399 99 498 purp 459 1 460 corr 292 109 401 grad 150 182 332 Relation Intrasentential Total restr 110 148 258 synchr 210 43 253 explicat 75 116 191 disjalt 174 12 186 exempl 21 106 127 equiv 36 56 92 gener 7 84 91 conjalt 49 16 65 f_opp 23 26 49 f_reason 8 26 34 f_cond 14 1 15 total 12266 5512 17778
INTERANNOTATOR AGREEMENT MEASUREMENT treebank: 10 sections, in each of them a sample annotated by all annotators for the IAA measurement (2,084 sentences) following table: only the inter-sentential relations (assumed improvement with intra-sententials) connective-based measurement one pair of annotators average agreement on semantic types 0.77 similar measure in PDTB 0.8 (Prasad et al., LREC 2008) agreement on higher level: 4 basic semantic classes 0.89 (Penn 0.94)
IAA IN THE SUBSEQUENT MEASUREMENTS Measurement Connectivebased F1 measure Agreement on semantic types Kappa on sem. types train-2 0.83 0.69 0.57 train-3 0.79 0.8 0.75 train-4 0.8 0.75 0.69 train-5 0.85 0.76 0.71 train-6 0.84 0.77 0.68 train-7 0.79 0.67 0.61 train-8 0.86 0.84 0.79 dtest 0.85 0.73 0.67 etest 0.83 0.72 0.68 train-1 0.84 0.91 0.88
WHAT HAVE WE LEARNED? Linguistically: the absolute number of explicit inter-sentential relations is low (it increases with the incorporation of "syntactic" discourse edges) coreference annotation (entity-based, EntRel) on the same data HELPS a lot things different nature of the relations (condition syntax-bound; specification text-bound) some very ambiguous connectives in Czech (ale = but) some Czech connectives have no exact English counterpart (totiž additional semantic category?) some findings correspond with Penn numbers discourse structure to some extent language independent we need genre distinction
WHAT HAVE WE LEARNED? Technically: trees with resolved syntactic structure help a lot (ellipsis, easy extraction of intra-sentential relations, coreference accessible ) sometimes problems to deal with larger arguments, complexity of the trees the representation offers lot of data with all the various types of linguistic information AT ONCE
RECENT & NEXT Completed: the annotation of explicits annotation manual in Czech and in English webpage with a browser in a sample of data intergration of the two projects to ONE layer with coreference and bridging annotations In progress: extracting relevant tectogrammatical information (intra-sentential discourse relations) release of the discourse and coreference data as PDT 2.5 Next: altlexes genre distinction of the texts interplays (information structure, Play Coref, the phenomenon of contrast...) automatic experiments
Thank you! polakova@ufal.mff.cuni.cz http://ufal.mff.cuni.cz/discourse This work was supported by the Grant Agency of the Czech Republic (P406/12/0658, P406/2010/0875 ) and by the Czech Ministry of Education (ME 10018).