PDTB-style Discourse Annotation of Chinese Text XYZ Brandeis University PDTB Workshop University of Pennsylvania 4/30/2012
PDTB-style Discourse Annotation of Chinese Text Nianwen Xue, Yaqin Yang, Yuping Zhou Brandeis University PDTB Workshop University of Pennsylvania 4/30/2012
Questions How is the Chinese discourse similar to / different from English? Why is discourse annotation particularly important for Chinese? Can discourse relations be extracted from existing annotated resources, i.e., a treebank? Is PDTB-style discourse annotation feasible for Chinese and what adaptations have to be made?
Similarities Explicit and implicit relations; Explicit connectives: subordinating conjunctions, coordinating conjunctions discourse adverbials
Subordinating conjunctions /if /reform /measure /not / effective ( 么 /then) /investor /then /have /possibility /BA /attention /turn to /emerging /market "If the reform measures are not effective, confidence crisis still exists, then investors are likely to turn their attention to other emerging markets."
Coordinating conjunctions /modern /parent /difficult /to /DE / area /be /not only /no way /eliminate /blood /in /traditional /DE /values / but also /need /face /new /DE / values "The difficulty of being modern parents lies in the fact they can not get rid of the traditional values flowing in their blood, and they also need to face new values."
Adverbial connectives /Clinton /Admininstration /already /indicate /will /extend /China /DE /MFN /status /theorefore /this /CL /lobby /de /target /be /those / relatively /conservative /DE / congressmen The Clinton Administration has already indicated that it will extend China's MFN status, therefore, the focus of the lobby this time is on those relatively conservative congressmen.
Differences in discourse connectives Characteristics of the connectives in Chinese: They are often optional They vary in their syntactic position Many paired connectives, and the boundary between subordinating and coordinating conjunctions less clear Many different ways of expressing the same discourse relation
An example /Taiwan businessmen /children /school ( /although) /already /lay foundation / {but, however} /funding /insufficient / faculty /undecided The foundation of the school for Taiwan businessmen has been laid, but the funding is insufficient and its faculty hasn t been decided.
Same discourse relation expressed with multiple discourse connectives Gloss Part 1 Part 2 although,,,,,,,, because,,,, if,, 么, even if,,,, as long as, only if therefore, for example,
Differences in punctuation marks Characteristics of the Chinese punctuation marks Comma is a good indicator of the boundary of a discourse unit (arguments in the sense of PDTB) Where there is a discourse unit boundary, there is usually a comma But when there is a comma, there isn t always a discourse unit boundary Period is not a good indicator of sentence boundary Periods, exclamation and question marks always end sentences, but (arguably) not all sentences are ended by periods, exclamation and question marks. Commas (arguably) sometimes serve that function as well.
What is a sentence? nano 3 (1), (2), (3), (4) (5) pay attention to this Nano 3 recently, (1) even visit a few computer stores in person, (2) comparatively speaking, (3) Zhuoyue s prices be relatively low, (4) and can also guarantee that be genuine, (5) therefore place the order. I have been paying attention to this Nano 3 recently, (1) and I even visited a few computer stores in person. (2) Comparatively speaking, (3) Zhuoyue s prices are relatively low, (4) and they can also guarantee that their products are genuine. (5) Therefore I placed the order.
Answers to Question 1 Why is discourse annotation particularly important for Chinese? Discourse structure is needed to determine the sentence boundaries in Chinese (in addition to the usual purposes of discourse structure).
An attempt to automatically extract discourse relations Commas in Chinese are more reliable anchors of discourse relations than discourse connectives There isn t always a discourse connective, but there is always a punctuation mark (comma) at the boundary of a discourse segment A first-approximation of discourse relation analysis can be modeled as comma classification Based on automatically extractable patterns around commas in the Chinese Treebank
Syntactic patterns IP-Root IP IP IP, IP PP, IP NP VP IP, IP VP, VP a. SENTENCE BOUNDARY b. IP-COORDINATION c. VP COORDINATION IP IP CP/IP-CND, Main clause NP-SBJ VP VV, IP d. ADJUNCTION e. COMPLEMENTATION
Taxonomy of automatically extractable discourse relations NON-RELATION SENT BOUNDARY ALL COORD COORD-IP RELATION COORD-VP SUBORD ADJUNCTION COMP (ATTRIBUTION)
17
Answers to Question 2 Can discourse relations be extracted from existing annotated resources, i.e., a treebank? A first approximation of discourse structure can be extracted and may even be useful Can only extract sentence-internal discourse relations Substantial manual annotation needed to construct the full discourse structure of a document
Systematic Adaptations Procedural division between explicit and implicit discourse relation Annotation of implicit discourse relations Definition of Arg1 and Arg2
Explicit and Implicit Unified Always use punctuations as potential anchors of discourse relations Mark explicit connectives as an attribute of the discourse relation Justification: Punctuation marks are more reliable indicators of discourse unit boundary than discourse connectives Discourse connectives are often optional, syntactically flexible 82% implicit, 18% explicit
Systematic Adaptations Procedural division between explicit and implicit discourse relation Annotation of implicit discourse relations Definition of Arg1 and Arg2
Annotation of Implicit Relations No insertion of explicit connective Difficulties of insertion Inter-annotator agreement (Miltsakaki et al. 2004) Annotating with senses directly Now that a sense hierarchy already exists Difficulties (Prasad et al. 2008): use prototypical connectives as aids Benefits Inclusion of EntRel in the sense hierarchy Move from annotating connectives to annotating relations
Systematic Adaptations Procedural division between explicit and implicit discourse relation Annotation of implicit discourse relations Definition of Arg1 and Arg2
Why? Arg1/2 Defined Semantically 82% implicit: distinction less meaningful Discourse connectives often optional How? Use the sense hierarchy already developed for English Example: CONTINGENCY: Cause reason : for cases like because, since etc. result : for cases like so, as a result etc. reason Arg1, clause bound to ( because ) etc. result Arg2, clause bound to ( therefore ) etc.
Annotation experiments Annotation type Chinese Token count f (p/r) (%) PDTB (%) Rel id 3951* 95.4 (96.0/94.7) N/A Rel type 3951 95.1 N/A Imp sense type 2967 87.4 72 Argument order 3059 99.8 N/A Exp span exact 1580 84.2 90.2 Exp span partial 1580 99.6 94.5 Imp span exact 5934 96.9 85.1 Overall boundary 14039 87.7(87.5/87.9) N/A
Answers to Question 3 Is PDTB-style discourse annotation feasible? Yes, it s feasible What adaptations have to be made? intra-sentential discourse relations often delimited by comma without an explicit connective Use commas as well as periods as indicators of discourse unit boundaries Significantly more implicit relations than in English: 82% implicit in Chinese vs. 54.5% implicit in PDTB 2.0 Annotate explicit and implicit discourse connectives in one unified process Discourse connectives are often optional Define argument labels semantically
References Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bonnie Webber. 2004. Annotating discourse connectives and their arguments. In Proceedings of the HLT/NAACL Workshop on Frontiers in Corpus Annotation, pages 9 16, Boston, MA, May. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse Treebank 2.0. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008). Yuping Zhou and Nianwen Xue. 2012 (to appear). PDTB-style discourse annotation of Chinese text. ACL-2012. Jeju Island, Korea. Yaqin Yang and Nianwen Xue. 2012 (to appear). Chinese comma disambiguation for discourse analysis. ACL-2012. Jeju Island, Korea.