PDTB-style Discourse Annotation of Chinese Text

Similar documents
The Discourse Anaphoric Properties of Connectives

University of Edinburgh. University of Pennsylvania

Annotation Projection for Discourse Connectives

Developing a large semantically annotated corpus

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

LTAG-spinal and the Treebank

Linking Task: Identifying authors and book titles in verbose queries

Prediction of Maximal Projection for Semantic Role Labeling

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Leveraging Sentiment to Compute Word Similarity

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Control and Boundedness

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

"f TOPIC =T COMP COMP... OBJ

arxiv:cmp-lg/ v1 16 Aug 1996

Vocabulary Agreement Among Model Summaries And Source Documents 1

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

The College Board Redesigned SAT Grade 12

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Compositional Semantics

Florida Reading Endorsement Alignment Matrix Competency 1

Pre-Processing MRSes

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Accurate Unlexicalized Parsing for Modern Hebrew

- Period - Semicolon - Comma + FANBOYS - Question mark - Exclamation mark

Word Segmentation of Off-line Handwritten Documents

Dual and Joint Degrees Values and Questions

Loughton School s curriculum evening. 28 th February 2017

Generating Test Cases From Use Cases

PHILOSOPHY & CULTURE Syllabus

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Developing a TT-MCTAG for German with an RCG-based Parser

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Proof Theory for Syntacticians

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Ontologies vs. classification systems

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

SEMAFOR: Frame Argument Resolution with Log-Linear Models

National Literacy and Numeracy Framework for years 3/4

Argument structure and theta roles

EAGLE: an Error-Annotated Corpus of Beginning Learner German

Software Maintenance

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Grammars & Parsing, Part 1:

Disambiguation of Thai Personal Name from Online News Articles

DIRECT AND INDIRECT SPEECH

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Specifying a shallow grammatical for parsing purposes

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Incorporating Punctuation Into the Sentence Grammar: A Lexicalized Tree Adjoining Grammar Perspective

Using dialogue context to improve parsing performance in dialogue systems

Ensemble Technique Utilization for Indonesian Dependency Parser

Some Principles of Automated Natural Language Information Extraction

Annotation Guidelines for Rhetorical Structure

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Copyright and moral rights for this thesis are retained by the author

Mandarin Lexical Tone Recognition: The Gating Paradigm

The Smart/Empire TIPSTER IR System

BYLINE [Heng Ji, Computer Science Department, New York University,

The Political Engagement Activity Student Guide

Word Stress and Intonation: Introduction

Update on Soar-based language processing

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Sources of difficulties in cross-cultural communication and ELT: The case of the long-distance but in Chinese discourse

An Introduction to the Minimalist Program

The stages of event extraction

The MEANING Multilingual Central Repository

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

WORK OF LEADERS GROUP REPORT

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Chapter 4: Valence & Agreement CSLI Publications

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

LING 329 : MORPHOLOGY

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Raw Data Files Instructions

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Construction Grammar. University of Jena.

A discursive grid approach to model local coherence in multi-document summaries

AUTHORITATIVE SOURCES ADULT AND COMMUNITY LEARNING LEARNING PROGRAMMES

Detecting negation scope is easy, except when it isn t

The Interface between Phrasal and Functional Constraints

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Adjectives tell you more about a noun (for example: the red dress ).

Cross Language Information Retrieval

Coordination Structure Analysis using Dual Decomposition

Transcription:

PDTB-style Discourse Annotation of Chinese Text XYZ Brandeis University PDTB Workshop University of Pennsylvania 4/30/2012

PDTB-style Discourse Annotation of Chinese Text Nianwen Xue, Yaqin Yang, Yuping Zhou Brandeis University PDTB Workshop University of Pennsylvania 4/30/2012

Questions How is the Chinese discourse similar to / different from English? Why is discourse annotation particularly important for Chinese? Can discourse relations be extracted from existing annotated resources, i.e., a treebank? Is PDTB-style discourse annotation feasible for Chinese and what adaptations have to be made?

Similarities Explicit and implicit relations; Explicit connectives: subordinating conjunctions, coordinating conjunctions discourse adverbials

Subordinating conjunctions /if /reform /measure /not / effective ( 么 /then) /investor /then /have /possibility /BA /attention /turn to /emerging /market "If the reform measures are not effective, confidence crisis still exists, then investors are likely to turn their attention to other emerging markets."

Coordinating conjunctions /modern /parent /difficult /to /DE / area /be /not only /no way /eliminate /blood /in /traditional /DE /values / but also /need /face /new /DE / values "The difficulty of being modern parents lies in the fact they can not get rid of the traditional values flowing in their blood, and they also need to face new values."

Adverbial connectives /Clinton /Admininstration /already /indicate /will /extend /China /DE /MFN /status /theorefore /this /CL /lobby /de /target /be /those / relatively /conservative /DE / congressmen The Clinton Administration has already indicated that it will extend China's MFN status, therefore, the focus of the lobby this time is on those relatively conservative congressmen.

Differences in discourse connectives Characteristics of the connectives in Chinese: They are often optional They vary in their syntactic position Many paired connectives, and the boundary between subordinating and coordinating conjunctions less clear Many different ways of expressing the same discourse relation

An example /Taiwan businessmen /children /school ( /although) /already /lay foundation / {but, however} /funding /insufficient / faculty /undecided The foundation of the school for Taiwan businessmen has been laid, but the funding is insufficient and its faculty hasn t been decided.

Same discourse relation expressed with multiple discourse connectives Gloss Part 1 Part 2 although,,,,,,,, because,,,, if,, 么, even if,,,, as long as, only if therefore, for example,

Differences in punctuation marks Characteristics of the Chinese punctuation marks Comma is a good indicator of the boundary of a discourse unit (arguments in the sense of PDTB) Where there is a discourse unit boundary, there is usually a comma But when there is a comma, there isn t always a discourse unit boundary Period is not a good indicator of sentence boundary Periods, exclamation and question marks always end sentences, but (arguably) not all sentences are ended by periods, exclamation and question marks. Commas (arguably) sometimes serve that function as well.

What is a sentence? nano 3 (1), (2), (3), (4) (5) pay attention to this Nano 3 recently, (1) even visit a few computer stores in person, (2) comparatively speaking, (3) Zhuoyue s prices be relatively low, (4) and can also guarantee that be genuine, (5) therefore place the order. I have been paying attention to this Nano 3 recently, (1) and I even visited a few computer stores in person. (2) Comparatively speaking, (3) Zhuoyue s prices are relatively low, (4) and they can also guarantee that their products are genuine. (5) Therefore I placed the order.

Answers to Question 1 Why is discourse annotation particularly important for Chinese? Discourse structure is needed to determine the sentence boundaries in Chinese (in addition to the usual purposes of discourse structure).

An attempt to automatically extract discourse relations Commas in Chinese are more reliable anchors of discourse relations than discourse connectives There isn t always a discourse connective, but there is always a punctuation mark (comma) at the boundary of a discourse segment A first-approximation of discourse relation analysis can be modeled as comma classification Based on automatically extractable patterns around commas in the Chinese Treebank

Syntactic patterns IP-Root IP IP IP, IP PP, IP NP VP IP, IP VP, VP a. SENTENCE BOUNDARY b. IP-COORDINATION c. VP COORDINATION IP IP CP/IP-CND, Main clause NP-SBJ VP VV, IP d. ADJUNCTION e. COMPLEMENTATION

Taxonomy of automatically extractable discourse relations NON-RELATION SENT BOUNDARY ALL COORD COORD-IP RELATION COORD-VP SUBORD ADJUNCTION COMP (ATTRIBUTION)

17

Answers to Question 2 Can discourse relations be extracted from existing annotated resources, i.e., a treebank? A first approximation of discourse structure can be extracted and may even be useful Can only extract sentence-internal discourse relations Substantial manual annotation needed to construct the full discourse structure of a document

Systematic Adaptations Procedural division between explicit and implicit discourse relation Annotation of implicit discourse relations Definition of Arg1 and Arg2

Explicit and Implicit Unified Always use punctuations as potential anchors of discourse relations Mark explicit connectives as an attribute of the discourse relation Justification: Punctuation marks are more reliable indicators of discourse unit boundary than discourse connectives Discourse connectives are often optional, syntactically flexible 82% implicit, 18% explicit

Systematic Adaptations Procedural division between explicit and implicit discourse relation Annotation of implicit discourse relations Definition of Arg1 and Arg2

Annotation of Implicit Relations No insertion of explicit connective Difficulties of insertion Inter-annotator agreement (Miltsakaki et al. 2004) Annotating with senses directly Now that a sense hierarchy already exists Difficulties (Prasad et al. 2008): use prototypical connectives as aids Benefits Inclusion of EntRel in the sense hierarchy Move from annotating connectives to annotating relations

Systematic Adaptations Procedural division between explicit and implicit discourse relation Annotation of implicit discourse relations Definition of Arg1 and Arg2

Why? Arg1/2 Defined Semantically 82% implicit: distinction less meaningful Discourse connectives often optional How? Use the sense hierarchy already developed for English Example: CONTINGENCY: Cause reason : for cases like because, since etc. result : for cases like so, as a result etc. reason Arg1, clause bound to ( because ) etc. result Arg2, clause bound to ( therefore ) etc.

Annotation experiments Annotation type Chinese Token count f (p/r) (%) PDTB (%) Rel id 3951* 95.4 (96.0/94.7) N/A Rel type 3951 95.1 N/A Imp sense type 2967 87.4 72 Argument order 3059 99.8 N/A Exp span exact 1580 84.2 90.2 Exp span partial 1580 99.6 94.5 Imp span exact 5934 96.9 85.1 Overall boundary 14039 87.7(87.5/87.9) N/A

Answers to Question 3 Is PDTB-style discourse annotation feasible? Yes, it s feasible What adaptations have to be made? intra-sentential discourse relations often delimited by comma without an explicit connective Use commas as well as periods as indicators of discourse unit boundaries Significantly more implicit relations than in English: 82% implicit in Chinese vs. 54.5% implicit in PDTB 2.0 Annotate explicit and implicit discourse connectives in one unified process Discourse connectives are often optional Define argument labels semantically

References Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, and Bonnie Webber. 2004. Annotating discourse connectives and their arguments. In Proceedings of the HLT/NAACL Workshop on Frontiers in Corpus Annotation, pages 9 16, Boston, MA, May. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse Treebank 2.0. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008). Yuping Zhou and Nianwen Xue. 2012 (to appear). PDTB-style discourse annotation of Chinese text. ACL-2012. Jeju Island, Korea. Yaqin Yang and Nianwen Xue. 2012 (to appear). Chinese comma disambiguation for discourse analysis. ACL-2012. Jeju Island, Korea.