ANNOTATING DISCOURSE IN PRAGUE DEPENDENCY TREEBANK

Similar documents
Some Principles of Automated Natural Language Information Extraction

Proof Theory for Syntacticians

The Discourse Anaphoric Properties of Connectives

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

New perspectives on cohesion and coherence

Compositional Semantics

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Adding syntactic structure to bilingual terminology for improved domain adaptation

CS 598 Natural Language Processing

University of Edinburgh. University of Pennsylvania

Segmented Discourse Representation Theory. Dynamic Semantics with Discourse Structure

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Accurate Unlexicalized Parsing for Modern Hebrew

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Control and Boundedness

English Language and Applied Linguistics. Module Descriptions 2017/18

A High-Quality Web Corpus of Czech

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Developing a TT-MCTAG for German with an RCG-based Parser

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Applications of memory-based natural language processing

LING 329 : MORPHOLOGY

Underlying and Surface Grammatical Relations in Greek consider

Annotation Projection for Discourse Connectives

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

A Framework for Customizable Generation of Hypertext Presentations

Semi-supervised Training for the Averaged Perceptron POS Tagger

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

SEMAFOR: Frame Argument Resolution with Log-Linear Models

An Interactive Intelligent Language Tutor Over The Internet

An Introduction to the Minimalist Program

Florida Reading Endorsement Alignment Matrix Competency 1

Annotating (Anaphoric) Ambiguity 1 INTRODUCTION. Paper presentend at Corpus Linguistics 2005, University of Birmingham, England

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Parsing of part-of-speech tagged Assamese Texts

Grammar Extraction from Treebanks for Hindi and Telugu

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Phonological and Phonetic Representations: The Case of Neutralization

Ensemble Technique Utilization for Indonesian Dependency Parser

The interpretation of Latin predicative participles

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Linking Task: Identifying authors and book titles in verbose queries

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Language Evolution, Metasyntactically. First International Workshop on Bidirectional Transformations (BX 2012)

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Derivational and Inflectional Morphemes in Pak-Pak Language

Constraining X-Bar: Theta Theory

Realization of Textual Cohesion and Coherence in Business Letters through Presupposition 1

The stages of event extraction

Issues of Projectivity in the Prague Dependency Treebank

Using Semantic Relations to Refine Coreference Decisions

Prediction of Maximal Projection for Semantic Role Labeling

LTAG-spinal and the Treebank

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Dissertation Summaries. The Acquisition of Aspect and Motion Verbs in the Native Language (Aristotle University of Thessaloniki, 2014)

Construction Grammar. University of Jena.

arxiv: v1 [cs.cl] 2 Apr 2017

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Syntactic Dependencies for Multilingual and Multilevel Corpus Annotation

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Annotation Guidelines for Rhetorical Structure

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Annotation and Taxonomy of Gestures in Lecture Videos

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 12: 9 September 2012 ISSN

Chapter 4: Valence & Agreement CSLI Publications

A discursive grid approach to model local coherence in multi-document summaries

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

A First-Pass Approach for Evaluating Machine Translation Systems

On-Line Data Analytics

Parallel Syntactic Annotation of Multiple Languages

Experiments with a Higher-Order Projective Dependency Parser

Beyond the Pipeline: Discrete Optimization in NLP

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Providing student writers with pre-text feedback

Cross-linguistic aspects in child L2 acquisition

Controlled vocabulary

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Private International Law In Czech Republic. By Monika Pauknerová

POWLA: Modeling linguistic corpora in OWL/DL

New Features & Functionality in Q Release Version 3.2 June 2016

Beyond constructions:

Intensive Writing Class

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

Developing a large semantically annotated corpus

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Mental Models and the Meaning of Connectives: A Study on Children, Adolescents and Adults

The Smart/Empire TIPSTER IR System

Highlighting and Annotation Tips Foundation Lesson

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

DYNAMIC ADAPTIVE HYPERMEDIA SYSTEMS FOR E-LEARNING

Specifying Logic Programs in Controlled Natural Language

LFG Semantics via Constraints

Transcription:

ANNOTATING DISCOURSE IN PRAGUE DEPENDENCY TREEBANK PDTB WORKSHOP, PHILADEPHIA APRIL 30, 2012 Lucie Poláková (Mladová), Charles University in Prague

PRAGUE DEPENDENCY TREEBANK & DISCOURSE 2006: Prague Dependency Treebank (PDT) 2.0 Hajič et al. : multilayer annotations of 50,000 sentences of Czech journalistic texts: - morphology - surface syntax - underlying syntax + sentence semantics (tectogrammatical tree structures) + other phenomena included: information structure, grammatically bound and textual pronominal coreference 2009-2012: annotation of discourse-related phenomena - Prof. Eva Hajičová

DISCOURSE-LEVEL PHENOMENA IN PDT Manual annotations of the whole treebank: 49,431 sentences in 3165 documents of 1-231 sentences, with an average length of 15.6 sentences 1. Explicit discourse connectives + their arguments (scopes), sense tags (= PDTB II. annotation) 2. a) Extended textual coreference annotation b) Bridging relations ALL DISCOURSE PHENOMENA ANNOTATED DIRECTLY ON THE SYNTACTIC (tectogrammatical) TREES! Both projects completed 2011, currently under checking procedures

EXAMPLE OF ANNOTATION ON TREES 1. Searching for the possible discourse connectives in a plain text (unlike in Penn) V polovině července radní předložené rozhodnutí akceptovali. Dodnes však žádná smlouva podepsána nebyla. In middle July, the councilors accepted the proposed decision. However, no contract was signed so far. 2. Marking the discourse relation on the tectogrammatical trees:

In middle July, the councilors accepted the proposed decision. However, no contract was signed so far.

In middle July, the councilors accepted the proposed decision. However, no contract was signed so far.

In middle July, the councilors accepted the proposed decision. However, no contract was signed so far.

In middle July, the councilors accepted the proposed decision. However, no contract was signed so far. - linking of the arguments of the relation by an arrow (automatically connected with choice of semantic intepretation)

In middle July, the councilors accepted the proposed decision. However, no contract was signed so far. - linking of the arguments of the relation by an arrow (automatically connected with choice of semantic intepretation) - adding of the connective

In middle July, the councilors accepted the proposed decision. However, no contract was signed so far. - linking of the arguments of the relation by an arrow (automatically connected with choice of semantic intepretation) - adding of the connective - marking of the extent of the arguments

DISCOURSE REPRESENTATION COMPACT VIEW

SEMANTIC LABELS IN PRAGUE TEMPORAL CONTINGENCY COMPARISON (CONTRAST) EXPANSION asynchronous reason - result confrontation conjunction synchronous pragmatic reason result opposition instantiation purpose pragmatic contrast specification explication restrictive opposition equivalence condition concession generalization pragmatic condition correction (replacement) gradation conjunctive alternative disjunctive alternative

SEMANTIC LABELS IN PRAGUE TEMPORAL CONTINGENCY COMPARISON (CONTRAST) EXPANSION asynchronous reason - result confrontation conjunction synchronous pragmatic reason result opposition instantiation purpose pragmatic contrast specification explication restrictive opposition equivalence condition concession generalization pragmatic condition correction (replacement) gradation conjunctive alternative disjunctive alternative

ADDITIONAL INFORMATION ANNOTATED List structures quite a typical way of text composition Headings Alternative lexical expressions of the connectives, for example: Z toho vyplývá = tedy, takže Důvodem je = protože (This implies = so, therefore ) (The reason is that = because) Double-sense relations The so-called collections Not annotated so far: Implicit connectives, their arguments and senses (in comparison to PDTB 2.0) Attribution (partly present in some features of the syntactic trees)

TREEBANK STATISTICS Measured on train data = 9/10 of the treebank = 43,955 sentences Relation Intra-sentential Intersentential Intersentential Total conj 4706 1259 5965 opp 1179 1602 2781 reason 1507 900 2407 cond 1332 15 1347 conc 618 234 852 preced 586 205 791 confr 311 272 583 spec 399 99 498 purp 459 1 460 corr 292 109 401 grad 150 182 332 Relation Intrasentential Total restr 110 148 258 synchr 210 43 253 explicat 75 116 191 disjalt 174 12 186 exempl 21 106 127 equiv 36 56 92 gener 7 84 91 conjalt 49 16 65 f_opp 23 26 49 f_reason 8 26 34 f_cond 14 1 15 total 12266 5512 17778

INTERANNOTATOR AGREEMENT MEASUREMENT treebank: 10 sections, in each of them a sample annotated by all annotators for the IAA measurement (2,084 sentences) following table: only the inter-sentential relations (assumed improvement with intra-sententials) connective-based measurement one pair of annotators average agreement on semantic types 0.77 similar measure in PDTB 0.8 (Prasad et al., LREC 2008) agreement on higher level: 4 basic semantic classes 0.89 (Penn 0.94)

IAA IN THE SUBSEQUENT MEASUREMENTS Measurement Connectivebased F1 measure Agreement on semantic types Kappa on sem. types train-2 0.83 0.69 0.57 train-3 0.79 0.8 0.75 train-4 0.8 0.75 0.69 train-5 0.85 0.76 0.71 train-6 0.84 0.77 0.68 train-7 0.79 0.67 0.61 train-8 0.86 0.84 0.79 dtest 0.85 0.73 0.67 etest 0.83 0.72 0.68 train-1 0.84 0.91 0.88

WHAT HAVE WE LEARNED? Linguistically: the absolute number of explicit inter-sentential relations is low (it increases with the incorporation of "syntactic" discourse edges) coreference annotation (entity-based, EntRel) on the same data HELPS a lot things different nature of the relations (condition syntax-bound; specification text-bound) some very ambiguous connectives in Czech (ale = but) some Czech connectives have no exact English counterpart (totiž additional semantic category?) some findings correspond with Penn numbers discourse structure to some extent language independent we need genre distinction

WHAT HAVE WE LEARNED? Technically: trees with resolved syntactic structure help a lot (ellipsis, easy extraction of intra-sentential relations, coreference accessible ) sometimes problems to deal with larger arguments, complexity of the trees the representation offers lot of data with all the various types of linguistic information AT ONCE

RECENT & NEXT Completed: the annotation of explicits annotation manual in Czech and in English webpage with a browser in a sample of data intergration of the two projects to ONE layer with coreference and bridging annotations In progress: extracting relevant tectogrammatical information (intra-sentential discourse relations) release of the discourse and coreference data as PDT 2.5 Next: altlexes genre distinction of the texts interplays (information structure, Play Coref, the phenomenon of contrast...) automatic experiments

Thank you! polakova@ufal.mff.cuni.cz http://ufal.mff.cuni.cz/discourse This work was supported by the Grant Agency of the Czech Republic (P406/12/0658, P406/2010/0875 ) and by the Czech Ministry of Education (ME 10018).