Correcting Errors in a Treebank Based on Synchronous Tree Substitution Grammar

Similar documents
Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Prediction of Maximal Projection for Semantic Role Labeling

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Linking Task: Identifying authors and book titles in verbose queries

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Grammars & Parsing, Part 1:

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

The stages of event extraction

Context Free Grammars. Many slides from Michael Collins

Learning Computational Grammars

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

LTAG-spinal and the Treebank

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

The Smart/Empire TIPSTER IR System

Using dialogue context to improve parsing performance in dialogue systems

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Ensemble Technique Utilization for Indonesian Dependency Parser

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Memory-based grammatical error correction

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Vocabulary Usage and Intelligibility in Learner Language

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Parsing of part-of-speech tagged Assamese Texts

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Cross Language Information Retrieval

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Building a Semantic Role Labelling System for Vietnamese

The Ups and Downs of Preposition Error Detection in ESL Writing

A Named Entity Recognition Method using Rules Acquired from Unlabeled Data

A Version Space Approach to Learning Context-free Grammars

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Disambiguation of Thai Personal Name from Online News Articles

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

An Efficient Implementation of a New POP Model

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

The Indiana Cooperative Remote Search Task (CReST) Corpus

Grammar Extraction from Treebanks for Hindi and Telugu

A Domain Ontology Development Environment Using a MRD and Text Corpus

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Annotation Projection for Discourse Connectives

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

CS 598 Natural Language Processing

Yoshida Honmachi, Sakyo-ku, Kyoto, Japan 1 Although the label set contains verb phrases, they

The Role of the Head in the Interpretation of English Deverbal Compounds

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Adapting Stochastic Output for Rule-Based Semantics

BYLINE [Heng Ji, Computer Science Department, New York University,

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Graph Based Authorship Identification Approach

Specifying a shallow grammatical for parsing purposes

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

cmp-lg/ Jan 1998

Constructing Parallel Corpus from Movie Subtitles

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Rule Learning With Negation: Issues Regarding Effectiveness

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Discriminative Learning of Beam-Search Heuristics for Planning

Learning Methods in Multilingual Speech Recognition

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

GCSE. Mathematics A. Mark Scheme for January General Certificate of Secondary Education Unit A503/01: Mathematics C (Foundation Tier)

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Create Quiz Questions

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Loughton School s curriculum evening. 28 th February 2017

Dependency Annotation of Coordination for Learner Language

Search right and thou shalt find... Using Web Queries for Learner Error Detection

A heuristic framework for pivot-based bilingual dictionary induction

Beyond the Pipeline: Discrete Optimization in NLP

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

The College Board Redesigned SAT Grade 12

An investigation of imitation learning algorithms for structured prediction

Short Text Understanding Through Lexical-Semantic Analysis

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

arxiv: v1 [cs.cl] 2 Apr 2017

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Proof Theory for Syntacticians

Accurate Unlexicalized Parsing for Modern Hebrew

Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

AQUA: An Ontology-Driven Question Answering System

Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles

Mathematics Scoring Guide for Sample Test 2005

Developing a TT-MCTAG for German with an RCG-based Parser

Transcription:

Correcting Errors in a Treebank Based on ynchronous Tree ubstitution Grammar Yoshihide Kato 1 and higeki Matsubara 2 1Information Technology Center Nagoya University 2Graduate chool of Information cience Nagoya University Furo-cho Chikusa-ku Nagoya 464-861 Japan yosihide@elitcnagoya-uacjp Abstract This paper proposes a method of correcting annotation errors in a treebank By using a synchronous grammar the method transforms parse trees containing annotation errors into the ones whose errors are corrected The synchronous grammar is automatically induced from the treebank We report an experimental result of applying our method to the Penn Treebank The result demonstrates that our method corrects syntactic annotation errors with high precision 1 Introduction Annotated corpora play an important role in the fields such as theoretical linguistic researches or the development of NLP systems However often contain annotation errors which are caused by a manual or semi-manual mark-up process These errors are problematic for corpus-based researches To solve this problem several error detection and correction methods have been proposed so far (Eskin 2; Nakagawa and Matsumoto 22; Dickinson and Meurers 23a; Dickinson and Meurers 23b; Ule and imov 24; Murata et al 25; Dickinson and Meurers 25; Boyd et al 28) These methods detect corpus positions which are marked up incorrectly and find the correct labels (eg pos-tags) for those positions However the methods cannot correct errors in structural annotation This means that are insufficient to correct annotation errors in a treebank This paper proposes a method of correcting errors in structural annotation Our method is based on a synchronous grammar formalism called synchronous tree substitution grammar (TG) (Eisner 23) which defines a tree-to-tree transformation By using an TG our method transforms parse trees containing errors into the ones whose errors are corrected The grammar is automatically induced from the treebank To select TG rules which are useful for error correction we define a score function based on the occurrence frequencies of the rules An experimental result shows that the selected rules archive high precision This paper is organized as follows: ection 2 gives an overview of previous work ection 3 explains our method of correcting errors in a treebank ection 4 reports an experimental result using the Penn Treebank 2 Previous Work This section summarizes previous methods for correcting errors in corpus annotation and discusses their problem ome research addresses the detection of errors in pos-annotation (Nakagawa and Matsumoto 22; Dickinson and Meurers 23a) syntactic annotation (Dickinson and Meurers 23b; Ule and imov 24; Dickinson and Meurers 25) and dependency annotation (Boyd et al 28) These methods only detect corpus positions where errors occur It is unclear how we can correct the errors everal methods can correct annotation errors (Eskin 2; Murata et al 25) These methods are to correct tag-annotation errors that is simply suggest a candidate tag for each position where an error is detected The methods cannot correct syntactic annotation errors because syntactic annotation is structural There is no approach to correct structural annotation errors To clarify the problem let us consider an example Figure 1 depicts two parse trees annotated according to the Penn Treebank annotation 1 The 1 and are null elements 74 Proceedings of the ACL 21 Conference hort Papers pages 74 79 Uppsala weden 11-16 July 21 c 21 Association for Computational Linguistics

(a) incorrect parse tree VBP (b) correct parse tree VBP BAR BAR MD will MD will VB be VB be JJ good JJ good ADJP for ADJP Figure 1: An example of a treebank error for NN bonds NN bonds parse tree (a) contains errors and the parse tree (b) is the corrected version In the parse tree (a) the positions of the two subtrees ( ) are erroneous To correct the errors we need to move the subtrees to the positions which are directly dominated by the node This example demonstrates that we need a framework of transforming tree structures to correct structural annotation errors 3 Correcting Errors by Using ynchronous Grammar To solve the problem described in ection 2 this section proposes a method of correcting structural annotation errors by using a synchronous tree substitution grammar (TG) (Eisner 23) An TG defines a tree-to-tree transformation Our method induces an TG which transforms parse trees containing errors into the ones whose errors are corrected 31 ynchronous Tree ubstitution Grammar First of all we describe the TG formalism An TG defines a set of tree pairs An TG can be treated as a tree transducer which takes a tree as input and produces a tree as output Each grammar rule consists of the following elements: a pair of trees called elementary trees source 1 2 3 4 target 1 2 3 Figure 2: An example of an TG rule a one-to-one alignment between nodes in the elementary trees For a tree pair t t the tree t and t are called source and target respectively The nonterminal leaves of elementary trees are called frontier nodes There exists a one-to-one alignment between the frontier nodes in t and t The rule means that the structure which matches the source elementary tree is transformed into the structure which is represented by the target elementary tree Figure 2 shows an example of an TG rule The subscripts indicate the alignment This rule can correct the errors in the parse tree (a) depicted in Figure 1 An TG derives tree pairs Any derivation process starts with the pair of nodes labeled with special symbols called start symbols A derivation proceeds in the following steps: 1 Choose a pair of frontier nodes η η for which there exists an alignment 2 Choose a rule t t st label(η) = root(t) and label(η ) = root(t ) where label(η) is the label of η and root(t) is the root label of t 3 ubstitute t and t into η and η respectively Figure 3 shows a derivation process in an TG In the rest of the paper we focus on the rules in which the source elementary tree is not identical to its target since such identical rules cannot contribute to error correction 32 Inducing an TG for Error Correction This section describes a method of inducing an TG for error correction The basic idea of our method is similar to the method presented by Dickinson and Meurers (23b) Their method detects errors by seeking word sequences satisfying the following conditions: The word sequence occurs more than once in the corpus 4 75

(a) 1 2 4 1 (b) 1 2 4 3 VBP 5 BAR 6 7 8 1 3 VBP 5 BAR 6 7 8 9 9 (c) Figure 4: An example of a partial parse tree pair in a pseudo parallel corpus (d) VBP BAR VBD ADJP will JJ proud of $ NN his abilities Figure 3: A derivation process of tree pairs in an TG Different syntactic labels are assigned to the occurrences of the word sequence Unlike their method our method seeks word sequences whose occurrences have different partial parse trees We call a collection of these word sequences with partial parse trees pseudo parallel corpus Moreover our method extracts TG rules which transform the one partial tree into the other 321 Constructing a Pseudo Parallel Corpus Our method firstly constructs a pseudo parallel corpus which represents a correspondence between parse trees containing errors and the ones whose errors are corrected The procedure is as follows: Let T be the set of the parse trees occurring in the corpus We write ub(σ) for the set which consists of the partial parse trees included in the parse tree σ A pseudo parallel corpus P ara(t ) is constructed as follows: P ara(t ) = { τ τ τ τ τ τ σ T ub(σ) yield(τ) = yield(τ ) root(τ) = root(τ )} Figure 5: Another example of a parse tree containing a word sequence where yield(τ) is the word sequence dominated by τ Let us consider an example If the parse trees depicted in Figure 1 exist in the treebank T the pair of partial parse trees depicted in Figure 4 is an element of P ara(t ) We also obtain this pair in the case where there exists not the parse tree (b) depicted in Figure 1 but the parse tree depicted in Figure 5 which contains the word sequence 322 Inducing a Grammar from a Pseudo Parallel Corpus Our method induces an TG from the pseudo parallel corpus according to the method proposed by Cohn and Lapata (29) Cohn and Lapata s method can induce an TG which represents a correspondence in a parallel corpus Their method firstly determine an alignment of nodes between pairs of trees in the parallel corpus and extracts TG rules according to the alignments For partial parse trees τ and τ we define a node alignment C(τ τ ) as follows: C(τ τ ) = { η η η Node(τ) η Node(τ ) η is not the root of τ 76

η is not the root of τ label(η) = label(η ) yield(η) = yield(η )} (1) (2) where Node(τ) is the set of the nodes in τ and yield(η) is the word sequence dominated by η Figure 4 shows an example of a node alignment The subscripts indicate the alignment An TG rule is extracted by deleting nodes in a partial parse tree pair τ τ P ara(t ) The procedure is as follows: For each η η C(τ τ ) delete the descendants of η and η For example the rule shown in Figure 2 is extracted from the pair shown in Figure 4 33 Rule election ome rules extracted by the procedure in ection 32 are not useful for error correction since the pseudo parallel corpus contains tree pairs whose source tree is correct or whose target tree is incorrect The rules which are extracted from such pairs can be harmful To select rules which are useful for error correction we define a score function which is based on the occurrence frequencies of elementary trees in the treebank The score function is defined as follows: core( t t ) = f(t ) f(t) + f(t ) where f( ) is the occurrence frequency in the treebank The score function ranges from to 1 We assume that the occurrence frequency of an elementary tree matching incorrect parse trees is very low According to this assumption the score function core( t t ) is high when the source elementary tree t matches incorrect parse trees and the target elementary tree t matches correct parse trees Therefore TG rules with high scores are regarded to be useful for error correction 4 An Experiment To evaluate the effectiveness of our method we conducted an experiment using the Penn Treebank (Marcus et al 1993) We used 4928 sentences in Wall treet Journal sections We induced TG rules by applying our method to the corpus We obtained 8776 rules We (3) NN NN (4) source target Figure 6: Examples of error correction rules induced from the Penn Treebank measured the precision of the rules The precision is defined as follows: # of the positions where an error is corrected precision = # of the positions to which some rule is applied We manually checked whether each rule application corrected an error because the corrected treebank does not exist 2 Furthermore we only evaluated the first 1 rules which are ordered by the score function described in ection 33 since it is time-consuming and expensive to evaluate all of the rules These 1 rules were applied at 331 positions The precision of the rules is 719% For each rule we measured the precision of it 7 rules achieved 1% precision These results demonstrate that our method can correct syntactic annotation errors with high precision Moreover 3 rules of the 7 rules transformed bracketed structures This fact shows that the treebank contains structural errors which cannot be dealt with by the previous methods Figure 6 depicts examples of error correction rules which achieved 1% precision Rule (1) (2) and (3) are rules which transform bracketed structures Rule (4) simply replaces a node label Rule (1) corrects an erroneous position of a comma (see Figure 7 (a)) Rule (2) deletes a useless node in a subject position (see Figure 7 (b)) Rule (3) inserts a node (see Figure 7 (c)) Rule (4) replaces a node label with the correct label (see Figure 7 (d)) These examples demonstrate that our method can correct syntactic annotation errors Figure 8 depicts an example where our method detected an annotation error but could not correct it To correct the error we need to attach the node rules 2 This also means that we cannot measure the recall of the 77

(a) (b) I think I think all you need is one good one all you need is one good one (c) NN of the respondents of NN the respondents (d) only two or three other major banks in only two or three other major banks in the U the U Figure 7: Examples of correcting syntactic annotation errors At CD 1:33 BAR BAR when At CD when 1:33 TOP The average of interbank offered rates based on quotations at five major banks Figure 8: An example where our method detected an annotation error but could not correct it BAR under the node We found that 22 of the rule applications were of this type Figure 9 depicts a false positive example where our method mistakenly transformed a correct syntactic structure The score of the rule is very high since the source elementary tree (TOP ( )) is less frequent This example shows that our method has a risk of changing correct annotations of less frequent syntactic structures 5 Conclusion This paper proposes a method of correcting errors in a treebank by using a synchronous tree substitution grammar Our method constructs a pseudo parallel corpus from the treebank and extracts TG rules from the parallel corpus The experimental result demonstrates that we can obtain error correction rules with high precision TOP The average of interbank offered rates based on quotations at five major banks Figure 9: A false positive example where a correct syntactic structure was mistakenly transformed In future work we will explore a method of increasing the recall of error correction by constructing a wide-coverage TG Acknowledgements This research is partially supported by the Grantin-Aid for cientific Research (B) (No 22351) of JP and by the Kayamori Foundation of Informational cience Advancement 78

References Adriane Boyd Markus Dickinson and Detmar Meurers 28 On detecting errors in dependency treebanks Research on Language and Computation 6(2):113 137 Trevor Cohn and Mirella Lapata 29 entence compression as tree transduction Journal of Artificial Intelligence Research 34(1):637 674 Markus Dickinson and Detmar Meurers 23a Detecting errors in part-of-speech annotation In Proceedings of the 1th Conference of the European Chapter of the Association for Computational Linguistics pages 17 114 Markus Dickinson and Detmar Meurers 23b Detecting inconsistencies in treebanks In Proceedings of the econd Workshop on Treebanks and Linguistic Theories Markus Dickinson and W Detmar Meurers 25 Prune diseased branches to get healthy trees! how to find erroneous local trees in a treebank and why it matters In Proceedings of the 4th Workshop on Treebanks and Linguistic Theories Jason Eisner 23 Learning non-isomorphic tree mappings for machine translation In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics Companion Volume pages 25 28 Eleazar Eskin 2 Detecting errors within a corpus using anomaly detection In Proceedings of the 1st North American chapter of the Association for Computational Linguistics Conference pages 148 153 Mitchell P Marcus Beatrice antorini and Mary Ann Marcinkiewicz 1993 Building a large annotated corpus of English: the Penn Treebank Computational Linguistics 19(2):31 33 Masaki Murata Masao Utiyama Kiyotaka Uchimoto Hitoshi Isahara and Qing Ma 25 Correction of errors in a verb modality corpus for machine translation with a machine-learning method ACM Transactions on Asian Language Information Processing 4(1):18 37 Tetsuji Nakagawa and Yuji Matsumoto 22 Detecting errors in corpora using support vector machines In Proceedings of the 19th Internatinal Conference on Computatinal Linguistics pages 79 715 Tylman Ule and Kiril imov 24 Unexpected productions may well be errors In Proceedings of 4th International Conference on Language Resources and Evaluation pages 1795 1798 79