A simple DOP model for constituency parsing of Italian sentences

Similar documents
An Efficient Implementation of a New POP Model

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Developing a TT-MCTAG for German with an RCG-based Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Prediction of Maximal Projection for Semantic Role Labeling

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Learning Computational Grammars

Accurate Unlexicalized Parsing for Modern Hebrew

Some Principles of Automated Natural Language Information Extraction

Grammars & Parsing, Part 1:

Ensemble Technique Utilization for Indonesian Dependency Parser

Parsing of part-of-speech tagged Assamese Texts

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Experiments with a Higher-Order Projective Dependency Parser

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Using dialogue context to improve parsing performance in dialogue systems

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

A Version Space Approach to Learning Context-free Grammars

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

"f TOPIC =T COMP COMP... OBJ

Proof Theory for Syntacticians

LTAG-spinal and the Treebank

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Linking Task: Identifying authors and book titles in verbose queries

Hyperedge Replacement and Nonprojective Dependency Structures

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Memory-based grammatical error correction

Unsupervised Dependency Parsing without Gold Part-of-Speech Tags

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Context Free Grammars. Many slides from Michael Collins

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Rule Learning With Negation: Issues Regarding Effectiveness

A Case Study: News Classification Based on Term Frequency

The stages of event extraction

A Comparison of Two Text Representations for Sentiment Analysis

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

The Discourse Anaphoric Properties of Connectives

Efficient Normal-Form Parsing for Combinatory Categorial Grammar

Probabilistic Latent Semantic Analysis

Rule Learning with Negation: Issues Regarding Effectiveness

Character Stream Parsing of Mixed-lingual Text

CS 598 Natural Language Processing

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Som and Optimality Theory

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Building a Semantic Role Labelling System for Vietnamese

Domain Adaptation for Parsing

Handling Sparsity for Verb Noun MWE Token Classification

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Online Updating of Word Representations for Part-of-Speech Tagging

The Smart/Empire TIPSTER IR System

The Strong Minimalist Thesis and Bounded Optimality

Grade 6: Correlated to AGS Basic Math Skills

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Multi-Lingual Text Leveling

Language Model and Grammar Extraction Variation in Machine Translation

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

SARDNET: A Self-Organizing Feature Map for Sequences

Learning Methods in Multilingual Speech Recognition

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Python Machine Learning

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Analysis of Probabilistic Parsing in NLP

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Cross-Media Knowledge Extraction in the Car Manufacturing Industry

Compositional Semantics

The Interface between Phrasal and Functional Constraints

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

A Graph Based Authorship Identification Approach

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

A Statistical Approach to the Semantics of Verb-Particles

Variations of the Similarity Function of TextRank for Automated Summarization

Semantic Inference at the Lexical-Syntactic Level

Talk About It. More Ideas. Formative Assessment. Have students try the following problem.

Cross Language Information Retrieval

AMULTIAGENT system [1] can be defined as a group of

Pre-Processing MRSes

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Language properties and Grammar of Parallel and Series Parallel Languages

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Toward Probabilistic Natural Logic for Syllogistic Reasoning

The Role of the Head in the Interpretation of English Deverbal Compounds

Software Maintenance

Human-like Natural Language Generation Using Monte Carlo Tree Search

University of Edinburgh. University of Pennsylvania

The Ups and Downs of Preposition Error Detection in ESL Writing

Transcription:

A simple DOP model for constituency parsing of Italian sentences Federico angati Institute for Logic, Language and Computation - University of Amsterdam fsangati@uvanl Abstract We present a simplified Data-Oriented Parsing (DOP) formalism for learning the constituency structure of Italian sentences In our approach we try to simplify the original DOP methodology by constraining the number and type of fragments we extract from the training corpus We provide some examples of the types of constructions that occur more often in the treebank, and quantify the performance of our grammar on the constituency parsing task Keywords: Data-Oriented Parsing, Tree substitution grammar, statistical model, fragments, kernel methods 1 Introduction The Data-Oriented Parsing (DOP) framework, proposed in [1] and developed in [2], has become one of the most successful methods in constituency parsing (cf [3], [4]) The main idea behind this methodology is to extract as many as possible fragments from the training corpus, and recombine them via a probabilistic generative model, in order to parse novel sentences In the current EVALITA 9 task we aim at simplifying the original DOP methodology by constraining the number of fragments we extract from the training corpus In particular we maintain only those fragments which are occurring at least two times in the training data The main motivation behind this choice is to keep in our grammar only those fragments for which there is an empirical evidence about their reusability 11 Data-Oriented Parsing A DOP grammar can be described as a collection T of fragments Figure 1 shows an example of four fragments that are extracted from the training parse tree depicted in figure 2, belonging to the TUT 1 training corpus Fragments are defined in such a way that every node is either a non-terminal leaf (with no more daughters), or has the exact same daughters as in the original tree Two elementary trees α and β can be combined by means of the substitution operation, α β, iff the root of β has the same label of the leftmost nonterminal leaf of α The result of this operation is a unified fragment which corresponds to α with the leftmost nonterminal leaf replaced with the entire fragment β The substitution operation can be applied iteratively: α β γ = (α β) γ 1 Turin University Treebank: http://wwwdiunitoit/~tutreeb, see also [5]

TOP ADJbar ADJ~IN PREP VMA~PA "Ogni" "di" "ammesso" ADJbar VAU~RE "mezzo" "prova" τ 1 τ 2 τ 3 τ 4 Fig 1: Example of elementary trees of depth 4, 3, 3, and 2 TOP ADJbar VAU~RE ADJ~IN PREP VMA~PA "Ogni" "di" "ammesso" "mezzo" "prova" Fig 2: Parse tree of the sentence Ogni mezzo di prova è ammesso (Every piece of evidence is admitted) When the tree resulting from a series of substitution operations is a complete parse tree, ie all its leaf nodes are lexical nodes, we define the sequence of the elementary trees used in the operations as a derivation of the complete parse tree Considering the 4 elementary trees in figure 1, τ 1 τ 2 τ 3 τ 4 constitutes a possible derivation of the complete parse tree of Figure 2 A stochastic instantiation of this grammar can be defined as follow: for every τ T, the probability of using τ in a substitution operation is defined as P (τ) = f(τ,t ) f(root(τ),t ), where the numerator returns the frequency of τ in T, and the denominator the number of fragments in T having root(τ) as root node If a derivation d is constituted by n elementary trees τ 1 τ 2 τ n, the probability of the derivation is calculated as: P (d) = n i=1 P (τ i) Given that we have multiple derivations d 1, d 2,, d m for the same parse tree t, the probability of t is defined as: P (t) = m i=1 P (d i)

2 Implementation In order to build our DOP grammar we have extracted all the fragments occurring in the 2,2 training structures 2 two or more times, by using an algorithm which is analogous to the one presented in [6] In figure 3 we show the distribution of the frequencies of the extracted fragments with respect to their depths In figure 4 we report the most common fragments containing the verb è (is), which can be seen as a collection of its main valency structures appearing in the annotated data In addition to these fragments we have added in our grammar all CFG rules that occur exactly once in the training corpus (9,497 rules) We have converted the DOP grammar to an isomorphic CFG (more details in [7]), and used the BitPar parser in [8] to parse the 2 sentences in the test set For every test sentence we have approximated the most probable parse tree by taking the 1, most probable derivations, summing the probabilities of those yielding the same parse tree, and selecting the most probable 3 Results Table 1 shows a summary of the parsing results of our system, which achieves 7576% in labeled F-score More detailed analyses on the results are given in figure 6 where we show the accuracy within each single label: all main categories (,,, ) achieve an accuracy which is in line with the overall score of the system Further investigations presented in figure 5 and in figure 7 suggest that the majority of parsing errors are due to crossing brackets among these four categories; wrongly labeled constituents are in fact a minor source of error 4 Conclusions We have presented a simplified DOP formalism for learning the constituency structure of Italian sentences As in previous works (cf [7], [9], [1]) the main motivation was to try to build a grammar based on those structures which are linguistically relevant, in this case those for which there is some empirical evidence about their reusability The results are poor relative to the same methodology applied to English treebanks One of the main reasons is certainly the smaller size of the training corpus used in the current shared task: as in other types of exemplar-based learning techniques, DOP models require a large amount of data in order to achieve high accuracy We nevertheless believe that few more steps could contribute to improve results within the same framework, in particular the use of proper smoothing techniques over the fragments as in [11], and an investigation over different probability distributions Acknowledgments We gratefully acknowledge funding by the Netherlands Organization for cientific Research (NWO): the author is funded through a Vici-grant Integrating Cognition (27776) to Rens Bod 2 We have removed all empty nodes, traces, and functional labels from the corpus

Depth Types Tokens 1 3364 8874 2 5818 72718 3 8768 6651 4 8328 37745 5 4795 1647 6 1839 535 7 56 1343 8 118 248 9 29 61 1 6 13 11 2 4 13 1 2 14 1 2 Number of Types 1, 9, 8, 7, 6, 5, 4, 3, 2, 1, Types Tokens 1 2 3 4 5 6 7 1, 9, 8, 7, 6, 5, 4, 3, 2, 1, Number of Tokens Depth Fig 3: Distribution of the frequency of the extracted 33,629 fragments with respect to their depths All these fragments occur at least two times in the training corpus 22 66 41 29 PREP ADJ~QU PREP 22 22 19 18 TOP AD ADVB ADJ~QU 17 12 1 1 Fig 4: The most frequent fragments in the grammar containing the verb è (is), when it is a main verb (VMA RE) and not an auxiliary (VAU RE)

Table 1: ummary of the parsing evaluation results Bracketing Labeled Recall 7853% Bracketing Labeled Precision 7324% Bracketing Labeled F-score 7579% Complete match 2% Average crossing 247 No crossing 425% 2 or less crossing 66% 1 9 8 7 6 5 4 3 2 1 / / / / / / /BAR / / /PRN BAR/ /PRN PRN/ / BAR/ /PRN PRN/ PRN/ BAR/ /AD Crossing brackets Counts 85 8 Category F- score % in gold 48 4 75 7 65 32 24 16 6 8 55 AD BAR PRN gold/parsed Fig 5: Frequencies of crossing brackets (number of constituents in the gold tree that cross one constituent in the parsed tree) Category F- score % in gold Wrong Category Counts Fig 6: F-score of the 8 most frequent categories in the corpus 9 8 7 6 5 4 3 2 1 / / / /DP / BAR/ 133/ BAR/ / / / / gold/parsed Fig 7: Frequencies of constituents in the parsed tree with correct spans but wrong labels

References 1 cha, R: Taaltheorie en taaltechnologie; competence en performance In: de Kort, R, Leerdam, G, (eds) Computertoepassingen in de Neerlandistiek, pp 7 22 LVVN, Almere, the Netherlands English translation at http://iaaanl/rs/leerdamehtml (199) 2 Bod, R: A computational model of language performance: Data oriented parsing In: Proceedings of COLING 1992, pp 855 859 (1992) 3 Bod, R: What is the minimal set of fragments that achieves maximal parse accuracy? In: Proceedings of ACL 21, pp 66 73 (21) 4 Bod, R, ima an, K, cha, R: Data-Oriented Parsing University of Chicago Press, Chicago, IL, UA (23) 5 Lesmo, L, Lombardo, V, Bosco, C: Treebank development: the TUT approach In: Proceedings of the International Conference on Natural Language Processing, pp 61 7 Vikas Publishing House (22) 6 Collins, M, Duffy, N: Convolution kernels for natural language In: Dietterich, TG, Becker,, Ghahramani, Z (eds) NIP, pp 625 632 MIT Press (21) 7 angati, F: Towards simpler tree substitution grammars, Mc Thesis (27) 8 chmid, H: Efficient parsing of highly ambiguous context-free grammars with bit vectors In: Proceedings of COLING 24, pp 162 168 Geneva, witzerland (24) 9 angati, F, Zuidema, W: Unsupervised methods for head assignments In: Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 29), pp 71 79 Association for Computational Linguistics (29) 1 Zuidema, W:Parsimonious data-oriented parsing In: Proceedings of the 27 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp 551 56 Association for Computational Linguistics (27) 11 Bod, R: Two questions about data-oriented parsing In: Proceedings of the Fourth Workshop on Very Large Corpora (1996)