Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Similar documents
Designing Autonomous Robot Systems - Evaluation of the R3-COP Decision Support System Approach

Teachers response to unexplained answers

Smart Grids Simulation with MECSYCO

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

User Profile Modelling for Digital Resource Management Systems

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

A Novel Approach for the Recognition of a wide Arabic Handwritten Word Lexicon

Students concept images of inverse functions

Ensemble Technique Utilization for Indonesian Dependency Parser

Developing a TT-MCTAG for German with an RCG-based Parser

LTAG-spinal and the Treebank

Specification of a multilevel model for an individualized didactic planning: case of learning to read

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Disambiguation of Thai Personal Name from Online News Articles

Parsing of part-of-speech tagged Assamese Texts

Hyperedge Replacement and Nonprojective Dependency Structures

Linking Task: Identifying authors and book titles in verbose queries

Language specific preferences in anaphor resolution: Exposure or gricean maxims?

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Natural Language Processing. George Konidaris

Grammars & Parsing, Part 1:

Deep Lexical Segmentation and Syntactic Parsing in the Easy-First Dependency Framework

A Version Space Approach to Learning Context-free Grammars

Context Free Grammars. Many slides from Michael Collins

Compositional Semantics

Accurate Unlexicalized Parsing for Modern Hebrew

An Efficient Implementation of a New POP Model

"f TOPIC =T COMP COMP... OBJ

Process Assessment Issues in a Bachelor Capstone Project

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

The Smart/Empire TIPSTER IR System

The Interface between Phrasal and Functional Constraints

Using dialogue context to improve parsing performance in dialogue systems

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Some Principles of Automated Natural Language Information Extraction

Agnès Tutin and Olivier Kraif Univ. Grenoble Alpes, LIDILEM CS Grenoble cedex 9, France

Prediction of Maximal Projection for Semantic Role Labeling

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Adapting Stochastic Output for Rule-Based Semantics

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

CS 598 Natural Language Processing

Control and Boundedness

Theoretical Syntax Winter Answers to practice problems

Cross Language Information Retrieval

On document relevance and lexical cohesion between query terms

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Constructing Parallel Corpus from Movie Subtitles

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Efficient Normal-Form Parsing for Combinatory Categorial Grammar

AQUA: An Ontology-Driven Question Answering System

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Analysis of Probabilistic Parsing in NLP

The Discourse Anaphoric Properties of Connectives

The Role of the Head in the Interpretation of English Deverbal Compounds

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The stages of event extraction

arxiv:cmp-lg/ v1 16 Aug 1996

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Raising awareness on Archaeology: A Multiplayer Game-Based Approach with Mixed Reality

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Does Linguistic Communication Rest on Inference?

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Beyond the Pipeline: Discrete Optimization in NLP

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Ch VI- SENTENCE PATTERNS.

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Proof Theory for Syntacticians

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Handling Sparsity for Verb Noun MWE Token Classification

A Study of Synthetic Oversampling for Twitter Imbalanced Sentiment Analysis

Facets and Prisms as a Means to Achieve Pedagogical Indexation of Texts for Language Learning: Consequences of the Notion of Pedagogical Context

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Character Stream Parsing of Mixed-lingual Text

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Universiteit Leiden ICT in Business

Pre-Processing MRSes

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Learning Computational Grammars

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Radius STEM Readiness TM

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Technology-mediated realistic mathematics education and the bridge21 model: A teaching experiment

Rule-based Expert Systems

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Language Model and Grammar Extraction Variation in Machine Translation

Transcription:

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Jakub Waszczuk, Agata Savary To cite this version: Jakub Waszczuk, Agata Savary. Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]. PARSEME 6th general meeting, Apr 206, Struga, Macedonia. HAL Id: hal-0505052 https://hal.archives-ouvertes.fr/hal-0505052 Submitted on 2 Apr 207 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3] Jakub Waszczuk, Agata Savary Université François Rabelais Tours, Laboratoire d informatique, France first.last@univ-tours.fr atural language parsing is known to potentially produce a high number of syntactic interpretations for a sentence. Some of them may contain multiword expressions (MWEs) and achieving them faster than compositional alternatives proved efficient in symbolic parsing (see below). We propose to apply this strategy to symbolic LTAG (Lexicalized Tree Adjoining Grammar) parsing using an architecture adaptable to probabilistic parsing. We are particularly interested in LTAGs because, according to (Abeillé and Schabes 989), they show several advantages with respect to parsing MWEs. Firstly, unification constraints on feature structures attached to tree nodes allow one to naturally express dependencies between arguments at different depths in the elementary trees (as in P 0 vider DET sac to express one s secret thoughts, where the determiner DET embedded in the direct object must agree in person and number with the subject P 0 ). Secondly, the so-called extended domain of locality offers a natural framework for representing two different kinds of discontinuities. amely, discontinuities coming from the internal structure of a MWE are directly visible in elementary trees and are handled in parsing mostly by substitution. Discontinuities coming from insertion of modifiers (e.g. a bunch of P, a whole bunch of P) are invisible in elementary trees but are handled in parsing by adjunction. Consider the sentence in example (). () Acid rains in Ghana are equally grim. When it is being scanned by a left-to-right parser, two competing interpretations are syntactically valid for the first 4 words. One of them considers rains as a verb whose subject is acid while, according to the other, rains is the head noun of the compound acid rains. Our objective is to propose a parsing strategy which would promote the latter interpretation due the fact that it contains a known MWE. More precisely, the parser should: (i) trivially, admit only grammar-compliant analyses of a sentence, (ii) achieve MWE-oriented interpretations more rapidly than potential compositional interpretations, (iii) eliminate no grammar-compliant interpretations. ote that all these conditions could rather easily be met for sentence () in a pre-processingbased approach in which potential MWEs are identified prior to parsing and conflated into word-with-spaces tokens. Such an approach might however lead to a parsing failure in the case of sentence (2) if the two initial tokens are wrongly merged into a nominal compound in the pre-parsing step. In order to avoid errors of this kind, MWE identification and parsing should be performed jointly. (2) Hunger strikes the civilians since 200. Seminal works, such as (Finkel and Manning 2009, Green et al. 20, 203, Constant et al. 203), show that the results of probabilistic MWE identification and/or parsing are improved when both tasks are performed simultaneously. (Wehrli et al. 200) point out that such an improvement (also within further parsingbased applications, e.g. machine translation) occurs in symbolic parsing (here: in a Chomskian grammar-based approach) when the knowledge about a potential occurrence of MWEs guides the parsing process. Our goal is to apply a similar strategy to the one in (Wehrli et al. 200), i.e. to systematically promote MWE-oriented interpretations, within LTAG parsing We additionally wish to design the parser architecture in such a way that corpusbased probabilities about MWE contexts can be The parsing algorithm should of course abstract away from the way the input LTAG grammar was obtained (manually crafted, generated from a metagrammar, or learned from a treebank).

P P acid rains P 0 0 acid S VP V S P VP VP V 2 rains V 2 P acid rains 0.5 0.5 P 3 4 3 4 acid rains Figure : A toy LTAG grammar and its conversion into flat rules easily injected into it as soon as they are available (we have performed no experiments to obtain them yet) ote that promoting MWEs will of course be inaccurate for sentence (2). However: (i) the correct interpretation will not be discarded (it will simply be followed later than the MWE-oriented one), (ii) (Wehrli et al. 200) shows that giving high priority to certain types of MWEs in parsing is a good strategy on average. LTAG with weighted terminals Our parser relies on a particular LTAG grammar representation in which each elementary LTAG tree is converted into a set of flat production rules 2, similarly to (Alonso et al. 999). Fig. illustrates this conversion on a set of 3 elementary trees 3. ote that the non-terminal occurring 3 times in this grammar is represented by 3 different nonterminals 0, 3 and 4 in the target rules. 4 This distinction is necessary in order to prevent noncompatible subtree combinations. For instance, we should not admit an -compound rains acid (which would be admitted if the two terminals from the 3 rd tree were not distinguished in the resulting production rules). We admit a version of the grammar in which each elementary tree has the same weight (equal to ) i.e. the same probability of being used 2 The proposals from the following section apply, though, also to the standard LTAG grammar format. 3 For the sake of simplicity we only present initial trees and ignore auxiliary trees in this abstract. Our algorithm, however, does take auxiliary trees as well as the adjunction operation into account. 4 Here, we do not present the conversion process in details. It includes, in fact, a compression stage based on common subtree sharing, and representing flat rules via a finite-state automaton. in parsing a sentence. This weight is then distributed equally over all terminal nodes occurring in the tree. Here, the terminal nodes acid and rains have weight in each of the st two trees, while they have weight 0.5 in the 3 rd tree. Parsing as a hypergraph We propose an Early-style parsing algorithm for LTAGs inspired by (Klein and Manning 200). The parsing process is represented here as a hypergraph (Gallo et al. 993) whose nodes are parsing chart states, and whose hyperarcs represent applications of inference rules, i.e. combinations of previous chart states resulting in new states. The appendix shows a fragment of the hypergraph created while parsing the two initial words of sentence () with the grammar from Fig.. For instance, the hyperarc leading from the initial state ( 3 acid, 0, 0) to state ( 3 acid, 0, ) indicates that the terminal acid has been recognized over the sentence span from position 0 to. The latter state can then be combined with state (P 3 4, 0, 0) yielding a new state (P 3 4, 0, ), and so on. The whole sentence is successfully parsed if a state has been reached whose underlying rule has the S symbol in its head and the dot at the end of its body, and whose span goes from 0 to the length of the sentence. ote that some hyperarcs, namely those corresponding to scanning a symbol from the input, are weighted with the values stemming from the corresponding terminal nodes in the grammar. For instance the hyperarc from ( 0 acid, 0, 0) to ( 0 acid, 0, ) has weight since its underlying rule 0 acid stems from the st tree in Fig., while the hyperarc from ( 3 acid, 0, 0) to ( 3 acid, 0, ) has weight 0.5 since its rule stems from the 3 rd tree. The cost of a parse is then defined as the sum of weights of all traversed hyperarcs. Here, the hyperpath (highlighted in bold), corresponding to the idiomatic interpretation of acid rains, has cost, while the interpretation assuming that rains is a verb has cost 2. Thus, promoting MWE-oriented interpretations boils down to finding minimum-cost hyperpaths in the parsing hypergraph. Recall that we also wish to find such interpretations earlier than compositional alternatives. We think that this problems could be solved by

an A*-style algorithm, similarly to (Lewis and Steedman 204) for CCG parsing. The A* algorithm is based on a heuristic which estimates the distance that separates a given node from the target node. This distance estimation must never overestimate. We propose an estimation function h based precisely on the potential occurrence of MWEs in the part of the sentence that remains to be parsed. It assumes that each remaining word will be scanned with a grammar terminal containing the lowest possible weight, thus providing a lower bound on the remaining parsing cost. For example, the value of h( 0 acid, 0, ) is 0.5 because the remaining part (assuming that acid rains is all that there is to parse), rains, cannot be scanned cheaper than 0.5. The total estimated cost of this state is thus equal to.5, therefore it will not be visited before state (S P V P, 0, 2) which represents the optimalcost interpretation of acid rains is reached. ote that the more terminals a grammar tree contains the lower the weights assigned to these terminals. Thus, this strategy truly promotes MWE-oriented interpretations. Formally, remaining cost estimation for state (q, i, j) depends only on its span (i, j): h(q, i, j) = k {,...,i} {j+,..., s } w(k) w(k) = min{weight(r, l) : r F(G), l {,..., r }, r k = s l } where s is the input sentence, s i is its i-th word (starting from ), G is a TAG, F(G) is G converted to the set of flat rules, r is the length of r s body, r i its i-th body element, and weight(r, l) is the weight assigned to the l-th body element of r. The perspectives of this work include proving the correctness of our MWE-based heuristics in A*, and providing experimental results of the parser. In the long run, weights assigned to grammar trees might be enhanced with probabilities acquired from a corpus, which would result in a probabilistic MWE-prone parser for LTAGs. References Abeillé, A. and Schabes, Y. (989). Parsing idioms in lexicalized tags. In H. L. Somers and M. M. Wood, eds., Proceedings of the 4th Conference of the European Chapter of the ACL, EACL 89, Manchester, pp. 9. The Association for Computer Linguistics. Alonso, M. A., Cabrero, D., de la Clergerie, E. V., and Ferro, M. V. (999). Tabular algorithms for TAG parsing. In EACL 999, 9th Conference of the European Chapter of the Association for Computational Linguistics, June 8-2, 999, University of Bergen, Bergen, orway, pp. 50 57. The Association for Computer Linguistics. Constant, M., Roux, J. L., and Sigogne, A. (203). Combining compound recognition and PCFG-LA parsing with word lattices and conditional random fields. ACM Trans. Speech Lang. Process., 0(3), 8: 8:24. Finkel, J. R. and Manning, C. D. (2009). Joint Parsing and amed Entity Recognition. In HLT-AACL, pp. 326 334. The Association for Computational Linguistics. Gallo, G., Longo, G., Pallottino, S., and guyen, S. (993). Directed hypergraphs and applications. Discrete Appl. Math., 42(2-3), 77 20. Green, S., de Marneffe, M.-C., Bauer, J., and Manning, C. D. (20). Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French. In EMLP, pp. 725 735. ACL. Green, S., de Marneffe, M.-C., and Manning, C. D. (203). Parsing Models for Identifying Multiword Expressions. Computational Linguistics, 39(), 95 227. Klein, D. and Manning, C. D. (200). Parsing and hypergraphs. In Proceedings of the Seventh International Workshop on Parsing Technologies (IWPT-200), 7-9 October 200, Beijing, China. Tsinghua University Press. Lewis, M. and Steedman, M. (204). A* CCG Parsing with a Supertag-factored Model. In Proceedings of the 204 Conference on Empirical Methods in atural Language Processing (EMLP), pp. 990 000. Association for Computational Linguistics. Wehrli, E., Seretan, V., and erima, L. (200). Sentence analysis and collocation identification. In Proceedings of the Workshop on Multiword Expressions: from Theory to Applications (MWE 200), pp. 27 35, Beijing, China. Association for Computational Linguistics.

Appendix A Chart parsing of the substring acid rains represented as a hypergraph ( 0 acid, 0, 0) ( 0 acid, 0, ) (P 0, 0, 0) (S P VP, 0, 0) (P 3 4, 0, 0) (P 0, 0, ) (S P VP, 0, ) ( 3 0.5 acid, 0, 0) ( 3 acid, 0, ) ( 4 0.5 rains,, ) ( 4 rains,, 2) (P 3 4, 0, ) (P 3 4, 0, 2) (S P VP, 0, 2) minimal cost of reaching = (VP V 2,, ) (S P VP, 0, 2) (V 2 rains,, ) (V 2 rains,, 2) (VP V 2,, 2) minimal cost of reaching = 2