Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Similar documents
2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Accurate Unlexicalized Parsing for Modern Hebrew

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

BULATS A2 WORDLIST 2

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Grammars & Parsing, Part 1:

Context Free Grammars. Many slides from Michael Collins

CS 598 Natural Language Processing

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

The stages of event extraction

Words come in categories

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chapter 4: Valence & Agreement CSLI Publications

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Developing a TT-MCTAG for German with an RCG-based Parser

Natural Language Processing. George Konidaris

LTAG-spinal and the Treebank

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Development of the First LRs for Macedonian: Current Projects

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Parsing of part-of-speech tagged Assamese Texts

Parsing Morphologically Rich Languages:

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Specifying a shallow grammatical for parsing purposes

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Ensemble Technique Utilization for Indonesian Dependency Parser

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Using dialogue context to improve parsing performance in dialogue systems

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Prediction of Maximal Projection for Semantic Role Labeling

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Character Stream Parsing of Mixed-lingual Text

A Graph Based Authorship Identification Approach

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Modeling full form lexica for Arabic

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

An Evaluation of POS Taggers for the CHILDES Corpus

Adjectives tell you more about a noun (for example: the red dress ).

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Learning Methods in Multilingual Speech Recognition

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Ch VI- SENTENCE PATTERNS.

EAGLE: an Error-Annotated Corpus of Beginning Learner German

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

THE VERB ARGUMENT BROWSER

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A Computational Evaluation of Case-Assignment Algorithms

Building an HPSG-based Indonesian Resource Grammar (INDRA)

A Syllable Based Word Recognition Model for Korean Noun Extraction

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

What the National Curriculum requires in reading at Y5 and Y6

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The Indiana Cooperative Remote Search Task (CReST) Corpus

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Leveraging Sentiment to Compute Word Similarity

The Role of the Head in the Interpretation of English Deverbal Compounds

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles

Derivational and Inflectional Morphemes in Pak-Pak Language

On the Notion Determiner

Training and evaluation of POS taggers on the French MULTITAG corpus

Linking Task: Identifying authors and book titles in verbose queries

Analysis of Probabilistic Parsing in NLP

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Introduction to Text Mining

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Some Principles of Automated Natural Language Information Extraction

Unsupervised Dependency Parsing without Gold Part-of-Speech Tags

Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Coast Academies Writing Framework Step 4. 1 of 7

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Adapting Stochastic Output for Rule-Based Semantics

Basic concepts: words and morphemes. LING 481 Winter 2011

Domain Adaptation for Parsing

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

LING 329 : MORPHOLOGY

Universiteit Leiden ICT in Business

Formulaic Language and Fluency: ESL Teaching Applications

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

1. Introduction. 2. The OMBI database editor

Part of Speech Template

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Advanced Grammar in Use

Transcription:

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion University University of Amsterdam EACL 2009, Athens

What we do Unlexicalized Hebrew Parsing

Parsing with PCFGs Basic stuff you probably already know Learning Start with a Treebank

Parsing with PCFGs Basic stuff you probably already know Learning Start with a Treebank Extract a Grammar S NP VP NP DT NN VP VB NP... DT the NN cat NN cake NN dog VB ate VB kicked

Parsing with PCFGs Basic stuff you probably already know Learning Start with a Treebank Extract a Grammar Assign probabilities to rules S NP VP NP DT NN VP VB NP... DT the NN cat NN cake NN dog VB ate VB kicked 0.2 0.04 0.5 0.1 0.002 0.005 0.003 0.08 0.09

Parsing with PCFGs Basic stuff you probably already know Learning Start with a Treebank Extract a Grammar Assign probabilities to rules Inference Standard CKY stuff S NP VP NP DT NN VP VB NP... DT the NN cat NN cake NN dog VB ate VB kicked 0.2 0.04 0.5 0.1 0.002 0.005 0.003 0.08 0.09

Parsing with PCFGs Two kinds of rules Syntactic Rules Finite (small) set of symbols Relative frequency estimates + some smoothing works fine Lexical Rules Huge set of terminal symbols Problem with rare events Sparsity Overfitting S NP VP NP DT NN VP VB NP... DT the NN cat NN cake NN dog VB ate VB kicked 0.2 0.04 0.5 0.1 0.002 0.005 0.003 0.08 0.09

Parsing with PCFGs Two kinds of rules Syntactic Rules Finite (small) set of symbols Relative frequency estimates + some smoothing works fine Lexical Rules Huge set of terminal symbols Problem with rare events Sparsity Overfitting Focus of this work S NP VP NP DT NN VP VB NP... DT the NN cat NN cake NN dog VB ate VB kicked 0.2 0.04 0.5 0.1 0.002 0.005 0.003 0.08 0.09

A piece of Hebrew In (mostly) English words Affixation: and, from, to, the, which, as, in are prefixes possessives are suffixed to nouns

A piece of Hebrew In (mostly) English words Affixation: and, from, to, the, which, as, in are prefixes possessives are suffixed to nouns In her net inhernet

A piece of Hebrew In (mostly) English words Affixation: and, from, to, the, which, as, in are prefixes possessives are suffixed to nouns In her net inhernet Unvocalized writing system most vowels are dropped in writing

A piece of Hebrew In (mostly) English words Affixation: and, from, to, the, which, as, in are prefixes possessives are suffixed to nouns In her net inhernet Unvocalized writing system most vowels are dropped in writing in her net inhernet inhrnt

A piece of Hebrew In (mostly) English words Affixation: and, from, to, the, which, as, in are prefixes possessives are suffixed to nouns In her net inhernet Unvocalized writing system most vowels are dropped in writing in her net inhernet inhrnt in her net? in her note? in her night? inherent?

A piece of Hebrew In (mostly) English words Affixation: and, from, to, the, which, as, in are prefixes possessives are suffixed to nouns In her net inhernet Unvocalized writing system most vowels are dropped in writing in her net inhernet inhrnt Rich morphology in her net? in her note? in her night? inherent? inherent could be inflected into different forms according to sing/pl, masc/fem properties inhrnt, inhrnti, inhrntit, inrntiot, inhrntim

A piece of Hebrew In (mostly) English words Affixation: and, from, to, the, which, as, in are prefixes possessives are suffixed to nouns In her net inhernet Unvocalized writing system most vowels are dropped in writing in her net inhernet inhrnt Rich morphology in her net? in her note? in her night? inherent? inherent could be inflected into different forms according to sing/pl, masc/fem properties inhrnt, inhrnti, inhrntit, inrntiot, inhrntim Especially complex verb morphology Root + template morphology for verbs ktb ktb mktyb ywktb htktb kwtb yktwb ykwtb...

Tying it together... The situation in Hebrew Complex, productive morphology Many word forms (487K distinct tokens in a 34M words corpus) High level of ambiguity 2.7 tags/token, vs. 1.4 in English POS carries a lot of information gender, number, tense, possesiveness, status,...

Tying it together... The situation in Hebrew Complex, productive morphology Many word forms (487K distinct tokens in a 34M words corpus) High level of ambiguity 2.7 tags/token, vs. 1.4 in English POS carries a lot of information gender, number, tense, possesiveness, status,... which means Treebank derived lexicon is inadequate Low coverage Many unseen events Hard to guess POS of unknown words

some baseline parsing performance but first...

Our parsing setup Data: Hebrew Treebank V2 ( 6000 sentences)

Our parsing setup Data: Hebrew Treebank V2 ( 6000 sentences) Syntactic Rules (Goldberg and Tsarfaty 2008) Parent annotation Linguistically motivated state splits p(x Y ): relative frequency estimate (unsmoothed)

Our parsing setup Data: Hebrew Treebank V2 ( 6000 sentences) Syntactic Rules (Goldberg and Tsarfaty 2008) Parent annotation Linguistically motivated state splits p(x Y ): relative frequency estimate (unsmoothed) Stable lexical items (seen K times in treebank) Rare/unseen lexical items (seen < K times)

Our parsing setup Data: Hebrew Treebank V2 ( 6000 sentences) Syntactic Rules (Goldberg and Tsarfaty 2008) Parent annotation Linguistically motivated state splits p(x Y ): relative frequency estimate (unsmoothed) Stable lexical items (seen K times in treebank) p(tag word) = p rf (word tag) Rare/unseen lexical items (seen < K times)

Our parsing setup Data: Hebrew Treebank V2 ( 6000 sentences) Syntactic Rules (Goldberg and Tsarfaty 2008) Parent annotation Linguistically motivated state splits p(x Y ): relative frequency estimate (unsmoothed) Stable lexical items (seen K times in treebank) p(tag word) = p rf (word tag) Fixed Rare/unseen lexical items (seen < K times)

Our parsing setup Data: Hebrew Treebank V2 ( 6000 sentences) Syntactic Rules (Goldberg and Tsarfaty 2008) Parent annotation Linguistically motivated state splits p(x Y ): relative frequency estimate (unsmoothed) Stable lexical items (seen K times in treebank) p(tag word) = p rf (word tag) Fixed Varies Rare/unseen lexical items (seen < K times)???

Is the low-coverage of the TB lexicon really a problem? Easy baseline: assuming a segmentation Oracle Input Sentence: Parser sees: inhrnt in hr nt Model rare/unknown items replaced with RARE token p(tag word) = distribution over rare words: { p rf (RARE tag) rare p(word tag) = p rf (word tag) otherwise

Is the low-coverage of the TB lexicon really a problem? Easy baseline: assuming a segmentation Oracle Input Sentence: Parser sees: inhrnt in hr nt Model rare/unknown items replaced with RARE token p(tag word) = distribution over rare words: { p rf (RARE tag) rare p(word tag) = p rf (word tag) otherwise 72.24 F (evalb score)

Is the low-coverage of the TB lexicon really a problem? Realistic baseline: no Oracles Input Sentence: Parser sees: inhrnt inhrnt

Is the low-coverage of the TB lexicon really a problem? Realistic baseline: no Oracles Input Sentence: Parser sees: inhrnt inhrnt Model Model of Goldberg and Tsarfaty (2008) lattice parser non-trivial treebank-based morphological analyzer extended with a spellchecker wordlist for details, see paper

Is the low-coverage of the TB lexicon really a problem? Realistic baseline: no Oracles Input Sentence: Parser sees: inhrnt inhrnt Model Model of Goldberg and Tsarfaty (2008) lattice parser non-trivial treebank-based morphological analyzer extended with a spellchecker wordlist for details, see paper 72.24 F (evalb score) 67.02 F (generalized evalb score)

What can we do?

What can we do? Look outside of the treebank Dictionary Base Morphological Analyzer (Developed and maintained by the Knowledge center for processing Hebrew)

What can we do? Look outside of the treebank Dictionary Base Morphological Analyzer (Developed and maintained by the Knowledge center for processing Hebrew) כתבתי Noun f,s+gen/b/s/1st Verb b,s,1st,past,paal maps word forms to their possible analyses

Treebank vs. Dictionary Low Lexical Coverage 6,219 sentences 17,731 unique (non-affixed) word forms 28,349 unique tokens High Lexical Coverage 25k lemmas 562,439 (non-prefixed) word forms 73 prefixes and prefixation rules + smart heuristic for unknown words (Adler et al 2008)

Resource Incompatibility Let s use the Dictionary for rare words!

Resource Incompatibility Let s use the Dictionary for rare words! But the tagsets are different...

Resource Incompatibility Treebank and Dictionary use different tagsets NN NNT NNP PRP JJ JJT RB RBR MOD VB VBMD VBINF AUX AGR IN COM REL CC QW HAM WDT DT CD CDE CDT AT POS Noun NounC Proper Pron Adj AdjC Adv Exist Copula Conj Pref Verb Beinoni Modal Infinitive Prep QW Det Num NumExp NumC At Pos

Resource Incompatibility Treebank and Dictionary use different tagsets NN NNT NNP AT... POS Noun NounC Proper At... Pos

Resource Incompatibility Treebank and Dictionary use different tagsets RB JJ MOD VB AUX IN COM REL AGR CC Adj Adv Exist Cop Conj Pref Verb Beinoni Prep

Resource Incompatibility What causes the treebank and dictionary incompatibility? Differences in annotation perspectives Syntactic annotation scheme If a word modifies a verb and can be replaced with an adverb, it s an adverb Lexicographic guidelines If a word can have this inflection, it can be a verb

Resource Incompatibility Conversion? Retag the treebank with the dictionary tagset?

Resource Incompatibility Conversion? Retag the treebank with the dictionary tagset? A lesson from Arabic Arabic TB originally constructed with lexicon-based tags Switching to more syntactic tags improved results by 2F-points (Maamouri et.al 2008) Hurt parser performance

Resource Incompatibility Conversion? Retag the treebank with the dictionary tagset? And in Hebrew We re-tagged the treebank 90% automatically, 10% manually Gold-morphology Oracle experiment Input Sentence: inhrnt Parser sees: IN PRP f,p NN f,s

Resource Incompatibility Conversion? Retag the treebank with the dictionary tagset? And in Hebrew We re-tagged the treebank 90% automatically, 10% manually Gold-morphology Oracle experiment Input Sentence: inhrnt Parser sees: IN PRP f,p NN f,s 83.29 F 81.29 F Hurt parser performance

Resource Incompatibility Conversion? Notice same grammar: Gold morphology 83.29 Retag the treebank Gold withsegmentation the dictionary tagset? 72.24 Full ambiguity 67.02 And in Hebrew morphology is informative! We re-tagged morphology the treebankis ambiguous! morphology is hard! 90% automatically, 10% manually Gold-morphology Oracle experiment Input Sentence: inhrnt Parser sees: IN PRP f,p NN f,s 83.29 F 81.29 F Hurt parser performance

Resource Incompatibility Conversion? Retag the treebank with the dictionary tagset? And in Hebrew We re-tagged the treebank 90% automatically, 10% manually Gold-morphology Oracle experiment Input Sentence: inhrnt Parser sees: IN PRP f,p NN f,s 83.29 F 81.29 F Hurt parser performance

Fuzzy Map Retag the treebank with the dictionary tagset? Hurt parser performance We would like to Keep syntactic hints of TB tagging Benefit from the large coverage of the Dictionary Probabilistic Fuzzy Mapping Take the best of both worlds Define a probabilistic mapping function between the tagsets: p(t Dict T TB ) sometimes, demonstrative pronouns function as adjective

Layered Trees The fuzzy map gives rise to a simple generative process: T TB T Dict Word

Layered Trees + TB Dict Layered. JJ-ZY. Pron-M-S-3-DEM. JJ-ZY זה this. IN זה this.. Prep Noun-F-S Pron-M-S-3-DEM זה this. IN במסגרת inside ב in מסגרת frame Prep ב in Noun-F-S מסגרת frame

Layered Trees + TB Dict Layered. JJ-ZY זה this. IN. Pron-M-S-3-DEM זה this.. Prep Noun-F-S. JJ-ZY Pron-M-S-3-DEM זה this. IN Mapping layer במסגרת inside ב in מסגרת frame Prep ב in Noun-F-S מסגרת frame

Combining fuzzy-mapping in a parser New lexical model Stable words (seen 2 in training) estimated as usual: Rare/unseen words: p(t TB word) = p rf (word T TB ) p(t TB word) = p(t TB T Dict )p(t Dict word)

Combining fuzzy-mapping in a parser New lexical model Stable words (seen 2 in training) estimated as usual: Rare/unseen words: p(t TB word) = p rf (word T TB ) p(t TB word) = p(t TB T Dict )p(t Dict word) But... what is p(t Dict word)?

Estimating p(t Dict w rare ) Dictionary as Filter Option 1: LexFilter Use the tag-distribution over rare-words in training, but zero out analyses incompatible with the lexicon: p(t Dict w rare ) = p(w rare T Dict ) = { count(rare,tdict ) count(t Dict ) T Dict Dict(w rare ) 0 T Dict / Dict(w rare )

Results Segmentation Oracle No Oracle Baseline 72.24 67.02 LexFilter + 76.54

Results Segmentation Oracle No Oracle Baseline 72.24 67.02 LexFilter + 76.54 68.84

Results Segmentation Oracle No Oracle Baseline 72.24 67.02 LexFilter + 76.54 68.84 Realistic performance still low... can we do better?

Hope in the face of uncertainty

Estimating p(t Dict w rare ) Semi-supervised estimation Option 2: LexProb Consider the familiar HMM Tagging model: p(t 1,..., t n, w 1,..., w n ) = p(t i t i 1, t i 2 )p(w i t i )

Estimating p(t Dict w rare ) Semi-supervised estimation Option 2: LexProb Consider the familiar HMM Tagging model: p(t 1,..., t n, w 1,..., w n ) = p(t i t i 1, t i 2 )p(w i t i ) Can be estimated from raw text using EM

Estimating p(t Dict w rare ) Semi-supervised estimation Option 2: LexProb Dictionary Raw Text Smart Thing P(t t 1, t 2 ) P(w t) > 92% accuracy (Adler and Elhadad 2006, Goldberg et.al 2008)

Estimating p(t Dict w rare ) Semi-supervised estimation Option 2: LexProb Ignore Dictionary Raw Text Smart Thing (Adler and Elhadad 2006, Goldberg et.al 2008) P(t t 1, t 2 ) P(w t) > 92% accuracy Use as P(T Dict word)

Results Segmentation Oracle No Oracle Baseline 72.24 67.02 LexFilter + 76.54 68.84 LexProb + + 76.64

Results Segmentation Oracle No Oracle Baseline 72.24 67.02 LexFilter + 76.54 68.84 LexProb + + 76.64 73.69

Results Segmentation Oracle No Oracle Baseline 72.24 67.02 LexFilter + 76.54 68.84 LexProb + + 76.64 73.69 We re happy (... at least until next year)

Take home message Treebank derived lexicons are sparse Use an external dictionary / morphological analyzer Tagsets may differ That s OK. Tagsets may (and should) differ Use a fuzzy map Dictionaries don t provide probabilities Semi-supervised estimation using dictionary and raw text