A Lemma-Based Approach to a Maximum Entropy Word Sense Disambiguation System for Dutch

Similar documents
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Lecture 1: Machine Learning Basics

Applications of memory-based natural language processing

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Prediction of Maximal Projection for Semantic Role Labeling

A Bayesian Learning Approach to Concept-Based Document Classification

Memory-based grammatical error correction

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

A Case Study: News Classification Based on Term Frequency

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Probabilistic Latent Semantic Analysis

Linking Task: Identifying authors and book titles in verbose queries

Speech Recognition at ICSI: Broadcast News and beyond

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Python Machine Learning

A Comparison of Two Text Representations for Sentiment Analysis

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Combining a Chinese Thesaurus with a Chinese Dictionary

Constructing Parallel Corpus from Movie Subtitles

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Switchboard Language Model Improvement with Conversational Data from Gigaword

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

CS Machine Learning

Word Sense Disambiguation

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Cross-Lingual Text Categorization

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

The taming of the data:

Rule Learning With Negation: Issues Regarding Effectiveness

Calibration of Confidence Measures in Speech Recognition

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Generative models and adversarial training

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Learning Computational Grammars

Vocabulary Usage and Intelligibility in Learner Language

ScienceDirect. Malayalam question answering system

Using dialogue context to improve parsing performance in dialogue systems

ARNE - A tool for Namend Entity Recognition from Arabic Text

On document relevance and lexical cohesion between query terms

Multilingual Sentiment and Subjectivity Analysis

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Beyond the Pipeline: Discrete Optimization in NLP

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Assignment 1: Predicting Amazon Review Ratings

Artificial Neural Networks written examination

Online Updating of Word Representations for Part-of-Speech Tagging

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Development of the First LRs for Macedonian: Current Projects

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Rule Learning with Negation: Issues Regarding Effectiveness

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

arxiv: v1 [cs.lg] 3 May 2013

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Corrective Feedback and Persistent Learning for Information Extraction

Softprop: Softmax Neural Network Backpropagation Learning

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Developing a TT-MCTAG for German with an RCG-based Parser

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Short Text Understanding Through Lexical-Semantic Analysis

Handling Sparsity for Verb Noun MWE Token Classification

CS 598 Natural Language Processing

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Modeling function word errors in DNN-HMM based LVCSR systems

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Proceedings of the 19th COLING, , 2002.

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Learning Methods in Multilingual Speech Recognition

Australian Journal of Basic and Applied Sciences

Semi-Supervised Face Detection

Multi-Lingual Text Leveling

Universiteit Leiden ICT in Business

The stages of event extraction

Modeling function word errors in DNN-HMM based LVCSR systems

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Modeling full form lexica for Arabic

Graph Alignment for Semi-Supervised Semantic Role Labeling

Exploration. CS : Deep Reinforcement Learning Sergey Levine

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Building a Semantic Role Labelling System for Vietnamese

The Role of the Head in the Interpretation of English Deverbal Compounds

2.1 The Theory of Semantic Fields

CS 446: Machine Learning

Reducing Features to Improve Bug Prediction

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

A Graph Based Authorship Identification Approach

arxiv: v1 [cs.cl] 2 Apr 2017

The Choice of Features for Classification of Verbs in Biomedical Texts

Software Maintenance

Using Semantic Relations to Refine Coreference Decisions

BYLINE [Heng Ji, Computer Science Department, New York University,

The Good Judgment Project: A large scale test of different methods of combining expert predictions

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

A corpus-based approach to the acquisition of collocational prepositional phrases

Cross Language Information Retrieval

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Transcription:

A Lemma-Based Approach to a Maximum Entropy Word Sense Disambiguation System for Dutch Tanja Gaustad Humanities Computing University of Groningen, The Netherlands tanja@let.rug.nl www.let.rug.nl/ tanja Coling 2004

Overview Word Sense Disambiguation (WSD) Lemma-based approach * Dictionary-based lemmatizer for Dutch Maximum entropy WSD system Results Evaluation Coling 2004 1

Word Sense Disambiguation Semantic lexical ambiguity * is a major problem in NLP * is largely unsolved * arises in for example MT or IR WSD is the task of attributing the correct sense(s) to words in context WSD system used here is * for Dutch * supervised, corpus-based * combination of statistical classification with linguistic information Coling 2004 2

Lemma-Based Approach Previous research built a separate classifier for each ambiguous word form, e.g. voet ( foot ) and voeten ( feet ) Lemma-based approach builds a separate classifier for each ambiguous lemma, e.g. voet subsumes voet and voeten Advantage: All inflected forms are clustered together the more inflection in a language, the more lemmatization will compress and generalize the data Higher accuracy expected with lemma-based approach Coling 2004 3

Dictionary-Based Lemmatizer for Dutch Corpora contain many different, often infrequent words Lemmatizer reduces all inflected forms of a word to their lemma Consequently, # of different lemmas < # of different word forms more reliable estimation of probabilities Accurate and fast lemmatizer is a prerequisite for lemma-based approach to work Combination of lexical database (CELEX) and finite-state automata Coling 2004 4

Dictionary-Based Lemmatizer for Dutch II Datasetp CELEXp lemmas pos FSA Dictionary Lookup Disambiguation plemmatized Data if not in CELEX Guessing FSA (Backup Strategy) Coling 2004 5

Lemma-Based Approach II Constructing classifiers based on lemmas, not word forms reduces number of classifiers Lemmas produce more concise and generic evidence than inflected forms (already noted by Yarowsky (1994)) more training data available per classifier E.g. all instances of one verb are clustered in a single classifier instead of several (one for each inflected form found in the data) N.B. Dutch SENSEVAL-2 Data is ambiguous with regard to meaning and part-of-speech (PoS) Coling 2004 6

Schematic Overview of Lemma-Based Approach nonambiguous psense 1 sense pword form X senses ambiguous 1 lemma LEMMA MODEL psense X lemmas WORD FORM MODEL psense Coling 2004 7

Maximum Entropy WSD System WSD seen as a statistical classification task Maximum entropy: technique to estimate probability distributions Use features extracted from labeled training data to derive constraints for model Constraints characterize class-specific expectations for distribution Distribution should maximize entropy and model should satisfy constraints imposed by training data Coling 2004 8

Maximum Entropy Classification Examples of features * PoS of the ambiguous word (e.g. N, V) * First contextword to the left of the ambiguous word * First contextword to the right of the ambiguous word, etc. Training: weight λ i for each feature i present in the training data computed and stored Testing: sum of weights λ i of all features i found in the test instances computed for each class c and class with highest score chosen Gaussian priors used for smoothing Coling 2004 9

Maximum Entropy Classification II Main advantages: Property functions take into account any information which might be useful for disambiguation Dissimilar types of information can be combined into single model for WSD No independence assumptions (as in e.g. a Naive Bayes algorithm) necessary Coling 2004 10

Corpus and Building Classifiers Dutch SENSEVAL-2 WSD data (training: 120,000 tokens, testing: 40,000 tokens) Procedure to build classifiers * lemmatize and PoS tag corpus * extract all instances for each ambiguous word form or lemma * transform instances into feature vectors, e.g. aarde N gat in de, zodat het aarde grond * build classifier for each ambiguous word form or lemma Settings: ±3 context lemmas (only within same sentence), PoS, morphological information Coling 2004 11

Results with Word Form and Lemma-Based Approach Model Accuracy # classifiers baseline all ambiguous words 78.47% 953 word form classifiers 83.66% 953 lemma-based classifiers 84.15% 669 Baseline: choose most frequent sense for each ambiguous word Comparison of word form-based and lemma-based approach Lemma-based approach works significantly better Less classifiers need to be built with lemma-based approach more training material per classifier Coling 2004 12

Number of Classifiers Used During Testing lemma-based word forms unique ambiguous word forms 512 512 classifiers used based on word forms 230 410 based on lemmas 70 0 word forms subsumed 208 0 word forms seen 1st time 74 102 Coling 2004 13

Detailed Comparison of Results Model Accuracy baseline 76.77% word form classifiers 78.66% lemma-based classifiers 80.39% Comparison of word form-based and lemma-based approach for word forms with different classifiers only Clear gain from lemmatization error rate reduction 8% fewer classifiers, smaller system more word forms classified Coling 2004 14

Comparison of Different WSD Systems ambiguous baseline test data 78.5% 89.4% word form classifiers 83.7% 92.4% lemma-based classifiers 84.1% 92.5% Hendrickx et al. 2002 84.0% 92.5% MBL system (Hendrickx et al. 2002) uses * extensive parameter optimization per classifier * frequency threshold of min. 10 training instances (frequency baseline used for words below threshold) Lemma-based system scores same without extensive per classifier parameter optimization (better results may be possible) all Coling 2004 15

Comparison of Different WSD Systems: The Impact of Deep Syntactic Information ambiguous baseline test data 78.5% 89.4% word form classifiers 83.7% 92.4% lemma-based classifiers 84.1% 92.5% incl. syntactic information 85.7% 93.4% Hendrickx et al. 2002 84.0% 92.5% all Coling 2004 16

Evaluation and Conclusion System using lemma-based approach * is smaller * is more robust * has higher accuracy (best results to date) Compared to earlier results for WSD of Dutch, lemma-based approach performs the same involving less work Coling 2004 17

Smoothing with Gaussian Priors Smoothing is essential to optimize feature weights (sparseness) Parameters of MaxEnt model should not be too large optimization problems with infinite weights Enforce distribution of parameters according to Gaussian prior with mean µ = 0 and variance σ 2 = 1000 Effects on MaxEnt model: * trade off some expectation-matching for smaller parameters * more weight for more common features * better accuracy and faster convergence Coling 2004 18