Word Disambiguation Lecture #13

Similar documents
Word Sense Disambiguation

Probabilistic Latent Semantic Analysis

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

A Comparison of Two Text Representations for Sentiment Analysis

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A Bayesian Learning Approach to Concept-Based Document Classification

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

On document relevance and lexical cohesion between query terms

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

(Sub)Gradient Descent

AQUA: An Ontology-Driven Question Answering System

Artificial Neural Networks written examination

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Getting Started with Deliberate Practice

Multilingual Sentiment and Subjectivity Analysis

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

THE VERB ARGUMENT BROWSER

A Case Study: News Classification Based on Term Frequency

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Learning Methods in Multilingual Speech Recognition

Lecture 1: Machine Learning Basics

Combining a Chinese Thesaurus with a Chinese Dictionary

The Evolution of Random Phenomena

Vocabulary Usage and Intelligibility in Learner Language

Switchboard Language Model Improvement with Conversational Data from Gigaword

Python Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Short Text Understanding Through Lexical-Semantic Analysis

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Leveraging Sentiment to Compute Word Similarity

Using dialogue context to improve parsing performance in dialogue systems

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Rule Learning With Negation: Issues Regarding Effectiveness

On the Combined Behavior of Autonomous Resource Management Agents

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Guidelines for Writing an Internship Report

Learning From the Past with Experiment Databases

Effectiveness of Electronic Dictionary in College Students English Learning

ASTR 102: Introduction to Astronomy: Stars, Galaxies, and Cosmology

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Shockwheat. Statistics 1, Activity 1

Cross Language Information Retrieval

Writing Research Articles

Kelli Allen. Vicki Nieter. Jeanna Scheve. Foreword by Gregory J. Kaiser

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

Disambiguation of Thai Personal Name from Online News Articles

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

5 Star Writing Persuasive Essay

An Interactive Intelligent Language Tutor Over The Internet

Semi-Supervised Face Detection

Linking Task: Identifying authors and book titles in verbose queries

Using Web Searches on Important Words to Create Background Sets for LSI Classification

TextGraphs: Graph-based algorithms for Natural Language Processing

CS 446: Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Course Content Concepts

Exploration. CS : Deep Reinforcement Learning Sergey Levine

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Copyright Corwin 2015

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

2.1 The Theory of Semantic Fields

Discriminative Learning of Beam-Search Heuristics for Planning

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Axiom 2013 Team Description Paper

A Comparative Evaluation of Word Sense Disambiguation Algorithms for German

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

arxiv:cmp-lg/ v1 22 Aug 1994

Tutoring First-Year Writing Students at UNM

The Task. A Guide for Tutors in the Rutgers Writing Centers Written and edited by Michael Goeller and Karen Kalteissen

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

What the National Curriculum requires in reading at Y5 and Y6

Active Learning. Yingyu Liang Computer Sciences 760 Fall

B. How to write a research paper

Detecting English-French Cognates Using Orthographic Edit Distance

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A Neural Network GUI Tested on Text-To-Phoneme Mapping

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Online Updating of Word Representations for Part-of-Speech Tagging

The stages of event extraction

The Smart/Empire TIPSTER IR System

What is a Mental Model?

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Constructing Parallel Corpus from Movie Subtitles

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Learning Methods for Fuzzy Systems

arxiv: v1 [cs.cl] 2 Apr 2017

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Loughton School s curriculum evening. 28 th February 2017

Lecture 1: Basic Concepts of Machine Learning

Transcription:

Word Disambiguation Lecture #13 Computational Linguistics CMPSCI 591N, Spring 2006 University of Massachusetts Amherst Andrew McCallum

Words and their meaning Three lectures: Last time: Collocations multiple words together, different meaning than than the sum of its parts Today: Word disambiguation one word, multiple meanings Expectation Maximization Future: Word clustering multiple words, same meaning

Today s Main Points What is word sense disambiguation, and why is it useful. Homonymy, Polysemy Other similar NLP problems 4 Methods for performing WSD. Supervised, naïve Bayes Unsupervised, Expectation Maximization

Word Sense Disambiguation The task is to determine which of various senses of a word are invoked in context. True annuals are plants grown from seed that blossom, set new seed and die in a single year. Nissan s Tennessee manufacturing plant beat back a United Auto Workers organizing effort with aggressive tactics. This is an important problem: Most words are ambiguous (have multiple senses) Problem statement: A word is assumed to have a finite number of discrete senses Make a forced choice between each word usage based on some limited context around the word Converse: word or senses that mean (almost) the same: image, likeness, portrait, facsimile, picture (Next lecture)

WSD important for Translation The spirit is willing but the flesh is weak. The vodka is good, but the meat is spoiled. Information Retrieval query: wireless mouse document: Australian long tail hopping mouse Computational lexicography To automatically identify multiple definitions to be listed in a dictionary Parsing To give preference to parses with correct use of senses There isn t generally one way to divide the uses of a word into a set of non-overlapping categories. Senses depend on the task [Kilgarriff 1997]

WSD: Many other cases are harder title Name/heading of a book, statute, work of art of music, etc. Material at the start of a film The right of legal ownership (of land) The document that is evidence of this right An appellation of respect attached to a person s name A written work

WSD: types of problems Homonymy: meanings are unrelated: bank of a river bank financial institution Polysemy: related meanings (as on previous slide) title of a book title material at the start of a film Systematic polysemy: standard methods of extending a meaning, such as from an organization to the building where it is housed. The speaker of the legislature The legislature decided today He opened the door, and entered the legislature A word frequently takes on further related meanings through systematic polysemy or metaphor.

Upper and lower bounds on performance Upper bound: human performance How often do human judges agree on the correct sense assignment? Particularly interesting if you only give humans the same input context given to machine method. (A good test for any NLP method!) Gale 1992: give pairs of words in context, humans say if they are the same sense. Agreement 97-99% for word with clear senses, but ~65-70% for polysemous words. Lower bound: simple baseline algorithm Always pick the most common sense for each word. Accuracy depends greatly on sense distribution! 90-50%?

Senseval competitions Senseval 1: September 1998. Results in Computers and the Humanities 34(1-2). OUP Hector corpus. Senseval 2: In first half of 2001. WordNet senses. http://www.itri.brighton.ac.uk/events/senseval

WSD automated method performance Varies widely depending on how difficult the disambiguation task is. Accuracies over 90% are commonly reported on the classic, often fairly easy, word disambiguation tasks (pike, star, interest) Senseval brought careful evaluation of difficult WSD (many senses, different POS) Senseval 1, fine grained senses, wide range of types Overall: about 75% accuracy Nouns: about 80% accuracy Verbs: about 70% accuracy

WSD solution #1: expert systems [Small 1980] [Hirst 1988] Most early work used semantic networks, frames, logical reasoning, or expert system methods for disambiguation based on contexts. The problem got quite out of hand: The word expert for throw is currently six pages long, but should be ten times that size (Small and Rieger 1982)

WSD solution #2: dictionary-based [Lesk 1986] A word s dictionary definitions are likely to be good indicators for the senses they define. One sense for each dictionary definition Look for overlap between words in definition and words in context at hand Word= ash Sense Definition 1. tree a tree of the olive family 2. burned the solid residue left when combustible material is burned This cigar burns slowly and creates a stiff ash sense1=0 sense2=1 The ash is one of the last trees to come into leaf sense1=1 sense2=0 Insufficient information in definitions. Accuracy 50-70%

WSD solution #3: thesaurus-based [Walker 1987] [Yarowsky 1992] Occurrences of a word in multiple thesaurus subject codes is a good indicator of its senses. Count number of times context words appear among the entries for each possible subject code. Increase coverage of rare words and proper nouns by also looking in the thesaurus for words that co-occur with context words more often than chance. E.g. Hawking co-occurs with cosmology, black hole Word Sense Roget category Accuracy star space object UNIVERSE 96% celebrity ENTERTAINER 95% star-shaped INSIGNIA 82%

An extra trick: global constraints [Yarowsky 1995] One sense per discourse: the sense of a word is highly consistent within a document Get a lot more context words because combine the context of multiple occurrences True for topic dependent words Not so true for other items like adjectives and verbs, e.g. make, take.

Other similar disambiguation problems Sentence boundary detection I live on Palm Dr. Smith lives downtown. Only really ambiguous when word before the period is an abbreviation (which can end a sentence - not something like a title) word after the period is capitalized (and can be a proper name - otherwise it must be a sentence end) Context-sensitive spelling correction I know their is a problem with there account.

WSD solution #4: supervised classification Gather a lot of labeled data: words in context, hand-labeled into different sense categories. Use naïve Bayes document classification with context as the document! Straightforward classification problem. Simple, powerful method! :-) Requires hand-labeling a lot of data :-( Can we still use naïve Bayes, but without labeled data?

WSD sol n #5: unsupervised disambiguation word+context, labeled according to sense word+context, unlabeled Train one multinomial per class via maximum likelihood. What you just did for HW#1 Label is missing!

28 years ago

Filling in Missing Labels with EM [Dempster et al 77], [Ghahramani & Jordan 95], [McLachlan & Krishnan 97] Expectation Maximization is a class of iterative algorithms for maximum likelihood estimation with incomplete data. E-step: Use current estimates of model parameters to guess value of missing labels. M-step: Use current guesses for missing labels to calculate new estimates of model parameters. Repeat E- and M-steps until convergence. Finds the model parameters that locally maximize the probability of both the labeled and the unlabeled data.

Recall: Naïve Bayes Pick the most probable class, given the evidence: - a class (like Planning ) - a document (like language intelligence proof... ) Bayes Rule: Naïve Bayes : - the i th word in d (like proof )

Recall: Parameter Estimation in Naïve Bayes Estimate of P(c) Estimate of P(w c)

EM Recipe Initialization Create an array P(c d) for each document, and fill it with random (normalized) values. Set P(c) to the uniform distribution. M-step (likelihood Maximization) Calculate maximum-likelihood estimates for parameters P(w c) using current P(c d). E-step (missing-value Estimation) Using current parameters, calculate new P(c d) the same way you would at test time. Loop back to M-step, until convergence. Converged when maximum change in a parameter P(w c) is below some threshold.

EM We could have simply written down likelihood, taken derivative and solved but unlike complete data case, not solvable in closed form must use iterative method: gradient ascent EM is another form of ascent on this likelihood surface Convergence, speed and local minima are all issues. If you make hard 0 versus 1 assignments in P(c d), you get the K-means algorithm. Likelihood will always be highest with more classes. Use a prior over number of classes, or just pick arbitrarily.

EM Some good things about EM no learning rate parameter very fast for low dimensions each iteration is guaranteed to improve likelihood adapts unused units rapidly Some bad things about EM can get stuck in local minima ignores model cost (how many classes?) both steps require considering all explanations of the data (all classes)

Semi-Supervised Document Classification Training data with class labels Data available at training time, but without class labels Web pages user says are interesting Web pages user says are uninteresting Web pages user hasn t seen or said anything about Can we use the unlabeled documents to increase accuracy?

Semi-Supervised Document Classification Build a classification model using limited labeled data Use model to estimate the labels of the unlabeled documents Use all documents to build a new classification model, which is often more accurate because it is trained using more data.

An Example Baseball The new hitter struck out... Struck out in last inning... Homerun in the first inning... Pete Rose is not as good an athlete as Tara Lipinski... Labeled Data Ice Skating Fell on the ice... Perfect triple jump... Katarina Witt s gold medal performance... New ice skates... Practice at the ice rink every day... Before EM: Pr ( Lipinski ) = 0.01 Pr ( Lipinski ) = 0.001 Unlabeled Data Tara Lipinski s substitute ice skates didn t hurt her performance. She graced the ice with a series of perfect jumps and won the gold medal. Tara Lipinski bought a new house for her parents. After EM: Pr ( Lipinski Ice Skating ) = 0.02 Pr ( Lipinski Baseball ) = 0.003

WebKB Data Set student faculty course project 4 classes, 4199 documents from CS academic departments

Word Vector Evolution with EM Iteration 0 intelligence DD artificial understanding DDw dist identical rus arrange games dartmouth natural cognitive logic proving prolog Iteration 1 DD D lecture cc D* DD:DD handout due problem set tay DDam yurtas homework kfoury sec (D is a digit) Iteration 2 D DD lecture cc DD:DD due D* homework assignment handout set hw exam problem DDam postscript

EM as Clustering X X X = unlabeled

EM as Clustering, Gone Wrong X X X

20 Newsgroups Data Set sci.crypt rec.sport.hockey rec.sport.baseball comp.windows.x comp.sys.mac.hardware comp.sys.ibm.pc.hardware comp.os.ms-windows.misc comp.graphics alt.atheism talk.politics.misc talk.politics.mideast talk.politics.guns sci.space sci.electronics sci.med talk.religion.misc 20 class labels, 20,000 documents 62k unique words

Newsgroups Classification Accuracy varying # labeled documents