Natural Language Processing CS 6320 Lecture 13 Word Sense Disambiguation

Similar documents
Word Sense Disambiguation

On document relevance and lexical cohesion between query terms

A Bayesian Learning Approach to Concept-Based Document Classification

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Combining a Chinese Thesaurus with a Chinese Dictionary

Multilingual Sentiment and Subjectivity Analysis

THE VERB ARGUMENT BROWSER

A Domain Ontology Development Environment Using a MRD and Text Corpus

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Case Study: News Classification Based on Term Frequency

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A Comparison of Two Text Representations for Sentiment Analysis

Lecture 1: Machine Learning Basics

2.1 The Theory of Semantic Fields

Vocabulary Usage and Intelligibility in Learner Language

Linking Task: Identifying authors and book titles in verbose queries

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

arxiv: v1 [cs.cl] 2 Apr 2017

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A heuristic framework for pivot-based bilingual dictionary induction

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

The MEANING Multilingual Central Repository

AQUA: An Ontology-Driven Question Answering System

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Distant Supervised Relation Extraction with Wikipedia and Freebase

Probabilistic Latent Semantic Analysis

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Leveraging Sentiment to Compute Word Similarity

Switchboard Language Model Improvement with Conversational Data from Gigaword

Using dialogue context to improve parsing performance in dialogue systems

Universiteit Leiden ICT in Business

Constructing Parallel Corpus from Movie Subtitles

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The stages of event extraction

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Rule Learning With Negation: Issues Regarding Effectiveness

Accuracy (%) # features

Cross Language Information Retrieval

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Disambiguation of Thai Personal Name from Online News Articles

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

The following information has been adapted from A guide to using AntConc.

A Comparative Evaluation of Word Sense Disambiguation Algorithms for German

arxiv:cmp-lg/ v1 22 Aug 1994

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Prediction of Maximal Projection for Semantic Role Labeling

The taming of the data:

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Beyond the Pipeline: Discrete Optimization in NLP

Compositional Semantics

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Memory-based grammatical error correction

Matching Similarity for Keyword-Based Clustering

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Data-driven Type Checking in Open Domain Question Answering

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

1. Introduction. 2. The OMBI database editor

Generative models and adversarial training

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

CS 446: Machine Learning

Development of the First LRs for Macedonian: Current Projects

A Graph Based Authorship Identification Approach

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Grade 6: Correlated to AGS Basic Math Skills

Web as a Corpus: Going Beyond the n-gram

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Robust Sense-Based Sentiment Classification

What is a Mental Model?

Cross-Lingual Text Categorization

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Determining the Semantic Orientation of Terms through Gloss Classification

Short Text Understanding Through Lexical-Semantic Analysis

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Australian Journal of Basic and Applied Sciences

A Bootstrapping Model of Frequency and Context Effects in Word Learning

BYLINE [Heng Ji, Computer Science Department, New York University,

CS Machine Learning

Corpus Linguistics (L615)

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

TextGraphs: Graph-based algorithms for Natural Language Processing

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Reducing Features to Improve Bug Prediction

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Lecture 2: Quantifiers and Approximation

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Online Updating of Word Representations for Part-of-Speech Tagging

Methods for the Qualitative Evaluation of Lexical Association Measures

An Empirical and Computational Test of Linguistic Relativity

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Data-driven type checking in open domain question answering

Transcription:

Natural Language Processing CS 630 Lecture 13 Word Sense Disambiguation Instructor: Sanda Harabagiu Copyright 011 by Sanda Harabagiu 1

Word Sense Disambiguation Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. Sense Inventory usually comes from a dictionary or thesaurus, e.g. WordNet. Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory.

Word Senses The meaning of a word distinguished in a given context Word sense representations With respect to a dictionary chair = a seat for one person, with a support for the back; "he put his coat over the back of the chair and sat down" chair = the position of professor; "he was awarded an endowed chair in economics" With respect to the translation in a second language chair = chaise chair = directeur With respect to the context where it occurs (discrimination) Sit on a chair Take a seat on this chair The chair of the Math Department The chair of the meeting 3

Possible definitions for the inventory of sense tags Two variants of the WSD task: 1. The lexical sample task A small pre-selected set of target words is chosen along with an inventory of senses for each word from some lexicon. Supervised machine learning techniques are typically used. All words task Entire texts are considered along with an inventory of senses Similar to part-of-speech tagging, but with a larger set of tags. 4

Approaches to Word Sense Disambiguation Knowledge-Based Disambiguation use of external lexical resources such as dictionaries and thesauri discourse properties Supervised Disambiguation based on a labeled training set the learning system has: a training set of feature-encoded inputs AND their appropriate sense label (category) Unsupervised Disambiguation based on unlabeled corpora the learning system has: a training set of feature-encoded inputs BUT NOT their appropriate sense label (category) 5

Two methods of WSD Developed by David Yarowsky Method 1 Published in COLING-9 uses statistical models of Roget s Categories trained on Large Corpora The senses of a word are defined by the list of categories for that word in the Roget s International Thesaurus(4 th Edition-Chapman 97) Note: Other concept hierarchies could be used, e.g. WordNet or LDOCE subject codes. 6

Sense Disambiguation The disambiguation of a word depends on its definition in the Roget Thesaurus. word i list 1 list selecting the list category which is most probable, given the surrounding context. list j 7

Example: the word crane Two senses: crane as MACHINE crane as ANIMAL 8

Proposed Method 3 observations: a) different conceptual classes of words, e.g. ANIMALS or MACHINES tend to appear in recognizable different contexts. b) different word senses tend to belong to different conceptual classes. c) if one can build a context discriminator for a conceptual class, one has effectively built a context discriminator for the word senses that are members of those classes. 9

What should be done? There are 104 Roget Categories. For each category 1. Collect contexts which are representative of the Roget Category,. Identify salient words in the collective context and determine weights for each word 3. Use the resulting weights to predict the appropriate category for a polysemous word occurring in novel text. 10

Step 1: Collect Contexts How? Extract concordances of 100 surrounding words for each occurrence of each member of the category in the corpus. Example of partial concordances for words in the category TOOLS/MACHINERY. The complete set contains 30,94 lines selected from 10 million word, June 1991, electronic version of Grolier s Encyclopedia. 11

Spurious Examples Ideally each concordance line should only include references to a given category. In reality it is hard to have this case, since many words are polysemous. 1

Step Identify salient words in the collective context Weight them appropriately What is a salient word? intuitively, a word which appears significantly more often in the context of a category than at other points in the corpus. better than average indicator for a category. 13

Formalization use a mutual-information-like estimate: Salience Frequency = Importance Pr( w RCat) Pr( w) salience (weight) 14

Category words vs. important words Example: category TOOLS/MACHINERY MERONYMS blade, engine, gear, wheel, shaft, tooth, piston, cylinder Functions of machines cut, rotate, move, turn, pull Typical objects of those actions wood, metal Typical modifiers for machines electric, mechanical, pneumatic 15

Step 3 Use the resulting weights to predict the appropriate category for a word in novel text. How? When any of the salient words from step appear in the context of an ambiguous wors evidence the word belongs to the indicated category. When several such words appear compound evidence How? Bayes rule: sum the weights over all words in the context and determine the category for which the sum is greatest. Pr( w RCat) Pr( RCat) ARGMAX log Pr( w) RCat w in context 16

Results 17

Results 18

Method # David Yarowski (ACL-95) Unsupervised learning algorithm for sense disambiguation based on the usage of two powerful properties of human language: Heuristic 1: One sense per collection. Nearby words provide strong and consistent clues to the sense of a target word, conditional or relative distance, order and syntactic relationship. Heuristic : One sense per discourse. The sense of a target word is highly consistent with any given document. 19

How are the heuristics used? For a word w and its senses s 1, s,, s n use a seed set of collocations for each sense s i, 1 i n use H1 + H to incrementally identify collocations for sense s i of w. 0

How valid is the One sense per discourse heuristic? Use 373 examples (hand-tagged over three years) Measure: the accuracy (when the word occurs more than once in a discourse, how often it takes on the majority sense for the discourse) the applicability (how often the word does occur more than once in a discourse) 1

The one-sense-per-discourse hypothesis

One Sense Per Collection There is a strong tendency for words to exhibit only one sense in a given collocation. However, this effect varies depending on the type of collocation: it is stronger for words in a predicateargument relationship than for arbitrary associations at equivalent distance. it is much stronger for collocations with content words than with function words. 3

Using decision lists Integrate a wide diversity of potential evidence sources (lemmas, inflected forms, parts of speech and arbitrary word classes) in a variety of positional relationships (local and distant collocations, trigram sequences, predicate-argument association) Training procedure: a) compute word-sense probability distribution for all such collocations b) order probabilities by log-likelihood ratio: log Pr( Sense Pr( Sense A B Collocation Collocation i i ) ) 4

Unsupervised WSD Algorithm Step 1: for a polysemous word w, identify all its examples in a given corpus and store their contexts as lines in an initially untagged training set. 5

Step For each sense of the word, identify a relatively small number of training examples representative of that sense. Solution: hand-tag a subset of the training sentences Yarowsky had a better solution: identify a small number of seed collocations representative of each sense and tag all training examples containing the seed collocates with the sense label. Example: word: plant sense A: collocation: plant life sense B: collocation: manufacturing plant 6

Training Examples 7

Training Examples Copyright 006 by Sanda Harabagiu 8

Sample Initial State Sense-A: life Sense-B: factory All occurrences of the target word are identified A small training set of seed data is tagged with word sense 9

Step 3a Train the supervised classification algorithm on the SENSE-A/SENSE-B seed sets. 30

Step 3b Apply the decision-list classifier to the entire sample set. Take those members in the residual that are tagged as SENSE-A or SENSE-B with probability above a certain threshold and add those examples to the growing seed sets. What happens? The new additions contain newlylearned collocations that are reliably indicative of the previously-trained seed sets. 31

Sample Intermediate State Seed set grows and residual set shrinks. 3

Later Convergence: Stop when residual set stabilizes 33

Step 3c Optionally, use the one-sense-per-discourse heuristic to both filter and augment the addition of collocations. If several instances of a polysemous word in a discourse have already been assigned SENSE-A extend this tag to all examples in the discourse, conditional on the relative numbers and the probabilities associated with the tagged examples. 34

Step 3d Repeat Step 3 iteratively. The training set (seeds + newly added examples) will tend to grow. The residual will tend to shrink. Step 4: STOP when the training parameters are held constant, the algorithm will converge on a stable residual set. Step 5: The classification procedure from the final supervised trained step can be applied to new data. 35

Final decision list for plant. Precision 97% 36

3 rd Method Mihalcea & Moldovan (ACL 99) Novelties: 1) Use the Internet for searching for collocations between two words. ) Pairs of words (W 1, W ). using the senses of W while keeping the sense of W 1 fixed. 3) Rank the senses by the order provided by the number of hits. 37

Contextual ranking of word senses Algorithm 1: Input: word 1 word Output: rank the senses of one word Procedure Step 1: Generate a similarity list for each sense of one of the words. Example: (report, study) similarity list: words from the synset and from the hypernyms (WordNet) (report, news report, story, account, write up) 38

Step Algorithm 1 Generate the W 1 W i(s) pairs 1 (W, W (W, W... (W m, W 1(1) (1) m(1), W, W 1(), W (),...,W m(),...,w 1(k (k,...,w 1 ) ) ) m(k ) m ) ) Similarity lists 1 (W - W, W - W (W - W, W - W... 1 1 (W - W 1 m 1 1, W - W 1 1(1) (1) m(1), W - W 1, W - W 1, W - W 1 1() (),...,W - W m() 1,...,W - W 1,...,W - W 1 1(k 1 ) (k ) ) m(k ) m ) ) Similarity pair-lists 39

Step 3 Search the Internet and rank the senses W i(s) Use AltaVista to generate queries: ( W 1 W i OR W 1 W i(1) OR W 1 W i() OR OR W 1 W i(ki) ) ((W 1 NEAR W i ) OR (W 1 NEAR W i(1) ) OR (W 1 NEAR W i() ) OR (W 1 NEAR W i(ki) )) for all 1 i m rank the m senses of W as they relate to W 1 40

Conceptual density algorithm a measure of the relatedness between words Approach: Build a linguistic context for each sense of a verb & a noun Measure the # of common nouns Take the synset glosses as micro-contexts. 41

Algorithm INPUT: semantically untagged (verb-noun) pair and a ranking of noun senses (output of Algorithm 1) OUTPUT: sense-tagged (verb-noun) pair Step 1: Given a verb-noun pair V-N, denote <v 1, v,, v h > and <n 1, n,, n l > the possible V - N senses Step : Rank the senses of N using Algorithm 1; use only the first t senses. 4

Step3: Conceptual Density For each possible pair v i -n j compute the conceptual density as follows: a) Extract all the glosses from the WN sub-hierarchy containing v i b) Determine the nouns from these glosses noun-context of the verb Each such noun is stored with a weight w that indicates the level of the sub-hierarchy of the verb concept in whose gloss the noun was found. 43

Conceptual density (cont) c) Determine the nouns from the sub-hierarchy including n j d) Compute the conceptual density C ij of the concepts between the nouns obtained at b) and the nouns obtained at c) C ij = cd k log( descendents ij w k j ) 44

Conceptual density (cont) C ij = cd k log( descendents ij w k j ) cd ij = # of common concepts between the hierarchies of v i and n j w k the levels of the nouns in the hierarchy of v i descendents j - #of words within the hierarchy of the noun n j Step 4: C ij is used to rank each pair v i -n j 45