Natural Language Processing CS 630 Lecture 13 Word Sense Disambiguation Instructor: Sanda Harabagiu Copyright 011 by Sanda Harabagiu 1
Word Sense Disambiguation Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. Sense Inventory usually comes from a dictionary or thesaurus, e.g. WordNet. Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory.
Word Senses The meaning of a word distinguished in a given context Word sense representations With respect to a dictionary chair = a seat for one person, with a support for the back; "he put his coat over the back of the chair and sat down" chair = the position of professor; "he was awarded an endowed chair in economics" With respect to the translation in a second language chair = chaise chair = directeur With respect to the context where it occurs (discrimination) Sit on a chair Take a seat on this chair The chair of the Math Department The chair of the meeting 3
Possible definitions for the inventory of sense tags Two variants of the WSD task: 1. The lexical sample task A small pre-selected set of target words is chosen along with an inventory of senses for each word from some lexicon. Supervised machine learning techniques are typically used. All words task Entire texts are considered along with an inventory of senses Similar to part-of-speech tagging, but with a larger set of tags. 4
Approaches to Word Sense Disambiguation Knowledge-Based Disambiguation use of external lexical resources such as dictionaries and thesauri discourse properties Supervised Disambiguation based on a labeled training set the learning system has: a training set of feature-encoded inputs AND their appropriate sense label (category) Unsupervised Disambiguation based on unlabeled corpora the learning system has: a training set of feature-encoded inputs BUT NOT their appropriate sense label (category) 5
Two methods of WSD Developed by David Yarowsky Method 1 Published in COLING-9 uses statistical models of Roget s Categories trained on Large Corpora The senses of a word are defined by the list of categories for that word in the Roget s International Thesaurus(4 th Edition-Chapman 97) Note: Other concept hierarchies could be used, e.g. WordNet or LDOCE subject codes. 6
Sense Disambiguation The disambiguation of a word depends on its definition in the Roget Thesaurus. word i list 1 list selecting the list category which is most probable, given the surrounding context. list j 7
Example: the word crane Two senses: crane as MACHINE crane as ANIMAL 8
Proposed Method 3 observations: a) different conceptual classes of words, e.g. ANIMALS or MACHINES tend to appear in recognizable different contexts. b) different word senses tend to belong to different conceptual classes. c) if one can build a context discriminator for a conceptual class, one has effectively built a context discriminator for the word senses that are members of those classes. 9
What should be done? There are 104 Roget Categories. For each category 1. Collect contexts which are representative of the Roget Category,. Identify salient words in the collective context and determine weights for each word 3. Use the resulting weights to predict the appropriate category for a polysemous word occurring in novel text. 10
Step 1: Collect Contexts How? Extract concordances of 100 surrounding words for each occurrence of each member of the category in the corpus. Example of partial concordances for words in the category TOOLS/MACHINERY. The complete set contains 30,94 lines selected from 10 million word, June 1991, electronic version of Grolier s Encyclopedia. 11
Spurious Examples Ideally each concordance line should only include references to a given category. In reality it is hard to have this case, since many words are polysemous. 1
Step Identify salient words in the collective context Weight them appropriately What is a salient word? intuitively, a word which appears significantly more often in the context of a category than at other points in the corpus. better than average indicator for a category. 13
Formalization use a mutual-information-like estimate: Salience Frequency = Importance Pr( w RCat) Pr( w) salience (weight) 14
Category words vs. important words Example: category TOOLS/MACHINERY MERONYMS blade, engine, gear, wheel, shaft, tooth, piston, cylinder Functions of machines cut, rotate, move, turn, pull Typical objects of those actions wood, metal Typical modifiers for machines electric, mechanical, pneumatic 15
Step 3 Use the resulting weights to predict the appropriate category for a word in novel text. How? When any of the salient words from step appear in the context of an ambiguous wors evidence the word belongs to the indicated category. When several such words appear compound evidence How? Bayes rule: sum the weights over all words in the context and determine the category for which the sum is greatest. Pr( w RCat) Pr( RCat) ARGMAX log Pr( w) RCat w in context 16
Results 17
Results 18
Method # David Yarowski (ACL-95) Unsupervised learning algorithm for sense disambiguation based on the usage of two powerful properties of human language: Heuristic 1: One sense per collection. Nearby words provide strong and consistent clues to the sense of a target word, conditional or relative distance, order and syntactic relationship. Heuristic : One sense per discourse. The sense of a target word is highly consistent with any given document. 19
How are the heuristics used? For a word w and its senses s 1, s,, s n use a seed set of collocations for each sense s i, 1 i n use H1 + H to incrementally identify collocations for sense s i of w. 0
How valid is the One sense per discourse heuristic? Use 373 examples (hand-tagged over three years) Measure: the accuracy (when the word occurs more than once in a discourse, how often it takes on the majority sense for the discourse) the applicability (how often the word does occur more than once in a discourse) 1
The one-sense-per-discourse hypothesis
One Sense Per Collection There is a strong tendency for words to exhibit only one sense in a given collocation. However, this effect varies depending on the type of collocation: it is stronger for words in a predicateargument relationship than for arbitrary associations at equivalent distance. it is much stronger for collocations with content words than with function words. 3
Using decision lists Integrate a wide diversity of potential evidence sources (lemmas, inflected forms, parts of speech and arbitrary word classes) in a variety of positional relationships (local and distant collocations, trigram sequences, predicate-argument association) Training procedure: a) compute word-sense probability distribution for all such collocations b) order probabilities by log-likelihood ratio: log Pr( Sense Pr( Sense A B Collocation Collocation i i ) ) 4
Unsupervised WSD Algorithm Step 1: for a polysemous word w, identify all its examples in a given corpus and store their contexts as lines in an initially untagged training set. 5
Step For each sense of the word, identify a relatively small number of training examples representative of that sense. Solution: hand-tag a subset of the training sentences Yarowsky had a better solution: identify a small number of seed collocations representative of each sense and tag all training examples containing the seed collocates with the sense label. Example: word: plant sense A: collocation: plant life sense B: collocation: manufacturing plant 6
Training Examples 7
Training Examples Copyright 006 by Sanda Harabagiu 8
Sample Initial State Sense-A: life Sense-B: factory All occurrences of the target word are identified A small training set of seed data is tagged with word sense 9
Step 3a Train the supervised classification algorithm on the SENSE-A/SENSE-B seed sets. 30
Step 3b Apply the decision-list classifier to the entire sample set. Take those members in the residual that are tagged as SENSE-A or SENSE-B with probability above a certain threshold and add those examples to the growing seed sets. What happens? The new additions contain newlylearned collocations that are reliably indicative of the previously-trained seed sets. 31
Sample Intermediate State Seed set grows and residual set shrinks. 3
Later Convergence: Stop when residual set stabilizes 33
Step 3c Optionally, use the one-sense-per-discourse heuristic to both filter and augment the addition of collocations. If several instances of a polysemous word in a discourse have already been assigned SENSE-A extend this tag to all examples in the discourse, conditional on the relative numbers and the probabilities associated with the tagged examples. 34
Step 3d Repeat Step 3 iteratively. The training set (seeds + newly added examples) will tend to grow. The residual will tend to shrink. Step 4: STOP when the training parameters are held constant, the algorithm will converge on a stable residual set. Step 5: The classification procedure from the final supervised trained step can be applied to new data. 35
Final decision list for plant. Precision 97% 36
3 rd Method Mihalcea & Moldovan (ACL 99) Novelties: 1) Use the Internet for searching for collocations between two words. ) Pairs of words (W 1, W ). using the senses of W while keeping the sense of W 1 fixed. 3) Rank the senses by the order provided by the number of hits. 37
Contextual ranking of word senses Algorithm 1: Input: word 1 word Output: rank the senses of one word Procedure Step 1: Generate a similarity list for each sense of one of the words. Example: (report, study) similarity list: words from the synset and from the hypernyms (WordNet) (report, news report, story, account, write up) 38
Step Algorithm 1 Generate the W 1 W i(s) pairs 1 (W, W (W, W... (W m, W 1(1) (1) m(1), W, W 1(), W (),...,W m(),...,w 1(k (k,...,w 1 ) ) ) m(k ) m ) ) Similarity lists 1 (W - W, W - W (W - W, W - W... 1 1 (W - W 1 m 1 1, W - W 1 1(1) (1) m(1), W - W 1, W - W 1, W - W 1 1() (),...,W - W m() 1,...,W - W 1,...,W - W 1 1(k 1 ) (k ) ) m(k ) m ) ) Similarity pair-lists 39
Step 3 Search the Internet and rank the senses W i(s) Use AltaVista to generate queries: ( W 1 W i OR W 1 W i(1) OR W 1 W i() OR OR W 1 W i(ki) ) ((W 1 NEAR W i ) OR (W 1 NEAR W i(1) ) OR (W 1 NEAR W i() ) OR (W 1 NEAR W i(ki) )) for all 1 i m rank the m senses of W as they relate to W 1 40
Conceptual density algorithm a measure of the relatedness between words Approach: Build a linguistic context for each sense of a verb & a noun Measure the # of common nouns Take the synset glosses as micro-contexts. 41
Algorithm INPUT: semantically untagged (verb-noun) pair and a ranking of noun senses (output of Algorithm 1) OUTPUT: sense-tagged (verb-noun) pair Step 1: Given a verb-noun pair V-N, denote <v 1, v,, v h > and <n 1, n,, n l > the possible V - N senses Step : Rank the senses of N using Algorithm 1; use only the first t senses. 4
Step3: Conceptual Density For each possible pair v i -n j compute the conceptual density as follows: a) Extract all the glosses from the WN sub-hierarchy containing v i b) Determine the nouns from these glosses noun-context of the verb Each such noun is stored with a weight w that indicates the level of the sub-hierarchy of the verb concept in whose gloss the noun was found. 43
Conceptual density (cont) c) Determine the nouns from the sub-hierarchy including n j d) Compute the conceptual density C ij of the concepts between the nouns obtained at b) and the nouns obtained at c) C ij = cd k log( descendents ij w k j ) 44
Conceptual density (cont) C ij = cd k log( descendents ij w k j ) cd ij = # of common concepts between the hierarchies of v i and n j w k the levels of the nouns in the hierarchy of v i descendents j - #of words within the hierarchy of the noun n j Step 4: C ij is used to rank each pair v i -n j 45