Word Sense Disambiguation - PDF Free Download

Word Sense Disambiguation Computational Lexical Semantics Gemma Boleda 1 Stefan Evert 2 1 Universitat Politècnica de Catalunya 2 University of Osnabrück ESSLLI. Bordeaux, France, July 2009. 1 / 56

Thanks Overview These slides are based on Jurafsky & Martin (2004: chapter 20) and material by Ann Copestake (course at UPF, 2008) 2 / 56

Outline Overview 1 Overview 2 3 4 5 3 / 56

Outline Overview 1 Overview 2 3 4 5 4 / 56

Overview Word Sense Disambiguation The task of selecting the correct sense for a word in context. potentially helpful in many applications machine translation, question answering, information retrieval... we focus on WSD as a stand-alone task artificial! 5 / 56

WSD algorithm basic form: input: word in context, fixed inventory of word senses output: the correct word sense for that use context? words surrounding the target word: annotated? just the words in no particular order? context size? inventory? task-dependent machine translation from English to Spanish: set of Spanish translations speech synthesis: homographs with differing pronunciations (e.g., bass) stand-alone task: a lexical resource (usually, WordNet) 6 / 56

An example WordNet Sense Target Word in Context bass 4... fish as Pacific salmon and striped bass and... bass 4... produce filets of smoked bass or sturgeon... bass 7... exciting jazz bass player since Ray Brown... bass 7... play bass because he doesn t have to solo... Figure: Possible inventory of sense tags for word bass 7 / 56

Variants of the task lexical sample task WSD for a small set of target words a number of corpus instances are selected and labeled similar to task in our case study supervised approaches; word-specific classifiers all-words WSD for all content words in a text similar to POS-tagging; but very large tagset! data sparseness not enough training data for every word 8 / 56

Outline Overview 1 Overview 2 3 4 5 9 / 56

Feature extraction supervised approach need to identify features that are predictive of word senses fundamental (and early) insight: look at the context words bass smoked bass or jazz bass player window (e.g., 1-word window) 10 / 56

Method Overview process the dataset (POS-tagging, lemmatization, parsing) build feature representation encoding the relevant linguistic information two main feature types: 1 collocational features 2 bag-of-words features 11 / 56

Collocational features features that take order or syntactic relations into account restricted to immediate word context (usually fixed window). For example: lemma and part of speech of two-word window syntactic function of the target word 12 / 56

Collocational features: Example Example: (20.1) An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps. 2-word window representation, using parts of speech: [guitar, NN, and, CC, player, NN, stand, VB] [w 2, P 2, w 1, P 1, w + 1, P + 1, w + 2, P + 2] 13 / 56

Bag-of-words features lexical features pre-selected words that are potentially relevant for sense distinctions. For example: for all-words task: frequent content words in the corpus for lexical sample task: content words in the sentences of the target word test for presence/absence of a certain word in the selected context 14 / 56

Bag-of-words features: Example Example: (20.1) An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps. pre-selected words: [fishing, big, sound, player, fly] feature vector: [0, 0, 0, 1, 0] 15 / 56

More on features collocational cues account for: collocational effects bass+player=bass 7 syntax-related sense differences serve breakfast to customers vs. serve Philadelphia bag of word features account for topic and domain related features resemblance to semantic fields, frames,... complementary information both feature types usually combined 16 / 56

Combined representation: Example simplified representation for 2 sentences: collocational features corresponding to 1-word window:... jazz bass player...... smoked bass or... bag-of-word features only fishing, player 17 / 56

Combined representation Weka format @relation bass @attribute wordl1 {jazz,smoke} @attribute posl1 {CC,VBD} @attribute wordr1 {player,or} @attribute posr1 {CC,NN} @attribute fishing {0,1} @attribute player {0,1} @attribute sense {s4,s7} @data jazz,cc,player,nn,0,1,s7 smoke,vbd,or,nn,0,0,s4... jazz bass player...... smoked bass or... 18 / 56

Method Overview any supervised algorithm Decision Trees (for example, J48) Decision Lists (similar to Decision Trees) Naive Bayes (probabilistic)... and tool Weka R SVMTool your own implementation... 19 / 56

Interim Summary supervised approaches use sense-annotated datasets need many annotated examples for every word relevant information in the context: lexico-syntactic information (collocational features) lexical information (bag of words features) information is encoded in the form of features... and a classifier is trained to distinguish different senses of a given word 20 / 56

Outline Overview 1 Overview 2 3 4 5 21 / 56

Extrinsic evaluation long term goal: improve performance in end-to-end application extrinsic evaluation (or task-based, end-to-end, in vivo evaluation) example: Word Sense Disambiguation for (Cross-Lingual) Information Retrieval http://ixa2.si.ehu.es/clirwsd 22 / 56

Intrinsic evaluation however, extrinsic evaluation difficult and time consuming intrinsic evaluation (or in vitro evaluation) treat a WSD component as if it were a stand-alone system measure: sense accuracy (percentage of words correctly tagged) Accuracy = matches total method: held-out data from the same sense-tagged corpora used for training (train-test methodology) to standardize datasets and methods: SensEval and SemEval competitions example: our case study 23 / 56

Baseline Overview baseline: performance we would get without much knowledge / with a simple approach necessary for any Machine Learning experiment (how good is 70%?) simplest baseline: most frequent sense WordNet: first sense heuristic (senses ordered) very powerful baseline! skewed distribution of senses in corpora BUT we need access to annotated data for every word in the dataset to estimate sense frequencies this is a knowledge-laden baseline 24 / 56

Ceiling Overview ceiling or upper-bound for performance: inter-coder agreement all-word corpora using WordNet: A o 0.75 0.8 more coarse-grained sense distinctions: A o 0.9 another possibility: avoid annotation using pseudowords banana-door however: unrealistic real polysemy is not like banana-doors! need to find better ways to create pseudowords 25 / 56

Outline Overview 1 Overview 2 3 4 5 26 / 56

Overview sense-labeled corpora give accurate information but scarce! need other sources: dictionaries, thesaurus, selectional restrictions... idea: use dictionaries as corpora (identifying related words in definitions and examples) 27 / 56

An example Example: (20.10) The bank can guarantee deposits will eventually cover future tuition costs because it invests in adjustable-rate mortgage securities. bank 1 Gloss: a financial institution that accepts deposits and channels the money into lending activities Examples: he cashed a check at the bank ; that bank holds the mortgage on my home bank 2 Gloss: sloping land (especially beside a body of water) Examples: they pulled the canoe up on the bank ; he sat on the bank of the river Figure: WordNet information for two senses of bank 28 / 56

Signatures Overview signature: set of words that characterizes a given sense of a target word extracted from dictionaries, thesauri, tagged corpora,... for example (20.10): bank 1 : financial, institution, accept, deposit, channel, money, lending, activity, cash, check, hold, mortgage, home bank 2 : sloping, land, body, water, pull, canoe, bank, sit, river 30 / 56

Lesk Algorithm Lesk Algorithm function SIMPLIFIED LESK(word, sentence) returns best sense of word best-sense most frequent sense for word max-overlap 0 context set of words in sentence for each sense in senses of word do signature set of words in the gloss and examples of sense overlap COMPUTEOVERLAP(signature, context) if overlap > max-overlap then max-overlap overlap best-sense sense end return(best-sense) 31 / 56

Lesk Algorithm Example: she strolled by the river bank. best-sense bank 1 ; max-overlap 0 context {she, stroll, river} sense bank 1 : signature {financial, institution, accept, deposit, channel, money, lending, activity, cash, check, hold, mortgage, home} overlap 0; 0 > 0 fails sense bank 2 : signature {sloping, land, body, water, pull, canoe, bank, sit, river} overlap 1; 1 > 0 succeeds best-sense bank 2 ; max overlap 1 return bank 2 32 / 56

Overview right intuition: words that appear in dictionary definitions and examples are relevant to a given sense problem: data sparseness: dictionary entries short, not always examples Lesk algorithm currently used as baseline BUT many extensions possible and have been tried (generalizations over lemmata, corpus data, weighting,... ) AND dictionary-derived features can be used (are used) in standard supervised approaches 33 / 56

Interim Summary information encoded in dictionaries (definitions, examples) is useful for WSD can be used exclusively or in addition to other information (collocations, bag of words) for supervised approaches the Lesk algorithm disambiguates solely on the basis of dicionary information overlap between dictionary entry and context of word occurrence the most frequent sense and the Lesk algorithm are used as baselines for evaluation 34 / 56

Overview we have a huge number of classes (senses) need large hand-built resources: supervised approaches need large annotated corpora (unrealistic) dictionary methods need large dictionaries, which, even if available, often do not provide enough information alternatives: Minimally supervised WSD Unsupervised WSD both make use of unannotated data these approaches are not as successful as supervised approaches 35 / 56

Minimally supervised WSD: Bootstrapping for a given word, for example plant start with a small number of annotated examples (seeds) for each sense collect additional examples for each sense based on their similarity to annotated examples iterate 36 / 56

Bootstrapping: example plant (Yarowsky 1995) sense A: living entity; sense B: building first examples: those that appear with life (sense A) and manufacturing (sense B) Figure: Bootstrapping word senses. Figure 20.4 in Jurafsky & Martin. 37 / 56

Yarowsky 1995 Influential insights (used as heuristics in Yarowsky s algorithm): one sense per collocation life+plant = plant A manufacturing+plant = plant B one sense per discourse if a word appears multiple times in a text, probably all occurrences will bear the same sense also useful to enlarge datasets 38 / 56

Unsupervised WSD no previous knowledge no human-defined word senses simply group examples according to the similarity of the examples clustering and infer senses from that problem: hard to interpret and evaluate 39 / 56

Outline Overview 1 Overview 2 3 4 5 40 / 56

Interim summary WSD can be framed as a standard classification task training data, feature definition, classifier, evaluation supervised approaches most useful information: syntactic and lexical context (collocational features) words related to the different senses of a given word (bag of word features) words in dictionary (thesaurus, etc.) entries other approaches try to make use of unannotated data bootstrapping, unsupervised learning would be great, but not as successful as supervised approaches (and harder to interpret and work with) 41 / 56

Useful empirical facts skewed distribution of senses most frequent sense baseline heuristic when no other information is available BUT distribution varies with text/corpus! (cone in geometry textbook) one sense per collocation bass+player=bass 7 simple cues for sense classification (heuristic) one sense per discourse different occurences of a word in a given text tend to be used in the same sense heuristic for classification and for data gathering 42 / 56

Conceptual problems the task as currently defined does no allow for generalization over different words learning is word-specific number of classes = number of senses; equal to or greater than number of words! need training data for every sense of every word most words have low frequency (Zipf s law) no chance with unknown words this wouldn t be a problem if word sense alternation were like bank 1 bank 2 (homonymy)...... but many alternations are systematic! (regular polysemy, metonymy, metaphor) 43 / 56

Regular polysemy conversion bank (N): financial institution bank (V): put money in a bank same for sugar, hammer, tango, etc. (also derivation: -ize) adjectives (Boleda 2007) qualitative vs. relational: cara familiar ( familiar face ) vs. reunió familiar ( family meeting ) event-related vs. qualitative: fet sabut ( known fact ) vs. home sabut ( wise man ) 44 / 56

Regular polysemy: mass/count animal/meat chicken 1 : animal; chicken 2 : meat lamb 1 : animal; lamb 2 : meat... portions/kinds: two beers two servings of beer two types of beer generally: thing/derived substance (grinding) After several lorries had run over the body, there was rabbit splattered all over the road. 45 / 56

Regular polysemy verb alternations causative/inchoative (Levin 1993) John broke the window The window broke Spanish psychological verbs Le preocupa la situación (Dative + Subject) Bruna no quiere preocuparla (subject + Accusative) 46 / 56

Contextual coercion / Logical metonymy (Also see course by Louise McNally.) object to eventuality (Pustejovsky 1995) Mary enjoyed the book. After three martinis, Kim felt much happier. adjectives (Pustejovsky 1995): event selection fast runner vs. fast typist vs. fast car 47 / 56

Metonymy Overview container/content He drank a bottle of whisky. Morphology again: He drank a bottleful of whisky. (-ful suffixation) fruit/plant olive, grapefruit,... Spanish: often tree masculine (olivo, naranjo), fruit feminine (oliva, naranja) figure/ground Kim painted the door Kim walked through the door 48 / 56

Metonymy Overview country names Location: I live in China. Government: The US and Lybia have agreed to work together to solve... Team (sports): England won last s year World Cup. more generally: institutions Barcelona applied for the Olympic Games. The banks won t give credits now. The newspapers criticized this policy. object/person The cello is playing badly. Not so regular: contextual metaphor: The ham sandwich wants his check. (Lakoff & Johnson 1980) 49 / 56

Metaphor Overview physical mental depart 1 : physical transfer; arrive 1 : physical transfer; go 1 : physical transfer depart 2 : mental transfer; arrive 2 : mental transfer; go 2 : mental transfer concrete abstract aigua clara ( clear water ) vs. estil clar ( clear style ) cabells negres ( black hair ) vs. humor negre ( black humour ) 50 / 56

To sum up pervasive systematicity in sense alternations: regular polysemy, metonymy, metaphor productive We found a little, hairy wampimuk sleeping behind the tree (McDonald & Ramscar 2001) Wampimuk soup is delicious! inherent property of language analogical reasoning (psychology again) WSD as currently handled cannot capture these regularities theoretical and practical problem! 51 / 56

WSD and regularities: what one can do generalize on FEATURES e.g., jazz MUSIC-STYLE jazz, rock, blues,... provided some lexical resource is available that encodes this information He is a jazz bass player. I love bass solos in rock music. problem: when (how) to generalize? when to stop? 52 / 56

WSD and regularities: what would be desirable train on chicken and use the data for lamb, wampimuk,... Resources such as WordNet encode the meat/animal distinction: WordNet info for chicken: chicken 1 : the flesh of a chicken used for food. chicken 2 : a domesticated gallinaceous bird (hyponym). chicken 3 : a person who lacks confidence. chicken 4 : a foolhardy competition. WordNet info for lamb: lamb 1 : young sheep. lamb 2 : a person easily deceived or cheated. lamb 3 : a sweet innocent mild-mannered person. lamb 4 : the flesh of a young domestic sheep eaten as food. WHAT IS MISSING: link between chicken 2 and lamb 1, chicken 1 and lamb 4 (note other senses) 53 / 56

Word Sense Disambiguation Computational Lexical Semantics Gemma Boleda 1 Stefan Evert 2 1 Universitat Politècnica de Catalunya 2 University of Osnabrück ESSLLI. Bordeaux, France, July 2009. 54 / 56

Classifier example 1: Naive Bayes probabilistic classifier (related to HMMs) choosing the best sense amounts to choosing the most probable sense given the feature vector conditional probability BUT it is impossible to train it directly (too many feature combinations) 2 strategies: decomposing the probabilities (Bayes rules) easier to estimate making unrealistic assumption: words are independent ( Naive Bayes) training the classifier = estimating probabilities from the sense-tagged corpus 55 / 56

Classifier example 2: Decision Lists similar to decision trees (difference: only one condition) Rule Sense fish within window bass 4 striped bass bass 4 guitar within window bass 7 play/v bass bass 7 Figure: Decision List for word bass to learn a decision list classifier: generate and order tests according to the training data 56 / 56