Word Sense Disambiguation

Similar documents
Word Sense Disambiguation

On document relevance and lexical cohesion between query terms

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Linking Task: Identifying authors and book titles in verbose queries

Multilingual Sentiment and Subjectivity Analysis

A Bayesian Learning Approach to Concept-Based Document Classification

A Case Study: News Classification Based on Term Frequency

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

2.1 The Theory of Semantic Fields

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

AQUA: An Ontology-Driven Question Answering System

Cross Language Information Retrieval

Probabilistic Latent Semantic Analysis

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Vocabulary Usage and Intelligibility in Learner Language

Combining a Chinese Thesaurus with a Chinese Dictionary

Speech Recognition at ICSI: Broadcast News and beyond

Lecture 1: Machine Learning Basics

Rule Learning With Negation: Issues Regarding Effectiveness

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Matching Similarity for Keyword-Based Clustering

The stages of event extraction

(Sub)Gradient Descent

The MEANING Multilingual Central Repository

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Python Machine Learning

CS Machine Learning

Using dialogue context to improve parsing performance in dialogue systems

Rule Learning with Negation: Issues Regarding Effectiveness

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Leveraging Sentiment to Compute Word Similarity

SEMAFOR: Frame Argument Resolution with Log-Linear Models

A Comparison of Two Text Representations for Sentiment Analysis

Word Segmentation of Off-line Handwritten Documents

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Memory-based grammatical error correction

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Short Text Understanding Through Lexical-Semantic Analysis

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

BYLINE [Heng Ji, Computer Science Department, New York University,

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

Assignment 1: Predicting Amazon Review Ratings

Switchboard Language Model Improvement with Conversational Data from Gigaword

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Parsing of part-of-speech tagged Assamese Texts

Universiteit Leiden ICT in Business

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Accuracy (%) # features

Context Free Grammars. Many slides from Michael Collins

The Role of the Head in the Interpretation of English Deverbal Compounds

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

The Smart/Empire TIPSTER IR System

The taming of the data:

Developing a TT-MCTAG for German with an RCG-based Parser

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Prediction of Maximal Projection for Semantic Role Labeling

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Characteristics of the Text Genre Realistic fi ction Text Structure

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

1. Introduction. 2. The OMBI database editor

Developing Grammar in Context

CEFR Overall Illustrative English Proficiency Scales

Compositional Semantics

The College Board Redesigned SAT Grade 12

arxiv:cmp-lg/ v1 22 Aug 1994

Learning From the Past with Experiment Databases

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:

Ensemble Technique Utilization for Indonesian Dependency Parser

Distant Supervised Relation Extraction with Wikipedia and Freebase

Corpus Linguistics (L615)

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Beyond the Pipeline: Discrete Optimization in NLP

arxiv: v1 [cs.cl] 2 Apr 2017

Detecting English-French Cognates Using Orthographic Edit Distance

Learning Methods in Multilingual Speech Recognition

Part III: Semantics. Notes on Natural Language Processing. Chia-Ping Chen

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Applications of memory-based natural language processing

THE VERB ARGUMENT BROWSER

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Software Maintenance

Grade 6: Correlated to AGS Basic Math Skills

Constructing Parallel Corpus from Movie Subtitles

Loughton School s curriculum evening. 28 th February 2017

Artificial Neural Networks written examination

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Transcription:

Word Sense Disambiguation Carlo Strapparava FBK-Irst Istituto per la ricerca scientifica e tecnologica I-38050 Povo, Trento, ITALY strappa@fbk.eu The problem of WSD What is the idea of word sense disambiguation? Many words have several meanings or senses For such words (given out of context) there is ambiguity about how they should be interpreted WSD is the task of examining word tokens in context and specifying exactly which sense of each word is being used 1

Computers versus Humans Polysemy most words have many possible meanings. A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human Ambiguity is rarely a problem for humans in their day to day communication, except in extreme cases Ambiguity for a Computer The fisherman jumped off the bank and into the water. The bank down the street was robbed! Back in the day, we had an entire bank of computers devoted to this problem. The bank in that road is entirely too steep and is really dangerous. The plane took a bank to the left, and then headed off towards the mountains. 2

Other examples Two examples: 1. In my office there are 2 tables and 4 chairs. 2. This year the chair of the ACL conference is prof. W. D. For humans - it is not a problem Ex. 1: chair = in the sense of furniture Ex. 2: chair = in the sense of the role covered by a person For machines - it is not trivial: one of the hardest problems in NLP Brief Historical Overview 1970s - 1980s Rule based systems Rely on hand crafted knowledge sources 1990s Corpus based approaches Dependence on sense tagged text (Ide and Veronis, 1998) overview history from early days to 1998. 2000s Hybrid Systems Minimizing or eliminating use of sense tagged text Taking advantage of the Web 3

Interdisciplinary Connections Cognitive Science & Psychology Quillian (1968), Collins and Loftus (1975) : spreading activation Hirst (1987) developed marker passing model Linguistics Fodor & Katz (1963) : selectional preferences Resnik (1993) pursued statistically Philosophy of Language Wittgenstein (1958): meaning as use For a large class of cases - though not for all - in which we employ the word "meaning" it can be defined thus: the meaning of a word is its use in the language. Why do we need the senses? Sense disambiguation is an intermediate task (Wilks and Stevenson, 1996) It is necessary to deal with most natural language tasks that involve language understanding (e.g. message understanding, man-machine communication, ) 4

Where WSD could be useful Machine translation Information retrieval and hypertext navigation Content and thematic analysis Grammatical analysis Speech processing Text processing WSD for machine translation Sense disambiguation is essential for the proper translation of words For example the French word grille, depending on the context, can be translated as railings, gate, bar, grid, scale, schedule, etc 5

WSD for information retrieval WSD could be useful for information retrieval and hypertext navigation When searching for specific keywords, it is desirable to eliminate documents where words are used in an inappropriate sense For example, when searching for judicial references, court as associated with royalty rather than with law. Vorhees (1999) Krovetz (1997, 2002) =>more benefits in Cross Language IR WSD for grammatical analysis Sense disambiguation can be useful in Part-of-speech tagging ex. L étagère plie sous les livres [the shelf is bending under (the weight of) the books] livres (which can mean books or pounds) is masculine in in the former sense, feminine in the latter Prepositional phrase attachment (Hindle and Rooth, 1993) In general it can restrict the space of parses (Alshawi and Carter, 1994) 6

WSD for speech processing Sense disambiguation is required for correct phonetization of words in speech synthesis Ex. The word conjure He conjured up an image or I conjure you to help me (Yarowsky, 1996) And also for word segmentation and homophone discrimination in speech recognition WSD for content representation WSD is useful to have a content-based representation of documents Ex. user-model for multilingual news web sites A content-based technique to represent documents as a starting point to build a model of user s interest A user-model built using a word-sense based representation of the visited documents A filtering procedure dynamically predicts new document on the basis of the user s interest model 7

A simple idea: selectional restriction-based disambiguation Examples from Wall Street Journal washing dishes Ms. Chen works well, stir-frying several simple dishes Two senses: Physical Objects Meals or Recipes The selection can be based on restrictions imposed by wash or stir-fry on their PATIENT role The object of stir-fry should be something edible The sense of meal conflicts with the restrictions imposed by wash Selectional restrictions (2) However there are some limitations Sometimes the available selectional restrictions are too general -> what kind of dishes do you recommend? Violations of selectional restrictions, but perfectly well-formed and interpretable sentences: ex. metaphors, metonymy) Requires a huge knowledge base Some attempts: FrameNet (Fillmore @ Berkeley) LCS (Lexical Conceptual Structure) - Jackendoff 83 8

WSD methodology A WSD task involves two steps 1. Sense repository -> the determination of all the different senses for every word relevant to the text under consideration 2. Sense assignment -> a means to assign each occurrence of a word to the appropriate sense Sense repositories Distinguish word senses in texts with respect to a dictionary, thesaurus, etc WordNet, LDOCE, Roget thesaurus A word is assumed to have a finite number of discrete senses However often a word has somewhat related senses and it is unclear whether to and where to draw lines between them The issue of a coarse-grained task 9

Word Senses (ex. WordNet) The noun title has 10 senses (first 7 from tagged texts) 1. title, statute title -- (a heading that names a statute or legislative bill; gives a brief summary of the matters it deals with; "Title 8 provided federal help for schools") 2. title -- (the name of a work of art or literary composition etc.; "he looked for books with the word `jazz' in the title"; "he refused to give titles to his paintings"; "I can never remember movie titles") 3. title -- (a general or descriptive heading for a section of a written work; "the novel had chapter titles") 4. championship, title -- (the status of being a champion; "he held the title for two years ) 5. deed, deed of conveyance, title -- (a legal document signed and sealed and delivered to effect a transfer of property and to show the legal right to possess it; "he signed the deed"; "he kept the title to his car in the glove compartment") 6. title -- (an identifying appellation signifying status or function: e.g. Mr. or General; "the professor didn't like his friends to use his formal title") 7. title, claim -- (an established or recognized right: "a strong legal claim to the property"; "he had no documents confirming his title to his father's estate") 8. title -- ((usually plural) written material introduced into a movie or TV show to give credits or represent dialogue or explain an action; "the titles go by faster than I can read ) 9. title -- (an appellation signifying nobility; "`your majesty' is the appropriate title to use in addressing a king") 10. claim, title -- (an informal right to something: "his claim on her attentions"; "his title to fame ) The verb title has 1 sense (first 1 from tagged texts) 1. entitle, title -- (give a title to) Sense repositories - a different approach Word Sense Discrimination Sense discrimination divides the occurrences of a word into a number of classes by determining for any two occurrences whether they belong to the same sense or not. We need only determine which occurrences have the same meaning and not what the meaning actually is. 10

Word Sense Discrimination Discrimination Clustering word senses in a text Pros: no need for a-priori dictionary definitions agglomerative clustering is a well studied field Cons: sense inventory varies from one text to another hard to evaluate hard to standardize (Schütze 98) Automatic Word Sense Discrimination ACL 98 WSD methodology A WSD task involves two steps 1. Sense repository -> the determination of all the different senses for every word relevant to the text under consideration 2. Sense assignment -> a means to assign each occurrence of a word to the appropriate sense 11

Sense assignment The assignment of words to senses relies on two major sources of information the context of the word to be disambiguated, in the broad sense: this includes info contained within the text in which the word appears external knowledge sources, including lexical, encyclopedic, etc. resources as well as (possibly) hand-devised knowledge sources Sense assignment (2) All disambiguation work involves matching the context of the word to be disambiguated with either information from an external knowledge source (knowledge-driven WSD) OR information information about the contexts of previous disambiguated instances of the word derived from corpora (corpus-based WSD) 12

Some approaches Corpus based approaches Supervised algorithms: Exemplar-Based Learning (Ng & Lee 96) Naïve Bayes Semi-supervised algorithms: bootstrapping approaches (Yarowsky 95) Dictionary based approaches Lesk 86 Hybrid algorithms (supervised + dictionary) Mihalcea 00 Brief review: What is Supervised Learning? Collect a set of examples that illustrate the various possible classifications or outcomes of an event Identify patterns in the examples associated with each particular class of the event Generalize those patterns into rules Apply the rules to classify a new event 13

Learn from these examples : when do I go to the university? CLASS F1 F2 F3 Day Go to University? Hot Outside? Slept Well? Ate Well? 1 YES YES 2 YES YES 3 YES 4 YES Learn from these examples : when do I go to the university? Day CLASS Go to University? F1 Hot Outside? F2 Slept Well? F3 Ate Well? 1 YES YES 2 YES YES 3 YES 4 YES 14

Supervised WSD Supervised WSD: Class of methods that induces a classifier from manually sense-tagged text using machine learning techniques. Resources Sense Tagged Text Dictionary (implicit source of sense inventory) Syntactic Analysis (POS tagger, Chunker, Parser, ) Scope Typically one target word per context Part of speech of target word resolved Lends itself to targeted word formulation Looking at WSD as a classification problem where a target word is assigned the most appropriate sense from a given set of possibilities based on the context in which it occurs An Example of Sense Tagged Text Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers My bank/1 charges too much for an overdraft. I went to the bank/1 to deposit my check and get a new ATM card. The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River. My grandfather planted his pole in the bank/2 and got a great big catfish! The bank/2 is pretty muddy, I can t walk there. 15

Two Bags of Words (Co-occurrences in the window of context ) FINANCIAL_BANK_BAG: a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were RIVER_BANK_BAG: a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West Simple Supervised Approach Given a sentence S containing bank : For each word W i in S If W i is in FINANCIAL_BANK_BAG then Sense_1 = Sense_1 + 1; If W i is in RIVER_BANK_BAG then Sense_2 = Sense_2 + 1; If Sense_1 > Sense_2 then print Financial else if Sense_2 > Sense_1 then print River else print Can t Decide ; 16

General Supervised Methodology Create a sample of training data where a given target word is manually annotated with a sense from a predetermined set of possibilities One tagged word per instance/lexical sample disambiguation Select a set of features with which to represent context. co-occurrences, collocations, POS tags, verb-obj relations, etc... Convert sense-tagged training instances to feature vectors Apply a machine learning algorithm to induce a classifier Form structure or relation among features Parameters strength of feature interactions Convert a held out sample of test data into feature vectors Apply classifier to test instances to assign a sense tag From Text to Feature Vectors My/pronoun grandfather/noun used/verb to/prep fish/verb along/adv the/det banks/shore of/prep the/det Mississippi/noun River/noun. (S1) The/det bank/finance issued/verb a/det check/noun for/prep the/det amount/noun of/prep interest/noun. (S2) P-2 P-1 P+1 P+2 fish check river interest SENSE TAG S1 adv det prep det Y N Y N SHORE S2 det verb det N Y N Y FINANCE 17

Supervised Learning Algorithms Once data is converted to feature vector form, any supervised learning algorithm can be used. Many have been applied to WSD with good results: Support Vector Machines Nearest Neighbor Classifiers Decision Trees Decision Lists Naïve Bayesian Classifiers Perceptrons Neural Networks Graphical Models Log Linear Models Summing up: Supervised algorithms In ML approaches, systems are trained on a set of labeled instances to perform the task of WSD What is learned is a classifier that can be used to assign unseen examples to senses These approaches vary in the nature of the training material how much material is needed the kind of linguistic knowledge 18

Summing up: Supervised algorithms All approaches put emphasis on acquiring knowledge needed for the task from data (rather from humans) The question is about scaling up: Is it possible (or realistic) to apply these methodologies to the entire lexicon of the language? Manually building sense-tagged corpora is extremely costly Inputs: feature vectors The input consists of the target word (i.e. the word to be disambiguated) the context (i.e. a portion of the text in which it is embedded) The input is normally part-of-speech tagged The context consists of larger of smaller segments surrounding the target word Often some kind of morphological processing is performed on all words of the context Seldom, some form of parsing is performed to find out grammatical roles and relations 19

Feature vectors Two steps Selecting the relevant linguistic features Encoding them in a suitable way for a learning algorithm A feature vector consists of numerical or nominal values encoding the selected linguistic information Linguistic features The linguistic features used in training WSD systems can be divided in two classes Syntagmatic features - two words are syntagmatically related when they frequently appear in the same syntagm (e.g. when one of them frequently follows or precedes the other) Paradigmatic features - two words are paradigmatically related when their meanings are closely related (e.g. like synonyms, hyponyms, they have the same semantic domains) 20

Syntagmatic features Typical syntagmatic features are collocational features, encoding information about the lexical inhabitants of specific positions on the left or on the right of the target words: the word the root form of the word the word s part-of-speech Ex. An electric guitar and bass player stand off 2 words on the right, 2 on the left, with POS [guitar, NN1, and, CJC, player, NN1, stand, VVB] Pos Paradigmatic features Features that are effective at capturing the general topic of the discourse in which the target word has occurred Bag of words Domain labels In BOW, co-occurrences of words ignoring their exact position - the value of the feature is the number the word occurs in a region surrounding the target word 21

Ng and Lee (1996): LEXAS Exemplar-based learning (in practice k-nearest neighbor learning) LEXical Ambiguity-resolving System A set of features is extracted from disambiguated examples When a new untagged example is encountered, it is compared with each of the training examples, using a distance function LEXAS: the features Feature extraction [L 3, L 2, L 1, R 1, R 2, R 3, K 1,..., K m, C 1,..., C 9, V] Part of speech and morphological form: part of speech of the words to the left L 3, L 2, L 1 and right R 1, R 2, R 3 Unordered set of surrounding words: keywords that co-occur frequently with the word w K 1,.., K m Local collocations C 1,.., C 9 determined by left and right offset e.g. : [-1 1] national interest of Verb-object syntactic relations used only for nouns 22

LEXAS: the metric Distance among feature vectors: distance among two vectors: sum of distances among features distance among two values v 1 and v 2 of feature f n #!(v 1,v 2 ) = C 1,i " C 2,i C 1 C 2 i=1 C 1,i represents the number of training examples with value v 1 for the feature f that are classified with the sense i in the training corpus. C 1 is the number with value v 1 in any sense C 2,i and C 2 denote similar quantities for v 2 n is the total number of senses for the word LEXAS: the algorithm Training phase: build training examples Testing: for a new untagged occurrence of word w, measure the distance with the training examples choose the sense from the example which provides the minimum distance 23

LEXAS: evaluation Two datasets: The interest corpus by Bruce and Wiebe (1994): 2,639 sentences from the Wall Street Journal each containing the noun interest Sense repository: one of the six senses from LDOCE Lexas: 89% Yarowsky: 72% Bruce & Wiebe: 79% 192,800 occurrences of very ambiguous words: 121 nouns (e.g. action, board) [~ 7.8 senses per word] and 70 verbs (e.g. add, build) [~ 12 senses per word] - (from Brown and WSJ) Sense repository: WordNet Lexas: 68,6% Most Frequent Sense: 63,7% Intermezzo: Evaluating a WSD system Numerical evaluation is meaningless without specifying how difficult is the task Ex. 90% accuracy is easy for a POS tagger but it is beyond the ability of any Machine Translation system Estimation of upper and lower bound does make sense of the performance of an algorithm 24

Evaluation a WSD system: Upper bound The upper bound is usually human performance In the case of WSD, if humans disagree on sense disambiguation, we cannot expect a WSD system to do better Inter-judgment agreement is higher for words with clear sense distinctions ex. bank (95% and higher), lower for polysemous words ex. title (65% to 70%) Evaluation a WSD system: Inter-judgment agreement To correctly compare the extent of interjudge agreement, we need to correct for the expected chance agreement This depends on the number of senses being distinguished This is done by Kappa statistics (Carletta, 1996) 25

Evaluation a WSD system: Lower bound The lower bound or baseline is the performance of the simplest possible algorithm Usually the assignment of most frequent sense 90% accuracy is a good result for a word with 2 equiprobable sense, a trivial result for a word with 2 sense in a 9 to 1 frequency ratio Another possible baseline: random choice Evaluation a WSD system: Scoring Precision and Recall Precision was defined as the proportion of classified instances that were correctly classified Recall as the proportion of instances classified correctly these allow for the possibility of an algorithm choosing not to classify a given instance. For the sense-tagging task, accuracy is reported as recall. The coverage of a system (i.e., the percentage of items for which the system guesses some sense tag) can be computed by dividing recall by precision. 26

Evaluation frameworks SENSEVAL http://www.senseval.org A competition with various WSD tasks and on different languages All-words task Lexical sample task Senseval 1: 1999 [Hector dictionary] ~ 10 teams Senseval 2: 2001 [WordNet dictionary] ~ 30 teams Senseval 3: 2004 [mainly WordNet dictionary] ~ 60 teams many different tasks Senseval 4 -> Semeval 1 : 2007 Naïve Bayes A premise: choosing the best sense for an input vector is choosing the most probable sense given that vector s ˆ = argmaxp(s V ) s!s s ˆ = argmax s!s P(V s)p(s) P(V ) re-writing in the usual Bayesian manner But the data available that associates specific vectors with sense is too sparse What is largely available in the training set is information about individual feature-value pairs for a specific sense 27

Naïve Bayes (2) Naïve assumption: the features are independent P(V s)! P(v j s) n " j =1 s ˆ = argmaxp(s) P(v j s) s#s n " j =1 P(V) is the same for all possible senses it does not effect the final ranking of senses Training a naïve Bayes classifier consists of collecting the individual feature-value statistics wrt each sense of the target word in a sensetagged training corpus In practice, considerations about smoothing apply Semi-supervised approaches Problems with supervised algorithms => the need of a large sense-tagged training set Bootstrapping approach (Yarowsky, 1995) A small number of labeled instances (seeds) is used to trained an initial classifier in a supervised way This classifier is then used to extract a large training set from an untagged corpus Iterating this process results in a series of accurate classifiers 28

One sense per constraints There are constraints between different occurrences of an ambiguous word within a corpus that can be exploited for disambiguation: One sense per discourse: The sense of a target word is highly consistent within any given document. e.g. He planted the pansy seeds himself, buying them from a pansy specialist. These specialists have done a great deal of work to improve the size and health of the plants and the resulting flowers. Their seeds produce vigorous blooming plants half again the size of he unimproved strains. One sense per collocation: nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order and syntactic relationship. e.g. industrial plant -- same meaning for plant regardless of where this collocation occurs One sense per constraints Summing up: One sense per discourse A word tends to preserve its meaning across all its occurrences in a given discourse (Gale, Church, Yarowksy 1992) One sense per collocation A word tends to preserver its meaning when used in the same collocation (Yarowsky 1993) Strong for adjacent collocations Weaker as the distance between words increases 29

Bootstrapping (Yarowsky 95) Simplification: binary sense assignment Step 1 identify in a corpus all examples for a given polysemous word Step 2 Identify a small set of representative examples for each sense Step 3 a) Train a classification algorithm on Sense-A/Sense-B seed sets b) Apply the resulting classifier to the rest of the corpus and add the new examples to the seed set c) Repeat iteratively Step 4 Apply the classification algorithm to the testing set Bootstrapping (Yarowsky 95) Example: word plant Sense-A living organism Sense-B factory Seed collocations: life and manufacturing Extract examples containing these seeds Decision list: LogL Collocation Sense 8.10 plant life A 7.58 manufacturing plant B 7.39 life (within ± 2-10 words) A 7.20 manufacturing (within ± 2-10 words) B 6.27 animal (within ± 2-10 words) A 4.70 equipment (within ± 2-10 words) B 4.36 employee (within ± 2-10 words) B 30

Bootstrapping (Yarowsky 95) Use this decision list to classify new examples Repeat until no more examples can be classified Test: apply the decision list on the testing set Option for training seeds: use words in dictionary definitions single defining collocate (e.g. bird and machine for word crane ) Results: Yarowsky 96.5% Most frequent sense 63.9 % It works well only for distinct senses of words. Sometime this is not the case 1. bass -- (the lowest part of the musical range) 3. bass, basso -- (an adult male singer with the lowest voice) Some approaches Corpus based approaches Supervised algorithms: Exemplar-Based Learning (Ng & Lee 96) Naïve Bayes Semi-supervised algorithms: bootstrapping approaches (Yarowsky 95) Dictionary based approaches Lesk 86 Hybrid algorithms (supervised + dictionary) Mihalcea 00 31

Lesk algorithm (1986) It is one of the first algorithm developed for the semantic disambiguation of all words in open text The only resource requires is a set of dictionary entries (definitions) The most likely sense for a word in a given context is decided by a measure of the overlap between the definitions of the target words and of the words of the current context Lesk algorithm - dictionary based The main idea of the original version of the algorithm is to disambiguate words finding the overlap among their sense definitions (1) for each sense i of W 1 (2) for each sense j of W 2 (3) determine Overlap ij as the number of common occurrences between definitions of sense i of W 1 and sense j of W 2 (4) find i and j for which Overlap ij is maximum (5) Assign sense i to W 1 and sense j to W 2 32

Lesk algorithm - an example Select the appropriate sense of cone and pine in the phrase pine cone given the following definitions pine 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness cone 1. solid body which narrows to a point from a round flat base 2. something of this shape whether solid or hollow 3. fruit of certain evergreen trees Select pine#1 and cone#3 because they have two words in common Lesk algorithm - corpus based A corpus-based variation: take into consideration also additional tagged examples (1) for each sense s of the word W 1 (2) set weight(s) to zero (3) for each unique word w in surrounding context of W 1 (4) for each sense s, (5) if w occurs in the training examples / dictionary definitions for sense s, (6) add weight(w) to weight(s) (7) choose sense with greatest weight(s) weight(w) = IDF = -log(p(w)) p(w) is estimated over the examples and dictionary definitions 33

Lesk algorithm - evaluation Corpus-based variation is one of the best performing baselines in comparative evaluation of WSD systems In Senseval-2 => 51.2% precision compared to 64.2% achieved by the best supervised system Problems of this approach: The dictionary entries are relatively short Combinatorial explosion when applied to more than two words Some approaches Corpus based approaches Supervised algorithms: Exemplar-Based Learning (Ng & Lee 96) Naïve Bayes Semi-supervised algorithms: bootstrapping approaches (Yarowsky 95) Dictionary based approaches Lesk 86 Hybrid algorithms (supervised + dictionary) Mihalcea 00 34

Hybrid algorithms (Mihalcea 2000) It combines two sources of information: WordNet and a sense tagged corpus (SemCor) It is based on WordNet definitions WordNet ISA relations Rules acquired from SemCor Hybrid algorithm (Mihalcea 2000) It was developed for the purpose of increasing the Information Retrieval with WSD techniques: disambiguation of the words in the input IR query disambiguation of words in the documents The algorithm determines a set of nouns and verbs that can be disambiguated with high precision Several procedures (8) are called iteratively in the main algorithm 35

Procedure 1 The system uses a Named Entity recognizer In particular person names, organizations and locations Identify Named Entities in text and mark them with sense #1 Examples: Bush => PER => person#1 Trento => LOC => location#1 IBM => ORG => organization#1 Procedure 2 Exploiting the monosemous words Identify words having only one sense in WordNet and mark them with that sense Example: The noun subcommittee has only one sense in WordNet 36

Procedure 3 Exploiting the contextual clues about the usage of the words Given a clue (collocation), search it in SemCor and mark them with the correspondent sense from SemCor Example: disambiguate approval in approval of => 4 examples in SemCor with the approval#1 of the Farm Credit Association subject to the approval#1 of the Secretary of State administrative approval#1 of the reclassification recommended approval#1 of the 1-A classification In all this occurrences the sense of approval is approval#1 Procedure 4 Using SemCor, for a given noun N in the text, determine the noun-context of each of its senses Noun-context: list of nouns which occur more often within the context of N Find common words between the current context and the noun context Example: diameter has 2 senses 2 noun-contexts: diameter#1: {property, hole, ratio} diameter#2: {form} 37

Procedure 5 Find words that are semantically connected to already disambiguated words Connected means there is a relation in WordNet If they belong to the same synset => connection distance = 0 Example: authorize and clear in a text to be disambiguated Knowing: authorize#1 disambiguated with procedure 2 It results: clear#4 - because they are synonyms in WordNet Procedure 6 Find words that are semantically connected and for which the connection distance is 0 Weaker than procedure 5: none of the words considered are already disambiguated Example: measure and bill, both are ambiguous bill has 10 senses, measure 9 bill#1 and measure#4 belong to the same synset 38

Procedures 7-8 Similar to procedures 5-6, but they use other semantic relations (connection distance = 1) hypernymy/hyponymy = ISA relations Example: subcommittee and committee subcommittee #1 disambiguated with procedure 2 committee #1 because is a hypernym of subcommittee#1 (Mihalcea 2000) - Evaluation The procedures presented above are applied iteratively This allows us to identify a set of nouns and verbs that can be disambiguated with high precision Tests on six randomly selected files from SemCor The algorithm disambiguate 55% of the nouns and verbs with 92% precision 39