Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling

Similar documents
Word Sense Disambiguation

Using Voluntary work to get ahead in the job market

Precision Decisions for the Timings Chart

Making and marking progress on the DCSF Languages Ladder

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Vocabulary Usage and Intelligibility in Learner Language

Ensemble Technique Utilization for Indonesian Dependency Parser

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Combining a Chinese Thesaurus with a Chinese Dictionary

A Comparative Evaluation of Word Sense Disambiguation Algorithms for German

arxiv: v1 [cs.cl] 2 Apr 2017

Python Machine Learning

Lecture 1: Machine Learning Basics

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Linking Task: Identifying authors and book titles in verbose queries

Leveraging Sentiment to Compute Word Similarity

On document relevance and lexical cohesion between query terms

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

A Bayesian Learning Approach to Concept-Based Document Classification

CS Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Using dialogue context to improve parsing performance in dialogue systems

Probabilistic Latent Semantic Analysis

The Smart/Empire TIPSTER IR System

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Word Segmentation of Off-line Handwritten Documents

Training and evaluation of POS taggers on the French MULTITAG corpus

Assignment 1: Predicting Amazon Review Ratings

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Disambiguation of Thai Personal Name from Online News Articles

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Variations of the Similarity Function of TextRank for Automated Summarization

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

A study of speaker adaptation for DNN-based speech synthesis

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 10: Reinforcement Learning

AQUA: An Ontology-Driven Question Answering System

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

(Sub)Gradient Descent

The stages of event extraction

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Multilingual Sentiment and Subjectivity Analysis

A Comparison of Two Text Representations for Sentiment Analysis

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Robust Sense-Based Sentiment Classification

Corrective Feedback and Persistent Learning for Information Extraction

Grade 6: Correlated to AGS Basic Math Skills

Short Text Understanding Through Lexical-Semantic Analysis

Discriminative Learning of Beam-Search Heuristics for Planning

Accuracy (%) # features

Learning From the Past with Experiment Databases

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

What the National Curriculum requires in reading at Y5 and Y6

Switchboard Language Model Improvement with Conversational Data from Gigaword

Learning Methods in Multilingual Speech Recognition

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Introduction to Causal Inference. Problem Set 1. Required Problems

Distant Supervised Relation Extraction with Wikipedia and Freebase

Speech Recognition at ICSI: Broadcast News and beyond

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

NCEO Technical Report 27

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Multi-Lingual Text Leveling

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Online Updating of Word Representations for Part-of-Speech Tagging

Artificial Neural Networks written examination

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Diagnostic Test. Middle School Mathematics

BENCHMARK TREND COMPARISON REPORT:

1. Introduction. 2. The OMBI database editor

2.1 The Theory of Semantic Fields

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Rule Learning with Negation: Issues Regarding Effectiveness

Software Maintenance

The Moodle and joule 2 Teacher Toolkit

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Julia Smith. Effective Classroom Approaches to.

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

The Strong Minimalist Thesis and Bounded Optimality

CSC200: Lecture 4. Allan Borodin

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Transcription:

Unsupervised Large-Vocabuary Word Sense Disambiguation with Graph-based Agorithms for Sequence Data Labeing Rada Mihacea Department of Computer Science University of North Texas rada@cs.unt.edu Abstract This paper introduces a graph-based agorithm for sequence data abeing, using random waks on graphs encoding abe dependencies. The agorithm is iustrated and tested in the context of an unsupervised word sense disambiguation probem, and shown to significanty outperform the accuracy achieved through individua abe assignment, as measured on standard senseannotated data sets. 1 Introduction Many natura anguage processing tasks consist of abeing sequences of words with inguistic annotations, e.g. word sense disambiguation, part-of-speech tagging, named entity recognition, and others. Typica abeing agorithms attempt to formuate the annotation task as a traditiona earning probem, where the correct abe is individuay determined for each word in the sequence using a earning process, usuay conducted independent of the abes assigned to the other words in the sequence. Such agorithms do not have the abiity to encode and thereby expoit dependencies across abes corresponding to the words in the sequence, which potentiay imits their performance in appications where such dependencies can infuence the seection of the correct set of abes. In this paper, we introduce a graph-based sequence data abeing agorithm we suited for such natura anguage annotation tasks. The agorithm simutaneousy annotates a the words in a sequence by expoiting reations identified among word abes, using random waks on graphs encoding abe dependencies. The random waks are mathematicay modeed through iterative graph-based agorithms, which are appied on the abe graph associated with the given sequence of words, resuting in a stationary distribution over abe probabiities. These probabiities are then used to simutaneousy seect the most probabe set of abes for the words in the input sequence. The annotation method is iustrated and tested on an unsupervised word sense disambiguation probem, targeting the annotation of a open-cass words in unrestricted text using information derived excusivey from dictionary definitions. The graph-based sequence data abeing agorithm significanty outperforms the accuracy achieved through individua data abeing, resuting in an error reduction of 10.7%, as measured on standard sense-annotated data sets. The method is aso shown to exceed the performance of other previousy proposed unsupervised word sense disambiguation agorithms. 2 Iterative Graphica Agorithms for Sequence Data Labeing In this section, we introduce the iterative graphica agorithm for sequence data abeing. The agorithm is succincty iustrated using a sampe sequence for a generic annotation probem, with a more extensive iustration and evauation provided in Section 3. Given a sequence of words, each word with corresponding admissibe abes # "!, we define a abe graph G = (V,E) such *) that there is a vertex $&%(' for every possibe abe, + -,/., 0 1,324. Dependencies between pairs of abes are represented as directed or indirected edges 56%87, defined over the set of vertex pairs ':9;'. Such abe dependencies can be earned from annotated data, or derived by other means, as iustrated ater. Figure 1 shows an exampe of a graph-

' 3 w1 2 w1 1 w1 w 1 [0.86] 1.1 0.2 0.5 4 w4 3 w4 1.6 2 0.9 2 [1.12] [1.38] [1.05] w2 0.2 w4 0.7 1 1 0.1 1 [1.39] w2 [1.13] w3 [1.56] w4 [0.40] 0.6 0.4 1.3 w 2 w 3 w 4 [0.48] [0.58] Figure 1: Sampe graph buit on the set of possibe abes (shaded nodes) for a sequence of four words (unshaded nodes). Labe dependencies are indicated as edge weights. Scores computed by the graph-based agorithm are shown in brackets, next to each abe. ica structure derived over the set of abes for a sequence of four words. Note that the graph does not have to be fuy connected, as not a abe pairs can be reated by a dependency. Given such a abe graph associated with a sequence of words, the ikeihood of each abe can be recursivey determined using an iterative graph-based ranking agorithm, which runs over the graph of abes and identifies the importance of each abe (vertex) in the graph. The iterative graphica agorithm is modeing a random wak, eading to a stationary distribution over abe probabiities, represented as scores attached to vertices in the graph. These scores are then used to identify the most probabe abe for each word, resuting in the annotation of a the words in the input sequence. For instance, for the graph drawn in Figure 1, the word wi be assigned with abe, since the score associated with this abe (, ) is the maximum among the scores assigned to a admissibe abes associated with this word. A remarkabe property that makes these iterative graphica agorithms appeaing for sequence data abeing is the fact that they take into account goba information recursivey drawn from the entire graph, rather than reying on oca vertex-specific information. Through the random wak performed on the abe graph, these iterative agorithms attempt to coectivey expoit the dependencies drawn between a abes in the graph, which makes them superior to other approaches that rey ony on oca information, individuay derived for each word in the sequence. 2.1 Graph-based Ranking The basic idea impemented by an iterative graphbased ranking agorithm is that of voting or recommendation. When one vertex inks to another one, it is basicay casting a vote for that other vertex. The higher the number of votes that are cast for a vertex, the higher the importance of the vertex. Moreover, the importance of the vertex casting a vote determines how important the vote itsef is, and this information is aso taken into account by the ranking agorithm. Whie there are severa graph-based ranking agorithms previousy proposed in the iterature, we focus on ony one such agorithm, namey PageRank (Brin and Page, 1998), as it was previousy found successfu in a number of appications, incuding Web ink anaysis, socia networks, citation anaysis, and more recenty in severa text processing appications.. Given a graph 7, et ' be the set of vertices that point to vertex ' (predecessors), and et ' be the set of vertices that vertex ' points to (successors). The PageRank score associated with the vertex ' is then defined using a recursive function that integrates the scores of its predecessors: ',!#"%$& ('!*)*+ '-,. ',. / 0 / (1) where is a parameter that is set between 0 and 1 1. This vertex scoring scheme is based on a random wak mode, where a waker takes random steps on the graph, with the wak being modeed as a Markov process that is, the decision on what edge to foow is soey based on the vertex where the waker is currenty ocated. Under certain conditions, this mode converges to a stationary distribution of probabiities, associated with vertices in the graph. Based on the Ergodic theorem for Markov chains (Grimmett and Stirzaker, 1989), the agorithm is guaranteed to converge if the graph is both aperiodic and irreducibe. The first condition is achieved for any graph that is a non-bipartite graph, whie the second condition hods for any strongy connected graph property achieved by PageRank through the random jumps introduced by the,01 factor. In matrix notation, the PageRank vector of stationary probabiities is the principa eigenvector for the matrix 20354, which is obtained from the adjacency matrix 2 representing the graph, with a rows normaized to sum to 1: ( 206354 ). Intuitivey, the stationary probabiity associated with a vertex in the graph represents the probabiity 1 The typica vaue for 7 is 0.85 (Brin and Page, 1998), and this is the vaue we are aso using in our impementation.

5, X ^ Y i D D of finding the waker at that vertex during the random wak, and thus it represents the importance of the vertex within the graph. In the context of sequence data abeing, the random wak is performed on the abe graph associated with a sequence of words, and thus the resuting stationary distribution of probabiities can be used to decide on the most probabe set of abes for the given sequence. 2.2 Ranking on Weighted Graphs In a weighted graph, the decision on what edge to foow during a random wak is aso taking into account the weights of outgoing edges, with a higher ikeihood of foowing an edge that has a arger weight. The weighted version of the ranking agorithm is particuary usefu for sequence data abeing, since the dependencies between pairs of abes are more naturay modeed through weights indicating their strength, rather than binary vaues. Given a set of weights, associated with edges connecting vertices ' and ',, the weighted PageRank score is determined as: ) 7 7! " ) " #$%'&)(*! " +, " 2.3 Agorithm for Sequence Data Labeing Given a sequence of words with their corresponding admissibe abes, the agorithm for sequence data abeing seeks to identify a graph of abe dependencies on which a random wak can be performed, resuting in a set of scores that can be used for abe assignment. Agorithm 1 shows the pseudocode for the abeing process. The agorithm consists of three main steps: (1) construction of abe dependencies graph; (2) abe scoring using graph-based ranking agorithms; (3) abe assignment. First, a weighted graph of abe dependencies is buit by adding a vertex for each admissibe abe, and an edge for each pair of abes for which a dependency is identified. A maximum aowabe distance can be set (-/. 021 +43 ), indicating a constraint over the distance between words for which a abe dependency is sought. For instance, if -5.6071 +43 is set to, no edges wi be drawn between abes corresponding to words that are more than three words apart, counting a running words.. Labe.:9<; dependencies are determined through the 1 5$8 5 function, whose definition depends on the appication and type of resources avaiabe (see Section 2.4). Next, scores are assigned to vertices using a graphbased ranking agorithm. Current experiments are (2) Agorithm 1 Graph-based Sequence Data Labeing Input: Sequence >=!? @ ABA CED Input: Admissibe abes F!G >=)H (? I!G ABA C @!G <ABA C, Output: Sequence of abes F J=)H? @!G ABA CED, with abe H!G corresponding to word! from the input sequence. @ Buid graph G of abe dependencies 1: for to C do 2: for K @ @MLONQPRTS@*UVI to C do 3: if K P[Z then 4: W4X<Y 5: endi if 6: for U to C!G do 7: for to C!]\ do 8:! @_^[`]IMa5S Y YbYc 7YdcfeVg H (!hgvi Hkj 9: if! @_^[`]I2Lnm Y q then H (! Gdi H!h\ j i! @r^[`[i Y 10: o 7#7p 7 11: end if 12: end for 13: end for 14: end for 15: end for Score vertices in G 1: repeat! \ i! 2: for a )ts Iu@ U q YdX evy )< v do 3: 7 f " 7]w! " ), " yx "! " + < # %'&)(* 4: end for 5: unti convergence of scores, ) @ Labe assignment 1: forh a/pto C do 2:!G 3: end for ^z{p[r =, H (!G? I <ABA C!G i!m based on PageRank, but other ranking agorithms can be used as we. Finay, the most ikey set of abes is determined by identifying for each word the abe that has the highest score. Note that a admissibe abes corresponding to the words in the input sequence are assigned with a score, and thus the seection of two or more most ikey abes for a word is aso possibe. 2.4 Labe Dependencies Labe dependencies can be defined in various ways, depending on the appication at hand and on the knowedge sources that are avaiabe. If an annotated corpus is avaiabe, dependencies can be defined as abe co-occurrence probabiities approximated with y} frequency counts, or as conditiona probabiities # / y}. Optionay, these dependencies can be exicaized by taking into account the corre- / } 49 sponding words in the sequence, e.g. /. In the absence of an annotated corpus, dependencies can be derived by other means, e.g. part-

" " " 9 of-speech probabiities can be approximated from a raw corpus as in (Cutting et a., 1992), word-sense dependencies can be derived as definition-based simiarities, etc. Labe dependencies are set as weights on the arcs drawn between corresponding abes. Arcs can be directed or undirected for joint probabiities or simiarity measures, and are usuay directed for conditiona probabiities. 2.5 Labeing Exampe Consider again the exampe from Figure 1, consisting of a sequence of four words, and their possibe corresponding abes. In the first step of the agorithm, abe dependencies are determined, and et us assume that the vaues for these dependencies are as indicated through the edge weights in Figure 1. Next, vertices in the graph are scored using an iterative ranking agorithm, resuting in a score attached to each abe, shown in brackets next to each vertex. Finay, the most probabe abe for each word is seected. Word is thus assigned with abe, since the score of this abe (, ) is the maximum among the scores associated with a its possibe abes (,,,*,, ). Simiary, word is assigned with abe, with abe, and receives abe. 2.6 Efficiency Considerations For a sequence of words 24, each word with admissibe abes, the running time of the graph-based sequence data abeing agorithm } 2 2 is proportiona with O( ) ) (the time spent in buiding the abe graph and iterating the agorithm for a constant number of times ). This is order of magnitudes better than the running time 2 of O( ) for agorithms that attempt to seect the best sequence of abes by searching through the entire space of possibe abe combinations, athough it can be significanty higher than the running time of 24 O( ) for individua data abeing. 2.7 Other Agorithms for Sequence Data Labeing It is interesting to contrast our agorithm with previousy proposed modes for sequence data abeing, e.g. Hidden Markov Modes, Maximum Entropy Markov Modes, or Conditiona Random Fieds. Athough they differ in the mode used (generative, discriminative, or dua), and the type of probabiities invoved (joint or conditiona), these previous agorithms are a parameterized agorithms that typicay require parameter training through maximization of ikeihood on training exampes. In these modes, parameters that maximize sequence probabiities are earned from a corpus during a training phase, and then appied to the annotation of new unseen data. Instead, in the agorithm proposed in this paper, the ikeihood of a sequence of abes is determined during test phase, through random waks performed on the abe graph buit for the data to be annotated. Whie current evauations of our agorithm are performed on an unsupervised abeing task, future work wi consider the evauation of the agorithm in the presence of an annotated corpus, which wi aow for direct comparison with these previousy proposed modes for sequence data abeing. 3 Experiments in Word Sense Disambiguation The agorithm for sequence data abeing is iustrated and tested on an a-words word sense disambiguation probem. Word sense disambiguation is a abeing task consisting of assigning the correct meaning to each open-cass word in a sequence (usuay a sentence). Most of the efforts for soving this probem were concentrated so far toward targeted supervised earning, where each sense tagged occurrence of a particuar word is transformed into a feature vector used in an automatic earning process. The appicabiity of such supervised agorithms is however imited to those few words for which sense tagged data is avaiabe, and their accuracy is strongy connected to the amount of abeed data avaiabe at hand. Instead, agorithms that attempt to disambiguate a-words in unrestricted text have received significanty ess attention, as the deveopment and success of such agorithms has been hindered by both (a) ack of resources (training data), and (b) efficiency aspects resuting from the arge size of the probem. 3.1 Graph-based Sequence Data Labeing for Unsupervised Word Sense Disambiguation To appy the graph-based sequence data abeing agorithm to the disambiguation of an input text, we need information on abes (word senses) and dependencies (word sense dependencies). Word senses can be easiy obtained from any sense inventory, e.g. WordNet or LDOCE. Sense dependencies can be derived in various ways, depending on the type of resources avaiabe for the anguage and/or domain at hand. In this paper, we expore the unsupervised derivation of sense

dependencies using information drawn from machine readabe dictionaries, which is genera and can be appied to any anguage or domain for which a sense inventory is avaiabe. Reying excusivey on a machine readabe dictionary, a sense dependency can be defined as a measure of simiarity between word senses. There are severa metrics that can be used for this purpose, see for instance (Budanitsky and Hirst, 2001) for an overview. However, most of them rey on measures of semantic distance computed on semantic networks, and thus they are imited by the avaiabiity of expicity encoded semantic reations (e.g. is-a, part-of). To maintain the unsupervised aspect of the agorithm, we chose instead to use a measure of simiarity based on sense definitions, which can be computed on any dictionary, and can be evauated across different parts-ofspeech. Given two word senses and their corresponding definitions, the sense simiarity is determined as a function of definition overap, measured as the number of common tokens between the two definitions, after running them through a simpe fiter that eiminates a stop-words. To avoid promoting ong definitions, we aso use a normaization factor, and divide the content overap of the two definitions with the ength of each definition. This sense simiarity measure is inspired by the definition of the Lesk agorithm (Lesk, 1986). Starting with a sense inventory and a function for computing sense dependencies, the appication of the sequence data abeing agorithm to the unsupervised disambiguation of a new text proceeds as foows. First, for the given text, a abe graph is buit by adding a vertex for each possibe sense for a opencass words in the text. Next, weighted edges are drawn using the definition-based semantic simiarity measure, computed for a pairs of senses for words found within a certain distance (-5.6071 +43, as defined in Agorithm 1). Once the graph is constructed, the graph-based ranking agorithm is appied, and a score is determined for a word senses in the graph. Finay, for each open-cass word in the text, we seect the vertex in the abe graph which has the highest score, and abe the word with the corresponding word sense. 3.2 An Exampe Consider the task of assigning senses to the words in the text The church bes no onger rung on Sundays 2. For the purpose of iustration, we assume at 2 Exampe drawn from the data set provided during the SENSEVAL-2 Engish a-words task. Manua sense annotations The church bes no onger rung on Sundays. church 1: one of the groups of Christians who have their own beiefs and forms of worship 2: a pace for pubic (especiay Christian) worship 3: a service conducted in a church be 1: a hoow device made of meta that makes a ringing sound when struck 2: a push button at an outer door that gives a ringing or buzzing signa when pushed 3: the sound of a be ring 1: make a ringing sound 2: ring or echo with sound 3: make (bes) ring, often for the purposes of musica edification Sunday 1: first day of the week; observed as a day of rest and worship by most Christians [0.96] [2.56] 0.19 0.80 1.01 0.35 [0.99] [1.46] s3 s2 S1 be 0.55 1.06 0.85 0.40 S3 s2 0.23 s1 ring [0.73] [0.93] 0.50 0.34 0.30 s3 S2 s1 church [0.42] [0.63] [0.58] 0.31 0.35 S1 [0.67] Sunday Figure 2: The abe graph for assigning senses to words in the sentence The church bes no onger rung on Sundays. most three senses for each word, which are shown in Figure 2. Word senses and definitions are obtained from the WordNet sense inventory (Mier, 1995). A word senses are added as vertices in the abe graph, and weighted edges are drawn as dependencies among word senses, derived using the definition-based simiarity measure (no edges are drawn between word senses with a simiarity of zero). The resuting abe graph is an undirected weighted graph, as shown in Figure 2. After running the ranking agorithm, scores are identified for each word-sense in the graph, indicated between brackets next to each node. Seecting for each word the sense with the argest score resuts in the foowing sense assignment: The church#2 bes#1 were aso made avaiabe for this data.

, no onger rung#3 on Sundays#1, which is correct according to annotations performed by professiona exicographers. 3.3 Resuts and Discussion The agorithm was primariy evauated on the SENSEVAL-2 Engish a-words data set, consisting of three documents from Penn Treebank, with 2,456 open-cass words (Pamer et a., 2001). Unike other sense-annotated data sets, e.g. SENSEVAL-3 or Sem- Cor, SENSEVAL-2 is the ony testbed for a-words word sense disambiguation that incudes a sense map, which aows for additiona coarse-grained sense evauations. Moreover, there is a arger body of previous work that was evauated on this data set, which can be used as a base of comparison. The performance of our agorithm is compared with the disambiguation accuracy obtained with a variation of the Lesk agorithm 3 (Lesk, 1986), which seects the meaning of an open-cass word by finding the word sense that eads to the highest overap between the corresponding dictionary definition and the current context. Simiar to the definition simiarity function used in the graph-based disambiguation agorithm (Section 3.1), the overap measure used in the Lesk impementation does not take into account stop-words, and it is normaized with the ength of each definition to avoid promoting onger definitions. We are thus comparing the performance of sequence data abeing, which takes into account abe dependencies, with individua data abeing, where a abe is seected independent of the other abes in the text. Note that both agorithms rey on the same knowedge source, i.e. dictionary definitions, and thus they are directy comparabe. Moreover, none of the agorithms take into account the dictionary sense order (e.g. the most frequent sense provided by WordNet), and therefore they are both fuy unsupervised. Tabe 1 shows precision and reca figures 4 for a 3 Given a sequence of words, the origina Lesk agorithm attempts to identify the combination of word senses that maximizes the redundancy (overap) across a corresponding definitions. The agorithm was ater improved through a method for simuated anneaing (Cowie et a., 1992), which soved the combinatoria exposion of word senses, whie sti finding an optima soution. However, recent comparative evauations of different variants of the Lesk agorithm have shown that the performance of the origina agorithm is significanty exceeded by an agorithm variation that reies on the overap between word senses and current context (Vasiescu et a., 2004). We are thus using this atter Lesk variant in our impementation. 4 Reca is particuary ow for each individua part-of-speech because it is cacuated with respect to the entire data set. The overa precision and reca figures coincide, refecting the 100% coverage of the agorithm. context size (-/. 021 +V3 ) equa to the ength of each sentence, using: (a) sequence data abeing with iterative graph-based agorithms; (b) individua data abeing with a version of the Lesk agorithm; (c) random baseine. Evauations are run for both fine-grained and coarse-grained sense distinctions, to determine the agorithm performance under different cassification granuarities. The accuracy of the graph-based sequence data abeing agorithm exceeds by a arge margin the individua data abeing agorithm, resuting in 10.7% error rate reduction for fine-grained sense distinctions, which is statisticay significant (8 TT, paired t-test). Performance improvements are equay distributed across a parts-of-speech, with comparabe improvements obtained for nouns, verbs, and adjectives. A simiar error rate reduction of 11.0% is obtained for coarse-grained sense distinctions, which suggests that the performance of the graph-based sequence data abeing agorithm does not depend on cassification granuarity, and simiar improvements over individua data abeing can be obtained regardess of the average number of abes per word. We aso measured the variation of performance with context size, and evauated the disambiguation accuracy for both agorithms for a window size ranging from two words to an entire sentence. The window size parameter imits the number of surrounding words considered when seeking abe dependencies (sequence data abeing), or the words counted in the measure of definition context overap (individua data abeing). Figure 3 pots the disambiguation accuracy of the two agorithms as a function of context size. As seen in the figure, both agorithms benefit from arger contexts, with a steady increase in performance observed for increasingy arger window sizes. Athough the initia growth observed for the sequence data abeing agorithm is somewhat sharper, the gap between the two curves stabiizes for window sizes arger than five words, which suggests that the improvement in performance achieved with sequence data abeing over individua data abeing does not depend on the size of avaiabe context. The agorithm was aso evauated on two other data sets, SENSEVAL-3 Engish a-words data (Snyder and Pamer, 2004) and a subset of SemCor (Mier et a., 1993), athough ony fine-grained sense evauations coud be conducted on these test sets. The disambiguation precision on the SENSEVAL-3 data was measured at 52.2% using sequence data abeing, compared to 48.1% obtained with individua

Fine-grained sense distinctions Coarse-grained sense distinctions Random Individua Sequence Random Individua Sequence Part-of baseine (Lesk) (graph-based) baseine (Lesk) (graph-based) speech P R P R P R P R P R P R Noun 41.4% 19.4% 50.3% 23.6% 57.5% 27.0% 42.7% 20.0% 51.4% 24.1% 58.8% 27.5% Verb 20.7% 3.9% 30.5% 5.7% 36.5% 6.9% 22.8% 4.3% 31.9% 6.0% 37.9% 7.1% Adjective 41.3% 9.3% 49.1% 11.0% 56.7% 12.7% 42.6% 42.6% 49.8% 11.2% 57.6% 12.9% Adverb 44.6% 5.2% 64.6% 7.6% 70.9% 8.3% 40.7% 4.8% 65.3% 7.7% 71.9% 8.5% ALL 37.9% 37.9% 48.7% 48.7% 54.2% 54.2% 38.7% 38.7% 49.8% 49.8% 55.3% 55.3% Tabe 1: Precision and reca for graph-based sequence data abeing, individua data abeing, and random baseine, for fine-grained and coarse-grained sense distinctions. Disambiguation precision (%) 60 55 50 45 40 35 0 5 10 15 20 25 30 Window size sequence individua random Figure 3: Disambiguation resuts using sequence data abeing, individua abeing, and random baseine, for various context sizes. data abeing, and 34.3% achieved through random sense assignment. The average disambiguation figure obtained on a the words in a random subset of 10 SemCor documents, covering different domains, was 56.5% for sequence data abeing, 47.4% for individua abeing, and 35.3% for the random baseine. Comparison with Reated Work For a given sequence of ambiguous words, the origina definition of the Lesk agorithm (Lesk, 1986), and more recent improvements based on simuated anneaing (Cowie et a., 1992), seek to identify the combination of senses that maximizes the overap among their dictionary definitions. Tests performed with this agorithm on the SENSEVAL-2 data set resuted in a disambiguation accuracy of 39.5%. This precision is exceeded by the Lesk agorithm variation used in the experiments reported in this paper, which measures the overap between sense definitions and the current context, for a precision of 48.7% on the same data set (see Tabe 1). In the SENSEVAL-2 evauations, the best performing fuy unsupervised agorithm 5 was deveoped by (Litkowski, 2001), who combines anaysis of mutiword units and contextua cues based on coocations and content words from dictionary definitions and exampes, for an overa precision and reca of 45.1%. More recenty, (McCarthy et a., 2004) reports one of the best resuts on the SENSEVAL-2 data set, using an agorithm that automaticay derives the most frequent sense for a word using distributiona simiarities earned from a arge raw corpus, for a disambiguation precision of 53.0% and a reca of 49.0%. Another reated ine of work consists of the disambiguation agorithms based on exica chains (Morris and Hirst, 1991), and the more recent improvements reported in (Gaey and McKeown, 2003) where threads of meaning are identified throughout a text. Lexica chains however ony take into account connections between concepts identified in a static way, without considering the importance of the concepts that participate in a reation, which is recursivey determined in our agorithm. Moreover, the construction of exica chains requires structured dictionaries such as WordNet, with expicity defined semantic reations between word senses, whereas our agorithm can aso work with simpe unstructured dictionaries that provide ony word sense definitions. (Gaey and McKeown, 2003) evauated their agorithm on the nouns from a subset of SEMCOR, reporting 62.09% disambiguation precision. The performance of our agorithm on the same subset of SEMCOR nouns was measured at 64.2% 6. Finay, another disambiguation method reying on graph agorithms that expoit the 5 Agorithms that integrate the most frequent sense in Word- Net are not considered here, since this represents a supervised knowedge source (WordNet sense frequencies are derived from a sense-annotated corpus). 6 Note that the resuts are not directy comparabe, since (Gaey and McKeown, 2003) used the WordNet sense order to break the ties, whereas we assume that such sense order frequency is not avaiabe, and thus we break the ties through random choice.

structure of semantic networks was proposed in (Mihacea et a., 2004), with a disambiguation accuracy of 50.9% measured on a the words in the SENSEVAL-2 data set. Athough it reies excusivey on dictionary definitions, the graph-based sequence data abeing agorithm proposed in this paper, with its overa performance of 54.2%, exceeds significanty the accuracy of a these previousy proposed unsupervised word sense disambiguation methods, proving the benefits of taking into account abe dependencies when annotating sequence data. An additiona interesting benefit of the agorithm is that it provides a ranking over word senses, and thus the seection of two or more most probabe senses for each word is aso possibe. 4 Concusions We proposed a graphica agorithm for sequence data abeing that reies on random waks on graphs encoding abe dependencies. Through the abe graphs it buids for a given sequence of words, the agorithm expoits reations between word abes, and impements a concept of recommendation. A abe recommends other reated abes, and the strength of the recommendation is recursivey computed based on the importance of the abes making the recommendation. In this way, the agorithm simutaneousy annotates a the words in an input sequence, by identifying the most probabe (most recommended) set of abes. The agorithm was iustrated and tested on an unsupervised word sense disambiguation probem, targeting the annotation of a words in unrestricted texts. Through experiments performed on standard senseannotated data sets, the graph-based sequence data abeing agorithm was shown to significanty outperform the accuracy achieved through individua data abeing, resuting in a statisticay significant error rate reduction of 10.7%. The disambiguation method was aso shown to exceed the performance of previousy proposed unsupervised word sense disambiguation agorithms. Moreover, comparative resuts obtained under various experimenta settings have shown that the agorithm is robust to changes in cassification granuarity and context size. Acknowedgments This work was partiay supported by a Nationa Science Foundation grant IIS-0336793. References S. Brin and L. Page. 1998. The anatomy of a arge-scae hypertextua Web search engine. Computer Networks and ISDN Systems, 30(1 7). A. Budanitsky and G. Hirst. 2001. Semantic distance in wordnet: An experimenta, appication-oriented evauation of five measures. In Proceedings of the NAACL Workshop on WordNet and Other Lexica Resources, Pittsburgh. J. Cowie, L. Guthrie, and J. Guthrie. 1992. Lexica disambiguation using simuated anneaing. In Proceedings of the 5th Internationa Conference on Computationa Linguistics (COLING 1992). D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun. 1992. A practica part-of-speech tagger. In Proceedings of the Third Conference on Appied Natura Language Processing ANLP-92. M. Gaey and K. McKeown. 2003. Improving word sense disambiguation in exica chaining. In Proceedings of the 18th Internationa Joint Conference on Artificia Inteigence (IJCAI 2003), Acapuco, Mexico, August. G. Grimmett and D. Stirzaker. 1989. Probabiity and Random Processes. Oxford University Press. M.E. Lesk. 1986. Automatic sense disambiguation using machine readabe dictionaries: How to te a pine cone from an ice cream cone. In Proceedings of the SIGDOC Conference 1986, Toronto. K. Litkowski. 2001. Use of machine readabe dictionaries in word sense disambiguation for Senseva-2. In Proceedings of ACL/SIGLEX Senseva-2, Tououse, France. D. McCarthy, R. Koeing, J. Weeds, and J. Carro. 2004. Using automaticay acquired predominant senses for word sense disambiguation. In Proceedings of ACL/SIGLEX Senseva-3, Barceona, Spain. R. Mihacea, P. Tarau, and E. Figa. 2004. PageRank on semantic networks, with appication to word sense disambiguation. In Proceedings of the 20st Internationa Conference on Computationa Linguistics (COLING 2004). G. Mier, C. Leacock, T. Randee, and R. Bunker. 1993. A semantic concordance. In Proceedings of the 3rd DARPA Workshop on Human Language Technoogy, Painsboro, New Jersey. G. Mier. 1995. Wordnet: A exica database. Communication of the ACM, 38(11):39 41. J. Morris and G. Hirst. 1991. Lexica cohesion, the thesaurus, and the structure of text. Computationa Linguistics, 17(1):21 48. M. Pamer, C. Febaum, S. Cotton, L. Defs, and H.T. Dang. 2001. Engish tasks: a-words and verb exica sampe. In Proceedings of ACL/SIGLEX Senseva-2, Tououse, France. B. Snyder and M. Pamer. 2004. The Engish awords task. In Proceedings of ACL/SIGLEX Senseva-3, Barceona, Spain. F. Vasiescu, P. Langais, and G. Lapame. 2004. Evauating variants of the Lesk approach for disambiguating words. In Proceedings of the Conference of Language Resources and Evauations (LREC 2004).