Dept. of Linguistics, Indiana University Fall 2015

Similar documents
Word Sense Disambiguation

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

The MEANING Multilingual Central Repository

Leveraging Sentiment to Compute Word Similarity

2.1 The Theory of Semantic Fields

On document relevance and lexical cohesion between query terms

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Vocabulary Usage and Intelligibility in Learner Language

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

A Bayesian Learning Approach to Concept-Based Document Classification

Multilingual Sentiment and Subjectivity Analysis

Prediction of Maximal Projection for Semantic Role Labeling

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

The stages of event extraction

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

AQUA: An Ontology-Driven Question Answering System

Linking Task: Identifying authors and book titles in verbose queries

A Comparison of Two Text Representations for Sentiment Analysis

A Case Study: News Classification Based on Term Frequency

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Learning Methods in Multilingual Speech Recognition

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Using dialogue context to improve parsing performance in dialogue systems

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Learning From the Past with Experiment Databases

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Ensemble Technique Utilization for Indonesian Dependency Parser

Cross Language Information Retrieval

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Chapter 9 Banked gap-filling

Constructing Parallel Corpus from Movie Subtitles

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Probabilistic Latent Semantic Analysis

CS Machine Learning

Combining a Chinese Thesaurus with a Chinese Dictionary

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Grammars & Parsing, Part 1:

Robust Sense-Based Sentiment Classification

THE VERB ARGUMENT BROWSER

Compositional Semantics

Python Machine Learning

Parsing of part-of-speech tagged Assamese Texts

Disambiguation of Thai Personal Name from Online News Articles

1. Introduction. 2. The OMBI database editor

Cross-Lingual Text Categorization

The Role of the Head in the Interpretation of English Deverbal Compounds

Memory-based grammatical error correction

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Loughton School s curriculum evening. 28 th February 2017

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Part III: Semantics. Notes on Natural Language Processing. Chia-Ping Chen

Rule Learning With Negation: Issues Regarding Effectiveness

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Short Text Understanding Through Lexical-Semantic Analysis

The taming of the data:

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Modeling user preferences and norms in context-aware systems

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speech Recognition at ICSI: Broadcast News and beyond

Speech Emotion Recognition Using Support Vector Machine

Let's Learn English Lesson Plan

Language Acquisition Chart

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Annotation Projection for Discourse Connectives

Context Free Grammars. Many slides from Michael Collins

The Ups and Downs of Preposition Error Detection in ESL Writing

Multi-Lingual Text Leveling

Applications of memory-based natural language processing

arxiv: v1 [cs.cl] 2 Apr 2017

An Introduction to the Minimalist Program

The Smart/Empire TIPSTER IR System

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Ontologies vs. classification systems

CEFR Overall Illustrative English Proficiency Scales

Methods for the Qualitative Evaluation of Lexical Association Measures

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Beyond the Pipeline: Discrete Optimization in NLP

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Universiteit Leiden ICT in Business

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Modeling function word errors in DNN-HMM based LVCSR systems

Accuracy (%) # features

Chapter 3: Semi-lexical categories. nor truly functional. As Corver and van Riemsdijk rightly point out, There is more

Transcription:

L645 / B659 (Some material from Jurafsky & Martin (2009) + Manning & Schütze (2000)) Dept. of Linguistics, Indiana University Fall 2015 1 / 30

Context Lexical Semantics A (word) sense represents one meaning of a word bank 1 : financial institution bank 2 : sloped ground near water Various relations: homonymy: 2 words/senses happen to sound the same (e.g., bank 1 & bank 2 ) polysemy: 2 senses have some semantic relation between them bank 1 & bank 3 = repository for biological entities 2 / 30

Context WordNet WordNet (http://wordnet.princeton.edu/) is a database of lexical relations: Nouns (117,798); verbs (11,529); adjectives (21,479) & adverbs (4,481) https://wordnet.princeton.edu/wordnet/man/wnstats. 7WN.html WordNet contains different senses of a word, defined by synsets (synonym sets) {chump 1, fool 2, gull 1, mark 9, patsy 1, fall guy 1, sucker 1, soft touch 1, mug 2 } Words are substitutable in some contexts gloss: a person who is gullible and easy to take advantage of See http://babelnet.org for other languages 3 / 30

(WSD) (WSD): determine the proper sense of an ambiguous word in a given context e.g., Given the word bank, is it: the rising ground bordering a body of water? an establishment for exchanging funds? Or maybe a repository (e.g., blood bank)? WSD comes in two variants: Lexical sample task: small pre-selected set of target words (along with sense inventory) All-words task: entire texts Our goal: get a flavor for insights & what techniques need to accomplish 4 / 30

: extract features which are helpful for particular senses & train a classifier to assign correct sense lexical sample task: labeled corpora for individual words all-word disambiguation task: use a semantic concordance (e.g., SemCor) 5 / 30

WSD Evaluation Extrinsic (in vivo) evaluation: evaluate WSD in the context of another task, e.g., question answering Intrinsic (in vitro) evaluation: evaluate WSD as a stand-alone system Baselines: Exact-match sense accuracy Precision/recall measures, if systems pass on some labelings Most frequent sense (MFS): for WordNet, take first sense (later) Ceiling: inter-annotator agreement, generally 75-80% 6 / 30

1. POS tag, lemmatize/stem, & perhaps parse the sentence in question 2. Extract context features within a certain window of a target word Feature vector: numeric or nominal values encoding linguistic information 7 / 30

Collocational features Collocational features encode information about specific positions to the left or right of a target word capture local lexical & grammatical information Consider: An electric guitar and bass player stand off to one side, not really part of the scene... [w i 2,POS i 2,w i 1,POS i 1,w i+1,pos i+1,w i+2,pos i+2 ] [guitar, NN, and, CC, player, NN, stand, VB] 8 / 30

Bag-of-words features Bag-of-words features encode unordered sets of surrounding words, ignoring exact position Captures more semantic properties & general topic of discourse Vocabulary for surrounding words usually pre-defined e.g., 12 most frequent content words from bass sentences in the WSJ: [fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band] leading to this feature vector: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0] 9 / 30

Bayesian WSD Look at a context of surrounding words, call it c, within a window of a particular size Select the best sense s from among the different senses (1) s = arg sk max P(s k c) = arg sk max P(c s k )P(s k ) P(c) = arg sk max P(c s k )P(s k ) Computationally simpler to calculate logarithms, giving: (2) s = arg sk max[log P(c s k ) + log P(s k )] 10 / 30

assumption Treat the context (c) as a bag of words (v j ) Make the assumption that every surrounding word v j is independent of the other ones: (3) P(c s k ) = P(v j s k ) v j c (4) s = arg sk max[ v j c log P(v j s k ) + log P(s k )] We get maximum likelihood estimates from the corpus to obtain P(s k ) and P(v j s k ) 11 / 30

Dictionary-based WSD Use general characterizations of the senses to aid in disambiguation Intuition: words found in a particular sense definition can provide contextual cues, e.g., for ash: Sense s 1 : tree s 2 : burned stuff Definition a tree of the olive family the solid residue left when combustible material is burned If tree is in the context of ash, the sense is more likely s 1 12 / 30

Look at words within the sense definition and the words within the definitions of context words, too (unioning over different senses) 1. Take all senses s k of a word w and gather the set of words for each definition Treat it as a bag of words 2. Gather all the words in the definitions of the surrounding words, within some context window 3. Calculate the overlap 4. Choose the sense with the higher overlap 13 / 30

Example (5) This cigar burns slowly and creates a stiff ash. (6) The ash is one of the last trees to come into leaf. So, sense s 2 goes with the first sentence and s 1 with the second Note that, depending on the dictionary, leaf might also be a contextual cue for sense s 1 of ash 14 / 30

Problems with dictionary-based WSD Not very accurate: 50%-70% Highly dependent upon the choice of dictionary Not always clear whether the dictionary definitions align with what we think of as different senses 15 / 30

Can use a heuristic to automatically select seeds One sense per discourse: the sense of a word is highly consistent within a given document One sense per collocation: collocations rarely have multiple senses associated with them 16 / 30

One sense per collocation Rank senses based on what collocations the word appears in, e.g., show interest might be strongly correlated with the attention, concern usage of interest The collocational feature could be a surrounding POS tag, or a word in the object position For a given context, select which collocational feature will be used to disambiguate, based on which feature is strongest indicator Avoid having to combine different pieces of information this way Rankings are based on the following, where f is a collocational feature: (7) P(s k 1 f) P(s k2 f) 17 / 30

Calculating collocations 1. Initially, calculate the collocations for s k 2. Calculate the contexts in which an ambiguous word is assigned to s k, based on those collocations 3. Calculate the set of collocations that are most characteristic of the contexts for s k, using the formula: (8) P(s k 1 f) P(s k2 f) 4. Repeat steps 2 & 3 until a threshold is reached. 18 / 30

Word similarity Idea: expect synonyms to behave similarly Define this in two ways: Knowledge-based: thesaurus-based WSD Knowledge-free: distributional methods Word similarity computations are useful for IR, QA, summarization, language modeling, etc. 19 / 30

Thesaurus-based WSD Use essentially the same set-up as dictionary-based WSD, but now: instead of requiring context words to have overlapping dictionary definitions we require surrounding context words to list the focus word w (or the subject code of w) as one of their topics e.g., If an animal or insect appears in the context of bass, we choose the fish sense instead of the musical one Alternative: use path lengths in an ontology like WordNet to calculate word similarity 20 / 30

Idea: when disambiguating a word w, look for a combination of w and some contextual word which translates to a particular pair, indicating a particular sense interest can be legal share (Beteiligung in German) or concern (Interesse) In the phrase show concern, we are more likely to translate to Interesse zeigen than Beteiligung zeigen So, in this English context, the German context tells us to go with the sense that corresponds to Interesse 21 / 30

Information-theoretic WSD Instead of using all contextual features which we assume are independent an information-theoretic approach tries to find one disambiguating feature Take a set of possible indicators and determine which is the best, i.e., which gives the highest mutual information in the training data Possible indicators: object of the verb the verb tense word to the left word to the right etc. When sense tagging, find value of that indicator to tag 22 / 30

Partitioning More specifically, determine what the values (x i ) of the indicator indicate, i.e. what sense (s i ) they point to. Assume two senses (P 1 and P 2 ), which can be captured in subsets Q 1 = {x i x i indicates sense 1} and Q 2 = {x i x i indicates sense 2} We will have a set of indicator values Q; our goal is to partition Q into these two sets The partition we choose is the one which maximizes the mutual information scores I(P 1, Q 1 ) and I(P 2, Q 2 ) The Flip-Flop algorithm is used when you have to automatically determine your senses (e.g., if using parallel text) 23 / 30

The Flip-Flop Algorithm (roughly) 1. Randomly partition P (possible senses/translations) into P 1 and P 2 2. While improving mutual information scores, 2.1 Find the partition Q (possible indicators) into Q 1 and Q 2 which maximizes I(P; Q) Q might be the set of objects which appear after the verb in question 2.2 Find the partition P into P 1 and P 2 which maximizes I(P; Q) 24 / 30

After determining the best indicator and partitioning the values, disambiguating is easy: 1. Determine the value x i of the indicator for the ambiguous word. 2. If x i is in Q 1, assign it sense 1; otherwise, sense 2. This method is also applicable for determining which indicators are best for a set of translation words 25 / 30

Perform sense discrimination, or clustering In other words, group comparable senses together even if you cannot give a correct label We will look briefly at the EM (Expectation-Maximization) algorithm for this task, based on a Bayesian model 26 / 30

EM algorithm: Bayesian review Bayesian WSD for supervised learning: Look at a context of surrounding words, call it c (v j = word in context), within a window of a particular size Select the best sense s from among the different senses (9) s = arg sk max P(s k c) = arg sk max P(c s k )P(s k ) P(c) = arg sk max P(c s k )P(s k ) = arg sk max[log P(c s k ) + log P(s k )] = arg sk max[ log P(v j s k ) + log P(s k )] v j c We need some other way to get estimates of P(s k ) and P(c s k ) 27 / 30

EM algorithm 1. Intialize the parameters randomly, i.e., the probabilities for all senses and contexts And decide K, the number of senses you want determines how fine-grained your distinctions are 2. While still improving: 2.1 Expectation: re-estimate the probability of s k generating the context c (10) ˆP(ci s k ) = P(c i s k ) K P(c i s k ) k=1 Recall that all contextual words v j (i.e., P(v j s k )) will be used to calculate the context 28 / 30

EM algorithm (cont.) 2. Maximization: Use the expected probabilities to re-estimate the parameters: (11) P(v j s k ) = ˆP(c i s k ) {c i :v j c i } ˆP(c i s k ) k {c i :v j c i } Of all the times that v j occurs in a context of any of this word s senses, how often does v j indicate s k? (12) P(s k ) = ˆP(c i s k ) i ˆP(c i s k ) k i Of all the times that any sense generates c i, how often does s k generate it? 29 / 30

Surveys on WSD Systems Surveys: Roberto Navigli (2009). : a Survey. ACM Computing Surveys, 41(2), pp. 1-69. Covers: decision lists, decision trees,, neural networks, instance-based learning, SVMs, ensemble methods, clustering, multilinguality, Semeval/Senseval, etc. http://wwwusers.di.uniroma1.it/ navigli/pubs/ ACM Survey 2009 Navigli.pdf Alok Ranjan Pal and Diganta Saha (2015). : A Survey. International Journal of Control Theory and Computer Modeling (IJCTCM), 5(3). http://arxiv.org/pdf/1508.01346.pdf 30 / 30