Final Projects. Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison

Similar documents
DKPro WSD A Generalized UIMA-based Framework for Word Sense Disambiguation

Word Sense Disambiguation

Leveraging Sentiment to Compute Word Similarity

The MEANING Multilingual Central Repository

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

On document relevance and lexical cohesion between query terms

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Comparative Evaluation of Word Sense Disambiguation Algorithms for German

Robust Sense-Based Sentiment Classification

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Online Updating of Word Representations for Part-of-Speech Tagging

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

A Bayesian Learning Approach to Concept-Based Document Classification

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

2.1 The Theory of Semantic Fields

The taming of the data:

Assignment 1: Predicting Amazon Review Ratings

Python Machine Learning

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Linking Task: Identifying authors and book titles in verbose queries

Multilingual Sentiment and Subjectivity Analysis

Distant Supervised Relation Extraction with Wikipedia and Freebase

1. Introduction. 2. The OMBI database editor

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Methods for the Qualitative Evaluation of Lexical Association Measures

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

The stages of event extraction

arxiv: v1 [cs.cl] 2 Apr 2017

Vocabulary Usage and Intelligibility in Learner Language

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Lecture 1: Machine Learning Basics

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Ensemble Technique Utilization for Indonesian Dependency Parser

Short Text Understanding Through Lexical-Semantic Analysis

Probabilistic Latent Semantic Analysis

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Handling Sparsity for Verb Noun MWE Token Classification

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Modeling function word errors in DNN-HMM based LVCSR systems

Using dialogue context to improve parsing performance in dialogue systems

A Case Study: News Classification Based on Term Frequency

The Role of the Head in the Interpretation of English Deverbal Compounds

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

A Comparison of Two Text Representations for Sentiment Analysis

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Attributed Social Network Embedding

Semi-supervised Training for the Averaged Perceptron POS Tagger

Development of the First LRs for Macedonian: Current Projects

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

AQUA: An Ontology-Driven Question Answering System

Indian Institute of Technology, Kanpur

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Using Small Random Samples for the Manual Evaluation of Statistical Association Measures

Calibration of Confidence Measures in Speech Recognition

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Prediction of Maximal Projection for Semantic Role Labeling

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Training and evaluation of POS taggers on the French MULTITAG corpus

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Cross Language Information Retrieval

Graph Alignment for Semi-Supervised Semantic Role Labeling

Comment-based Multi-View Clustering of Web 2.0 Items

Modeling function word errors in DNN-HMM based LVCSR systems

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

A study of speaker adaptation for DNN-based speech synthesis

I. INTRODUCTION. for conducting the research, the problems in teaching vocabulary, and the suitable

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Cross-Lingual Text Categorization

arxiv: v1 [cs.lg] 3 May 2013

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

arxiv: v1 [cs.cv] 10 May 2017

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

EQuIP Review Feedback

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Exploration. CS : Deep Reinforcement Learning Sergey Levine

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

TextGraphs: Graph-based algorithms for Natural Language Processing

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Artificial Intelligence

Word Translation Disambiguation without Parallel Texts

Temporal Information Extraction for Question Answering Using Syntactic Dependencies in an LSTM-based Architecture

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Transcription:

Final Projects Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison Alessandro Raganato, José Camacho Collados and Roberto Navigli lcl.uniroma1.it/wsdeval

Word Sense Disambiguation (WSD) Given the word in context, find the correct sense: The mouse ate the cheese. A mouse consists of an object held in one's hand, with one or more buttons. 2

International Workshops on Semantic Evaluation Many evaluation datasets have been constructed for the task: Senseval 2 (2001) Senseval 3 (2004) SemEval 2007 SemEval 2013 SemEval 2015 3

International Workshops on Semantic Evaluation Many evaluation datasets have been constructed for the task: Senseval 2 (2001) WN 1.7 Senseval 3 (2004) WN 1.7.1 SemEval 2007 WN 2.1 SemEval 2013 WN 3.0 SemEval 2015 WN 3.0 Problem: different formats, construction guidelines and sense inventory 3

Building a Unified Evaluation Framework Our goal: build a unified framework for all-words WSD (training and testing) use this evaluation framework to perform a fair quantitative and qualitative empirical comparison 4

Building a Unified Evaluation Framework Our goal: build a unified framework for all-words WSD (training and testing) use this evaluation framework to perform a fair quantitative and qualitative empirical comparison How: standardizing the WSD datasets and training corpora into a unified format semi-automatically converting annotations from any dataset to WordNet 3.0 preprocessing the datasets by consistently using the same pipeline. 4

Building a Unified Evaluation Framework Pipeline for standardizing any given WSD dataset: Standardizing format: convert all datasets to a unified XML scheme, where preprocessing information (e.g. lemma, PoS tag) of a given corpus can be encoded 5

Building a Unified Evaluation Framework Pipeline for standardizing any given WSD dataset: WN version mapping: map the sense annotations from its original WordNet version to 3.0 carried out semi-automatically (Daude et al., 2003) Jordi Daude, Lluis Padro, and German Rigau. Validation and tuning of wordnet mapping techniques. In Proceedings of RANLP 2003. 6

Building a Unified Evaluation Framework Pipeline for standardizing any given WSD dataset: Preprocessing: use the Stanford corenlp toolkit for part of speech tagging and lemmatization 7

Building a Unified Evaluation Framework Pipeline for standardizing any given WSD dataset: Semi-automatic verification: develop a script to check that the final dataset conforms to the guidelines ensure that the sense annotations match the lemma and the PoS tag provided by Stanford CoreNLP 8

Data - evaluation framework Training data: SemCor, a manually sense-annotated corpus OMSTI (One Million Sense-Tagged Instances), a large annotated corpus, automatically constructed by using an alignment based WSD approach 9

Data - evaluation framework Training data: SemCor, a manually sense-annotated corpus OMSTI (One Million Sense-Tagged Instances), a large annotated corpus, automatically constructed by using an alignment based WSD approach Testing data: Senseval 2, covers nouns, verbs, adverbs and adjectives Senseval 3, covers nouns, verbs, adverbs and adjectives SemEval 2007, covers nouns and verbs SemEval 2013, covers nouns only SemEval 2015, covers nouns, verbs, adverbs and adjectives ALL, the concatenation of all five testing data 9

Statistics - training data Annotations Sense types 911,134 33,362 226,036 3,730 10

Statistics - testing data 2,282 1,850 1,644 1,022 5.4 6.8 8.5 4.9 5.5 455 11

Statistics - testing data (ALL) ALL, the concatenation of all the five evaluation datasets Total test instances: 7.253 12

Statistics - testing data (ALL) ALL, the concatenation of all the five evaluation datasets Total test instances: 7.253 4,300 10.4 1,652 955 346 4.8 3.8 3.1 12

Evaluation 13

Evaluation: Comparison systems Knowledge-based Supervised 14

Evaluation: Comparison systems Knowledge-based Lesk_extended (Banerjee and Pedersen, 2003) Lesk+emb (Basile et al., 2014) UKB (Agirre et al., 2014) Babelfy (Moro et al., 2014) 14

Evaluation: Comparison systems (knowledge-based) Lesk (Lesk, 1986) Based on the overlap between the definitions of a given sense and the context of the target word. Two configurations: - Lesk_extended (Banerjee and Pedersen, 2003): it includes related senses and tf-idf for word weighting. - Lesk+emb (Basile et al., 2014): enhanced version of Lesk in which similarity between definitions and the target context is computed via word embeddings. 15

Evaluation: Comparison systems (knowledge-based) UKB (Agirre et al., 2014) Graph-based system which exploits random walks over a semantic network, using Personalized PageRank. It uses the standard WordNet graph plus disambiguated glosses as connections. 16

Evaluation: Comparison systems (knowledge-based) UKB (Agirre et al., 2014) Graph-based system which exploits random walks over a semantic network, using Personalized PageRank. It uses the standard WordNet graph plus disambiguated glosses as connections. NEW - UKB*: enhanced configuration using sense distributions from SemCor and running Personalized PageRank for each word. 16

Evaluation: Comparison systems (knowledge-based) Babelfy (Moro et al., 2014) Graph-based system that uses random walks with restart over a semantic network, creating high-coherence semantic interpretations of the input text. BabelNet as semantic network. BabelNet provides a large set of connections coming from Wikipedia and other resources. 17

Evaluation: Results on the concatenation of all datasets Knowledge-based 65.2 20 80 50 F-Measure (%) MCS baseline 18

Evaluation: Results on the concatenation of all datasets Knowledge-based 65.2 48.7 50 20 80 F-Measure (%) Lesk_extended MCS baseline 18

Evaluation: Results on the concatenation of all datasets Knowledge-based 65.2 48.7 50 57.5 20 80 F-Measure (%) Lesk_extended UKB MCS baseline 18

Evaluation: Results on the concatenation of all datasets Knowledge-based 65.2 48.7 50 57.5 63.7 20 80 F-Measure (%) Lesk_extended UKB Lesk +emb MCS baseline 18

Evaluation: Results on the concatenation of all datasets Knowledge-based 65.2 48.7 50 57.5 63.7 65.5 20 80 F-Measure (%) Lesk_extended UKB Lesk +emb Babelfy MCS baseline 18

Evaluation: Results on the concatenation of all datasets Knowledge-based 65.2 Supervised systems 48.7 50 57.5 63.7 65.5 68.4 20 80 F-Measure (%) Lesk_extended UKB Lesk +emb Babelfy MCS baseline Worst supervised system 18

Evaluation: Comparison systems Knowledge-based Lesk-extended (Banerjee and Pedersen, 2003) Lesk+emb (Basile et al., 2014) UKB (Agirre et al., 2014) Babelfy (Moro et al., 2014) Supervised IMS (Zhong and Ng, 2010) IMS+emb (Iacobacci et al. 2016) Context2Vec (Melamud et al., 2016) 19

Evaluation: Comparison systems (supervised) IMS (Zhong and Ng, 2010) SVM classifier over a set of conventional features: surroundings words, PoS tags and local collocations. Improvements integrating word embeddings as an additional feature (Taghipour and Ng, 2015; Rothe and Schütze, 2015; Iacobacci et al. 2016) -> IMS+emb. 20

Evaluation: Comparison systems (supervised) Context2Vec (Melamud et al., 2016) Three steps: - First, a bidirectional LSTM is trained on an unlabeled corpus. - Then, this model is used to learn an output (context) vector for each sense annotation in the sense-annotated training corpus. - Finally, the sense annotation whose context vector is closer to the target word s context vector is selected as the intended sense. 21

Evaluation: Results on the concatenation of all datasets Supervised (SemCor) 64.8 50 20 80 F-Measure (%) MFS baseline 22

Evaluation: Results on the concatenation of all datasets Supervised (SemCor) 64.8 50 68.4 20 80 F-Measure (%) IMS MFS baseline 22

Evaluation: Results on the concatenation of all datasets Supervised (SemCor) 64.8 69.0 50 68.4 20 80 F-Measure (%) IMS MFS baseline Context2Vec 22

Evaluation: Results on the concatenation of all datasets Supervised (SemCor) 64.8 69.0 50 68.4 69.6 20 80 F-Measure (%) IMS IMS+emb MFS baseline Context2Vec 22

Evaluation: Results on the concatenation of all datasets Supervised (SemCor + OMSTI) 64.8 50 69.0 +0.4 (OMSTI) +0.4 (OMSTI) 68.4 69.6 +0.1 (OMSTI) 20 80 F-Measure (%) IMS IMS+emb MFS baseline Context2Vec 22

Evaluation: Analysis Training corpus The automatically-constructed OMSTI helps to improve the results of the supervised systems trained on SemCor only. Research direction -> (semi)automatic construction of sense-annotated datasets in order to overcome the knowledge-acquisition bottleneck. 24

Evaluation: Analysis Knowledge-based vs. Supervised Supervised systems clearly outperform knowledge-based systems. Supervised systems seem to better capture local contexts: In sum, at both the federal and state government levels at least part of the seemingly irrational behavior voters display in the voting booth may have an exceedingly rational explanation. 25

Evaluation: Analysis Knowledge-based systems Competitive for nouns, but underperform in other PoS tags. The Most Common Sense (MCS) baseline is still hard to beat. Only Babelfy and UKB* manage to outperform this baseline but - Babelfy uses the MCS baseline as a back-off strategy. - The configuration of UKB which outperforms the baseline integrates all the sense distribution from SemCor. 26

Evaluation: Analysis Bias towards the Most Frequent Sense (MFS) All IMS-based systems answer over 75% of the times with the MFS. Context2Vec is slightly less affected (73.1% on average). The MFS bias is also present in graph-based systems, confirming the findings of previous studies: Calvo and Gelbukh (2015), Postma et al. (2016). 27

Evaluation: Analysis Low overall performance on verbs All systems below 58%. Verbs are extremely fine-grained in WordNet: 10.4 number of senses per verb on average on all datasets (4.8 in nouns and lower in adjectives and adverbs). For example, the verb keep has 22 meaning in WordNet, 6 of them denoting possession. 28

Conclusion We presented a unified evaluation framework for all-words Word Sense Disambiguation, including standardized training and testing data. This eases the task of researchers to evaluate their systems and ensures a fair comparison. 29

Conclusion We presented a unified evaluation framework for all-words Word Sense Disambiguation, including standardized training and testing data. This eases the task of researchers to evaluate their systems and ensures a fair comparison. Two potential research directions based on semisupervised learning: - Exploiting large amounts of unlabeled corpora for learning accurate word embeddings or training neural language models - (Semi)Automatic construction of high-quality sense-annotated corpora 29

Conclusion We presented a unified evaluation framework for all-words Word Sense Disambiguation, including standardized training and testing data. This eases the task of researchers to evaluate their systems and ensures a fair comparison. Two potential research directions based on semisupervised learning: - Exploiting large amounts of unlabeled corpora for learning accurate word embeddings or training neural language models - (Semi)Automatic construction of high-quality sense-annotated corpora http://lcl.uniroma1.it/wsdeval 29

Thank you! All the data available at http://lcl.uniroma1.it/wsdeval