CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 26 Unsupervised EM based WSD)

Similar documents
DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

S. RAZA GIRLS HIGH SCHOOL

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE


The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

ENGLISH Month August

Leveraging Sentiment to Compute Word Similarity

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

ह द स ख! Hindi Sikho!

DKPro WSD A Generalized UIMA-based Framework for Word Sense Disambiguation

Word Sense Disambiguation

Robust Sense-Based Sentiment Classification

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Multilingual Sentiment and Subjectivity Analysis

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

On document relevance and lexical cohesion between query terms

A Comparative Evaluation of Word Sense Disambiguation Algorithms for German

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

TextGraphs: Graph-based algorithms for Natural Language Processing

A heuristic framework for pivot-based bilingual dictionary induction

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Probabilistic Latent Semantic Analysis

A Bayesian Learning Approach to Concept-Based Document Classification

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

The MEANING Multilingual Central Repository

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Cross Language Information Retrieval

2.1 The Theory of Semantic Fields

Combining a Chinese Thesaurus with a Chinese Dictionary

arxiv:cmp-lg/ v1 22 Aug 1994

Constructing Parallel Corpus from Movie Subtitles

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Matching Similarity for Keyword-Based Clustering

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

व रण क ए आ दन-पत र. Prospectus Cum Application Form. न दय व kऱय सम त. Navodaya Vidyalaya Samiti ਨਵ ਦ ਆ ਦਵਦ ਆਦ ਆ ਸਦ ਤ. Navodaya Vidyalaya Samiti

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Proceedings of the 19th COLING, , 2002.

Multivariate k-nearest Neighbor Regression for Time Series data -

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Accuracy (%) # features

Artificial Neural Networks written examination

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

Finding Translations in Scanned Book Collections

Vocabulary Usage and Intelligibility in Learner Language

Online Updating of Word Representations for Part-of-Speech Tagging

Matching Meaning for Cross-Language Information Retrieval

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Distant Supervised Relation Extraction with Wikipedia and Freebase

Guide to Teaching Computer Science

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Short Text Understanding Through Lexical-Semantic Analysis

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Probability estimates in a scenario tree

arxiv: v1 [cs.cl] 2 Apr 2017

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A process by any other name

Translating Collocations for Use in Bilingual Lexicons

Cross-Lingual Text Categorization

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Lecture 1: Machine Learning Basics

On the Combined Behavior of Autonomous Resource Management Agents

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Using Semantic Relations to Refine Coreference Decisions

Truth Inference in Crowdsourcing: Is the Problem Solved?

Learning Methods in Multilingual Speech Recognition

BYLINE [Heng Ji, Computer Science Department, New York University,

A Case Study: News Classification Based on Term Frequency

Tun your everyday simulation activity into research

TIMSS Highlights from the Primary Grades

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A Comparison of Two Text Representations for Sentiment Analysis

AQUA: An Ontology-Driven Question Answering System

A study of speaker adaptation for DNN-based speech synthesis

Lexical Similarity based on Quantity of Information Exchanged - Synonym Extraction

The Ups and Downs of Preposition Error Detection in ESL Writing

Lecture 2: Quantifiers and Approximation

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

The stages of event extraction

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Disambiguation of Thai Personal Name from Online News Articles

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

Reinforcement Learning by Comparing Immediate Reward

1. Introduction. 2. The OMBI database editor

Indian Institute of Technology, Kanpur

Extended Similarity Test for the Evaluation of Semantic Similarity Functions

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Transcription:

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 26 Unsupervised EM based WSD) based on Mitesh Khapra, Salil Joshi and Pushpak Bhattacharyya, It takes two to Tango: A Bilingual Unsupervised Approach for Estimating Sense Distributions using Expectation Maximization, 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, November 2011. Pushpak Bhattacharyya CSE Dept., IIT Bombay 12 th March, 2012

Some Quick Definitions Synset (Synonymy Set) A Synset represents a concept, and contains a set of words, each of which is synonymous with the other words in the set Word Sense Disambiguation (WSD) Identify the correct sense/synset of a word river bank v/s financial bank 2

WSD: Cost Accuracy trade-off 3

Example of sense marking: its need एक_4187 नए श ध_1138 क अन स र_3123 जन ल ग _1189 क स म जक_43540 ज वन_125623 य त_48029 ह त ह उनक दम ग_16168 क एक_4187 ह स _120425 म अ धक_42403 जगह_113368 ह त ह (According to a new research, those people who have a busy social life, have larger space in a part of their brain). न चर य र स इ स म छप एक_4187 श ध_1138 क अन स र_3123 कई_4118 ल ग _1189 क दम ग_16168 क क न स पत _11431 चल क दम ग_16168 क एक_4187 ह स _120425 ए मगड ल स म जक_43540 य तत ओ _1438 क स थ_328602 स म ज य_166 क लए थ ड़ _38861 बढ़_25368 ज त ह यह श ध_1138 58 ल ग _1189 पर कय गय जसम उनक उ _13159 और दम ग_16168 क स इज़ क आ कड़ _128065 लए गए अमर क _413405 ट म_14077 न प य _227806 क जन ल ग _1189 क स शल न टव क ग अ धक_42403 ह उनक दम ग_16168 क ए मगड ल व ल ह स _120425 ब क _130137 ल ग _1189 क त लन _म _38220 अ धक_42403 बड़ _426602 ह दम ग_16168 क ए मगड ल व ल ह स _120425 भ वन ओ _1912 और म न सक_42151 थ त_1652 स ज ड़ ह आ म न _212436 ज त ह

Scenario In India Tourism, Health, Sports, Finance, Politics, etc. 5

impractical to collect data in Multiple Languages 6

Alternatives (1/2) Use Unsupervised and Knowledge Based approaches (e.g., McCarthy et. al., 2004; Mihalcea, 2005; Agirre & Soroa, 2009, etc.) Disambiguation by Translation Need parallel corpora an unreasonable demand (e.g., Gale, Church & Yarowsky, 1992; Diab and Resnik, 2002; Ng, Wang and Chan, 2003) Approaches which use non-parallel corpora give very poor accuracies (e.g., Kaji and Morimot, 2002; Li and Li, 2004) 7

Alternatives (2/2) Recent work on parameter projection (Khapra et. al., 2009, 2011) Leverage on annotated corpus available in one resource rich language What if such a resource rich language is not available? 8

OR. 9

Focus of this work Can two languages mutually benefit from each other s in-domain untagged data (non-parallel)? The performance will not be as high as supervised approaches but Can it be better than state-of-the-art knowledge based and unsupervised bilingual approaches? Can the performance come close to wordnet first sense baseline (supervised baseline)? 10

Intuition Counts of translations in the corpus of another language from the same domain provide clues about sense distributions For example, the Marathi word maan has two senses having different Hindi translations Sense Meaning Hind translation S1 neck gardan, galaa S2 prestige aadar, izzat In Health domain, S1 would be more prevalent and hence the translations gardan, galaa would be more prevalent in Hindi Health corpus Sense distributions can be estimated using the counts of these translations Refine the counts using an iterative algorithm (EM). 11

Background 12

Parameter projection (Khapra et. Al. 2009) 13

S3 S2 Synset Based Multilingual Dictionary Hindi S1 S4 S5 S3 Marathi S2 S1 S4 S5 S6 S7 S6 S7 A sample entry from the MultiDict Expansion approach for creating wordnets [Mohanty et. al., 2008] Instead of creating from scratch link to the synsets of existing wordnet Relations get borrowed from existing wordnet 14

Cross Linkages Between Synset Members Captures native speakers intuition Wherever the word ladkaa appears in Hindi one would expect to see the word mulgaa in Marathi For this work we do not use these manual cross linkages as they have a cost associated with them Instead we assume that every word in the Hindi synset is a translation of a word in the corresponding Marathi synset 15

Approach 16

ESTIMATING SENSE DISTRIBUTIONS If sense tagged Marathi corpus were available, we could have estimated But such a corpus is not available 17

Framework: Figure 1 and Figure 2

E-M steps

Points to note Symmetric formulation E and M steps are identical except for the change in language Either can be treated as the E-step, making the other as the M-step A back-and-forth traversal over translation correspondences in the two languages Does not require parallel corpus only in-domain corpus is needed 20

In General.. 21

Experiments 22

Experimental Setup Languages: Hindi, Marathi Domains: Tourism and Health (largest domain-specific sense tagged corpus) 23

Algorithms Being Compared EM (our approach) Personalized PageRank (Agirre and Soroa, 2009) State-of-the-art bilingual approach (using Mutual Information) (Kaji and Morimoto, 2002) Random Baseline Wordnet First sense baseline (supervised baseline) 24

Results Performs better than other state-of-the-art knowledge based and unsupervised approaches Does not beat the Wordnet First Sense Baseline which is a supervised baseline 25

Error Analysis Non-Progressiveness estimation Some words have the same translations in the target language across senses saagar(hindi) samudra (marathi) ( large water body as well as limitless ) Such words thus form a closed loop of translations In such cases the algorithm does not progress and gets stuck with the initial values Same is the case for some language specific words for which corresponding synsets were not available in the other language Such words accounted for 17-19% of the total words in the test corpus 26

have problem of Non Progressive Estimation Results are now closer to Wordnet First Sense Baseline For 2 out of the 4 language domain pairs the results are slightly better than WFS remarkable for an unsupervised approach 27

Further error Analysis (1/2) MultiDict related issues: Hindi word sankraman (infection) translates to sansarg (infection) in Marathi However, sansarg (infection) was absent in the corresponding Marathi synset (incomplete Marathi synset) Poor performance on verbs Highly polysemous a common bane for all WSD algorithms Do not form a close loop of translations but share many translations across senses e.g., the Hindi word karna (do) has the same Marathi translation in 8 out of its 21 senses Thus translations do not play a discriminatory role 28

Further error Analysis (2/2) Influence of synonyms in a rare sense: Hindi word jab has two senses, viz., when (S1) and if (S2) It is rarely used in the sense S2 (if) However, its other synonyms (yadi (if) and agar(if)) are frequently used in this sense (S2) The same is observed in the Marathi corpus where the translations of yadi (if) and agar(if) in S2 are very frequent As a result, these translations strongly bias the probability towards the if sense of jab 29

conclusions An unsupervised bilingual approach for estimating sense distributions using EM Performs a back-and-forth traversal over translation correspondences Performs better than current state-of-the-art approaches When restricted to words not facing the problem of non-progressiveness estimation, the performance was better than WFS for 2 out of 4 language domain pairs An effective way of utilizing untagged corpora in two languages 30

Future work Can the problem of non-progressiveness estimation be solved using more than two languages? 31