CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 26 Unsupervised EM based WSD)

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 26 Unsupervised EM based WSD) based on Mitesh Khapra, Salil Joshi and Pushpak Bhattacharyya, It takes two to Tango: A Bilingual Unsupervised Approach for Estimating Sense Distributions using Expectation Maximization, 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, November 2011. Pushpak Bhattacharyya CSE Dept., IIT Bombay 12 th March, 2012

Some Quick Definitions Synset (Synonymy Set) A Synset represents a concept, and contains a set of words, each of which is synonymous with the other words in the set Word Sense Disambiguation (WSD) Identify the correct sense/synset of a word river bank v/s financial bank 2

WSD: Cost Accuracy trade-off 3

Example of sense marking: its need एक_4187 नए श ध_1138 क अन स र_3123 जन ल ग _1189 क स म जक_43540 ज वन_125623 य त_48029 ह त ह उनक दम ग_16168 क एक_4187 ह स _120425 म अ धक_42403 जगह_113368 ह त ह (According to a new research, those people who have a busy social life, have larger space in a part of their brain). न चर य र स इ स म छप एक_4187 श ध_1138 क अन स र_3123 कई_4118 ल ग _1189 क दम ग_16168 क क न स पत _11431 चल क दम ग_16168 क एक_4187 ह स _120425 ए मगड ल स म जक_43540 य तत ओ _1438 क स थ_328602 स म ज य_166 क लए थ ड़ _38861 बढ़_25368 ज त ह यह श ध_1138 58 ल ग _1189 पर कय गय जसम उनक उ _13159 और दम ग_16168 क स इज़ क आ कड़ _128065 लए गए अमर क _413405 ट म_14077 न प य _227806 क जन ल ग _1189 क स शल न टव क ग अ धक_42403 ह उनक दम ग_16168 क ए मगड ल व ल ह स _120425 ब क _130137 ल ग _1189 क त लन _म _38220 अ धक_42403 बड़ _426602 ह दम ग_16168 क ए मगड ल व ल ह स _120425 भ वन ओ _1912 और म न सक_42151 थ त_1652 स ज ड़ ह आ म न _212436 ज त ह

Scenario In India Tourism, Health, Sports, Finance, Politics, etc. 5

impractical to collect data in Multiple Languages 6

Alternatives (1/2) Use Unsupervised and Knowledge Based approaches (e.g., McCarthy et. al., 2004; Mihalcea, 2005; Agirre & Soroa, 2009, etc.) Disambiguation by Translation Need parallel corpora an unreasonable demand (e.g., Gale, Church & Yarowsky, 1992; Diab and Resnik, 2002; Ng, Wang and Chan, 2003) Approaches which use non-parallel corpora give very poor accuracies (e.g., Kaji and Morimot, 2002; Li and Li, 2004) 7

Alternatives (2/2) Recent work on parameter projection (Khapra et. al., 2009, 2011) Leverage on annotated corpus available in one resource rich language What if such a resource rich language is not available? 8

OR. 9

Focus of this work Can two languages mutually benefit from each other s in-domain untagged data (non-parallel)? The performance will not be as high as supervised approaches but Can it be better than state-of-the-art knowledge based and unsupervised bilingual approaches? Can the performance come close to wordnet first sense baseline (supervised baseline)? 10

Intuition Counts of translations in the corpus of another language from the same domain provide clues about sense distributions For example, the Marathi word maan has two senses having different Hindi translations Sense Meaning Hind translation S1 neck gardan, galaa S2 prestige aadar, izzat In Health domain, S1 would be more prevalent and hence the translations gardan, galaa would be more prevalent in Hindi Health corpus Sense distributions can be estimated using the counts of these translations Refine the counts using an iterative algorithm (EM). 11

Background 12

Parameter projection (Khapra et. Al. 2009) 13

S3 S2 Synset Based Multilingual Dictionary Hindi S1 S4 S5 S3 Marathi S2 S1 S4 S5 S6 S7 S6 S7 A sample entry from the MultiDict Expansion approach for creating wordnets [Mohanty et. al., 2008] Instead of creating from scratch link to the synsets of existing wordnet Relations get borrowed from existing wordnet 14

Cross Linkages Between Synset Members Captures native speakers intuition Wherever the word ladkaa appears in Hindi one would expect to see the word mulgaa in Marathi For this work we do not use these manual cross linkages as they have a cost associated with them Instead we assume that every word in the Hindi synset is a translation of a word in the corresponding Marathi synset 15

Approach 16

ESTIMATING SENSE DISTRIBUTIONS If sense tagged Marathi corpus were available, we could have estimated But such a corpus is not available 17

Framework: Figure 1 and Figure 2

E-M steps

Points to note Symmetric formulation E and M steps are identical except for the change in language Either can be treated as the E-step, making the other as the M-step A back-and-forth traversal over translation correspondences in the two languages Does not require parallel corpus only in-domain corpus is needed 20

In General.. 21

Experiments 22

Experimental Setup Languages: Hindi, Marathi Domains: Tourism and Health (largest domain-specific sense tagged corpus) 23

Algorithms Being Compared EM (our approach) Personalized PageRank (Agirre and Soroa, 2009) State-of-the-art bilingual approach (using Mutual Information) (Kaji and Morimoto, 2002) Random Baseline Wordnet First sense baseline (supervised baseline) 24

Results Performs better than other state-of-the-art knowledge based and unsupervised approaches Does not beat the Wordnet First Sense Baseline which is a supervised baseline 25

Error Analysis Non-Progressiveness estimation Some words have the same translations in the target language across senses saagar(hindi) samudra (marathi) ( large water body as well as limitless ) Such words thus form a closed loop of translations In such cases the algorithm does not progress and gets stuck with the initial values Same is the case for some language specific words for which corresponding synsets were not available in the other language Such words accounted for 17-19% of the total words in the test corpus 26

have problem of Non Progressive Estimation Results are now closer to Wordnet First Sense Baseline For 2 out of the 4 language domain pairs the results are slightly better than WFS remarkable for an unsupervised approach 27

Further error Analysis (1/2) MultiDict related issues: Hindi word sankraman (infection) translates to sansarg (infection) in Marathi However, sansarg (infection) was absent in the corresponding Marathi synset (incomplete Marathi synset) Poor performance on verbs Highly polysemous a common bane for all WSD algorithms Do not form a close loop of translations but share many translations across senses e.g., the Hindi word karna (do) has the same Marathi translation in 8 out of its 21 senses Thus translations do not play a discriminatory role 28

Further error Analysis (2/2) Influence of synonyms in a rare sense: Hindi word jab has two senses, viz., when (S1) and if (S2) It is rarely used in the sense S2 (if) However, its other synonyms (yadi (if) and agar(if)) are frequently used in this sense (S2) The same is observed in the Marathi corpus where the translations of yadi (if) and agar(if) in S2 are very frequent As a result, these translations strongly bias the probability towards the if sense of jab 29

conclusions An unsupervised bilingual approach for estimating sense distributions using EM Performs a back-and-forth traversal over translation correspondences Performs better than current state-of-the-art approaches When restricted to words not facing the problem of non-progressiveness estimation, the performance was better than WFS for 2 out of 4 language domain pairs An effective way of utilizing untagged corpora in two languages 30

Future work Can the problem of non-progressiveness estimation be solved using more than two languages? 31