Word Sense Disambiguation in Natural Language Processing

Similar documents
DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

HinMA: Distributed Morphology based Hindi Morphological Analyzer

S. RAZA GIRLS HIGH SCHOOL

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)


CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

Word Sense Disambiguation

ह द स ख! Hindi Sikho!

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

Leveraging Sentiment to Compute Word Similarity

ENGLISH Month August

A Bayesian Learning Approach to Concept-Based Document Classification

Robust Sense-Based Sentiment Classification

Parsing of part-of-speech tagged Assamese Texts

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A Comparison of Two Text Representations for Sentiment Analysis

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

ScienceDirect. Malayalam question answering system

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

A Case Study: News Classification Based on Term Frequency

AQUA: An Ontology-Driven Question Answering System

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Ensemble Technique Utilization for Indonesian Dependency Parser

Vocabulary Usage and Intelligibility in Learner Language

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Named Entity Recognition: A Survey for the Indian Languages

Linking Task: Identifying authors and book titles in verbose queries

Cross Language Information Retrieval

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

A Comparative Evaluation of Word Sense Disambiguation Algorithms for German

2.1 The Theory of Semantic Fields

Indian Institute of Technology, Kanpur

Combining a Chinese Thesaurus with a Chinese Dictionary

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

व रण क ए आ दन-पत र. Prospectus Cum Application Form. न दय व kऱय सम त. Navodaya Vidyalaya Samiti ਨਵ ਦ ਆ ਦਵਦ ਆਦ ਆ ਸਦ ਤ. Navodaya Vidyalaya Samiti

Short Text Understanding Through Lexical-Semantic Analysis

Multilingual Sentiment and Subjectivity Analysis

Using Semantic Relations to Refine Coreference Decisions

Word Segmentation of Off-line Handwritten Documents

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

The stages of event extraction

Rule Learning With Negation: Issues Regarding Effectiveness

A process by any other name

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

The Smart/Empire TIPSTER IR System

DKPro WSD A Generalized UIMA-based Framework for Word Sense Disambiguation

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

On document relevance and lexical cohesion between query terms

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Beyond the Pipeline: Discrete Optimization in NLP

Memory-based grammatical error correction

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Applications of memory-based natural language processing

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Teaching Vocabulary Summary. Erin Cathey. Middle Tennessee State University

Part III: Semantics. Notes on Natural Language Processing. Chia-Ping Chen

The Discourse Anaphoric Properties of Connectives

A heuristic framework for pivot-based bilingual dictionary induction

Matching Similarity for Keyword-Based Clustering

A Domain Ontology Development Environment Using a MRD and Text Corpus

CS 598 Natural Language Processing

BYLINE [Heng Ji, Computer Science Department, New York University,

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Available online at ScienceDirect. Procedia Computer Science 54 (2015 )

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Universiteit Leiden ICT in Business

Rule Learning with Negation: Issues Regarding Effectiveness

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Annotation Projection for Discourse Connectives

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The MEANING Multilingual Central Repository

Natural Language Processing. George Konidaris

Graph Alignment for Semi-Supervised Semantic Role Labeling

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Analysis of Probabilistic Parsing in NLP

Reducing Features to Improve Bug Prediction

A Graph Based Authorship Identification Approach

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

SEMAFOR: Frame Argument Resolution with Log-Linear Models

CSL465/603 - Machine Learning

Accuracy (%) # features

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Welcome to. ECML/PKDD 2004 Community meeting

Human Emotion Recognition From Speech

Transcription:

Word Sense Disambiguation in Natural Language Processing Preeti Dubey Department of Computer Science, Govt. College for Women, Parade Jammu, J&K- India Abstract: Word Sense Disambiguation has been a research area since the evolution of Natural Language processing. Most of the languages have ambiguous words. Resolution to these ambiguous words is a very importance task in developing any tool for natural language processing which otherwise hampers the efficiency of the developed systems. The accuracy of the output generated depends on the sense of the context a word is being used in a sentence. Word Sense Disambiguation is used for resolving the ambiguous words. This paper presents a brief introduction to Word Sense Disambiguation, an overview of the approaches that can be used solving ambiguity and work done for various Indian languages. Keywords: WSD, NLP, Ambiguity, Lexeme Introduction All human languages have ambiguous words, these are the words whose meaning differ with context to their reference. The computational identification of correct sense of a word in a context is known as Word Sense Disambiguation (WSD). It automatically figures out the intended meaning of an ambiguous word when used in a sentence. Few illustrations of ambiguous words are: English: Bark, light, fair, fan, right, fly, watch, bank, bat, lead, live Hindi: स न, मत, ग ल ब, आम Illustration of some ambiguous words in English and Hindi is shown below in table I: Word Sense Disambiguation is applied in various areas such as Machine translation (MT), Information retrieval (IR), speech recognition, computational advertising, text processing, etc. Without WSD, the processing of data is error prone. WSD plays a very important role in increasing the accuracy of the output produced by a system. It preserves the meaning of the word with reference to the context. Ambiguity A word, phrase, or sentence is said to be ambiguous, if it has more than one meaning. The various types of ambiguities are discussed below: Preetidubey2000@yahoo.com *Preeti dubey 114

Bark Fan मत ग ल ब ENGLISH The bark of the tree is bown I was awakened by the bark of the dog. The fan stopped worked. I am a die hard fan of Tom Cruise. HINDI मत ब ल (Don t Talk) मत क आध र पर फ सल ह आ The decision was based on votes. ग ल ब ज म न अ छ ह Gulab Jamun (Indian Sweet dish) is sweet. 115 ग ल ब ल ल ह The rose is red. Table I: Examples of Ambiguity Lexical ambiguity: Words with multiple senses can either be homonymous or polysemous. Two senses of a word are said to be homonyms when they mean entirely different things but have the same spelling. For example, the two senses of the word right may be uses as below: 1) He is a right-handed batsman 2) He is right. A word is said to be polysemous when its sense has various shades of the same basic meaning i.e. it has different but related meaning. For example, the word accident is polysemous since its two senses a mishap and anything that happens by chance are somewhat related to each other. Structural (Syntactic) Ambiguity: If the ambiguity is in a sentence or clause, it is called structural ambiguity. It occurs when a phrase or sentence has more than one underlying structure e.g We should be discussing violence on TV. We should be discussing violence on TV. Referential ambiguity Sentences in which pronouns refer to certain words, but it is often difficult to find out, to which word it is referring e.g. She dropped the plate on the table and broke it. [13] Here, the ambiguity is that the pronoun It refers to which noun, the table or the plate.

Such references are called anaphoric references or anaphora. The Multi-Word Constructs: Ambiguity arising due to multi-word constructs like Idioms and Phrasal verbs. For example, My neighbor s house was broken into last night. The actual meaning of the idiom can t be identified from its constituent words. Approaches for WSD The Word Sense Disambiguation approaches are classified into three main categories: 1) Supervised approach 2) Unsupervised approach 3) Knowledge based approach Supervised Approach The Supervised is based on a annotated data/ corpus. The supervised approach s based on trained sense annotated corpus to build classifiers. Initially, annotated corpus is required to build a classifier. The classifier is used to recognized the sense of the word based on their context of use. This approach assumes that the context can provide enough evidence on its own to disambiguate words. A major disadvantage of this approach is he requirement of a large sense-tagged corpora. The accuracy of the classifier depends on the size and the variety of the data incorporated in the corpus. Larger the corpus, better the results. The creation of corpus itself is a challenge in terms of time and money. Some important WSD algorithms based on supervised approach are: Naïve Bayes Method, Decision List Method, Decision Tree Method, Support Vector Machine (SVM) Method. Unsupervised Approach The Unsupervised approach is based on unannotated corpora. It is based on clustering of words. It assumes that sense of the ambiguous word can be induced from the neighboring words. Since this approach is based on unannotated data, it is very essential to have most of the senses of the word in the training corpus. Some algorithms based on this approach are: Context Group Discrimination, Co-occurrence graphs, WSD using parallel corpora. Knowledge Based Approach This approach based on using Knowledge bases for Word Sense Disambiguation. The knowledge bases that can be used are: Word Net, Machine readable dictionaries, thesauri etc. This 116

approach is suitable for domain specific word sense disambiguation. One of the famous algorithm used in this approach is the algorithm proposed by Michael Lesk in1986 and is named Lesk Algorithm. Literature Review Lesk Algorithm [5] developed by Micheal Lesk in 1986. It is used to identify senses of polysemy words. He used the overlap of word definition from the Oxford Advanced Learner s Dictionary of Current English (OALD) to disambiguate the word senses. Banerjee and Pedersen [7] adapted the original Lesk algorithm and used WordNet as the knowledge source. The overall accuracy of their system was evaluated to be 32%. They used Senseval-2 word sense disambiguation exercise to evaluate their system. Sinha M. et. al. [8] at IIT Bombay developed automatic WSD for Hindi language using Hindi WordNet.They used statistical method for determining the senses. The system could disambiguate the nouns only. They compared the context of the polysemy word in a sentence with the contexts constructed from the WordNet. They evaluated the system using the Hindi corpora provided by the Central Institute of Indian Languages (CIIL). The accuracy of their algorithm was found in the range from 40% to 70%. Shrestha N. et. al. [9] developed WSD module based on Lesk algorithm for the disambiguation of Nepali ambiguous words. They collected words to be used as the knowledge resource from synset, gloss, examples and hypernym. [] The number of example for each sense of the target word was only one in the database. Each word in the context words is compared with each word in the collection of words available in the formed database. Dhungana and Shakya [12] have also worked for the disambiguation of the polysemous words in Nepali language using the Lesk algorithm. The experiments performed on 348 words (including the different senses of 59 polysemy words and context words) with the test data containing 201 Nepali sentences shows the accuracy of their system to be 88.05%. Richard Singh and K. Ghosh [14] in 2013 proposed a architecture for Manipuri Language. The system performs WSD in two phases: training phase and testing phase. The Manipuri word sense disambiguation system is composed of the following steps: (i) pre-processing, (ii) feature selection and generation and (iii) training, (iv) testing and (v) performance evaluation. 117

Haroon, R.P. (2010) [10] has given the first attempt for an automatic WSD in Malayalam. The author used the knowledge based approach. Rakesh and Ravinder [16] have proposed a WSD algorithm for removing ambiguity from the Punjabi text document. The authors used the Modified Lesk Algorithm for WSD. Ayan Das and Sudeshna Sarkar [10] have developed WSD system based on unsupervised graph-based clustering approach for Bengali language, which is applied to the Bengali-Hindi machine translation.. Preeti Dubey (2014) [13] has applied the unannotated corpus based WSD algorithm for word sense disambiguation in the copyrighted Hindi-Dogri Machine Translation system developed by her under the guidance of Prof. Devanand and Prof. Shashi Pathania with assistance from Dr.Vishal Goyal. Ambiguity in the Hindi-Dogri Machine Translation System is resolved using the n gram approach for the morphemes स and और. Some use cases of the morpheme स are shown in Table II: As seen in Table II, the translations from Hindi to Dogri did using the Hindi-Dogri MTS show high accuracy rate for disambiguation of the morpheme स. Out of 9 sentences taken from online Hindi news only two resolutions shown in table III of स are incorrect. Sentence Incorrect Translation Correct Translation ब हर स आकर ब र क ल आइय ब र द आइय ब ब स स ब ब स क ल ब ब स क न Table III: shows the incorrect resolutions by WSD module in Hindi-Dogri MTS as well as the correct translations of स. Conclusion Word sense disambiguation (WSD) has been a very active area of research in computational linguistics. It plays an important role in the increased accuracy of the output generated by systems. One of the factors that has hampered WSD research is the unavailability of linguistic resources. The state of art is that most researchers are mostly using the knowledge based approach of disambiguation based on Lesk s algorithm. Due to lack of corpus, researchers are using WordNet for disambiguation. 118

It is also studied that the knowledge based approaches are efficient in resolving the sense of the ambiguous words, but there is a great need to develop lexical resources to be used for the efficiency of the natural language processing tools. References [1] Arindam Chatterjee, Salil Rajeev Joshi, Mitesh M. Khapra, Pushpak Bhattacharyya, Introduction to Tools for IndoWordNet and Word Sense Disambiguation http://www.cfilt.iitb.ac.in/wordnet/webhwn/indowordnetpapers/18_tools%20for%20indowordnet%20&%20wsd.pdf [2] C. Fellbaum, WordNet: An Electronic Lexical Database., MIT Press, 1998. [3] Lokesh Nandanwar, Kalyani Mamulkar, Supervised, Semi-Supervised and Unsupervised WSD Approaches: An Overview, International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064 [4] Kuhoo Gupta, Manish Shrivastava, Smriti Singh and Pushpak Bhattacharyya, Morphological Richness Offsets Resource Poverty- an Experience in Building a POS Tagger for Hindi, COLING/ACL-2006, Sydney, Australia, July, 2006. https://pdfs.semanticscholar.org/1daa/370563deb1e9f5772d6ada1edfb4ec4cd7b4.pdf [5] M. Lesk. 1986. Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone. In Proceedings of the 5th annual international conference on Systems documentation, SIGDOC 86, pages 24 26, New York, NY, USA. ACM [6] Udaya Raj et.al, Word Sense Disambiguation using WSD specific Wordnet of polysemy Words https://arxiv.org/ftp/arxiv/papers/1409/1409.3512.pdf [7] Banerjee and Pederson An adapted Lesk algorithm for Word Sense Disambiguity using Word Net http://www.cs.cmu.edu/~banerjee/publications/cicling2002.pdf [8] Sinha M. et. Al, Unsupervised Graph-based Word Sense Disambiguation Using Measures of Word Semantic Similarity http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62.3892&rep=rep1&type=pdf [9] Shrestha N et.al Resources for Napali WSD in proceedings of IEEE conference on Natural Language Processing and Knowledge Engineering, 2008. NLP-KE '08. 2008 [10] Alok Ranjan Pal and Diganta Saha, WORD SENSE DISAMBIGUATION: A SURVEY, International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.3, July 2015 [11] Z.Z. Merhbene Laroussi, Unsupervised system for Lexical Disambiguation of Arabic Language using a vote procedure, IEEE-ICoAC, 2011 [12] Raj Dhungana, Subarna Shakya,et.al Word Sense Disambiguation Using Wsd Specific Wordnet Of Polysemy Words Udaya International Journal on Natural Language Computing (IJNLC) Vol. 3, No.4, August 2014 [13] Study and Development of Machine Translation System: an important tool to bridge the digital divide. Ph.D. Thesis submitted in the Department of Computer Science & IT, University of Jammu, 2014. [14] Shallu et.al A Survey of Word-sense Disambiguation Effective Techniques and Methods for Indian Languages, JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 4, NOVEMBER 2013 119

S.No Hindi Dogri 1 व क न स द श ह, जह ज़ दग श नद र ह. ओ ज दग श नद र ऐ क ड़ द श न, ज थ 2 ब हतर न स वध ओ स ल स ह? ब हतर न स वध ओ क न ल स ऐ? 3 उ तर व क क़र ब थत ड नम क त न न ड क द श म स एक ह 4 जम न स ड नम क आकर बस एन ट नब ख कहत ह क सफ़ अपन न ग रक क लए ह नह, ड नम क ब हर स आकर बसन व ल क लए भ ज नत ह. 5 य ज़ ल ड और इसक क़र ब द श ऑ लय रहन क लह ज़ स श नद र द श ह 6 प र द श स ल ग एक पच पर य म ग करत ह ए एक-द सर स ज ड़त गए. ज स -ज स द तख़त बढ़, अज़ क ल ब ई भ बढ़त गई. आज य अज़ य ज़ ल ड क एक य ज़यम म रख ह. उ तर व द कर ब थत ड नम क न ड क द श ब च इक ऐ जम न क ल ड नम क आइय बस एन ट नब ख आख दय न ज सफ अपन न ग रक द आ त ग न, ड नम क ब र क ल आइय बसन आ ल द आ त ब ज नत ऐ. य ज़ ल ड त ए द क़र ब द श ऑ लय र न द लह ज़ क न श नद र द श न प र द श श ल क इक पच पर ए म ग करद ह ए इक - द ए क न ज ड़द ग. जʼय - जʼय दसखत बध, अज़ द ल ब ई ब बधद ग ई. अ ज ए अज़ य ज़ ल ड द इक य ज़यम च र ख ऐ. 7 यह क ज वन- तर बह त स प चम द श स ब हतर ह. इ थ द ज वन - तर प चम द श क ल श ल ऐ. बड़ ह र 8 ब ब स स ब ह ए स फ़ न कह क, "ब ड ग क ल म पलन शहर म पलन स ब हतर ह त ह. 9 तब स ल ग इस ठ ड जगह म बस ह. ब ब स क ल ब ह ए स फ न आख आ ज, ब ड ग क ल च पलन श र च पलन क ल श ल ह द ऐ. अद द ब स न. ल क इस ठ ड थ र च Table II: Use cases of word sense disambiguation in Hindi-Dogri MTS 120