Multiword Expressions Dataset for Indian Languages

Similar documents
Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

HinMA: Distributed Morphology based Hindi Morphological Analyzer

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

S. RAZA GIRLS HIGH SCHOOL

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Leveraging Sentiment to Compute Word Similarity

Project in the framework of the AIM-WEST project Annotation of MWEs for translation


Linking Task: Identifying authors and book titles in verbose queries

Handling Sparsity for Verb Noun MWE Token Classification

Parsing of part-of-speech tagged Assamese Texts

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

Transliteration Systems Across Indian Languages Using Parallel Corpora

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

ENGLISH Month August

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Multilingual Sentiment and Subjectivity Analysis

Indian Institute of Technology, Kanpur

Cross Language Information Retrieval

A Statistical Approach to the Semantics of Verb-Particles

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

AQUA: An Ontology-Driven Question Answering System

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Word Sense Disambiguation

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Named Entity Recognition: A Survey for the Indian Languages

Vocabulary Usage and Intelligibility in Learner Language

SEMAFOR: Frame Argument Resolution with Log-Linear Models

THE VERB ARGUMENT BROWSER

The MEANING Multilingual Central Repository

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

A Simple Surface Realization Engine for Telugu

ह द स ख! Hindi Sikho!

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

व रण क ए आ दन-पत र. Prospectus Cum Application Form. न दय व kऱय सम त. Navodaya Vidyalaya Samiti ਨਵ ਦ ਆ ਦਵਦ ਆਦ ਆ ਸਦ ਤ. Navodaya Vidyalaya Samiti

A Bayesian Learning Approach to Concept-Based Document Classification

Software Maintenance

Available online at ScienceDirect. Procedia Computer Science 54 (2015 )

The stages of event extraction

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

ScienceDirect. Malayalam question answering system

BYLINE [Heng Ji, Computer Science Department, New York University,

CS 598 Natural Language Processing

On document relevance and lexical cohesion between query terms

1. Introduction. 2. The OMBI database editor

The Ups and Downs of Preposition Error Detection in ESL Writing

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

First Grade Curriculum Highlights: In alignment with the Common Core Standards

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Re-examination of Lexical Association Measures

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

2.1 The Theory of Semantic Fields

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

Robust Sense-Based Sentiment Classification

Ensemble Technique Utilization for Indonesian Dependency Parser

Mercer County Schools

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Constructing Parallel Corpus from Movie Subtitles

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Introduction to Text Mining

LING 329 : MORPHOLOGY

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Natural Language Processing. George Konidaris

Derivational and Inflectional Morphemes in Pak-Pak Language

August 14th - 18th 2005, Oslo, Norway. Code Number: 001-E 117 SI - Library and Information Science Journals Simultaneous Interpretation: Yes

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Ch VI- SENTENCE PATTERNS.

Applications of memory-based natural language processing

Compositional Semantics

The Role of the Head in the Interpretation of English Deverbal Compounds

Prediction of Maximal Projection for Semantic Role Labeling

Distant Supervised Relation Extraction with Wikipedia and Freebase

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Combining a Chinese Thesaurus with a Chinese Dictionary

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Methods for the Qualitative Evaluation of Lexical Association Measures

Transfer of Training

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

A Case Study: News Classification Based on Term Frequency

A Domain Ontology Development Environment Using a MRD and Text Corpus

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Annotation Projection for Discourse Connectives

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A Comparison of Two Text Representations for Sentiment Analysis

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Developing Grammar in Context

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Transcription:

Multiword Expressions Dataset for Indian Languages Dhirendra Singh, Sudha Bhingardive, Pushpak Bhattacharyya Department of Computer Science and Engineering, Indian Institute of Technology Bombay, India. {dhirendra,sudha,pb}@cse.iitb.ac.in Abstract Multiword Expressions (MWEs) are used frequently in natural languages, but understanding the diversity in MWEs is one of the open problem in the area of Natural Language Processing. In the context of Indian languages, MWEs play an important role. In this paper, we present MWEs annotation dataset created for Indian languages viz., Hindi and Marathi. We extract possible MWE candidates using two repositories: 1) the POS-tagged corpus and 2) the IndoWordNet synsets. Annotation is done for two types of MWEs: compound nouns and light verb constructions. In the process of annotation, human annotators tag valid MWEs from these candidates based on the standard guidelines provided to them. We obtained 3178 compound nouns and 2556 light verb constructions in Hindi and 1003 compound nouns and 2416 light verb constructions in Marathi using two repositories mentioned before. This created resource is made available publicly and can be used as a gold standard for Hindi and Marathi MWE systems. Keywords: Multiword Expressions, MWEs, WordNet, Hindi WordNet, Compound Nouns, 1. Introduction Recently, various approaches have been proposed for the identification and extraction of MWEs (Calzolari et al., 2002; Baldwin et al., 2003; Guevara, 2010; Al-Haj and Wintner, 2010; Kunchukuttan and Damani, 2008; Chakrabarti et al., 2008; Sinha, 2011; Singh et al., 2012; Reddy et al., 2011). The quality of such approaches depends on the use of algorithms and also on the quality of resources used. Various standard MWEs datasets 1 are available for languages like English, French, German, Portuguese, etc and can be used for evaluation of MWE approaches. But for Indian languages, no such standard datasets are available publicly. Our goal is to create MWEs annotation for Indian languages viz., Hindi and Marathi and make it available publicly. We have explored two types of MWEs: compound nouns (CNs) and light verb constructions (LVCs), since they are used very frequently in the text data in comparison to other MWEs. The created resource can be useful for various natural language processing applications like information extraction, word sense disambiguation, machine translation, etc. The rest of the paper is organized as follows. Section 2 gives detail about the compound nouns and light verb constructions. Section 3 describes the extraction process of possible MWE candidates. Section 4 gives the statistics of MWEs annotation for Hindi and Marathi. MWEs guidelines are given in Section 5 followed by discussions in Section 6. Section 7 concludes the paper and points to the future work. 1 http://multiword.sourceforge.net/phite.php?sitesig= FILES&page =FILES_20_Data_Sets 2. Compound Nouns and In the context of Indian languages, MWEs are quite varied and many of these are borrowed from other languages like English, Urdu, Arabic, Sanskrit, etc. For Hindi, there are limited investigations on MWE extraction. Venkatapathy et. al., (2006) worked on syntactic and semantic features for N-V collocation extraction using MaxEnt classifier. Mukerjee et al., (2006) proposed Parts-of-Speech projection from English to Hindi with corpus alignment for extracting complex predicates. Kunchukuttan et. al., (2008) presented a method for extracting compound nouns in Hindi using statistical co-occurrence. Sinha (2009) uses linguistic property of light verbs in extraction of complex predicates using Hindi-English parallel corpus. All the work mentioned above have considered only limited aspects of Hindi MWE. In this paper, we focus on creating gold standard data for CNs and LVCs. Compound Nouns: A word-pair forms a CN if its meaning cannot be composed from the meanings of its constituent words. CNs are formed by either Noun+Noun (N+N) or Adjective+Noun (Adj+N) word combinations. For example, ब ग बग च (baaga bagiichaa, garden) (N+N), क ल धन (kaalaa dhana, black money) (Adj+N), etc. are examples of CNs in Hindi. : LVCs show high idiosyncratic constructions with nouns. It is difficult to predict which light verb chooses which noun and why the light verb cannot be substituted with another. LVCs are further classified into Conjunct Verbs (CjVs) and Compound Verbs (CpVs). Conjunct Verbs: CjVs are formed by Noun+Verb (N+V) or Adjective+Verb 2331

(Adj+V) or Adverb+Verb (Adv+V) word combinations. For example, क म करन (kaama karanaa, to work) (N+V), ठ क करन (thik karanaa, to repair) (Adj+V), व पस आन (vaapas aanaa, to come back) (Adv+V), etc. are examples of CjVs in Hindi. Compound Verbs: CpVs are formed by Verb+Verb (V+V) word combinations. For example, भ ग ज न (bhaaga jaanaa, run away) (V+V), उठ ज न (utha jaanaa, to wake up) (V+V), etc. are examples of CpVs in Hindi. 3. MWE Candidates Extraction We extracted possible MWE candidates using two resources: 1) the POS-tagged corpus and 2) the IndoWordNet synsets. 3.1. Candidate Extraction using POS-tagged Corpus For Indian languages, standard POS-tagged corpora are publicly available 2. We used such corpora for extracting possible candidates for MWEs. For CNs, we extracted candidates of patterns noun followed by noun and adjective followed by noun. However, for LVCs, we extracted candidates of patterns noun followed by verb, adjective followed by verb, adverb followed by verb and verb followed by verb. 3.2. Candidate Extraction using IndoWordNet Synsets IndoWordNet 3 (Bhattacharyya, 2010) is the Indian language WordNet of 18 official languages of India. It consists of synsets and semantic and lexical relations. It also stores MWEs as they represent concepts (synsets). For example, it stores Hindi CNs like ब ग बग च (baaga bagiichaa, garden), धन द लत (dhana daulata, wealth), क ल धन (kaalaa dhana, black money), etc. and LVCs like ग जर ज न (gujara jaanaa, passed away), क म करन (kaama karanaa, to work), भ ग ज न (bhaaga jaanaa, run away), etc. We extracted possible MWE candidates from WordNet synsets from Hindi and Marathi. Synsets which consist of words of following patterns are extracted and used as possible candidates. noun followed by noun adjective followed by noun noun followed by verb adjective followed by verb adverb followed by verb verb followed by verb All these MWEs candidates were given to three human annotators in both these languages. They were told to tag the valid MWEs based on the guidelines provided to them (Refer Section 5). 4. MWEs Annotation Statistics This section gives statistics of annotated MWEs by three human annotators. Valid MWEs are obtained by taking the majority of votes. These MWE dataset has been made available on the CFILT website http://www.cfilt.iitb.ac.in/downloads.html. 4.1. Annotation Statistics of MWEs obtained from the POS-tagged Corpus For CNs, we extracted 12000 possible candidates from Hindi and 2000 possible candidates from Marathi POStagged corpus. For LVCs, we extracted 4000 possible candidates each from Hindi and Marathi POS-tagged corpus. The statistics of valid MWEs annotated by human annotators are as shown in Table 1 and Table 2 respectively. Possible candidates Valid MWEs Compound Nouns 12000 2178 4000 1556 Table 1: Hindi MWEs annotation statistics obtained from pos-tagged corpus Possible candidates Valid MWEs Compound Nouns 2000 503 4000 1916 Table 2: Marathi MWEs annotation statistics obtained from pos-tagged corpus 2 http://www.ldcil.org/resourcestextcorp.aspx 3 Wordnets for Indian languages have been developed under the IndoWordNet umbrella. Wordnets are available in following Indian languages: Assamese, Bodo, Bengali, English, Gujarati, Hindi, Kashmiri, Konkani, Kannada, Malayalam, Manipuri, Marathi, Nepali, Punjabi, Sanskrit, Tamil, Telugu and Urdu. These languages cover 3 different language families, Indo Aryan, SinoTebetian and Dravidian. http://www.cfilt.iitb.ac.in/indowordnet/ Possible Annotated candidates MWEs Compound Nouns 19326 1000 4017 1000 Table 3: Hindi MWEs annotation statistics obtained from the IndoWordNet Synsets 2332

Possible Annotated candidates MWEs Compound Nouns 5327 500 1838 500 Table 4: Marathi MWEs annotation statistics obtained from the IndoWordNet Synsets 4.2. Annotation Statistics of MWEs obtained from the IndoWordNet Synsets For Hindi, we extracted 19326 possible candidates for CNs and 4017 possible candidates for LVCs from the IndoWordNet synsets. For Marathi, we extracted 5327 possible candidates for CNs and 1838 possible candidates for LVCs from the WordNet synsets. Statistics of valid MWEs annotated by human experts for Hindi and Marathi languages are as shown in Table 3 and Table 4 respectively. The inter-annotator agreement was calculated using Cohen s kappa index value. The inter-annotator agreement for the annotation is found to be 0.86 for Hindi and 0.82 for Marathi. 5. MWE Annotation Guidelines In this section, we describe guidelines given to human annotators to annotate MWEs from the possible candidates. Annotators have been told to check whether the candidate (word-pair) satisfy the following criteria of MWEs formation. Reduplication: Here, a root or stem of a word, or part of it is repeated. Reduplication can further be subdivided into: Onomatopoeic Expression: In this case, the constituent words imitate a sound or a sound of an action. Generally, in this case, the words are repeated twice with the same matra. For example, टक टक (tick tick, the ticking sound of watch s needle). Non-Onomatopoeic Expression: Here, the constituent words have meaning but they are repeated to convey a particular meaning. For example, चलत चलत (chalate chalate, while walking). Partial Reduplication: In this case, one of the constituent word is meaningful while the other is constructed by partially repeating the first word. For example, प न व ण (paani vaani, water). Semantic Reduplication: Here, the constituent words have some semantic relationship among them. For example, धन द लत (dhana daulata, wealth) shows [Synonymy], दन र त (dina raata, always) shows [Antonymy]. Fixed Expression: Fixed Expressions are immutable expressions, which do not undergo any transformation or morphological inflections or possibility of insertion between two words. For example, कम स कम (kam se kam, atleast), ज य द स ज य द (jyada se jyada, maximal). Semi-fixed Expression: Semi-fixed expressions obey constraints on word order and composition. They might show some degree of lexical variation. For example, क र प कर (car park, car park) can be used as क र प क सर (car parks, car parks). Non-Compositional: The meaning of a complete multiword expression can not be completely determined from the meaning of its constituent words. For example, अक षय तत य (akshaya Tritiyaa, a festival in India) Decomposable Idioms: Decomposable idioms are syntactically flexible and behave like semantically linked parts. But it is difficult to predict exactly what type of syntactic expression they are. For example, आट -द ल क भ व म ल म ह न (aate daal ka bhava maalum honaa, to create a knowledge). Here in this example, we can replace the phrase 'आट -द ल क भ व म ल म ह न ' (aate daal ka bhava maalum honaa) to 'आट -द ल क द म म ल म पड़न ' (aate daal ka bhava maalum padanaa). Non-Decomposable Idioms: Non- Decomposable idioms are those idioms, which do not undergo any syntactic variations but might allow some minor lexical modification. For example, न द ग य रह ह न (Nau do gyaraaha honaa, to run away). Name Entity Recognition(NER): Named entities are phrases that contain the names of persons, organizations, locations, times, and quantities. NERs are syntactically highly idiosyncratic. These entities are formed based on generally a place or a person. For example, भ रत य पर गक स स थ न (Bhartiya Prodyogiki Sansthan, Indian Institute of Technology) (Organization), स चन त द लकर (Sachin Tendulkar, Sachin Tendulkar) (Proper noun), त ज महल (Taj Mahal) (Location), etc. Collocations: A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. For example, कड़क च य (kadaka chai, strong tea), प स ट ऑ फस (post office, post office), etc. Foreign Words: A set of words borrowed from another languages are called as foreign words. They can be treated as valid MWEs in the context of Indian languages. For example, र लव स ट शन (Railway station, Railway Station), प स ट ऑ फस (Post office, post office), etc. 6. Discussions While annotating CNs and LVCs, annotators faced some difficulties which are mentioned below. 2333

Polysemous candidates: Sometimes extracted candidates were found to be polysemous. As we did not mention the context in which these candidates occurs, annotators confused while annotating these candidates. Most of the time these candidates behave as MWEs due to their frequent metaphoric usage. For example, 1. आग लग न (aag lagaana) has two senses in Hindi: 1) destroy by fire and 2) to provoke. It forms MWEs when used in its second sense which is metaphoric in nature. 2. पद र उठ न (pardaa uthanaa) has two senses in Hindi: 1) reveal secret information and 2) make visible. It forms MWEs when used in its first sense. For such polysemous candidates, annotators tagged them as valid MWEs based on their knowledge and context. Infrequent candidates: Sometimes candidates are not tagged as MWEs even though they satisfy some of the guidelines. This is because of their infrequent usage. For example, न ल प ल (neela piila) is not considered as a valid MWEs even though it looks similar to a valid MWEs ल ल प ल (lala piila). Such infrequent candidates are not annotated as MWEs. 7. Conclusion In this paper, we presented manually annotated dataset for MWEs in Hindi and Marathi. The annotation has been done for compound nouns and light verb constructions. MWEs candidates were extracted from the POS-tagged corpus and the IndoWordNet synsets. The annotation process involved three annotators in each languages and the validation of MWEs is done using a majority vote decision. For Hindi, we obtained 3178 compound nouns and 2556 light verb constructions as valid MWEs and for Marathi, we obtained 1003 compound nouns and 2416 light verb constructions as valid MWEs. This MWEs dataset has been made publicly available and now it can be used as a gold standard dataset for MWE systems and its applications. In future, we would like to work on annotating MWEs in the running text and will also try to explore the other types of MWEs and other languages also. 8. Acknowledgments We would like to thank Jaya Saraswati, Rajita Shukla, Laxmi Kashyap, Nilesh Joshi and Irawati Kulkarni from CFILT lab at IIT Bombay for giving their valuable contribution in gold standard data creation. We also acknowledge the support of the Department of Electronics and Information technology (DeitY), Ministry of Communication and Information Technology, Government of India and also of Ministry of Human Resource Development. 9. Bibliographical References Al-Haj, H. and Wintner, S. (2010). Identifying multiword expressions by leveraging morphological and syntactic idiosyncrasy. In Proceedings of the 23rd International conference on Computational Linguistics, pages 10 18. Association for Computational Linguistics. Baldwin, T., Bannard, C., Tanaka, T., and Widdows, D. (2003). An empirical model of multiword expression decomposability. In Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment-volume 18, pages 89 96. Bhattacharyya, P. (2010). Indowordnet. In Language Resources and Evaluation Conference (LREC), Malta. Calzolari, N., Fillmore, C. J., Grishman, R., Ide, N., Lenci, R., Macleod, C., and Zampolli, A. (2002). Towards best practice for multiword expressions in computational lexicons. In In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands. Citeseer. Chakrabarti, D., Mandalia, H., Priya, R., Sarma, V. M., and Bhattacharyya, P. (2008). Hindi compound verbs and their automatic extraction. In COLING (Posters), pages 27 30. Guevara, E. (2010). A regression model of adjectivenoun compositionality in distributional semantics. In Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics, pages 33 37. Kunchukuttan, A. and Damani, O. P. (2008). A system for compound noun multiword expression extraction for hindi. In 6th International. Conference on Natural Language Processing, pages 20 29. Mukerjee, A., Soni, A., and Raina, A. M. (2006). Detecting complex predicates in hindi using pos projection across parallel corpora. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, pages 28 35. Reddy, S., McCarthy, D., and Manandhar, S. (2011). An empirical study on compositionality in compound nouns. In IJCNLP, pages 210 218. Singh, S., Damani, O. P., and Sarma, V. M. (2012). Noun group and verb group identification for hindi. In COLING, pages 2491 2506. Citeseer. Sinha, R. M. K. (2009). Mining complex predicates in hindi using a parallel hindi-english corpus. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, pages 40 46. Association for Computational Linguistics. Sinha, R. M. K. (2011). Stepwise mining of multiword expressions in hindi. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, pages 110 115. 2334

Venkatapathy, S. and Joshi, A. K. (2006). Using information about multi-word expressions for the wordalignment task. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, pages 20 27. Association for Computational Linguistics. 2335