Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources

Similar documents
Two methods to incorporate local morphosyntactic features in Hindi dependency

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Named Entity Recognition: A Survey for the Indian Languages

Linking Task: Identifying authors and book titles in verbose queries

ScienceDirect. Malayalam question answering system

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

A High-Quality Web Corpus of Czech

Transliteration Systems Across Indian Languages Using Parallel Corpora

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Indian Institute of Technology, Kanpur

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Distant Supervised Relation Extraction with Wikipedia and Freebase

Development of the First LRs for Macedonian: Current Projects

Grammar Extraction from Treebanks for Hindi and Telugu

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

English to Marathi Rule-based Machine Translation of Simple Assertive Sentences

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Parsing of part-of-speech tagged Assamese Texts

THE VERB ARGUMENT BROWSER

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Semi-supervised Training for the Averaged Perceptron POS Tagger

Methods for the Qualitative Evaluation of Lexical Association Measures

Multilingual Sentiment and Subjectivity Analysis

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

An Evaluation of POS Taggers for the CHILDES Corpus

Training and evaluation of POS taggers on the French MULTITAG corpus

1. Introduction. 2. The OMBI database editor

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

A Simple Surface Realization Engine for Telugu

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Memory-based grammatical error correction

Prediction of Maximal Projection for Semantic Role Labeling

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Annotation Projection for Discourse Connectives

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

The stages of event extraction

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Finding Translations in Scanned Book Collections

The Role of the Head in the Interpretation of English Deverbal Compounds

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

The taming of the data:

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Modeling function word errors in DNN-HMM based LVCSR systems

Introduction to Text Mining

Survey of Named Entity Recognition Systems with respect to Indian and Foreign Languages

SEMAFOR: Frame Argument Resolution with Log-Linear Models

ARNE - A tool for Namend Entity Recognition from Arabic Text

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Search right and thou shalt find... Using Web Queries for Learner Error Detection

A Graph Based Authorship Identification Approach

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Online Updating of Word Representations for Part-of-Speech Tagging

Progressive Aspect in Nigerian English

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

BYLINE [Heng Ji, Computer Science Department, New York University,

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Modeling full form lexica for Arabic

Initial steps to be followed before filling Online Application Form

Using dialogue context to improve parsing performance in dialogue systems

The Ups and Downs of Preposition Error Detection in ESL Writing

AQUA: An Ontology-Driven Question Answering System

Developing a TT-MCTAG for German with an RCG-based Parser

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Task Tolerance of MT Output in Integrated Text Processes

A Case Study: News Classification Based on Term Frequency

Phonological Processing for Urdu Text to Speech System

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Using Semantic Relations to Refine Coreference Decisions

A Syllable Based Word Recognition Model for Korean Noun Extraction

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Postprint.

August 14th - 18th 2005, Oslo, Norway. Code Number: 001-E 117 SI - Library and Information Science Journals Simultaneous Interpretation: Yes

Cross-Lingual Text Categorization

Learning Methods in Multilingual Speech Recognition

HinMA: Distributed Morphology based Hindi Morphological Analyzer

A heuristic framework for pivot-based bilingual dictionary induction

Switchboard Language Model Improvement with Conversational Data from Gigaword

Constructing Parallel Corpus from Movie Subtitles

Variations of the Similarity Function of TextRank for Automated Summarization

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Transcription:

Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources Siva Reddy 1,2, Serge Sharoff 3 1 Lexical Computing Ltd, UK 2 University of York, UK 3 University of Leeds, UK Presenter: Aswarth Abhilash, IIIT Hyderabad, India CLIA 2011 @ IJCNLP 2011 Chiang Mai, Thailand November 13, 2011

Indian Languages Many Indian languages exhibit similarities in morphology and syntactic behaviour Indo-Aryan Languages Hindi, Urdu, Marathi, Nepali, Punjabi, Gujarathi, Rajasthani, Bengali, Oriya, Bihari Dravidian languages Telugu, Tamil, Kannada, Malayalam Some pairs exhibit high similarity Hindi-Urdu, Hindi-Marathi, Tamil-Malayalam, Telugu-Kannada...

Kannada and Telugu Some facts about Kannada and Telugu Dravidian family Spoken by 35 and 75 million people respectively. Telugu was highly influenced by Kannada making them slightly mutually intelligible (Datta, 1998) Scripts belong to the same family. Until 13 th century both the languages have same script. Telugu is relatively resource-richer than Kannada Corpus, Morphological Analyzer, POS Tagger, Dependency Parser

Kannada and Telugu: Similarities at Word level and in Structure Kannada Vivādagaḷa hinneleyalli tam ma taṇḍada punarracisalu aṇṇā hajāreyavaru yōjisiddāre Telugu Vivādāla nēpadhyanlō tana jaṭṭunu punarvyavasthīkarin cālani annā hajārē yōcin cāru English Gloss Controversies in view of his team restructuring Anna Hajare planing Anna Hazare is planning to restructure his team in view of controversies.

Cross language Tools Building tools for a Target language using Cross language resources Motivation Kannada Tools from Telugu Resources Not many resources for Kannada Existing resources not as efficient as for other languages Telugu relatively resource-richer than Kannada Kannada and Telugu are typologically related and exhibit similarities Our focus is to build POS taggers and Morphological disambiguators/analyzers.

Our Tagset Bharati et al. (2006) designed a common POS tagset for all Indian languages e.g. CC, JJ, NN, VM We added morphological information to the above tagset e.g. NN.n.f.pl.3.d Main POS Tag, Coarse label, Gender, Number, Person, Case Since POS tag contains morphological information, our tagger can also be used as morphological analyzer.

Tagset Statistics Field Description Number of Tags Tags Full Tag 311 NN.n.f.pl.3.d, VM.v.n.sg.3.,... 1 Main POS Tag 25 CC, JJ, NN, VM,... 2 Coarse POS Category 9 adj, n, num, unk... 3 Gender 6 any, f, m, n, punc, null 4 Number 4 any, pl, sg, null 5 Person 5 1, 2, 3, any, null 6 Case 3 d, o, null Table: Fields in each tag and its corresponding statistics. null denotes empty value, e.g. in the tag VM.v.n..3., number and case fields are null

HMM Based POS Tagger argmax t 1...t n [ ] n P(t i t i 1,t i 2 )P(w i t i ) i=1 (1) w i...w n is the word sequence to be tagged t 1...t n denotes the tag sequence P(t i t i 1,t i 2 ): Tag Transition Probabilities P(w i t i ) denotes Emission Probabilities

Kannada POS Tagger from Telugu Estimate Transition Probabilities and Emission Probabilities from Telugu Exploit lexical and syntactic level similarities between Kannada and Telugu Kannada Vivādagaḷa hinneleyalli tam ma taṇḍada punarracisalu aṇṇā hajāreyavaru yōjisiddāre Telugu Vivādāla nēpadhyanlō tana jaṭṭunu punarvyavasthīkarin cālani annā hajārē yōcin cāru English Gloss Controversies in view of his team restructuring Anna Hajare planing Anna Hazare is planning to restructure his team in view of controversies.

Steps Involved 1 Built large corpora of Kannada and Telugu POS tag the corpus with existing tools 2 Determine the transition probabilities of Kannada from Telugu tagged corpus (cross lingual) or from Kannada tagged corpus (mono-lingual) 3 Estimate the emission probabilities of Kannada from Telugu tagged corpus or using heuristics combined morphological analyser or from Kannada tagged corpus 4 Use the probabilities from the step 2 and 3 to build a POS tagger for Kannada

Step 1: Large Corpus Creation Corpus Factory (Kilgarriff et al., 2010) Build frequency list of Telugu and Kannada from Wikipedia Generate thousands of random tuples from frequency lists Feed them into a search engine and download search hits Clean the pages - remove html markup, language filtering Remove duplicates Telugu - 4.6 million words corpora [Collected in Dec 2009] Kannada - 16 million words corpora [Collected in June 2011] Differences in sizes due to time difference

Step 1: Large Corpus Creation Annotate Telugu Corpus with existing tagger We used ILMT consortium tagger built using (Avinesh and Karthik, 2007) Tagger is an integration of many tools running in pipeline tokenization, transliteration, morph analyzer, CRF model, transliteration If anything fails in the pipeline, tagger fails Only 70% of the corpus was finally tagged Not scalable, but usable We converted the output to our tagset We also tagged Kannada corpus using ILMT Kannada tagger To build mono-lingual taggers and compare performance with cross lingual tagger

Step 2: Transition Probabilities From Telugu: cross lingual Transition Probabilities across typologically related languages are likely to be same (Hana et al., 2004) Kannada and Telugu exhibit high similarities in syntactic structure Compute transition probabilities from Telugu tagged corpus of Step 1 Kannada Vivādagaḷa hinneleyalli tam ma taṇḍada punarracisalu aṇṇā hajāreyavaru Telugu Vivādāla nēpadhyanlō tana jaṭṭunu punarvyavasthīkarin cālani annā hajārē Tag sequence in Kannada and Telugu is same NN.n.n.pl..o NN.n.n.sg..o PRP.n.m.sg.3.d NN.n.n.sg..o VM.v.any.any.any. NNP.n.m.sg.3.o NNP.n.m.sg.3.o

Step 3: Emission Probabilities From Telugu using Approx. String Matching A Telugu-Kannada dictionary can be used but we do not have such dictionary Exploit lexical similarities Kannada and Telugu are slightly mutually intelligible (Datta, 1998) Edit distance: The minimum number of edits needed to transform one string into the other For each word in Kannada, choose the most nearest Telugu word. Since Telugu emission probabilities can be known from Telugu tagged corpus, use mappings of Kannada-Telugu words to estimate Kannada emission probabilities Kannada Neighbours in Telugu Result vibagavu (vibagamu, 0.539) (vibaga, 0.5) (vibagalanu, 0.467), (vibagamulu, 0.467) xaswanu ( xaswan, 0.545) ( xaswaru, 0.5) ( raswanu, 0.5) ( xaswadu, 0.5)

Step 3: Emission Probabilities Telugu tags and Kannada Morphology Using Telugu tagged corpus, the mappings of a morphological set to all possible tags are learned Morphological set n.n.sg..o is associated with all the tags which satisfy the regular expression *.n.n.sg..o For every word in Kannada, based on its morphology determined by the morphological analyser, we assign all the tags learned from Telugu. Uniform tag distribution is assumed Pitfall Explosion of search tags for each word

Step 3: Emission Probabilities Kannada tags with uniform distribution For each word, learn all the possible tag associations from Kannada tagged corpus Though we learn from tagged corpus, we do not use frequency information We assume uniform distribution of all tags for a word Search space is reduced From Kannada corpus Learn emission probabilities directly from the tagged Kannada corpus

Step 4: Tagging Model We use TnT (Brants, 2000), an implementation of HMM Transition and Emission probabilities from Steps 2 and 3 Avinesh and Karthik (2007) performance increased when morphological information is used as features for their CRF model. Since our tagset includes morphological information, TnT model may benefit from this information. TnT Model is known for predicting tags for unseen words a potential morphological analyser for new words

Additional Tools For each word form, we learned association between POS Tag, lemma and suffix markers from Kannada annotated corpora [Step 1] Our tagger could also be used for lemmatization and suffix prediction

Sample Output Word POS Tag Lemma.Suffix ಕ ಯ NN.n.n.sg..o ಕ.ಅ ಪ ರ NN.n.n.sg..d ಪ ರ.0 ಯ ನ NN.unk... ಯ ನ. ಆಟದ NN.n.n.sg..o ಆಟ.ಅ ಜ ದ VM.unk... ಜ ದ. ಚ ದ ಗ ಪ ನ NNP.unk... ಚ ದ ಗ ಪ ನ. ಅಪ ಯ NN.n.m.sg.3.o ಅಪ.ಅ ತ NN.n.n.sg..d ತ.0 ವ ದ VM.v.any.any.any. ವ ಸ.ಇದ ಇ ಬ QC.unk... ಇ ಬ. ಹ ಡ ಗನ NN.n.m.sg.3.o ಹ ಡ ಗ.ಅ ರ ಯನ NN.n.n.sg..o ರ.ಅನ VM.v..pl.2. ಡ.0 NN.n.n.sg..d.0

Evaluation Evaluation only for main POS tag. In NN.n.n.sg..o, main POS tag is NN Manually annotated Kannada corpora developed by ILMT consortium (licensed) The corpus consists of 201,373 words No evaluation data for morphological labels

Results Model Transition Prob Emission Prob Precision Recall F-measure Cross-Language POS Tagger 1 From Telugu Approximate string matching 56.88 56.88 56.88 2 From Telugu Telugu tags and Kannada morphology 28.65 28.65 28.65 3 From Telugu Kannada tags with uniform distribution 75.10 75.10 75.10 4 From Telugu Kannada emission probabilities 77.63 77.63 77.63 Mono-Lingual POS Tagger 5 From the Kannada language Kannada emission probabilities 77.66 77.66 77.66 6 Avinesh and Karthik (2007) 78.64 61.48 69.01 Table: Evaluation results of various tagging models [only the main Tag] Cross language taggers as good as mono-lingual taggers Model 3 and 4 better than existing Kannada Tagger Model 3 easy to built since it requires only a Kannada lexicon 50% accuracy (Model 1) with almost no resources of Kannada

Summary Cross-language resources can be used to build tools for other languages As good as mono-lingual tagger if target lexicon exists at least 50% accuracy if no resources of target language exists Promising direction for many resource-poor Indian languages POS tagger as a morphological analyser/disambiguator Tools can be downloaded from http://sivareddy.in

Bibliography I Avinesh, P. V. S. and Karthik, G. (2007). Part-Of-Speech Tagging and Chunking using Conditional Random Fields and Transformation-Based Learning. In Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL), pages 21 24. Bharati, A., Sangal, R., Sharma, D. M., and Bai, L. (2006). Anncorra: Annotating corpora guidelines for pos and chunk annotation for indian languages. In Technical Report (TR-LTRC-31), LTRC, IIIT-Hyderabad. Brants, T. (2000). Tnt: a statistical part-of-speech tagger. In Proceedings of the sixth conference on Applied natural language processing, ANLC 00, pages 224 231, Stroudsburg, PA, USA. Association for Computational Linguistics. Datta, A. (1998). The Encyclopaedia Of Indian Literature, volume 2. Hana, J., Feldman, A., and Brew, C. (2004). A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources. In Proceedings of EMNLP 2004, Barcelona, Spain.

Bibliography II Kilgarriff, A., Reddy, S., Pomikálek, J., and PVS, A. (2010). A corpus factory for many languages. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 10), Valletta, Malta. European Language Resources Association (ELRA).