Cross Language POS taggers for Resource Poor Languages

Similar documents
Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Indian Institute of Technology, Kanpur

Two methods to incorporate local morphosyntactic features in Hindi dependency

Linking Task: Identifying authors and book titles in verbose queries

Named Entity Recognition: A Survey for the Indian Languages

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A High-Quality Web Corpus of Czech

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Grammar Extraction from Treebanks for Hindi and Telugu

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Modeling full form lexica for Arabic

Training and evaluation of POS taggers on the French MULTITAG corpus

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Distant Supervised Relation Extraction with Wikipedia and Freebase

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Online Updating of Word Representations for Part-of-Speech Tagging

Variations of the Similarity Function of TextRank for Automated Summarization

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Ensemble Technique Utilization for Indonesian Dependency Parser

Proceedings of the 19th COLING, , 2002.

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

ScienceDirect. Malayalam question answering system

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Learning Methods in Multilingual Speech Recognition

CS 598 Natural Language Processing

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A heuristic framework for pivot-based bilingual dictionary induction

A Simple Surface Realization Engine for Telugu

Semi-supervised Training for the Averaged Perceptron POS Tagger

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Annotation Projection for Discourse Connectives

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Accurate Unlexicalized Parsing for Modern Hebrew

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Parsing of part-of-speech tagged Assamese Texts

Word Sense Disambiguation

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

BYLINE [Heng Ji, Computer Science Department, New York University,

Cross Language Information Retrieval

Transliteration Systems Across Indian Languages Using Parallel Corpora

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

1. Introduction. 2. The OMBI database editor

SEMAFOR: Frame Argument Resolution with Log-Linear Models

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

An Evaluation of POS Taggers for the CHILDES Corpus

Development of the First LRs for Macedonian: Current Projects

Towards a corpus-based online dictionary. of Italian Word Combinations

Progressive Aspect in Nigerian English

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

The Choice of Features for Classification of Verbs in Biomedical Texts

Corpus Linguistics (L615)

Short Text Understanding Through Lexical-Semantic Analysis

Introduction to Text Mining

Iraide Ibarretxe Antuñano Universidad de Zaragoza

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

The NICT Translation System for IWSLT 2012

End-to-End SMT with Zero or Small Parallel Texts 1. Abstract

Methods for the Qualitative Evaluation of Lexical Association Measures

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

INTERNAL ASSIGNMENT QUESTIONS P.G. Diploma in English Language & Teaching ANNUAL EXAMINATIONS ( )

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Case Study: News Classification Based on Term Frequency

Applications of memory-based natural language processing

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Unsupervised Dependency Parsing without Gold Part-of-Speech Tags

Automated Identification of Domain Preferences of Collocations

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

THE VERB ARGUMENT BROWSER

Natural Language Processing. George Konidaris

arxiv: v1 [cs.cl] 2 Apr 2017

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

A Syllable Based Word Recognition Model for Korean Noun Extraction

A Comparison of Two Text Representations for Sentiment Analysis

Learning Computational Grammars

Matching Similarity for Keyword-Based Clustering

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

NTU Student Dashboard

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Seminar - Organic Computing

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Using dialogue context to improve parsing performance in dialogue systems

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Language Independent Passage Retrieval for Question Answering

Switchboard Language Model Improvement with Conversational Data from Gigaword

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

EACL th Conference of the European Chapter of the Association for Computational Linguistics. Proceedings of the 2nd International Workshop on

Problems of the Arabic OCR: New Attitudes

Transcription:

Cross Language POS taggers for Resource Poor Languages April 22, 2011 1 Introduction POS tagger is one of the basic requirements of any language for the advancement of its linguistic research. There are many languages which do not have POS taggers. Reasons for lacking POS taggers vary. One reason for this is due to the lack of other basic resources like corpora, lexicons or morphological analyzers. With the advent of Web, corpora is no longer a major problem (see (Kilgarriff et al., 2010)). With technical advances in lexicography (Atkins and Rundell, 2008), lexicon building and morphological analyzers has also been addressed to a decent extent where the next stages of research can boot. The other reason for lack of POS taggers is partly due the lack of many or any research groups working on a particular language. So these languages do not have any annotated data to build efficient taggers. But this problem can be addressed if research/resources for a resource-rich language (source language) can be used for a resource-poor language (target language). If these languages are typologically related, efficient taggers can be built. In this work, we aim to build POS taggers for resource-poor languages benefiting from the resources of their typologically-related resource-rich languages. 2 Related Work There are many existing methods which build POS taggers of a target language without using any of its annotated data. Yarowsky et al. (2001); Yarowsky and Ngai (2001); Das and Petrov (2011) build POS taggers of a target language using parallel corpus of the target language and a source language. The source language is expected to have a POS tagger. First, the source language tools annotate the source side of the parallel corpora. Later, these annotations are projected to the target language side using the information of alignments in the parallel corpora. This way annotated corpora is built for the target language from which POS taggers are built. Other methods which make use of parallel corpora are (Snyder et al., 2008; Naseem et al., 2009). They use unsupervised approaches based on hierarchical 1

Bayesian models, using Markov Chain Monte Carlo sampling techniques for inference, gaining from the information shared across languages. The main disadvantage of the above methods is that they heavily rely on parallel corpora which itself is a costly resource for resource-poor languages. Hana et al. (2004); Feldman et al. (2006) propose a method for developing POS tagger (includes morphological analyzer) for a target language using another typologically related language. Their motivation is also similar to us that they aim to develop taggers for resource-less languages. The method is described in Section 3 3 Hana et al. (2004) Hana et al. aim to develop a tagger for Russian using Czech. They use use a HMM based tagging model. Even though the languages Czech and Russian are free-word order, they describe that HMM based tagger works well. A HMM tagger is based on two probabilities - the transition and emission probabilities. Transition probabilities describe the conditional probability of a tag for a current word of interest given the tags of previous words. Based on the intuition that transition probabilities across typologically related languages remain the same, they treat the transition probabilities of Russian to be the same as Czech. Emission probabilities describe the conditional probability of a tag given a word. Since Hana et al. do not use a bilingual lexicon, they could not use emission probabilities of Czech for Russian. Also, since they do not use annotated data for Russian, it is not straightforward to get emission probabilities. To overcome this, they develop a light paradigm-based (a set of rules) lexicon of Russian which describes all the possible tags (including morphological information) for a given word form. For a given word form, the distribution of its possible tags is treated to be uniform. Using this assumption, emission probabilities are calculated. Apart from this, to prevent errors in transition probabilities due to differences in languages, they remove patterns in Czech training corpus which do not occur in Russian. They call this Russification. After Russification, the transition behaviour of Czech is expected to be the same as Russian. Results show that Russification improved the performance. Also to prevent errors and over-generation by light morphological analyzer of Russian, they use certain filters. These are also found to improve the performance. Adding to these, they also train separate models for main POS tag and other morphological features such as gender, number, case and tense. Based on a voting scheme, they finally arrive at a tag for each word. Training separate models for each tag type also helped improving the accuracy of the tagger. 2

4 Our Focus: Target and Source Languages We aim to develop POS taggers for Dravidian languages like Kannada and Malayalam using Telugu or Tamil as source languages. Dravidian languages are spoken by more than 200 million with Telugu, Tamil, Kannada and Malayalam spoken by 75, 65, 35 and 33 million respectively (src: Wikipedia). Though the numbers are huge, the resources for these languages are relatively poorer compared to the Indo-Aryan languages which are other major Indian languages. Even among Dravidian, Kannada and Malayalam are resource-poor compared to Telugu and Tamil. Since these languages are highly morphological rich, they pose extra difficulty to build resources. Majority of the existing research in computational linguistics in Dravidian languages focused on Telugu and Tamil as a result of which resources for these languages are notable compared to Kannada and Malayalam. 4.1 Kannada and Telugu Telugu is known to be highly influenced by Kannada making the languages slightly mutually intelligible (Datta, 1998, pg. 1690). Until 13th century both the languages have same script. In the later years, the script has changed but still close similarities can be observed. Both the scripts belong to the same script family. To build Kannada POS tagger, we aim to take advantage of Telugu resources. 4.2 Tamil and Malayalam There are studies (Asher and Kumari, 1997) which say Malayalam and Tamil originated from Ancient Tamil and some say Malayalam is a dialect of Tamil. Both the writing scripts belong to the same family. Tamil is widely spoken and is resource-richer than Malayalam. So we aim to use Tamil resources to build Malayalam tagger. 4.3 Tagset All the Indian languages have similarities in morphological properties and syntactic behaviour. The only main difference is agglutinative behaviour of Dravidian languages. Observing these similarities and differences in Indian languages, Bharati et al. (2006) proposed a common POS tagset for all Indian languages. We aim to use this tagset. 4.4 Resources available Some linguistic tools for Dravidian languages are made available by the Indian Government initiative called Indian Language Machine Translation where 3

many universities formed a consortium to develop linguistic resources for Indian languages 1. These tools include morphological analyzers, POS taggers, transliteration tools for converting these languages to ASCII encoding called wx-format. All these POS taggers are built using the method (Avinesh and Karthik, 2007) in which conditional random fields (CRF) models are trained on manually annotated corpora. The training corpora sizes of Kannada and Malayalam are very small compared to Telugu and Tamil. [Serge: We may have to use already trained models of Telugu and Tamil since the annotated corpora are not freely available.] We create annotated corpora of Telugu and Tamil by tagging large corpora using the existing tools. These corpora are later used to create models for Kannada and Malayalam respectively. 5 Our Method Our POS tags involves main part-of-speech information along with morphological information of words such as case, gender, number, tense, aspect. Our method is inspired from the method of (Hana et al., 2004). Our contributions will be in learning transition and emission probabilities of the target language. 5.1 Estimating Transition Probabilities The transition probabilities of Kannada and Malayalam are learned from Telugu and Tamil respectively. Our contribution in learning transition probabilities is to minimize the use of manual intervention in providing linguistic information of the target language such as the combination of features (both morphological and POS tag information) which are useful to tag a target word. We aim to learn this automatically using decision trees. (Schmid and Laws, 2008) use decision trees to compute the transition probabilities along with best feature selection. 5.2 Estimating Emission Probabilities Our major contribution will be estimating in the emission probabilities. Since the languages we deal with are slightly mutually intelligible, we try to exploit this. We use different schemes for estimating emission probabilities. 5.2.1 Edit-distance (Hana et al., 2004) assume uniform distribution of POS tags in the target language without using the information of the source language. But one can make use of source language information if at all there exists a translation system. Since the languages we deal are slightly mutually intelligible, we transliterate both source and target languages to ASCII and then use similarities between 1 Tools for 9 Indian languages http://ltrc.iiit.ac.in/showfile.php?filename= downloads/shallow_parser.php 4

words in the source and target. Similarity between source language words and target language is measured using edit-distance methods. 5.2.2 Bilingual Lexicon The emission probabilities of the source language are converted to target language using bilingual lexicon. This bilingual lexicon can be a direct one or a pivot-language based where both bilingual lexicons of source and target exists in common with another language. 5.2.3 Uniform Distribution Similar to (Hana et al., 2004), we use uniform tag distribution for a given word over all its possible tags. 5.2.4 Relative Frequencies Rather than using uniform distribution over all tags for a given word, we use the information of relative frequencies of POS tags of the source language. 5.3 Further refinement [Serge s Idea:] An initial HMM tagger is build using above transition and emission probabilities. To make the tagger more accurate, we tag large corpora of the target language using the initial tagger developed. We then observe common error patterns in tagging of the target language. Using simple regular expressions, we correct these errors to build a near-gold standard data for the target language. We use this data to then train a new decision tree-based HMM model. Our expectation is that this new tagger will be more accurate than the previous one. But we should take care not to loose major useful information which is learned from the source language. 6 Evaluation [This is something we have not discussed yet.] We aim to use manual gold standard data and compare our taggers performance with the gold standard. We also compare our models with exisiting taggers. References Asher, R. E. and Kumari, T. C. (1997). Malayalam. Atkins, S. B. T. and Rundell, M. (2008). The Oxford Guide to Practical Lexicography. Oxford University Press, Oxford. 5

Avinesh, P. V. S. and Karthik, G. (2007). Part-Of-Speech Tagging and Chunking using Conditional Random Fields and Transformation-Based Learning. In Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL), pages 21 24. Bharati, A., Sangal, R., Sharma, D. M., and Bai, L. (2006). Anncorra: Annotating corpora guidelines for pos and chunk annotation for indian languages. In Technical Report (TR-LTRC-31), LTRC, IIIT-Hyderabad. Das, D. and Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of ACL 2011. Datta, A. (1998). The Encyclopaedia Of Indian Literature, volume 2. Feldman, A., Hana, J., and Brew, C. (2006). A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of LREC, pages 549 554. Hana, J., Feldman, A., and Brew, C. (2004). A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources. In Proceedings of EMNLP 2004, Barcelona, Spain. Kilgarriff, A., Reddy, S., Pomikálek, J., and PVS, A. (2010). A corpus factory for many languages. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 10), Valletta, Malta. European Language Resources Association (ELRA). Naseem, T., Snyder, B., Eisenstein, J., and Barzilay, R. (2009). Multilingual part-of-speech tagging: Two unsupervised approaches. J. Artif. Intell. Res. (JAIR), 36:341 385. Schmid, H. and Laws, F. (2008). Estimation of conditional probabilities with decision trees and an application to fine-grained pos tagging. In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, COLING 08, pages 777 784, Stroudsburg, PA, USA. Association for Computational Linguistics. Snyder, B., Naseem, T., Eisenstein, J., and Barzilay, R. (2008). Unsupervised multilingual learning for pos tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 08, pages 1041 1050, Stroudsburg, PA, USA. Association for Computational Linguistics. Yarowsky, D. and Ngai, G. (2001). Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, NAACL 01, pages 1 8, Stroudsburg, PA, USA. Association for Computational Linguistics. 6

Yarowsky, D., Ngai, G., and Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the first international conference on Human language technology research, HLT 01, pages 1 8, Stroudsburg, PA, USA. Association for Computational Linguistics. 7