Cross Language POS taggers for Resource Poor Languages April 22, 2011 1 Introduction POS tagger is one of the basic requirements of any language for the advancement of its linguistic research. There are many languages which do not have POS taggers. Reasons for lacking POS taggers vary. One reason for this is due to the lack of other basic resources like corpora, lexicons or morphological analyzers. With the advent of Web, corpora is no longer a major problem (see (Kilgarriff et al., 2010)). With technical advances in lexicography (Atkins and Rundell, 2008), lexicon building and morphological analyzers has also been addressed to a decent extent where the next stages of research can boot. The other reason for lack of POS taggers is partly due the lack of many or any research groups working on a particular language. So these languages do not have any annotated data to build efficient taggers. But this problem can be addressed if research/resources for a resource-rich language (source language) can be used for a resource-poor language (target language). If these languages are typologically related, efficient taggers can be built. In this work, we aim to build POS taggers for resource-poor languages benefiting from the resources of their typologically-related resource-rich languages. 2 Related Work There are many existing methods which build POS taggers of a target language without using any of its annotated data. Yarowsky et al. (2001); Yarowsky and Ngai (2001); Das and Petrov (2011) build POS taggers of a target language using parallel corpus of the target language and a source language. The source language is expected to have a POS tagger. First, the source language tools annotate the source side of the parallel corpora. Later, these annotations are projected to the target language side using the information of alignments in the parallel corpora. This way annotated corpora is built for the target language from which POS taggers are built. Other methods which make use of parallel corpora are (Snyder et al., 2008; Naseem et al., 2009). They use unsupervised approaches based on hierarchical 1
Bayesian models, using Markov Chain Monte Carlo sampling techniques for inference, gaining from the information shared across languages. The main disadvantage of the above methods is that they heavily rely on parallel corpora which itself is a costly resource for resource-poor languages. Hana et al. (2004); Feldman et al. (2006) propose a method for developing POS tagger (includes morphological analyzer) for a target language using another typologically related language. Their motivation is also similar to us that they aim to develop taggers for resource-less languages. The method is described in Section 3 3 Hana et al. (2004) Hana et al. aim to develop a tagger for Russian using Czech. They use use a HMM based tagging model. Even though the languages Czech and Russian are free-word order, they describe that HMM based tagger works well. A HMM tagger is based on two probabilities - the transition and emission probabilities. Transition probabilities describe the conditional probability of a tag for a current word of interest given the tags of previous words. Based on the intuition that transition probabilities across typologically related languages remain the same, they treat the transition probabilities of Russian to be the same as Czech. Emission probabilities describe the conditional probability of a tag given a word. Since Hana et al. do not use a bilingual lexicon, they could not use emission probabilities of Czech for Russian. Also, since they do not use annotated data for Russian, it is not straightforward to get emission probabilities. To overcome this, they develop a light paradigm-based (a set of rules) lexicon of Russian which describes all the possible tags (including morphological information) for a given word form. For a given word form, the distribution of its possible tags is treated to be uniform. Using this assumption, emission probabilities are calculated. Apart from this, to prevent errors in transition probabilities due to differences in languages, they remove patterns in Czech training corpus which do not occur in Russian. They call this Russification. After Russification, the transition behaviour of Czech is expected to be the same as Russian. Results show that Russification improved the performance. Also to prevent errors and over-generation by light morphological analyzer of Russian, they use certain filters. These are also found to improve the performance. Adding to these, they also train separate models for main POS tag and other morphological features such as gender, number, case and tense. Based on a voting scheme, they finally arrive at a tag for each word. Training separate models for each tag type also helped improving the accuracy of the tagger. 2
4 Our Focus: Target and Source Languages We aim to develop POS taggers for Dravidian languages like Kannada and Malayalam using Telugu or Tamil as source languages. Dravidian languages are spoken by more than 200 million with Telugu, Tamil, Kannada and Malayalam spoken by 75, 65, 35 and 33 million respectively (src: Wikipedia). Though the numbers are huge, the resources for these languages are relatively poorer compared to the Indo-Aryan languages which are other major Indian languages. Even among Dravidian, Kannada and Malayalam are resource-poor compared to Telugu and Tamil. Since these languages are highly morphological rich, they pose extra difficulty to build resources. Majority of the existing research in computational linguistics in Dravidian languages focused on Telugu and Tamil as a result of which resources for these languages are notable compared to Kannada and Malayalam. 4.1 Kannada and Telugu Telugu is known to be highly influenced by Kannada making the languages slightly mutually intelligible (Datta, 1998, pg. 1690). Until 13th century both the languages have same script. In the later years, the script has changed but still close similarities can be observed. Both the scripts belong to the same script family. To build Kannada POS tagger, we aim to take advantage of Telugu resources. 4.2 Tamil and Malayalam There are studies (Asher and Kumari, 1997) which say Malayalam and Tamil originated from Ancient Tamil and some say Malayalam is a dialect of Tamil. Both the writing scripts belong to the same family. Tamil is widely spoken and is resource-richer than Malayalam. So we aim to use Tamil resources to build Malayalam tagger. 4.3 Tagset All the Indian languages have similarities in morphological properties and syntactic behaviour. The only main difference is agglutinative behaviour of Dravidian languages. Observing these similarities and differences in Indian languages, Bharati et al. (2006) proposed a common POS tagset for all Indian languages. We aim to use this tagset. 4.4 Resources available Some linguistic tools for Dravidian languages are made available by the Indian Government initiative called Indian Language Machine Translation where 3
many universities formed a consortium to develop linguistic resources for Indian languages 1. These tools include morphological analyzers, POS taggers, transliteration tools for converting these languages to ASCII encoding called wx-format. All these POS taggers are built using the method (Avinesh and Karthik, 2007) in which conditional random fields (CRF) models are trained on manually annotated corpora. The training corpora sizes of Kannada and Malayalam are very small compared to Telugu and Tamil. [Serge: We may have to use already trained models of Telugu and Tamil since the annotated corpora are not freely available.] We create annotated corpora of Telugu and Tamil by tagging large corpora using the existing tools. These corpora are later used to create models for Kannada and Malayalam respectively. 5 Our Method Our POS tags involves main part-of-speech information along with morphological information of words such as case, gender, number, tense, aspect. Our method is inspired from the method of (Hana et al., 2004). Our contributions will be in learning transition and emission probabilities of the target language. 5.1 Estimating Transition Probabilities The transition probabilities of Kannada and Malayalam are learned from Telugu and Tamil respectively. Our contribution in learning transition probabilities is to minimize the use of manual intervention in providing linguistic information of the target language such as the combination of features (both morphological and POS tag information) which are useful to tag a target word. We aim to learn this automatically using decision trees. (Schmid and Laws, 2008) use decision trees to compute the transition probabilities along with best feature selection. 5.2 Estimating Emission Probabilities Our major contribution will be estimating in the emission probabilities. Since the languages we deal with are slightly mutually intelligible, we try to exploit this. We use different schemes for estimating emission probabilities. 5.2.1 Edit-distance (Hana et al., 2004) assume uniform distribution of POS tags in the target language without using the information of the source language. But one can make use of source language information if at all there exists a translation system. Since the languages we deal are slightly mutually intelligible, we transliterate both source and target languages to ASCII and then use similarities between 1 Tools for 9 Indian languages http://ltrc.iiit.ac.in/showfile.php?filename= downloads/shallow_parser.php 4
words in the source and target. Similarity between source language words and target language is measured using edit-distance methods. 5.2.2 Bilingual Lexicon The emission probabilities of the source language are converted to target language using bilingual lexicon. This bilingual lexicon can be a direct one or a pivot-language based where both bilingual lexicons of source and target exists in common with another language. 5.2.3 Uniform Distribution Similar to (Hana et al., 2004), we use uniform tag distribution for a given word over all its possible tags. 5.2.4 Relative Frequencies Rather than using uniform distribution over all tags for a given word, we use the information of relative frequencies of POS tags of the source language. 5.3 Further refinement [Serge s Idea:] An initial HMM tagger is build using above transition and emission probabilities. To make the tagger more accurate, we tag large corpora of the target language using the initial tagger developed. We then observe common error patterns in tagging of the target language. Using simple regular expressions, we correct these errors to build a near-gold standard data for the target language. We use this data to then train a new decision tree-based HMM model. Our expectation is that this new tagger will be more accurate than the previous one. But we should take care not to loose major useful information which is learned from the source language. 6 Evaluation [This is something we have not discussed yet.] We aim to use manual gold standard data and compare our taggers performance with the gold standard. We also compare our models with exisiting taggers. References Asher, R. E. and Kumari, T. C. (1997). Malayalam. Atkins, S. B. T. and Rundell, M. (2008). The Oxford Guide to Practical Lexicography. Oxford University Press, Oxford. 5
Avinesh, P. V. S. and Karthik, G. (2007). Part-Of-Speech Tagging and Chunking using Conditional Random Fields and Transformation-Based Learning. In Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL), pages 21 24. Bharati, A., Sangal, R., Sharma, D. M., and Bai, L. (2006). Anncorra: Annotating corpora guidelines for pos and chunk annotation for indian languages. In Technical Report (TR-LTRC-31), LTRC, IIIT-Hyderabad. Das, D. and Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of ACL 2011. Datta, A. (1998). The Encyclopaedia Of Indian Literature, volume 2. Feldman, A., Hana, J., and Brew, C. (2006). A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of LREC, pages 549 554. Hana, J., Feldman, A., and Brew, C. (2004). A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources. In Proceedings of EMNLP 2004, Barcelona, Spain. Kilgarriff, A., Reddy, S., Pomikálek, J., and PVS, A. (2010). A corpus factory for many languages. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 10), Valletta, Malta. European Language Resources Association (ELRA). Naseem, T., Snyder, B., Eisenstein, J., and Barzilay, R. (2009). Multilingual part-of-speech tagging: Two unsupervised approaches. J. Artif. Intell. Res. (JAIR), 36:341 385. Schmid, H. and Laws, F. (2008). Estimation of conditional probabilities with decision trees and an application to fine-grained pos tagging. In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, COLING 08, pages 777 784, Stroudsburg, PA, USA. Association for Computational Linguistics. Snyder, B., Naseem, T., Eisenstein, J., and Barzilay, R. (2008). Unsupervised multilingual learning for pos tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 08, pages 1041 1050, Stroudsburg, PA, USA. Association for Computational Linguistics. Yarowsky, D. and Ngai, G. (2001). Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, NAACL 01, pages 1 8, Stroudsburg, PA, USA. Association for Computational Linguistics. 6
Yarowsky, D., Ngai, G., and Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the first international conference on Human language technology research, HLT 01, pages 1 8, Stroudsburg, PA, USA. Association for Computational Linguistics. 7