Statistical Transliteration for Cross Language Information Retrieval using HMM alignment and CRF

Statistical Transliteration for Cross Language Information Retrieval using HMM alignment and CRF Prasad Pingali, Surya Ganesh, Sree Harsha, Vasudeva Varma, IIIT, Hyderabad

Outline Introduction Transliteration method Evaluation Conclusion

Introduction CLIR system Issues query in one language Retrieves documents from all languages Issues in CLIR Out of vocabulary words (named entities etc.) are common source of errors.

Cont.. Transliteration The practice of transcribing a word or text written in one language into another language. More than one valid transliterations possible for a given word Example ग तम Gautam Gautham Gowtham Gowtam Statistical methods can achieve this

Previous work Hindi CLIR in Thirty Days by Larkey, Connell, Abdul Jaleel. (2003) Statistical Transliteration for English Arabic Cross Language Information Retrieval by Nasreen Abdul Jaleel and Leah S. Larkey. (2003) Many other works in European and Asia Pacific languages.

Problem description Can be stated as a sequence labeling problem. Source language word ~ Observation sequence Target language word ~ Label sequence x i Character n gram in Source Language word y i Character n gram in Target Language word x x 1 2.... x y 1 y 2 n y n

Cont.. Valid target language alphabet (y i ) for a source language alphabet (x i ) in the input may depend on following The source language alphabet in the input word. The context (alphabets) surrounding source language alphabet (x i ) in the input word. The context (alphabets) surrounding target language alphabet (y i ) in the desired output word.

Transliteration method Statistical model for transliteration is based on Hidden Markov Model (HMM) alignment and Conditional Random Field (CRF). Language independent Bilingual corpus HMM alignment Character aligned corpus CRF Trained model Input word Transliteration system Target language words (n)

HMM alignment model using GIZA++ To get character level alignment (n gram) of source and target language words. Maximizes the probability of the observed word pairs using the expectation maximization algorithm. Character level alignments (n gram) are set to maximum posterior predictions of the model.

CRF Probabilistic framework for labeling and segmenting sequential data. Discriminative and an undirected graphical model each vertex represents a random variable whose distribution is to be inferred each edge represents a dependency between two random variables defines a conditional distribution over a label sequence given an observation sequence

Cont.. Probability of a target language word Y given a source language word X is given by Where is a normalizing factor is either a state function or a transition function λ is the parameter to be estimated The parameters of the CRF are usually estimated from a fully observed training data Maximum likelihood training chooses parameter values

Algorithm Three important phases Two are offline and one online Offline phases preprocessing the bilingual corpus training the model. The online phase takes the input word, and generates the Transliterations.

Algorithm Cont.. Preprocessing The words in the bilingual corpus are prefixed with symbol B and suffixed with symbol E. The words are segmented in to unigrams (lexemes) and aligned using GIZA++ Feature Pruning: The instances where a sequence of target language characters are aligned to a source language character and vice versa are counted and 50 most frequent instances are added to the symbol inventory.

Algorithm Cont.. Preprocessing The source and target languages are re segmented based on new symbol inventory and were aligned using GIZA++. The alignment file from GIZA++ output is used to generate the training file required for CRF++. CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. Ref : http://crfpp.sourceforge.net/ Each source language n gram aligned to a target language n gram is called a token which is represented in one line.

Algorithm Cont.. Training Phase Requires a template file which specifies features to be selected by model. The source language lexemes are used as features with a window length of 5. The training is done using Limited memory Broyden Fletcher Goldfarb Shannon method (LBFGS) which uses quasi newton algorithm for large scale numerical optimization problem

Algorithm Cont.. Testing phase (Transliterator) The input words that need to be transliterated are converted to CRF++ test file format. The trained model is used to generate top n probable target language words. CRF++ uses forward Viterbi and backward A* search to produce exact n best results.

Evaluation We evaluate our transliteration system in two ways Based on Transliteration accuracy. Based on CLIR (Cross Lingual Information Retrieval) performance We evaluate our model by comparing it with the model that was developed using HMM only (Jaleel et.al. 2003).

Evaluation Cont.. Transliteration accuracy The models were trained on 30,000 words and tested on 1,000 words containing Indian city names, family names, first and last names of persons. We evaluated the models on both in corpus and out of corpus words. The out of corpus words contain Indian and foreign city names and person names. We evaluated the models considering top 5,10,15,20and 25 transliterations.

Evaluation Cont.. Transliteration accuracy.. Accuracy = (C/N)*100 C Number of test words with correct transliteration appeared in desired number (5,10,15,20,25) of transliterations. N Total number of test words. Transliteration accuracy on training data Model Top 5 Top 10 Top 15 Top 20 Top 25 HMM 74.2 78.7 81.1 82.1 83.0 HMM & CRF 76.5 83.6 86.5 88.9 89.7

Evaluation Cont.. Transliteration accuracy on testing data Model Top 5 Top 10 Top 15 Top 20 Top 25 HMM 69.3 74.3 77.8 80.5 81.3 HMM & CRF 72.1 79.9 83.5 85.6 86.5

Evaluation Cont.. CLIR evaluation We tested the systems on the CLEF 2007 documents and 50 topics Topics that contained named entities were used for evaluation which were 13 in number.

Evaluation Cont.. CLIR Evaluation.. We developed a basic CLIR system which performs the following steps Tokenizes the Hindi query and removes the stop words. Performs query translation; each Hindi word is looked up in a Hindi English dictionary and all the English meanings for the Hindi word were added to the translated query and for the words which were not found in the dictionary, top 20 transliterations generated by one of the systems are added to the query. Retrieves relevant documents by giving translated query to CLEF documents using Lucene.

Evaluation Cont.. We present standard IR evaluation metrics such as P10, bpref, mean average precision (MAP) etc.. in the table below for the two systems Model P10 tot_rel tot_rel_ret MAP bpref HMM 0.3308 13000 3493 0.1347 0.2687 HMM & CRF 0.4154 13000 3687 0.1499 0.2836

Microsoft Dataset MSR Transliteration Data results for three language pairs Hindi English, Tamil English, Arabic English Training 3000 words, testing 300 words Results Language Top 1 Top 5 Top 10 Top 20 Hindi 0.600 0.823 0.897 0.920 Tamil 0.343 0.603 0.690 0.770 Arabic 0.347 0.593 0.703 0.760

Conclusion Demonstrated a statistical transliteration system using HMM alignment model and CRF for CLIR Important observations Quality of transliteration seems to be very good when going from larger alphabet to smaller alphabet and vice versa. The difference between accuracies for top n and n 5 (n>5) is decreasing on increasing the n value

Thank You Questions?