Statistical Transliteration for Cross Language Information Retrieval using HMM alignment and CRF

Similar documents
CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Named Entity Recognition: A Survey for the Indian Languages

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Speech Recognition at ICSI: Broadcast News and beyond

Corrective Feedback and Persistent Learning for Information Extraction

An Online Handwriting Recognition System For Turkish

Cross Language Information Retrieval

Learning Methods in Multilingual Speech Recognition

Indian Institute of Technology, Kanpur

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Linking Task: Identifying authors and book titles in verbose queries

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Cross-Lingual Text Categorization

Calibration of Confidence Measures in Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

The Role of String Similarity Metrics in Ontology Alignment

CS Machine Learning

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

Switchboard Language Model Improvement with Conversational Data from Gigaword

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Python Machine Learning

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Rule Learning With Negation: Issues Regarding Effectiveness

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The NICT Translation System for IWSLT 2012

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Lecture 10: Reinforcement Learning

What the National Curriculum requires in reading at Y5 and Y6

Short Text Understanding Through Lexical-Semantic Analysis

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Modeling function word errors in DNN-HMM based LVCSR systems

A Vector Space Approach for Aspect-Based Sentiment Analysis

CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES. Christian E. Loza. Thesis Prepared for the Degree of MASTER OF SCIENCE

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Modeling function word errors in DNN-HMM based LVCSR systems

A heuristic framework for pivot-based bilingual dictionary induction

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Truth Inference in Crowdsourcing: Is the Problem Solved?

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

HLTCOE at TREC 2013: Temporal Summarization

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Experts Retrieval with Multiword-Enhanced Author Topic Model

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

1. Introduction. 2. The OMBI database editor

Matching Meaning for Cross-Language Information Retrieval

Lecture 1: Machine Learning Basics

Finding Translations in Scanned Book Collections

Human Emotion Recognition From Speech

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

On document relevance and lexical cohesion between query terms

ScienceDirect. Malayalam question answering system

Word Segmentation of Off-line Handwritten Documents

Assignment 1: Predicting Amazon Review Ratings

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Constructing Parallel Corpus from Movie Subtitles

Radius STEM Readiness TM

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Dictionary-based techniques for cross-language information retrieval q

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Detecting English-French Cognates Using Orthographic Edit Distance

Probabilistic Latent Semantic Analysis

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Multiobjective Optimization for Biomedical Named Entity Recognition and Classification

Lecture 2: Quantifiers and Approximation

Rule Learning with Negation: Issues Regarding Effectiveness

Large vocabulary off-line handwriting recognition: A survey

Information Retrieval

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

OFFICE SUPPORT SPECIALIST Technical Diploma

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Finding Your Friends and Following Them to Where You Are

Objective: Add decimals using place value strategies, and relate those strategies to a written method.

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Measurement. When Smaller Is Better. Activity:

Distant Supervised Relation Extraction with Wikipedia and Freebase

Introduction to Simulation

Disambiguation of Thai Personal Name from Online News Articles

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

Transcription:

Statistical Transliteration for Cross Language Information Retrieval using HMM alignment and CRF Prasad Pingali, Surya Ganesh, Sree Harsha, Vasudeva Varma, IIIT, Hyderabad

Outline Introduction Transliteration method Evaluation Conclusion

Introduction CLIR system Issues query in one language Retrieves documents from all languages Issues in CLIR Out of vocabulary words (named entities etc.) are common source of errors.

Cont.. Transliteration The practice of transcribing a word or text written in one language into another language. More than one valid transliterations possible for a given word Example ग तम Gautam Gautham Gowtham Gowtam Statistical methods can achieve this

Previous work Hindi CLIR in Thirty Days by Larkey, Connell, Abdul Jaleel. (2003) Statistical Transliteration for English Arabic Cross Language Information Retrieval by Nasreen Abdul Jaleel and Leah S. Larkey. (2003) Many other works in European and Asia Pacific languages.

Problem description Can be stated as a sequence labeling problem. Source language word ~ Observation sequence Target language word ~ Label sequence x i Character n gram in Source Language word y i Character n gram in Target Language word x x 1 2.... x y 1 y 2 n y n

Cont.. Valid target language alphabet (y i ) for a source language alphabet (x i ) in the input may depend on following The source language alphabet in the input word. The context (alphabets) surrounding source language alphabet (x i ) in the input word. The context (alphabets) surrounding target language alphabet (y i ) in the desired output word.

Transliteration method Statistical model for transliteration is based on Hidden Markov Model (HMM) alignment and Conditional Random Field (CRF). Language independent Bilingual corpus HMM alignment Character aligned corpus CRF Trained model Input word Transliteration system Target language words (n)

HMM alignment model using GIZA++ To get character level alignment (n gram) of source and target language words. Maximizes the probability of the observed word pairs using the expectation maximization algorithm. Character level alignments (n gram) are set to maximum posterior predictions of the model.

CRF Probabilistic framework for labeling and segmenting sequential data. Discriminative and an undirected graphical model each vertex represents a random variable whose distribution is to be inferred each edge represents a dependency between two random variables defines a conditional distribution over a label sequence given an observation sequence

Cont.. Probability of a target language word Y given a source language word X is given by Where is a normalizing factor is either a state function or a transition function λ is the parameter to be estimated The parameters of the CRF are usually estimated from a fully observed training data Maximum likelihood training chooses parameter values

Algorithm Three important phases Two are offline and one online Offline phases preprocessing the bilingual corpus training the model. The online phase takes the input word, and generates the Transliterations.

Algorithm Cont.. Preprocessing The words in the bilingual corpus are prefixed with symbol B and suffixed with symbol E. The words are segmented in to unigrams (lexemes) and aligned using GIZA++ Feature Pruning: The instances where a sequence of target language characters are aligned to a source language character and vice versa are counted and 50 most frequent instances are added to the symbol inventory.

Algorithm Cont.. Preprocessing The source and target languages are re segmented based on new symbol inventory and were aligned using GIZA++. The alignment file from GIZA++ output is used to generate the training file required for CRF++. CRF++ is a simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. Ref : http://crfpp.sourceforge.net/ Each source language n gram aligned to a target language n gram is called a token which is represented in one line.

Algorithm Cont.. Training Phase Requires a template file which specifies features to be selected by model. The source language lexemes are used as features with a window length of 5. The training is done using Limited memory Broyden Fletcher Goldfarb Shannon method (LBFGS) which uses quasi newton algorithm for large scale numerical optimization problem

Algorithm Cont.. Testing phase (Transliterator) The input words that need to be transliterated are converted to CRF++ test file format. The trained model is used to generate top n probable target language words. CRF++ uses forward Viterbi and backward A* search to produce exact n best results.

Evaluation We evaluate our transliteration system in two ways Based on Transliteration accuracy. Based on CLIR (Cross Lingual Information Retrieval) performance We evaluate our model by comparing it with the model that was developed using HMM only (Jaleel et.al. 2003).

Evaluation Cont.. Transliteration accuracy The models were trained on 30,000 words and tested on 1,000 words containing Indian city names, family names, first and last names of persons. We evaluated the models on both in corpus and out of corpus words. The out of corpus words contain Indian and foreign city names and person names. We evaluated the models considering top 5,10,15,20and 25 transliterations.

Evaluation Cont.. Transliteration accuracy.. Accuracy = (C/N)*100 C Number of test words with correct transliteration appeared in desired number (5,10,15,20,25) of transliterations. N Total number of test words. Transliteration accuracy on training data Model Top 5 Top 10 Top 15 Top 20 Top 25 HMM 74.2 78.7 81.1 82.1 83.0 HMM & CRF 76.5 83.6 86.5 88.9 89.7

Evaluation Cont.. Transliteration accuracy on testing data Model Top 5 Top 10 Top 15 Top 20 Top 25 HMM 69.3 74.3 77.8 80.5 81.3 HMM & CRF 72.1 79.9 83.5 85.6 86.5

Evaluation Cont.. CLIR evaluation We tested the systems on the CLEF 2007 documents and 50 topics Topics that contained named entities were used for evaluation which were 13 in number.

Evaluation Cont.. CLIR Evaluation.. We developed a basic CLIR system which performs the following steps Tokenizes the Hindi query and removes the stop words. Performs query translation; each Hindi word is looked up in a Hindi English dictionary and all the English meanings for the Hindi word were added to the translated query and for the words which were not found in the dictionary, top 20 transliterations generated by one of the systems are added to the query. Retrieves relevant documents by giving translated query to CLEF documents using Lucene.

Evaluation Cont.. We present standard IR evaluation metrics such as P10, bpref, mean average precision (MAP) etc.. in the table below for the two systems Model P10 tot_rel tot_rel_ret MAP bpref HMM 0.3308 13000 3493 0.1347 0.2687 HMM & CRF 0.4154 13000 3687 0.1499 0.2836

Microsoft Dataset MSR Transliteration Data results for three language pairs Hindi English, Tamil English, Arabic English Training 3000 words, testing 300 words Results Language Top 1 Top 5 Top 10 Top 20 Hindi 0.600 0.823 0.897 0.920 Tamil 0.343 0.603 0.690 0.770 Arabic 0.347 0.593 0.703 0.760

Conclusion Demonstrated a statistical transliteration system using HMM alignment model and CRF for CLIR Important observations Quality of transliteration seems to be very good when going from larger alphabet to smaller alphabet and vice versa. The difference between accuracies for top n and n 5 (n>5) is decreasing on increasing the n value

Thank You Questions?