English to Tamil Statistical Machine Translation and Alignment Using HMM

Similar documents
Parsing of part-of-speech tagged Assamese Texts

Language Model and Grammar Extraction Variation in Machine Translation

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Derivational and Inflectional Morphemes in Pak-Pak Language

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

What the National Curriculum requires in reading at Y5 and Y6

Noisy SMS Machine Translation in Low-Density Languages

CS 598 Natural Language Processing

Using dialogue context to improve parsing performance in dialogue systems

Learning Methods in Multilingual Speech Recognition

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Constructing Parallel Corpus from Movie Subtitles

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Florida Reading Endorsement Alignment Matrix Competency 1

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Developing Grammar in Context

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

The NICT Translation System for IWSLT 2012

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

ScienceDirect. Malayalam question answering system

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Problems of the Arabic OCR: New Attitudes

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

A heuristic framework for pivot-based bilingual dictionary induction

Training and evaluation of POS taggers on the French MULTITAG corpus

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Detecting English-French Cognates Using Orthographic Edit Distance

Language Independent Passage Retrieval for Question Answering

Developing a TT-MCTAG for German with an RCG-based Parser

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

Memory-based grammatical error correction

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

AQUA: An Ontology-Driven Question Answering System

Lecture 10: Reinforcement Learning

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Linking Task: Identifying authors and book titles in verbose queries

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Modeling function word errors in DNN-HMM based LVCSR systems

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Advanced Grammar in Use

Speech Recognition at ICSI: Broadcast News and beyond

Coast Academies Writing Framework Step 4. 1 of 7

CS Machine Learning

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Probabilistic Latent Semantic Analysis

Software Maintenance

Cross Language Information Retrieval

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Grammars & Parsing, Part 1:

Considerations for Aligning Early Grades Curriculum with the Common Core

Universiteit Leiden ICT in Business

Human Emotion Recognition From Speech

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Generating Test Cases From Use Cases

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

LING 329 : MORPHOLOGY

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Prediction of Maximal Projection for Semantic Role Labeling

Some Principles of Automated Natural Language Information Extraction

Applications of memory-based natural language processing

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

California Department of Education English Language Development Standards for Grade 8

Using SAM Central With iread

1. Introduction. 2. The OMBI database editor

Evolution of Symbolisation in Chimpanzees and Neural Nets

Transcription:

RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING English to Tamil Statistical Machine Translation and Alignment Using HMM S.VETRIVEL, DIANA BABY Computer Science and Engineering arunya University arunya Nagar, Coimbatore, Tamilnadu INDIA svetrivel@karunya.edu,dianababy86@yahoo.com Abstract: - This paper describes English to Tamil statistical machine translation and its alignment using Hidden Markov Model (HMM).Statistical machine translation is a part of natural language processing and is based on probability distribution. Machine translation is a sub-field of computational linguistics that uses computer software to translate text in one natural language to another language. Alignment is one of the major challenges in machine translation.hidden markov model (HMM) based alignment described in this paper is more accurate, avoids invalid alignments and improves translation quality. HMM uses bigram translation probabilities for keeping word context in the language model which produce close to error-free output that reads fluently in the target language. ey-words: - Bigram translation probability, Hidden Markov model, phrase alignment, Statistical Machine Translation, translation model, word alignment. 1 Introduction Language is the main form of human communication. Translation is essential for co-operation among communities that speaks different languages. Machine Translation refers to the use of computers to automate the task of translation between human languages. A human language system can be considered as a system of arbitrary symbols and meanings of these symbols are defined and adopted by the users of that language for the purpose of effective exchanging of information. The translation process converts a text in one human language to another which preserves not only the meaning, but also the form, effect and style. There are some countries in which more than one language is spoken but there is not enough human translators are available. So a scheme for automatic translation between two languages is very desirable for social and political interactions. This paper concerned with the analysis, design and building of a model for English to Tamil Statistical Machine Translation (SMT) system. One of the central modeling problems in statistical machine translation (SMT) is alignment between parallel texts. The duty of alignment methodology is to identify translation equivalence between sentences, words and phrases within sentences. This paper deals with the hidden Markov models (HMMs) which is used for automatic alignment of words and phrases in parallel text. Parameters of a statistical word alignment model are estimated from parallel text and the model is used for word alignment with in the same text used in estimation. Short sequences of words form phrase pairs, which align to each other are extracted from the word-aligned parallel text for use in translation. Phrase-based SMT performances are influenced by the quality and quantity of the word-aligned parallel text. HMMs are potentially an attractive alternative to other models used for word alignment and phrase alignment of parallel text [1, 2]. The paper proceeds as follows. In section two English to Tamil translation is explained. The HMM and alignment methodology is formally presented in section three. The final conclusion is described in section four. 2 English to Tamil Translation Translation requires extensive linguistic knowledge in both the source and target language. The linguistic knowledge of a language includes the knowledge of its phonology, morphology, syntax, semantics, pragmatics and discourse. Translation also requires a comparable knowledge of grammatical and various other correspondences between the source and target language. In addition to this, a basic knowledge in the subject matter of the sentence with general knowledge and common sense are all essential for a good translation. Finally, knowledge of the customs and culture of speakers of both languages helps translators to select the best among alternatives. The phases of translations and other specifications about Tamil language are explained in the following sections. ISSN: 1790-5117 182 ISBN: 978-960-474-162-5

RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING 2.1 Phases of Translation Implementation of statistical machine translation split into two main phases called training phase and translation phase. In training phase a statistical model of translation is built, using a corpus of texts in both the source and target language (English and Tamil). The training phase is split into three parts: (i) (ii) (iii) Document collection, from where the corpus of texts from which the statistical model will be inferred. Building the translation model from the source language to the target language. Building the language model for the target language. Input Sentence Translation Model translators etc. The better models of translational equivalence are built empirically. Computational linguists use machine learning techniques to induce them from bitexts that is pairs of texts that are translations of each other. Computers should be able to figure out which expressions are translationally equivalent [3]. Language Model: Language model plays an important role in statistical machine translation. It is the key knowledge source to determine the right word order of the translation. The nominal task of the language model is to guide the search (decoding) procedure towards grammatical output. Standard n-gram based language model predicts the next word based on the immediate left context. It work well, are easy to train, require no manual annotation and are well understood. Use of language model can improve the translation qualities [4]. The second phase is the translation phase, which uses a heuristic search procedure to find a good translation of a text. The idea of the heuristic search is to consider partial sentences and partial alignments and maintaining a stack of particularly promising candidates. It s this phase which is actually used directly by the end user the training phase all happens offline, beforehand. Bag of possible words Seek improvement by trying other combinations 2.2 Translation Architecture Three types of translation architectures are used in MT systems. They are Transformer (Direct), Transfer and Inter- Lingua architectures. The MT system considered here is based on transformer architecture [4]. Language Model Most Probable Translation Source Text in English English Parser: Uses Dictionary and small Grammar to Produce English Structure Fig.1 Flow of Implementation Document Collection: There are many possible ways to build the statistical model of translation. One way is to represent them as a file containing several million URLs, where each URL pointing to an English language text which is used to build a language model. The file also contains URLs pointing to pairs of translated texts in English and in Tamil, which is used to build the translation model [3, 4]. Translation Model: Translation models are used to describe the mathematical relationship between two or more languages. A good translation model is a key to many translingual applications like machine translation. Other applications include cross-language information retrieval, computer-assisted language learning, and various tools for Target Text in Tamil Fig.2 Translation Architecture English to Tamil Transformer: English to Tamil Transformation Rules 2.3 Basic Linguistic Specifications of Tamil Being an agglutinative language, Tamil words are the combination of several morphemes. A Tamil word consists ISSN: 1790-5117 183 ISBN: 978-960-474-162-5

RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING of a root combined with other grammatical accretions. The concept of case in languages refers to the phenomenon of expressing reciprocal relations of nouns by means of caseterminators such as post-positions or auxiliary words. In English these types of relations are accommodated using prepositions such as in, on, at, by, with etc. In Tamil, singular and plural forms of nouns have the same form of case-terminators. Tamil uses the crude root of the verb and Tamil verbs usually carry tenses. English verbs are pre-modified by auxiliaries to accommodate tense, aspect, voice and number of the sentence. The Tamil verb should, in addition to all the above functions, carry information about gender. All this information is represented in the Tamil verb by different grammatical formatives suffixed to it in a pre-defined order. The gender information of a verb may be derived from its terminator. The passive voice of a transitive verb in Tamil is formed by combining the verb with the auxiliary verb padu. In English, the negative concept is introduced by the use of conjugate not adverbially. There is no such word in Tamil, though the word illai (no) is used sometimes. The tense of the Tamil negative verb is indeterminate in point of time and is therefore determined by the context [6]. the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but output dependent on the state is visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states [7]. 3.1 Architecture of Hidden Markov Model The diagram below shows the general architecture of an instantiated HMM. Each oval shape represents a random variable that can adopt any number of values. The random variable x (t) is the hidden state at time t (x (t) {x1, x2, x3}). The random variable y (t) is the observation at time t (y (t) {y1, y2, y3, y4}). The arrows in the diagram denote conditional dependencies. The conditional probability distribution of the hidden variable x (t) at time t, given the value of the hidden variable x (t 1), depends only on the value of the hidden variable x (t 1): the values at time t 2 and before have no influence. This is called the Markov property. Similarly, the value of the observed variable y (t) only depends on the value of the hidden variable x (t) (both at time t) [7]. 2.4 Word Combination Rules in Tamil Tamil word combination rules ensure the euphonic and natural composition of the adjacent words and inflectional and derivational processes. When combining two Tamil words (or affixes), the resultant word depends on the boundary syllables of components. There are three types of changes possible: insertion of a new letter, transmutation of letters and natural composition [6]. x(t-1) y(t-1) x(t) y(t) x(t+1) y(t+1) 2.5 Tamil Information Interchange Code ASCII is a standard representation widely used for information interchange within computers. ASCII is not sufficient to encode the letters of foreign alphabets such as Tamil. Therefore a Standard Code for Information Interchange in Tamil (SCIIT) is used. The SCIIT codes for the vowels a to au are assigned the codes 1 to 12. The consonants are assigned multiples of 20 in their order. The space is given the code zero. The code for a vowelconsonant is the addition of the codes of the corresponding vowel and consonant. The SCIIT code preserves Tamil alphabetical order [6]. 3 HMM Alignment HMM is used for word and phrase alignment of parallel text. A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved state. In a regular Markov model, Fig. 3 HMM Architecture 3.2 Variables Used for Alignment The model variables contain e=e 1 I t=t 1 J source sentence of I words (English) target sentence of J words (Tamil) The target language word sequence is an intermediate sequence of target language phrases. The variable-length word sequences in the target language are called phrases. u 1 phrase count variable that is target language is segmented into phrases target sentence of N phrases. ISSN: 1790-5117 184 ISBN: 978-960-474-162-5

RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING k a 1 E ak h 1 if h k =0, If h k =1, phrase length of kth phrase alignment process aligns target phrases to source word. word in source sentence position a k hallucination sequence then u k is aligned to null. then uk is aligned to s ak. Word-to-Phrase Translation: The translation of words to phrases is given as P(u 1 a 1,h 1, 1,,J,e)=(u k E ak,,h k, k) (4) So that target phrases are conditionally independent given their alignment to individual source words. Specialized translation tables can be maintained for hallucinated phrases to allow their statistics to differ from phrases that arise from direct translation of specific source words. Word context within the target language phrase via bigram translation probabilities [5]. There hidden variables in HMM are: a= ( 1, a 1, h 1, ) (1) 3.3 Phrase Segmentation, Alignment and Translation Models The modeling objective is to define a conditional distribution P (t, a e) over the alignments of the source (English) and target sentence (Tamil).It can be calculated using the equation: 3.4 Viterbi Algorithm Viterbi algorithm is one of the algorithms in HMM which is used for alignment process. Given the parameters of the model and a particular output sequence, find the state sequence that is most likely to have generated that output sequence. This requires finding a maximum over all possible state sequences which can be solved efficiently by the viterbi algorithm [8]. q1 a12 q2 a23 q3 P(u 1,, a 1, h 1, 1 e) = P( J,e) P(a 1, 1, h 1 a21,j,e) P(u 1 a 1, h 1, 1,,J,e) (2) b1 b2 b3 Phrase Count Distribution: P ( J, e) specifies the distribution over the number of phrases in the target sentence given the source sentence and the number of words in the target sentence. Single parameter distribution P ( J, e) = P ( J, I) α η 1 controls the segmentation of the target sentence into phrases. Larger values of η favor target sentence segmentations with many short phrases. Word-to-Phrase Alignment Distribution: The alignment is modeled as a Markov process that specifies the lengths of phrases and the alignment of each to one of the source word positions. P (a 1, 1,h 1, J, e) = k, k,h k a k-1, k-1,h k-1,, J, e) (3) The word-to-phrase alignment (a k ) is a Markov process over the source sentence word indices, as in word-to-word HMM alignment. It is formulated with a dependency on the hallucination variable so that target phrases can be inserted without disrupting the Markov dependencies of phrases aligned to non- NULL source words. o1 o2 o3 o=observation output,q = state probability b = output probability, a= transition probability Fig.4 Working of Algorithm The steps of viterbi algorithm are as follows: Step1: Initialization Assume initial probability and emission probability and then calculate the probability of first state using this assumption. Step2: Induction Calculate the probabilities of other states except start state. Step 3: Backtracking Find the most likely path that produces highest probability. ISSN: 1790-5117 185 ISBN: 978-960-474-162-5

RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING 4 Conclusion This paper deals with English to Tamil statistical machine translation and its alignment methodology based on computationally practical hidden Markov models. Statistical machine translation is based on probability and it produces more accurate result than other types and statistical alignment models improves the translation quality.hmm based statistical alignment model is more powerful. References: [1] F. Och, C. Tillman, and H. Ney, Improved alignment models for statistical machine translation, in Proc. Joint Conf. Empirical Methods Natural Lang. Process. Very Large Corpora, College Park, MD, pp. 20 28, 1999. [2] P. oehn,f. Och, and D. Marcu, Statistical phrasebased translation, in Proc. HLT-NAACL, pp. 127 133, 2003. [3] Hongfei Jiang, Muyun Yang, Tiejun Zhao, Sheng Li and Bo Wang A Statistical Machine Translation Model Based on a Synthetic Synchronous Grammar, Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 125 128,Suntec, Singapore, 4 August 2009. [4]Matt Post and Daniel Gildea Parsers as language models for statistical machine translation, Department of Computer Science University of Rochester, 2005. [5] Y. Deng and W. Byrne, MTT: An alignment toolkit for statistical machine translation, presented at the HLT- NAACL Demonstrations Program, Jun. 2006. [6] R.Ravi and S.ailasam Computer Vision of Single to Multi-Language Translation using Statistical Machine Translation, TIFAC-CORE, alasalingam University, Tamilnadu. [7] M. Ostendorf,V. Digalakis, and O.imball, From HMMs to segment models: A unified view of stochastic modeling for speech recognition,ieee Trans. Acoustics, Speech, Signal Process., vol. ASSP-4, no. 5, pp.360 378, Sep. 1996. [8] G.D. Brushe, Robert.E. Mahony and John. B. Moore, A forward backward algorithm for ml state and sequence estimation, International Symposium on Signal Processing and its applications, ISSPA, Gold Coast, Australia, 25-30 August, 1996.. ISSN: 1790-5117 186 ISBN: 978-960-474-162-5