English to Tamil Statistical Machine Translation and Alignment Using HMM

Size: px

Start display at page:

Download "English to Tamil Statistical Machine Translation and Alignment Using HMM"

Sarah Marshall
6 years ago
Views:

1 RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING English to Tamil Statistical Machine Translation and Alignment Using HMM S.VETRIVEL, DIANA BABY Computer Science and Engineering arunya University arunya Nagar, Coimbatore, Tamilnadu INDIA Abstract: - This paper describes English to Tamil statistical machine translation and its alignment using Hidden Markov Model (HMM).Statistical machine translation is a part of natural language processing and is based on probability distribution. Machine translation is a sub-field of computational linguistics that uses computer software to translate text in one natural language to another language. Alignment is one of the major challenges in machine translation.hidden markov model (HMM) based alignment described in this paper is more accurate, avoids invalid alignments and improves translation quality. HMM uses bigram translation probabilities for keeping word context in the language model which produce close to error-free output that reads fluently in the target language. ey-words: - Bigram translation probability, Hidden Markov model, phrase alignment, Statistical Machine Translation, translation model, word alignment. 1 Introduction Language is the main form of human communication. Translation is essential for co-operation among communities that speaks different languages. Machine Translation refers to the use of computers to automate the task of translation between human languages. A human language system can be considered as a system of arbitrary symbols and meanings of these symbols are defined and adopted by the users of that language for the purpose of effective exchanging of information. The translation process converts a text in one human language to another which preserves not only the meaning, but also the form, effect and style. There are some countries in which more than one language is spoken but there is not enough human translators are available. So a scheme for automatic translation between two languages is very desirable for social and political interactions. This paper concerned with the analysis, design and building of a model for English to Tamil Statistical Machine Translation (SMT) system. One of the central modeling problems in statistical machine translation (SMT) is alignment between parallel texts. The duty of alignment methodology is to identify translation equivalence between sentences, words and phrases within sentences. This paper deals with the hidden Markov models (HMMs) which is used for automatic alignment of words and phrases in parallel text. Parameters of a statistical word alignment model are estimated from parallel text and the model is used for word alignment with in the same text used in estimation. Short sequences of words form phrase pairs, which align to each other are extracted from the word-aligned parallel text for use in translation. Phrase-based SMT performances are influenced by the quality and quantity of the word-aligned parallel text. HMMs are potentially an attractive alternative to other models used for word alignment and phrase alignment of parallel text [1, 2]. The paper proceeds as follows. In section two English to Tamil translation is explained. The HMM and alignment methodology is formally presented in section three. The final conclusion is described in section four. 2 English to Tamil Translation Translation requires extensive linguistic knowledge in both the source and target language. The linguistic knowledge of a language includes the knowledge of its phonology, morphology, syntax, semantics, pragmatics and discourse. Translation also requires a comparable knowledge of grammatical and various other correspondences between the source and target language. In addition to this, a basic knowledge in the subject matter of the sentence with general knowledge and common sense are all essential for a good translation. Finally, knowledge of the customs and culture of speakers of both languages helps translators to select the best among alternatives. The phases of translations and other specifications about Tamil language are explained in the following sections. ISSN: ISBN:

2 RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING 2.1 Phases of Translation Implementation of statistical machine translation split into two main phases called training phase and translation phase. In training phase a statistical model of translation is built, using a corpus of texts in both the source and target language (English and Tamil). The training phase is split into three parts: (i) (ii) (iii) Document collection, from where the corpus of texts from which the statistical model will be inferred. Building the translation model from the source language to the target language. Building the language model for the target language. Input Sentence Translation Model translators etc. The better models of translational equivalence are built empirically. Computational linguists use machine learning techniques to induce them from bitexts that is pairs of texts that are translations of each other. Computers should be able to figure out which expressions are translationally equivalent [3]. Language Model: Language model plays an important role in statistical machine translation. It is the key knowledge source to determine the right word order of the translation. The nominal task of the language model is to guide the search (decoding) procedure towards grammatical output. Standard n-gram based language model predicts the next word based on the immediate left context. It work well, are easy to train, require no manual annotation and are well understood. Use of language model can improve the translation qualities [4]. The second phase is the translation phase, which uses a heuristic search procedure to find a good translation of a text. The idea of the heuristic search is to consider partial sentences and partial alignments and maintaining a stack of particularly promising candidates. It s this phase which is actually used directly by the end user the training phase all happens offline, beforehand. Bag of possible words Seek improvement by trying other combinations 2.2 Translation Architecture Three types of translation architectures are used in MT systems. They are Transformer (Direct), Transfer and Inter- Lingua architectures. The MT system considered here is based on transformer architecture [4]. Language Model Most Probable Translation Source Text in English English Parser: Uses Dictionary and small Grammar to Produce English Structure Fig.1 Flow of Implementation Document Collection: There are many possible ways to build the statistical model of translation. One way is to represent them as a file containing several million URLs, where each URL pointing to an English language text which is used to build a language model. The file also contains URLs pointing to pairs of translated texts in English and in Tamil, which is used to build the translation model [3, 4]. Translation Model: Translation models are used to describe the mathematical relationship between two or more languages. A good translation model is a key to many translingual applications like machine translation. Other applications include cross-language information retrieval, computer-assisted language learning, and various tools for Target Text in Tamil Fig.2 Translation Architecture English to Tamil Transformer: English to Tamil Transformation Rules 2.3 Basic Linguistic Specifications of Tamil Being an agglutinative language, Tamil words are the combination of several morphemes. A Tamil word consists ISSN: ISBN:

3 RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING of a root combined with other grammatical accretions. The concept of case in languages refers to the phenomenon of expressing reciprocal relations of nouns by means of caseterminators such as post-positions or auxiliary words. In English these types of relations are accommodated using prepositions such as in, on, at, by, with etc. In Tamil, singular and plural forms of nouns have the same form of case-terminators. Tamil uses the crude root of the verb and Tamil verbs usually carry tenses. English verbs are pre-modified by auxiliaries to accommodate tense, aspect, voice and number of the sentence. The Tamil verb should, in addition to all the above functions, carry information about gender. All this information is represented in the Tamil verb by different grammatical formatives suffixed to it in a pre-defined order. The gender information of a verb may be derived from its terminator. The passive voice of a transitive verb in Tamil is formed by combining the verb with the auxiliary verb padu. In English, the negative concept is introduced by the use of conjugate not adverbially. There is no such word in Tamil, though the word illai (no) is used sometimes. The tense of the Tamil negative verb is indeterminate in point of time and is therefore determined by the context [6]. the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but output dependent on the state is visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states [7]. 3.1 Architecture of Hidden Markov Model The diagram below shows the general architecture of an instantiated HMM. Each oval shape represents a random variable that can adopt any number of values. The random variable x (t) is the hidden state at time t (x (t) {x1, x2, x3}). The random variable y (t) is the observation at time t (y (t) {y1, y2, y3, y4}). The arrows in the diagram denote conditional dependencies. The conditional probability distribution of the hidden variable x (t) at time t, given the value of the hidden variable x (t 1), depends only on the value of the hidden variable x (t 1): the values at time t 2 and before have no influence. This is called the Markov property. Similarly, the value of the observed variable y (t) only depends on the value of the hidden variable x (t) (both at time t) [7]. 2.4 Word Combination Rules in Tamil Tamil word combination rules ensure the euphonic and natural composition of the adjacent words and inflectional and derivational processes. When combining two Tamil words (or affixes), the resultant word depends on the boundary syllables of components. There are three types of changes possible: insertion of a new letter, transmutation of letters and natural composition [6]. x(t-1) y(t-1) x(t) y(t) x(t+1) y(t+1) 2.5 Tamil Information Interchange Code ASCII is a standard representation widely used for information interchange within computers. ASCII is not sufficient to encode the letters of foreign alphabets such as Tamil. Therefore a Standard Code for Information Interchange in Tamil (SCIIT) is used. The SCIIT codes for the vowels a to au are assigned the codes 1 to 12. The consonants are assigned multiples of 20 in their order. The space is given the code zero. The code for a vowelconsonant is the addition of the codes of the corresponding vowel and consonant. The SCIIT code preserves Tamil alphabetical order [6]. 3 HMM Alignment HMM is used for word and phrase alignment of parallel text. A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved state. In a regular Markov model, Fig. 3 HMM Architecture 3.2 Variables Used for Alignment The model variables contain e=e 1 I t=t 1 J source sentence of I words (English) target sentence of J words (Tamil) The target language word sequence is an intermediate sequence of target language phrases. The variable-length word sequences in the target language are called phrases. u 1 phrase count variable that is target language is segmented into phrases target sentence of N phrases. ISSN: ISBN:

4 RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING k a 1 E ak h 1 if h k =0, If h k =1, phrase length of kth phrase alignment process aligns target phrases to source word. word in source sentence position a k hallucination sequence then u k is aligned to null. then uk is aligned to s ak. Word-to-Phrase Translation: The translation of words to phrases is given as P(u 1 a 1,h 1, 1,,J,e)=(u k E ak,,h k, k) (4) So that target phrases are conditionally independent given their alignment to individual source words. Specialized translation tables can be maintained for hallucinated phrases to allow their statistics to differ from phrases that arise from direct translation of specific source words. Word context within the target language phrase via bigram translation probabilities [5]. There hidden variables in HMM are: a= ( 1, a 1, h 1, ) (1) 3.3 Phrase Segmentation, Alignment and Translation Models The modeling objective is to define a conditional distribution P (t, a e) over the alignments of the source (English) and target sentence (Tamil).It can be calculated using the equation: 3.4 Viterbi Algorithm Viterbi algorithm is one of the algorithms in HMM which is used for alignment process. Given the parameters of the model and a particular output sequence, find the state sequence that is most likely to have generated that output sequence. This requires finding a maximum over all possible state sequences which can be solved efficiently by the viterbi algorithm [8]. q1 a12 q2 a23 q3 P(u 1,, a 1, h 1, 1 e) = P( J,e) P(a 1, 1, h 1 a21,j,e) P(u 1 a 1, h 1, 1,,J,e) (2) b1 b2 b3 Phrase Count Distribution: P ( J, e) specifies the distribution over the number of phrases in the target sentence given the source sentence and the number of words in the target sentence. Single parameter distribution P ( J, e) = P ( J, I) α η 1 controls the segmentation of the target sentence into phrases. Larger values of η favor target sentence segmentations with many short phrases. Word-to-Phrase Alignment Distribution: The alignment is modeled as a Markov process that specifies the lengths of phrases and the alignment of each to one of the source word positions. P (a 1, 1,h 1, J, e) = k, k,h k a k-1, k-1,h k-1,, J, e) (3) The word-to-phrase alignment (a k ) is a Markov process over the source sentence word indices, as in word-to-word HMM alignment. It is formulated with a dependency on the hallucination variable so that target phrases can be inserted without disrupting the Markov dependencies of phrases aligned to non- NULL source words. o1 o2 o3 o=observation output,q = state probability b = output probability, a= transition probability Fig.4 Working of Algorithm The steps of viterbi algorithm are as follows: Step1: Initialization Assume initial probability and emission probability and then calculate the probability of first state using this assumption. Step2: Induction Calculate the probabilities of other states except start state. Step 3: Backtracking Find the most likely path that produces highest probability. ISSN: ISBN:

5 RECENT ADVANCES in NETWORING, VLSI and SIGNAL PROCESSING 4 Conclusion This paper deals with English to Tamil statistical machine translation and its alignment methodology based on computationally practical hidden Markov models. Statistical machine translation is based on probability and it produces more accurate result than other types and statistical alignment models improves the translation quality.hmm based statistical alignment model is more powerful. References: [1] F. Och, C. Tillman, and H. Ney, Improved alignment models for statistical machine translation, in Proc. Joint Conf. Empirical Methods Natural Lang. Process. Very Large Corpora, College Park, MD, pp , [2] P. oehn,f. Och, and D. Marcu, Statistical phrasebased translation, in Proc. HLT-NAACL, pp , [3] Hongfei Jiang, Muyun Yang, Tiejun Zhao, Sheng Li and Bo Wang A Statistical Machine Translation Model Based on a Synthetic Synchronous Grammar, Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages ,Suntec, Singapore, 4 August [4]Matt Post and Daniel Gildea Parsers as language models for statistical machine translation, Department of Computer Science University of Rochester, [5] Y. Deng and W. Byrne, MTT: An alignment toolkit for statistical machine translation, presented at the HLT- NAACL Demonstrations Program, Jun [6] R.Ravi and S.ailasam Computer Vision of Single to Multi-Language Translation using Statistical Machine Translation, TIFAC-CORE, alasalingam University, Tamilnadu. [7] M. Ostendorf,V. Digalakis, and O.imball, From HMMs to segment models: A unified view of stochastic modeling for speech recognition,ieee Trans. Acoustics, Speech, Signal Process., vol. ASSP-4, no. 5, pp , Sep [8] G.D. Brushe, Robert.E. Mahony and John. B. Moore, A forward backward algorithm for ml state and sequence estimation, International Symposium on Signal Processing and its applications, ISSPA, Gold Coast, Australia, August, ISSN: ISBN:

Parsing of part-of-speech tagged Assamese Texts

IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal