Transliteration System for English to Sinhala Machine Translation
Transliteration System for English to Sinhala Machine Translation Budditha Hettige Department of Statistics and Computer Science, Faculty of Applied Sciences, University of Sri Jayewardenepura & Asoka S. Karunananda Faculty of Information Technology, University of Moratuwa, Sri Lanka
Overview What is Machine Translation Problems in Machine Translation Machine Transliteration Sinhala & English Language Existing Approaches and Methods Proposed approach: Design Modules Conclusion and further works Demonstration
What is Machine Translation? Machine translation (MT) is a translation process that translate one natural language into other.
Machine Translation Process Source language dictionary Source language Analysis source language sentence Bilingual dictionary Translation Target language dictionary Target language generation Target language sentence
Source language analysis Morphological analysis Source language Morphological analyzer analyze word by word in given sentence and returns Morphological information for each word. Syntax analysis Source language parser identify the syntax of the given source language sentence.
Translation Translator is used to translate source language word in to target language
Target language generation Morphological generation Source language Morphological analyzer/generator generate appropriate target language words with grammatical information Syntax generation Target language parser generates the sentences in the target language
Problems in Machine translation Out-of-Vocabulary No words in a dictionary Proper noun translation Example (Mahinda Rajapaksha) Handling technical terms Pentium IV Processor Multiword Expression Oil cake ^lejqï& Semantic and pragmatic
What is Machine Transliteration? Machine transliteration is a method for automatic conversion of words in one language in to phonetically equivalent ones in another language. Example the English word machine is transliterated into Sinhala as ueiska.
Why Machine Transliteration Machine Transliteration can be used to solve Out-of-Vocabulary problem Translate Proper nouns
Design: English to Sinhala Machine Translation System English Sentence English Morphological analyzer English Dictionary English Parser Transliteration Translator Bilingual Dictionary Intermediate Editor Sinhala Morphological analyzer Sinhala Dictionary Sinhala Parser Sinhala Sentence
Transliteration Approaches Grapheme-based transliteration direct orthographical mapping from source graphemes to target graphemes Phoneme-based transliteration based on pronunciation or the source phoneme rather than spelling or source grapheme Hybrid and Correspondence-based transliteration Used above two approaches
Types of Transliterations Forward Transliteration Transliteration of a name from its native script to a foreign one Backward Transliteration Restoration of a previously transliterated name to its native scripts
English and Sinhala language English Language English contains 26 letters with 5 vowels Sinhala Language The Sinhala alphabet consists of 61 letters comprising 18 vowels, 41 consonants and 2 semiconsonants Represent 40 sounds: 14 vowel sounds and 26 consonant sounds
Phonetic Relation between English and Sinhala These two languages are fundamentally different from each other There are no stokes in English language Spoken and written English are equivalent. But there is a difference between written and spoken Sinhala language Also Diphthongs are not used in written Sinhala language
Disambiguation Two English sounds ^ and ә is represented in one Sinhala letter a (w) There are two Sounds in English International phonetic alphabet (IPA) I and i for English but Sinhala uses one e (b) for above both two sounds No Diphthongs are used in Sinhala Language. Therefore these sound representations have some difficulties. Two sounds v and w are represented in one Sinhalese letter w (j) No Direct Sound for English Letters q, x, z in Sinhala Also large numbers of irregular word pronunciations are difficult
Available Approaches Dictionary writers have used numbers of methods for English to Sinhala transliteration phonetic-based transliteration method based on International Phonetic Alphabet (IPA) sounds non-phonetic-based transliteration method Based on letters
Transliteration Approaches English Malalasekara Rathna Godage Aback D nela w[d]nela tnela Binocular nb fkdlahq,ad Ìfk[d]lHq,[¾] nhsfkdlahq,¾ Quota laõdwüd lafjdag lafjdagd Volcono fjd,a flbkadw fj[d],,aflafkda fjd,aflafkda xenophobia fizkad*adwì fi[z]k[d]f*daìh fifkdaf*daìwd Zero iazbd¾dw [z]isfrda isfrda
Proposed Approach to English to Sinhala Transliteration Letter-based transliteration approach Use Finite State Automaton (FSA) Two types of transliteration models are developed Type 1 : Original English text E.g Computer Type 2 : Sinhala words written using English letters e.g. Ambepussa
English to Sinhala Transliteration for Original English Text (Type1) Letter-based transliteration approach Use Finite State Automaton (FSA)
IPA Chart for English Vowels IPA English English Sinhala Examples a: a wd Father ɪ i b Sit ɪ y b City i: ee B See ɛ e t Bed ε: ir ta Bird æ a we lad, cat, ran ʌ U, ou w (jsjd;) run, enough ɒ o, a T not, wasp ɔ: aw, au law, caught ʊ U, oo W put, wood uː oo, ou W! soon, through ə a w(ixjd;) About ə er w(ixjd;) Winner
IPA Chart for English Consonants IPA English Sinhala Examples P p ma pen, spin, tip B b í but, web T t Ü two, sting, bet D d â do, odd tʃ ch, t É chair, nature, teach dʒ d,j,dge ca gin, joy, edge K c,k,q,ck la cat, kill, skin, queen, thick ɡ g.a go, get, beg F f,gh *a fool, enough, leaf V v, ve õ voice, have Θ th ;a thing, teeth Ð th, the oa this, breathe, father S s, c, ss i see, city, pass Z z, se i zoo, rose
IPA Chart for English Consonants contd.. ʒ s, ge i pleasure, beige H h ya ham M m ï man, ham N n ka no, tin Ŋ ng x singer, ring L l, ll, left, bell ɹ r r run, very W j j we J y h yes ʍ j j what
FST for Types 1 transliteration A i e a, e, i, o, u, y V2 e, r V1 r B d C1 g c e C2 D1 C4 C3 v e C5 t, e, s,c,g h k e e o a V4 V3 Vowels w, u o, u C n t l 0 C6 C7 C8 h D2 l D g e D Consonants l 0 = {b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,y,z}
English to Sinhala Transliteration for Sinhala words written using English Letters (Type2) Letter-based transliteration approach Use Finite State Automaton (FSA)
Sinhala Transliteration alphabet for Type 2 Sinhala Eng Sinhala Eng Sinhala Eng w a X nga M pa wd aa Õ nnga M pha we ae p ca N ba we aee P cha N bha b i c ja U ma B ii CO jha U mba W u [ nya H ya W! uu { jnya R ra Ì Ị `P ndja, la Ï Ị g tta j va id ŗ G ttha Y sha
Sinhala Transliteration alphabet cont Sinhala Eng Sinhala Eng Sinhala Eng idd ŗ v daa I ssa T e V daha i sa Ta ee K nna y ha Ft ai ~ nnda < lla T o ; ta * fa oo : tha > gha T! au o da nda L ka O dha. ga L kha k na
FST for Types 2 transliteration I r V2 e V1 r D1 I e l s C2 C7 i t C1 s D1 h l b A a i o e u V3 V7 L2 V4 e V5 V6 L1 i u o, u B C d n t L2 C6 C3 C5 d n C4 L1 d D2 n, d, y D3 h d, j h t j D Vowels L1 = { a, e,,i, o, u, Ǐ, ŕ }, L2 = { a, e, i } d D4 Consonants L1 = { k, g, c, j, t d,b, m, y, r, f, v, s, h, l, n, p } L2 = { k, g, c, j, t, d, b, s, p}
Approach in Practice
Demonstration
Conclusion Handling of Pronunciations of an English word is a critical problem in English to Sinhala transliteration. English letter a represent different sound w, we and we (ago wf.da, America wefursld and antwekaá) in Sinhala English word contains different pronunciations two word father and fathom has different pronunciation for fath
Further work Incorporating English IPA into the system
Thank you!