English to Marathi Rule-based Machine Translation of Simple Assertive Sentences

Size: px

Start display at page:

Download "English to Marathi Rule-based Machine Translation of Simple Assertive Sentences"

Derek Berry
6 years ago
Views:

1 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 English to Marathi Rule-based Machine Translation of Simple Assertive Sentences G.V. Garje, G.K. Kharate and M.L. Dhore Abstract This paper presents a proposed system for machine translation of simple assertive sentences in English to their Marathi counterpart using rule-based approach. System takes simple assertive English sentence as an input and performs its lexical analysis to produce tokens (lexemes). Each token generated by lexical analyzer is searched in the English dictionary (lexicon). If the token is present in the lexicon, it is retrieved along with its morphological information. The local word grouping is performed based on the morphological information of a token, if required. Local word groups thus formed are checked against grammar rules of English language for syntactic validity of input sentence. The syntax is checked by using bottom up parsing technique. For the syntactically correct input, corresponding target language (Marathi) token is searched in target language dictionary. If all such Marathi tokens corresponding to English tokens are found, then Marathi sentence is generated using Marathi grammar rules. The paper emphasizes on developing production rules for simple English sentences and innovative rearrangement of target language tokens. Index Terms Language Translation, Lexical Analysis, Local Word Grouping, Local Word Separator. Machine Translation, Morphological Analysis, NLP. I I. INTRODUCTION NDIA is a multilingual, multicultural country where 22 official languages and approximately 2000 dialects are spoken by different communities [2]. English and Hindi are used for official work in majority of the states of India while state governments predominantly carry out their official work in their regional language such Hindi, Bengali, Marathi, Tamil, Kannada, Telugu, Punjabi, Gujarati, Oriya etc. The people of different states make use of these regional languages for oral as well as written communication. The entire official documents and reports of Union government are published in English or in both English and Hindi. Translating these documents manually into a regional language is very time consuming and costly task. Hence there is need to develop good machine translation (MT) systems in order to establish a G.V. Garje, He is working as a associate Professor and Head of Department of Computer Engineering. PVG s College of Engineering & Technology, Pune, Maharashtra, , India (gvg_comp@pvgcoet.ac.in) G.K. Kharate, He is working as Principal at Matoshri College of Engineering & Research Centre, Eklahare Nashik, Maharashtra, India (gkkharate@rediffmail.com) M.L. Dhore, He is working as a Professor of Computer Engineering at Vishwakarma Institute of Technology, Pune, Maharashtra, India (manikdhore@vit.edu) better communication between states and Union governments and exchange of information amongst the people of different states with different regional languages. English continues to be the link language in India. Machine translation has a much greater significance in breaking the language barrier within the sociological and regional structure [1]. Few MT systems for English to Indian languages for specific domain are developed and the work is still going on [2]. It is a tough task to develop general purpose English to Indian languages Machine Translation systems due to the complex and free-word order nature of different Indian languages. English language has simple, complex and compound sentences. The simple sentences are further subdivided into Interrogative, Assertive, Negative, and Exclamatory. Developing a tool for each sub-type of simple sentences and integrating them to form a full-fledged MT tool could be a better option. Marathi is a low resource Indian language. The tool for Simple Interrogative English sentences to Marathi has been already attempted [1]. We are proposing a system for translating Simple English Assertive sentences into Marathi sentences. Machine translation has different architectures such as Direct, Transfer-Based, Interlingua, Statistical, Example- Based and Hybrid. Each of them has its advantages and disadvantages and selection of the approach can be made based on the domain of the application. We have selected Transfer-Based architecture for the development using and Rule-based approach of implementation. [2][8]. II. RELATED WORK The field of Natural Language Processing has emerged in its own right and a large number of research groups around the world are working on it. In India also continuous efforts by individual researchers as well as organizations and group of organizations (consortium) are on from last 15 years for the development of MT systems for English to Indian languages and for Indian languages to Indian languages. Few noteworthy systems include Anusaaraka project (1995) started by Akshar Bharti at IIT Kanpur and now being continued at IIIT Hyderabad for Indian language to Indian language MT and is tested for translating simple Telugu sentences into corresponding Hindi sentences. The MANTRA (1997,1999) a MT system started by Akshar Bharti and further developed by Hemant Darbari and Manish Kumar Pande for the Rajya Sabha Secretariat, the Upper House of Parliament of India to translate the proceedings of parliament such as study to be laid

2 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2 on the Table of the house, Bulletin Part-I and Part-II. Anglabharati(2001) and Anglabharti-II(2004) MT system developed by R.M.K. Sinha et al for English to Indian languages. Shiva and Shakti MT System (2003) designed using an Example-based and combination of rule based and statistical approaches. The Shakti system works for three target languages Hindi, Marathi and Telgu. Shiva and Shakti are the two Machine Translation systems from English to Hindi developed jointly by Carneige Mellon University of USA, IIIT Hyderabad and IISc, Bangalore, India. The MaTra System (2004, 2006) developed by Ananthakrishnan R et al uses transfer-based approach to translate news, annual reports and technical phrases from English to Hindi. A consortium of Nine institutions namely C-DAC Mumbai, IISc Bangalore, IIIT Hyderabad, C-DAC Pune, IIT Mumbai, Jadavpur University Kolkata, IIIT Allahabad, Utkal University Bangalore, Amrita University Coimbatore and Banasthali Vidyapeeth, Banasthali are developing EILMT(2006-) a MT System for English to Indian Languages in Tourism and Healthcare Domains. This project is funded by Department of Information Technology, MCIT Government of India. They have developed Sampark MT system (2009). The role of C- DAC Mumbai is to develop statistical models and resources for a statistical MT (SMT) system from English to Hindi/Marathi/Bengali. Rule based machine translation from English to Urdu using transfer approach is developed by Naila Ata et al at Karachi, Pakistan [5]. They handled case phrases and verb postpositions through concept of Pannian Grammar. Dilshan De Silva et al have developed Sinhala to English translator with various inbuilt tools like grammar tool, dictionary, Unicode fonts, debugging tool, add word tool [3]. There are many more MT systems developed in India and abroad. III. SYSTEM OVERVIEW English language has a SVO (Subject-Verb-Object) structure whereas Marathi language follows SOV structure and is relatively free word order [1][8]. The overall architecture of system is depicted in Fig. 1. The input to the system is simple assertive English sentence like He is going to school. Lexical Analyzer splits the input sentence into tokens/lexemes separated by delimiters. Lexicon (Dictionary) contains a set of known words along with their complete morphological information such as its root, category, case, gender, number etc. Morphological Analyzer accepts a token and checks whether that tokens is present in the lexicon or not. If the token is present, system retrieves complete morphological information about it. IV. COMPONENT OF SYSTEM The proposed system composed of following components Lexical Analyzer Morphological Analyzer Parser Fig. 1. Architecture of proposed system Mapping module Local word Grouper Tokens rearrangement Transliteration A. Lexical Analyzer This module splits the given English sentence into the tokens and removes delimiters. Input: Sentence Output: Tokens/Lexemes. B. Morphological Analyzer Given a word, the morphological analyzer identifies the root and the grammatical features of the word. For languages that are not rich in inflections, a simple lookup dictionary that contains all the word forms would be sufficient. But creating such a dictionary for inflectionally rich languages is nontrivial and requires huge storage and high performance computing. The best alternative is to have a dictionary of root words and attaching the grammatical prefixes/suffixes is taken care in the target language generation. Part of Speech Tagging (POS) is the process of assigning a part of speech to each word in the sentence. Identification of the parts of speech such as nouns, verbs, adjectives, adverbs for each word of the sentence helps in analyzing the role of each constituent in a sentence Input: Token Output: Gender, Number, Person, Tense, Root. Example: Input: He is going to school Output of morphological analyzer: He [Root : he

3 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3 Gender : male Person : third Number : singular Category: pronoun] is [ Root : is Category: auxiliary] going [Root : go Category: verb ] to [Root : to Category: Preposition school [Root : school Category : noun Gender : Female Number : singular Person : third Type : common ] C. Shallow Parser This involves identifying simple noun phrases, verb groups, adjectival phrase, and adverbial phrase in a sentence. It also involves identifying the boundary of chunks and the labeling. Normally in language processing, sentences are parsed to identify the syntactic structure of the sentence. There are more similarities than differences between Indian languages. For example Marathi and Hindi language pair does not require a full parse. In this system, MT would be performed without a full sentential parser. The structural transformation is required when the source language structure does not have an equivalent structure in the target language. A partial parse or shallow parse is sufficient to identify the specific constituents in the sentence that has to undergo transformation. This component will also include the task of transliteration. The transliteration is done among Indian languages. Transliteration allows a word or words to be rendered in the script of the reader. Input: Tokens Output: Syntactically checked sentence D. Mapping Module In this module, each English word is mapped to the corresponding Marathi word. Verbs and Nouns are attached pratyay according to Gender Number Person and Tense. Adjectives are attached pratyay according to the Noun and Pronoun E. Local word Grouper Adjectives are grouped with their corresponding Nouns and Pronouns. (All need not be present in every case). If only pronoun or only noun exists, then it is considered as one group. Adverbs are grouped with their corresponding Verbs. Adverbs may be absent. In such cases a single verb may act as one group. Prepositions are grouped with the respective Noun or Pronoun. Transforming Algorithm: In Local Word Grouper (LWG)[3], grouping of all the tokens after assigning morphological information to them will be carried out. The Pronoun your and Noun name becomes Noun Phrase (NP) according to the rules given below. The NP and VP-PPS becomes NP-VP-PPS. Finally the Pronoun what, Auxiliary verb is and NP-VP-PPS forms a sentence S. This way LWG takes place using bottom-up parsing technique. The rules used in LWG for What is your name? are as follows: PRONOUN: = he NULL AUX: = is NULL NP: = name NULL NP: = PRONOUN NP VP: = VERB VERB: = NULL PP: = PREP NP PREP: = NULL VP-PPS: = VP PP NP-VP-PPS: = NP VP-PPS S: = PRONOUN AUX NP-PP-VPS Fig. 2. Parse tree generated by syntax analysis The structure of the valid sentence represented as parse tree is shown in Fig. 2. Syntactic analysis exploits the result of morphological analysis to build a structural description of the sentence. A shallow parser is designed for source language and the lexicon is built to store the root words of source language followed by target language. The rules mentioned above show a simple context free phrase structure grammar for English. Local Word Separator separates tokens from the sentence generated in LWG to search corresponding token in target language dictionary. Mapping Block maps the tokens of source language (English) to corresponding tokens of target language (Marathi).

4 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4 Though Marathi is relatively free word order language, there are few units which occur in fixed order. Noun phrase (NP) and verb phrase (VP) can be formed using only local information and more importantly they provide sufficient information for further processing of the sentence. This is how the LWG provides all the necessary information with minimum computational efforts. F. Tokens rearrangement Algorithm The Local Word Groups need to be rearranged in order to generate a valid target language (Marathi) sentence. After going through a number of research papers and analyzing a set of sentences, we proposed an algorithm for rearrangement of words. After analyzing a number of sentences, a regular pattern is observed in most of the sentences. It is found that, by keeping the first group (noun phrase) as it is and then reversing the remaining groups in sequence; it produces a valid Marathi sentence. This technique is apt for most of the sentences tested using the algorithm presented in this paper. It is found that the rearrangement technique suggested in this paper is the most optimized and simplistic in terms of understanding and implementation as compared to other methods. This comprises of major part of research work. Algorithm: 1. Read in English Sentence 2. Perform Lexical Analysis i.e. obtain tokens 3. Perform morphological analysis i.e. retrieve each token from dictionary along with its morphology 4. Check the syntax of the input sentence using production rules If (syntax is okay) { Retrieve corresponding Marathi tokens Perform mapping of English tokens to corresponding Marathi tokens Attach infections (pratyays) to Marathi tokens } else go to step 1 5. Apply Tokens rearrangement algorithm 6. Perform transliteration (use any transliteration tool) 7. Generate target language(marathi) sentence 8. End Tokens rearrangement Algorithm 1. Traverse input sentence from left to right 2. While(not end of sentence) { Keep position of NP/Noun intact and traverse a sentence by skipping NP/Noun till the end Reverse the sentence till NP/Noun Map source language tokens to target language tokens } 3. End Example: Input English Sentence: Word to Word Transliterated Mapping for Local Language: Lexical Word Grouping: After Applying Rearrangement Algorithm: Translation in Marathi using Transliteration: G. Transliteration After the rearrangement, text which is transliterated in Western script is converted into Devanagari script. There are many transliteration tools available to convert source language script to target language (English to Devanagari in this case). It is done using Akshara Bridge tool. V. MATHEMATICAL MODEL The overall process of a Rule Based Translation is carried out by using a following mathematical model. Terminologies used in the mathematical model are: STT - Source Translation Token equivalent to single word of English TTT - Target Translation Token is equivalent to word in Marathi. S STT 1, STT2,..., STT n represented as a set of Tokens in English T TTT1, TTT2,... TTT n represented as a set of Tokens in Marathi Ts T is a subset of Token of the category Noun T v T is a subset of Token of category Verb T o T is a subset of Token of category Object T a T is a subset of Token of category Article T p T is a subset of Token of category Preposition T u T is a subset of Token of category Auxiliary verb I went to the market with my mother to buy apples. mi gelo la bajara barobara mazhya AI NyAsAThi kharida saparchamda mi gelo bajarala mazhya AIbarobara kharidanyasathi saparchamda mi saparchamda kharidanyasathi mazhya AIbarobara bajarala gelo S - Legitimate Input Sentence in the source language English T - Output Sentence in the target language Marathi.

5 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5 C S T is a Bilingual corpus/dictionary of root words M STT i TTT i is a Translation Model used to directly map STT i S to TTT i T using bilingual corpus C LW G T is the Language Model used to perform the lexicon word grouping using the grammar of the target language Marathi based on whether article or proposition or auxiliary verb appears before TTT. Step 1: Input sentence Step 2: n = count( S the sentence n S Step 3: T M STT TTT i1 ) counts the number of tokens(words) in Translate individual tokens i i from source to destination language using corpus C s if TTTi Ts v if TTTi Tv o if TTTi To Step 4: i 1 tontagttti a if TTTi Ta p if TTTi T p u if TTTi Tu Tag individual tokens (words) of translated sentence T as sa aps sap Step 5: LW G T pv vp ps sp uv vu Language Model is used for lexicon word grouping using the grammar of Marathi. Step 6: T TTT1 i n to2ttti wherettt1 T and TTTi T Step 7: Translation in Marathi using Transliteration A. Noun VI. TARGET LANGUAGE (MARATHI) GENERATION 1. karant to Noun : In the sentence, if the preposition is present in noun phrase (NP), then karant pratyaya is applied to NP as per Table I e.g.: He came with me For me:= preposition-noun maza becomes mazya. 2. Shasti pratyaya to noun : If noun in the sentence has ( s) then depending on the gender of noun following it, pratyaya(cha, chi, che etc.) are applied to it. e.g.: That is Goraksha s book? Goraksha s become Gorakshache. 3. karant to Proper Noun: No karant is applied to proper noun. e.g.: He came for Alekh? Here Alekh is proper noun hence no karant is applied to it. Likewise for all the cases, rules are written for nouns, pronouns, adjectives and participles. Some of the rules are shown in Table I. B. Verb and Auxiliary: The pratyaya of verb depends on tense of the sentence, gender, number and the person. As shown in Table II & Table III. The output is generated in Roman script. It can be converted to Marathi language script by using existing software available. TABLE I KARANT Male The Noun of a karant changes to aa karant e.g. Ram Rama (singular) Nag Nagan (plural) The Noun of aa karant changes to ya karant e.g. Amba Ambya (singular) Killa Killya (plural) Female The Noun of a karant changes to aa in singular and e in plurale.g. Jibh Jibhe (singular) and Jibha (plural) The Noun of aa karant changes to e karant e.g. Shala Shale (singular) and Shala (plural) Bhasha Bhashe (singular) and Bhasha (plural) Number TABLE II ADJECTIVE/PARTICIPLE/PRONOUN Gender Male Female Neuter Singular a i e Plural e ya e TABLE III VERB FOR PRESENT TENSE IN MARATHI ROOT JA Number Singular Male Plural Ist Person mi ja-to amhi ja-to IInd Person tu ja-tos tumhi ja-ta IIIrd Person to ja-to te ja-tat VII. RESULTS Following are some of the test cases with output according to algorithm presented in this paper: 1) I am nice 2) She chose him

6 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6 3) It was done by his mother 4) My father told me to work myself for his movie 5) They themselves will be very enthusiastic for their nice and amazing movie 6) I went to market with my mother to buy apples 7) She has a nice doll having blue eyes 8) My favourite book was stolen 9) My dog should go running for food 10) I have been speaking with him 11) My father had been eating mango 12) I did it for my father compound and complex sentences. REFERENCES [1] Goraksh V. Garje, Manisha Marathe, Urmila Adsule, Translation of Simple Interrogative English Sentences to Marathi Sentences, proceedings of ICWET 10, Mumbai, Maharashtra, India February 26 27, 2010, pp. [2] G.V. Garje, G.K. Kharate,. (2013, Oct) Survey of Machine Translation Systems in India, International Journal on Natural Language Computing (IJNLC) [Online],Vol. 2, No.4, pp. Available: [3] Ray, P.; Harish V.; Sarkar, S.; and Basu, A.; Part of Speech Tagging and Local Word Grouping Techniques for Natural Language Parsing in Hindi ; Proceedings of the 1st International Conference on Natural Language Processing (ICON), Mysore, India, 2003, pp. [4] Dilshan De Silva, Asanga Alahakoon, Imesha Udayangani, Vishva Kumara, Devinda Kolonnage, Harindu Perera, and Samantha Thelijjagoda, Sinhala to English Language Translator, proceedings of 4 th International Conference on Information and Automation for Sustainability(ICIAFS), Colombo, Srilanka, 2008, pp. [5] Dr. Shridhar Shanvare, Abhinav Marathi Vyakaran, Marathi Lekhan, Vidya Vikas Mandal, Nagpur. [6] Naila Ata, Bushra Jawaid, Amir Kamarn, Rule based English to Urdu Machine Translation, in proceedings of Conference on Language and Technology(CLT 07], University of Karachi, Pakistan, 2007, pp. [7] Rajiv Sangal, Vineet Chaitanya, Natural Language Processing- a Paninian Perspective, Akshar Bharati Group, PHI publication. [8] R.M.K. Sinha, A.Jain, AnglaHindi: an English to Hindi Machine- Aided Translation System, in proceedings of the 9 th MT Summit(MTS), New Orieans, USA, Sep , 2003, pp [9] Uzair Muhammad, Atif Khan, Handling Proper nouns in machine translation from English into Urdu, Journal of Information & Communication Technology, Vol. 1, No. 2, 2007, pp VIII. CONCLUSION AND FUTURE SCOPE This paper presents a system for machine translation of Simple English assertive sentences to their Marathi counterpart. It follows the transfer approach with rule-based translation and emphasizes on assertive sentences and reordering algorithm for target language generation. It is difficult to frame the generalized rules for Marathi because grammar of English and Marathi are out of line. The system is successfully tested on 115 different simple assertive sentences using our production rules and produced satisfactory results. The major challenge in Machine Translation is to resolve the ambiguity in the meaning of words in the sentence. e.g.- He is standing near the bank? two possible contexts of the word bank - bank of river or the money bank. Resolving lexical and structural ambiguity would be big challenge for researchers. Grammar of the English Language sometimes allows the change in the sequence of words without changing the meaning of the sentence. e. g.- Should Ram have gone to the store? can be written as Should have Ram gone to the store?. The former sentence is translated by our system correctly but the latter is not. To allow such flexibility, there is a need to make rules more generalized. Translation of simple assertive sentences discussed in this paper can be extended for other sub-types of simple sentences such as imperative, negative and exclamatory sentences. The scope can be further expanded for

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.