BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Similar documents
Grammars & Parsing, Part 1:

Context Free Grammars. Many slides from Michael Collins

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Parsing of part-of-speech tagged Assamese Texts

CS 598 Natural Language Processing

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Developing Grammar in Context

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

The stages of event extraction

Memory-based grammatical error correction

Linking Task: Identifying authors and book titles in verbose queries

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

AQUA: An Ontology-Driven Question Answering System

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Cross Language Information Retrieval

Natural Language Processing. George Konidaris

Prediction of Maximal Projection for Semantic Role Labeling

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

GACE Computer Science Assessment Test at a Glance

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Accurate Unlexicalized Parsing for Modern Hebrew

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

ScienceDirect. Malayalam question answering system

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Developing a TT-MCTAG for German with an RCG-based Parser

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Modeling user preferences and norms in context-aware systems

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Advanced Grammar in Use

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

LTAG-spinal and the Treebank

Intensive English Program Southwest College

The Smart/Empire TIPSTER IR System

Some Principles of Automated Natural Language Information Extraction

Theoretical Syntax Winter Answers to practice problems

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

5 th Grade Language Arts Curriculum Map

SIE: Speech Enabled Interface for E-Learning

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Copyright 2017 DataWORKS Educational Research. All rights reserved.

Constraining X-Bar: Theta Theory

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Analysis of Probabilistic Parsing in NLP

Beginners French FREN 101 University Studies Program. Course Outline

Part I. Figuring out how English works

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Beyond the Pipeline: Discrete Optimization in NLP

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

A Corpus-based Evaluation of a Domain-specific Text to Knowledge Mapping Prototype

Compositional Semantics

Ch VI- SENTENCE PATTERNS.

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

LING 329 : MORPHOLOGY

An Interactive Intelligent Language Tutor Over The Internet

Chapter 4: Valence & Agreement CSLI Publications

Introduction of Open-Source e-learning Environment and Resources: A Novel Approach for Secondary Schools in Tanzania

Copyright 2002 by the McGraw-Hill Companies, Inc.

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Som and Optimality Theory

Argument structure and theta roles

LNGT0101 Introduction to Linguistics

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The Role of the Head in the Interpretation of English Deverbal Compounds

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Applications of memory-based natural language processing

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Learning Computational Grammars

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

CX 105/205/305 Greek Language 2017/18

2017 national curriculum tests. Key stage 1. English grammar, punctuation and spelling test mark schemes. Paper 1: spelling and Paper 2: questions

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Rule Learning With Negation: Issues Regarding Effectiveness

IBAN LANGUAGE PARSER USING RULE BASED APPROACH

Transcription:

Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk. Borhan Daffodil International University http://hdl.handle.net/20.500.11948/868 Downloaded from http://dspace.library.daffodilvarsity.edu.bd, Copyright Daffodil International University Library

DAFFODIL INTERNATIONAL UNIVERSITY JOURNAL OF SCIENCE AND TECHNOLOGY, VOLUME 8, ISSUE 1, JANUARY 2013 37 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Sk. Borhan Uddin 1, Dr. Md. Fokhray Hossain 2 and Kamanashis Biswas 3 1 Bangladesh Internet Press Limited, Dhaka 2 Department of CSE, Daffodil International University 3 Department of CSE, Daffodil International University Email: oneof.rebel@gmail.com, drfokhray@daffodilvarsity.edu.bd, ananda@daffodilvarsity.edu.bd Abstract- Natural Language Processing is one of the most difficult areas in artificial intelligence. Because, completely different grammatical rules (on which the languages are based on) make the task more complicated. The same problems are found in conversion of Bangla text to English text. Bangla grammar maintains so many rules which make the task harder. Although, a number of researches are done in different area such as Bangla keyboard layout design, English to Bangla translator etc., very few researches are done to translate Bangla text to English. In this paper, we developed a system to do the conversion using OpenNLP tool that performs both statistical and rule based conversion. We have found that using this conversion method it is possible to translate about 40% of text from Bangla to English correctly. Keywords- Machine Translation, OpenNLP Library, POS Tagging, Statistical Context Free Description 1. Introduction Since 1976 EC (European Commission) uses the MT (Machine Translator) to convert text from one language to another language. This broad usage spreads its importance widely and the translation technique is also developed for regular uses. Now-a-days, Google translator [1] is one of the pioneer applications supporting a number of languages to translate from one to another. Although, it has been successfully implemented for many languages but for Bangla language it is still in developing phase. The others translators e.g. Yahoo Babel Fish [2], SDL Free Translation, Systran Language Translation [3] etc. support multi language translation like Danish, English, Chinese, Italia, Japanese, French, Greek, Korean etc. but not Bangla. Hence, we developed the system which performs both statistical and rule based conversion to translate Bangla text to English. 2. Major Tools Used in Translation 2.1OpenNLP Library OpenNLP is an organizational center for open source projects related to natural language processing [4]. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects. OpenNLP is a javabased NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and co reference. This tool can be integrated with other software to assist in the processing of text. 2.2 Statistical context-free description of compound structures The statistical context-free description of compound structures is used to analyze compound entries. The linguistic description is a context free grammar, with associated linguistic probability. For example, to analyze the Bangla sentence আম ভ ত খ য় ম as an [((NP (PRP) (NN) (NN))] noun phrase, the system uses the following rules: noun-phrase (NP) =>.99 noun pronoun (PRP) =>.20 pronoun noun + noun =>.80 noun 2.3 Set of Rules A set of rules defined, which are used to establish a sentence, according to the grammar rules. This Date of submission : 01.12. 2011 Date of acceptance : 19. 07. 2012

38 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS operation is held in primary level, when we insert a simple sentence it just match some predefined common rules patterns. If the rule is absent then generate a pattern (rule) on basis of specific tense. 2.4 POS Tagging POS tagging is the process of assigning a parts of speech for each word in a sentence. Here we have used Penn Treebank, the linguistic corpus developed by the University of Pennsylvania. The POS tagger returns array according to tokens. 3. Conversion Process In this section, five major functions of the system are described. These are: (1) Bangla Grammar Detection (2) POS Tagging (3) Bangla Parse Tree Generation (4) Bangla Parse Tree to English Parse Tree Matching and (5) Bangla to English Text Translation. Machine Detect Grammars Then, to detect the tense, we just have to consider the Bangla verb. For example আম ভ ত খ য় ম In this sentence verb is খ য় ম. We want to find out the tense. We have a pre-defined database where we put all Bangla verb keywords to detect tense as shown in the table in next column. Table 1: Tense Mapping for Bangla sentence Tense_Mapping ID BangSuffix TensCode 1.3 খ 12 1.1, ই 11 1.2 ম 12 1.4 ম 13 1.5 য় 13 2.1 ম ল 21 2.2 ম ল 21 2.3 ম 21 2.4 খ ম ল 22 Retrieve Simple Text in English According to Bangla Parsing Input Source Translator Machine Replace Rules Design Output Generation BangSuffix determines the keywords of the verb, and tense code determines the tense. BangSuffix indicates tense code 11 which means Bangla Present Indefinite Tense. Now we get the tense (Bangla) from the table. In আম ভ ত খ য় ম sentence আম is determining the person. There is a table for person and number detection that helps to determine the Bangla person. At first আম is converted into English i.e. I then search in the table to detect the person and number. This table is also use to define pronoun. It helps to optimize the coding and searching data from database. Fig. 1: Bangla to English Conversion Process Table 2: Person detection table 3.1 Bangla Grammar Detection In this step, the first goal is to understand the grammar of input sentence using its components (like person, verb, objects etc). From the Bangla grammar sentence making rules, it is known that verb always sits at the end while person remains at the beginning in simple Bangla sentences. Pronoun ID Pronoun S_P Person 1 I S 1 2 We S 1 3 You S/P 2

DAFFODIL INTERNATIONAL UNIVERSITY JOURNAL OF SCIENCE AND TECHNOLOGY, VOLUME 8, ISSUE 1, DECEMBER 2012 39 Pronoun ID Pronoun S_P Person 4 He S 3 5 She S 3 6 They P 3 3.2 POS Tagging POS tagging is the process of assigning a parts of speech for each word in a sentence. Here we have used Penn Treebank, the linguistic corpus developed by the University of Pennsylvania [5]. The POS tagger returns array of tags and tokens. A sentence is splinted into tokens. Here we have used OpenNLP for POS tagging. Following figure shows the tag set used by OpenNLP [4]. Fig. 2: Tag set used by OpenNLP [4] 3.3 Bangla Parse Tree Generation This step consists of chunking and parsing. The chunkier returns phrase name based on pos tagging. Producing a full parse tree is a task that builds on the NLP algorithms, which goes in grouping the chunked phrases into a tree diagram that illustrates the structure of the sentence. The parsing algorithm is implemented by the OpenNLP library [6]. Based on chunk string parse tree is generated. The following figure shows the Bangle parse tree. P N Fig. 3: Parse tree of Bangla sentence N S Bangla POS tag is still in research level, so we represent Bangla parse tree in English. It represents the real Bangla sentence structure. V VB I ri at.. 3.4 Bangla Parse Tree to English Parse Tree Matching Most of the Machine Translation uses Chomsky Normal Form (CNF) [7] for defining grammar. Here, we have used a sorting method for defining English grammar. Let s see the structure of Bangla sentence and English sentence according to CNF. Bangla grammar in CNF is: 1. S = NP + NP + PP + VP 2. S = NP + PP + NP + AP And, corresponding English grammar in CNF is: 1. S = NP + VP + PP + NP 2. S = NP + AP+ NP+PP Here, NP = DET + NOUN + PRONOUN DET = a an The VP = AUXILIRAY + PRINCIPLE From the above example, for each Bangla rule there must be an English rule for generating Bangla parse tree. In CFG we have to define grammar for different sentences. But here we have used a sorting method for generating English parse tree. It helps to develop statistical knowledge of the system for further conversion. There are also some pre-defined rules that will be worked when a common pattern is matched between Bangla and English. Consider an example, আম ভ ত খ য় ম S = NP (PRP আম ) + NP (NN ভ ত) +VP (VBPP খ য় ম )

40 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS From the English sentence making rule, we get Subject + Verb + Object which corresponding Bangla sentence rule is Subject + Object + Verb. If we synchronize the Bangla rules with English rules we get, Corresponding Bangla Sentence Syntax = (PRP) + rice (NN) + have eaten (VBA+VBPP). English Sentence : I (PRP) + (have(vba) + eaten (VBPP)) + rice(nn). Subject + Verb (Auxiliary + Principle) + Object A sorting mechanism is implemented to rearrange the word according to the rules, and generate the English parse tree. 3.5 Bangla to English Text Translation After English grammar detection, we pass all the data (words) to a particular module known as sentence construction module as parameter. This sentence making process is done after completion of some steps. The steps are described here. Step 1: To complete a sentence, we detect Partsof-Speech from sentence at first. Here, we have to do POS tag again. For example, the sentence that we got at previous step is as follows. Corresponding Bangla Sentence Syntax - I rice have eaten. We have detected tense that is Present Perfect and person that is 1st person singular number. English sentence with POS tag - I (PRP) rice (NN) have(vba) eaten (VBPP). Step 2: Rearrange the words according to the priority of POS tag. We get the priority value of Parts of Speech from the following table. Table 3: Priority value of POS tag POS Sort ID POSTAG sortindex 1 DT 1 2 PRP 0 3 NNP 1.5 I POS Sort ID POSTAG sortindex 4 VBA 2 5 VBP 2 6 VBPP 3 7 NN 5 Example: আম ভ ত খ য় ম I (PRP) + rice (NN) + have (VBA) + eaten (VBPP) From the priority Table we get the values of Parts-of-Speech: PRP = 0; NN = 5; VBA = 2; VBPP = 3; We get English Sentence (ES) = 0, 5, 2, 3. Now, we can form the sentence by rearranging these values in ascending order, ES = 0, 2, 3, 5; ES = PRP + VBA + VBPP + NN; ES = I + have + eaten + rice. Finally, we get- I have eaten rice. 4. Example Illustrating Sequential Steps In this section, we have shown the conversion steps of another simple sentence according to rule based analysis: At first, assume a simple sentence- আম ভ ত খ ই Step 1: Split the sentence আম + ভ ত + খ ই Step 2: Detect the tense from verb আম + ভ ত + খ ই Bangla verb Output: From database and defined rules we get ই represents the present indefinite tense (Grammar Detection). Step 3: Detect the person, compare the data from database Bangla Person আম + ভ ত + খ ই Indicating the 1st Person singular number (Detect Person)

DAFFODIL INTERNATIONAL UNIVERSITY JOURNAL OF SCIENCE AND TECHNOLOGY, VOLUME 8, ISSUE 1, DECEMBER 2012 41 Step 4: Insert the corresponding English word from Database I + rice + eat Output: I rice eat.. (Data fetch) Fetch words from database Step 5: POS tag the sentence to get the Bangla parse tree PRP/I, NN/rice, VBP/eat sentence Parse the whole Output: (S(NP((PRP/I)(NN/rice))(VP(VBP/eat))) (Sentence Parsing) Step 6: Search the sentence pattern from database. BS pattern: PRP + NN +VBP corresponding ES pattern: PRP + VBP + NN Output: (S (NP (PRP/I)) (VP (VBP/eat) (NP (NN/rice)))... English Parse tree Generation Step 7: Get the sentence from parse tree. PRP/ I, VBP/ eat, NN/ rice. I eat rice. 5. Algorithm Final output It is almost impossible to develop an algorithm which is highly efficient in translating one language to another. Specially, the languages Step1W: Get page or Input from any source. Step2W: Split the Bangla sentence into word and find out the tense, person etc. Step3W: Send text to Machine Translator as parameter. ================MT================== Step 1: Receive input as text. Step2: Determine the Bangla tense from the Bangla verb (scaning the last bangla word) get the person from first Bangla word. Step 3: Find all the English word according to the Bangla words. Step4: For each sentences do a. Tokenizing sentences b. Parts of speech tagging. c. Divide in chunk and generate parse tree. d. Find out the appropriate rules e. Rearrange the words according to the rules. Step 5: Print English sentence. =============END OF MT============== Step5W: Replace Bangla content with english content. Step6W: Show the target language. with huge number of grammatical rules lead many problems such as if we want highly accurate result then the conversion will need more time. Hence, the target is to achieve near most translation from which we can understand the sentence. The algorithm that we developed also provides the conversion which is not hundred percent accurate for complex sentence but the meaning can be understood. The algorithm is described below. 6. Conclusion A number of critical issues always make natural language processing tasks more complex. There are a number of exceptions that violate the normal rules of grammar. And this is really tough to keep track of all those situations. Hence, the efficiency in translating languages with complex grammatical rules is not too high. We have implemented the system and found that from a chunk of text it can translate 40% of sentences correctly in average. Our observations have found that for simple sentences the system can easily response with correct answer (e.g. আপম খ থ হয়ত আসয়?, Where are you coming from?) but for complex sentences, (e.g. আ র গত ল বই ক র রয়ত মগয় ম ল, ম ন ত আ র খ মর য়রম ল We went buy book yesterday but we were late.) it requires more time and in some cases it cannot do it properly. 6.1 Future Work The main challenge in Bangla to English text conversion is grammatical rules. If we can make a complete format for all rules and exceptions then the task will be simpler. Efficient AI techniques, indexing and searching mechanisms will improve the total system that may result in more accurate output. References [1] Google Translator, www. translate.google.com, accessed on September 12, 2011. [2] Yahoo Babel Fish, www. babelfish.yahoo.com, accessed on September 12, 2011. [3] Jin Yang and Elke D. Lange, Systran on Altavista A User Study On Real-Time Machine Translation On The Internet, Lecture Notes in Computer Science, 1998,

42 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Volume 1529/1998, 275-285, DOI: 10.1007/3-540- 49478-2_25 [4] OpenNLP, www.maxent.sourceforge.net, accessed on September 12, 2011. [5] Penn Treebank Project, www.cis.upenn.edu /~treebank/, accessed on September 12, 2011. [6] The OpenNLP, www.opennlp.sourceforge.net /projects.html, accessed on September 12, 2011. [7] Chomsky Normal Form, www. en.wikipedia.org /wiki/chomsky_normal_form, accessed on September 12, 2011. Sk. Borhan Uddin has completed his graduation from Daffodil International University, Dhaka, Bangladesh in Computer Science and Engineering. His major area of research interest includes natural language processing, robotics and neural networks. At present, he is working as software engineer in Bangladesh Internet Press Limited. Dr. Md. Fokhray Hossain obtained his B.Sc. (Honors) and M.Sc. in Physics from Jahangirnagar University in 1991 and subsequently appointed as a research fellow at Dhaka University in 1993. Dr. Hossain obtained his Ph.D. from University of Glamorgan, UK back 1998 through Overseas Development Administration Shared Scholarship Scheme (ODASSS). At present, he is working as registrar at Daffodil International University. Kamanashis Biswas has post graduated in security engineering from BTH, Sweden. His major area of research interest includes artificial intelligence, algorithm, and computer and network security issues. At present, he is working as Assistant Professor at Daffodil International University, Dhaka, Bangladesh