CONTRIBUTIONS TO ENGLISH TO HINDI MACHINE TRANSLATION USING EXAMPLE-BASED APPROACH. by DEEPA GUPTA. Department of Mathematics.

Similar documents
MASTER OF ARTS IN APPLIED SOCIOLOGY. Thesis Option

AN ERROR ANALYSIS ON THE USE OF DERIVATION AT ENGLISH EDUCATION DEPARTMENT OF UNIVERSITAS MUHAMMADIYAH YOGYAKARTA. A Skripsi

THE INFLUENCE OF COOPERATIVE WRITING TECHNIQUE TO TEACH WRITING SKILL VIEWED FROM STUDENTS CREATIVITY

IMPROVING STUDENTS SPEAKING SKILL THROUGH

ENGLISH TEACHING AND LEARNING ACTIVITIES TO THE 4 TH GRADE IN SD NEGERI KESTALAN NO. 05 SURAKARTA

Linking Task: Identifying authors and book titles in verbose queries

Indian Institute of Technology, Kanpur

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Parsing of part-of-speech tagged Assamese Texts

Guidelines for Writing an Internship Report

IMPROVING STUDENTS READING COMPREHENSION BY IMPLEMENTING RECIPROCAL TEACHING (A

INTERNAL ASSIGNMENT QUESTIONS P.G. Diploma in English Language & Teaching ANNUAL EXAMINATIONS ( )

Derivational and Inflectional Morphemes in Pak-Pak Language

A Study of Socio-Economic Status and Emotional Intelligence among Madrasa and Islamic School students towards Inclusive Development

ScienceDirect. Malayalam question answering system

A First-Pass Approach for Evaluating Machine Translation Systems

SAMPLE PAPER SYLLABUS

User education in libraries

A THESIS. By: IRENE BRAINNITA OKTARIN S

Writing Research Articles

ANNEXURE VII (Part-II) PRACTICAL WORK FIRST YEAR ( )

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Linguistics. The School of Humanities

CHALLENGES FACING DEVELOPMENT OF STRATEGIC PLANS IN PUBLIC SECONDARY SCHOOLS IN MWINGI CENTRAL DISTRICT, KENYA

Heritage Korean Stage 6 Syllabus Preliminary and HSC Courses

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

MAHATMA GANDHI KASHI VIDYAPITH Deptt. of Library and Information Science B.Lib. I.Sc. Syllabus

Std: III rd. Subject: Morals cw.

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

INFORMATION OF THE SCHOOL REQUIRED TO BE UPLOADED ON WEBSITE

STUDENTS SATISFACTION LEVEL TOWARDS THE GENERIC SKILLS APPLIED IN THE CO-CURRICULUM SUBJECT IN UNIVERSITI TEKNOLOGI MALAYSIA NUR HANI BT MOHAMED

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Course Outline for Honors Spanish II Mrs. Sharon Koller

Ch VI- SENTENCE PATTERNS.

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Progressive Aspect in Nigerian English

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Mahamaheem's Speech for Teachers' Day Function on 5 th September, 2012 at KGMU, Lucknow

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Context Free Grammars. Many slides from Michael Collins

yang menghadapi masalah Down Syndrome. Mereka telah menghadiri satu program

A Simple Surface Realization Engine for Telugu

INSTITUTE OF MANAGEMENT STUDIES NOIDA

Research Journal ADE DEDI SALIPUTRA NIM: F

Cd A GLOBAL LANGUAGE CENTRE

Developing Grammar in Context

UNIVERSITY OF SOUTHERN QUEENSLAND

A Bayesian Learning Approach to Concept-Based Document Classification

Universiteit Leiden ICT in Business

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Constructing Parallel Corpus from Movie Subtitles

Developing a TT-MCTAG for German with an RCG-based Parser

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

MODULES. india WSA. DISTINCT THE CULTURE & ARCHITECTURE OF INDIA August 14th-20th, worldstudyabroad.org

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Advanced Grammar in Use

Software Maintenance

A STUDY ON AWARENESS ABOUT BUSINESS SCHOOLS AMONG RURAL GRADUATE STUDENTS WITH REFERENCE TO COIMBATORE REGION

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Cross Language Information Retrieval

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Course and Examination Regulations

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 3 March 2011 ISSN

GLOBAL MEET FOR A RESURGENT BIHAR

IMPORTANT GUIDELINE FOR PROJECT/ INPLANT REPORT. FOSTER DEVELOPMENT SCHOOL OF MANAGEMENT, DR.BABASAHEB AMBEDKAR MARATHWADA UNIVERSITY,AURANGABAD...

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

IMPROVING STUDENTS WRITING SKILL USING PAIR CHECK METHOD AT THE SECOND GRADE STUDENTS OF SMP MUHAMMADIYAH 3 JETIS IN THE ACADEMIC YEAR OF 2015/2016.

SUMMARY ON JEE (ADVANCED) [KANPUR ZONE] P Gupta & R N Sen Gupta

Faculty of Social Sciences. Department of Geography

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks]

Probabilistic Latent Semantic Analysis

Disambiguation of Thai Personal Name from Online News Articles

University Faculty Details Page on DU Web-site


SIMILARITY MEASURE FOR RETRIEVAL OF QUESTION ITEMS WITH MULTI-VARIABLE DATA SETS SITI HASRINAFASYA BINTI CHE HASSAN UNIVERSITI TEKNOLOGI MALAYSIA

Rotary Club of Portsmouth

Compositional Semantics

General study plan for third-cycle programmes in Sociology

Making welding simulators effective

Kentucky s Standards for Teaching and Learning. Kentucky s Learning Goals and Academic Expectations

Guidelines for Project I Delivery and Assessment Department of Industrial and Mechanical Engineering Lebanese American University

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Delaware Performance Appraisal System Building greater skills and knowledge for educators

A Case Study: News Classification Based on Term Frequency

B.A.B.Ed (Integrated) Course

E-LEARNING IN LIBRARY OF JAMIA HAMDARD UNIVERSITY

Oakland Unified School District English/ Language Arts Course Syllabus

Lawyers for Learning Mentoring Program Information Booklet

STUDENTS' RATINGS ON TEACHER

DIOCESE OF PLYMOUTH VICARIATE FOR EVANGELISATION CATECHESIS AND SCHOOLS

Teaching Vocabulary Summary. Erin Cathey. Middle Tennessee State University

Pontificia Universidad Católica del Ecuador Facultad de Comunicación, Lingüística y Literatura Escuela de Lenguas Sección de Inglés

1. Introduction. 2. The OMBI database editor

Myths, Legends, Fairytales and Novels (Writing a Letter)

Transcription:

CONTRIBUTIONS TO ENGLISH TO HINDI MACHINE TRANSLATION USING EXAMPLE-BASED APPROACH by DEEPA GUPTA Department of Mathematics Submitted in fulfilinent of the requirement of the degree of Doctor of Philosophy to the Indian Institute of Technology Delhi Hauz Khas, New Delhi-110016, India January, 2005

) AMY *ay. 144, -Fp: -3! 53 431 IHVG jr! 4

Dedicated to My Parents, My B ro ther.[shish and My 'Thesis Supervisor...

Certificate This is to certify that the thesis entitled "Contributions to English to Hindi Machine Translation Using Example-Based Approach" submitted by Ms. Deepa Gupta to the Department of Mathematics, Indian Institute of Technology Delhi, for the award of the degree of Doctor of Philosophy, is a record of bona fide research work carried out by her under my guidance and supervision. The thesis has reached the standards fulfilling the requirements of the regulations relating to the degree. The work contained in this thesis has not been submitted to any other university or institute for the award of any degree or diploma. Dr. Niladri Chatterjee Assistant Professor Department of Mathematics Indian Institute of Technology Delhi Delhi (INDIA)

Acknowledgement If I say that this is my thesis it would be totally untrue. It is like a dream come true. There are people in this world, some of them so wonderful, who helped in making this dream, a product that you are holding in your hand. I would like to thank all of them, and in particular: Dr. Niladri Chatterjee - mentor, guru and friend, taught me the basics of research and stayed with me right till the end. His efforts, comments, advices and ideas developed my thinking, and improved my way of presentation. Without his constant encouragement, keen interest, inspiring criticism and invaluable guidance, I would not have accomplished my work. I admit that his efforts need much more acknowledgement than expressed here. I acknowledge and thank the Indian Institute of Technology Delhi and Tata Infotech Research Lab who funded this research. I sincerely thank all the faculty members of Department of Mathematics, especially, I express my gratitude for Prof B. Chandra and Dr. R. K. Sharma, for providing me continuous moral support and help. I thank my SRC members, Prof. Saroj Kaushik and Prof. B. R. Handa, for their time and efforts. I also thank the department administrative staff for their assistance. I extend my thanks to Prof. R. B. Nair and Dr. Wagish Shukla of IIT Delhi, and Prof. Vaishna Narang, Prof. P. K. Pandey, Prof. G. V. Singh. Dr. D. K. Lobiyal. and Dr. Girish Nath Jha of Jawaharlal Nehru University Delhi, for the enlightening discussions on basics of languages. I would like to express my sincere thanks to my friends Priya and Dharmendra for many fruitful discussions regarding my research problem. I thank Mr. Gaurav

Kashyap for helping me in the implementation of the algorithms. In particular, I would like to thank Inderdeep Singh, for his help in writing some part of the thesis. I want to give special thanks to my friends, Sonia, Pranita and Nutan, for helping me in both good and bad times. I would like to thank Prabhakhar for his brotherly support. I extend my thanks to Manju, Anita, Sarita, Subhashini and Anju for cheering me, always. Shailly and Geeta - amazing friends who read the manuscript and gave honest comments. Both of them also stayed with me in the process, and handled me, and sometimes my out-of-control emotions so well. Especially, I wish to extend my thanks to Geeta for providing me stay in her hostel room, and also for her wonderful help when my leg got fractured when we knew each other for a month only. I wish to acknowledge Krishna for his constant help, both academic and nonacademic, and his continuous encouragement. I convey my sincere regards to my parents, and brothers for the sacrifices they have made, for the patience they have shown, and for the love and blessing they have showered. I thank Arun for his moral support. Most imperative of all, I would like to express my profound sense of gratitude and appreciation to my sister Neetu. Her irrational and unbreakable belief in me bordered on craziness at times. I cannot avoid to mention my friend Sharad who deserves more than a little acknowledgement. His constant inspiration and untiring support has sustained my confidence throughout this work. Finally, I thank GOD for every thing. Deepa-61:ta

Abstract This research focuses on development of Example Based Machine Translation (EBMT) system for English to Hindi. Development of a machine translation (MT) system typically demands a large volume of computational resources. For example, rulebased MT systems require extraction of syntactic and semantic knowledge in the form of rules, statistics-based MT systems require huge parallel corpus containing sentences in the source languages and their translations in target language. Requirement of such computational resources is much less in respect of EMBT. This makes development of EBMT systems for English to Hindi translation feasible, where availability of large-scale computational resources is still scarce. The primary motivation for this work comes because of the following: a) Although a small number of English to Hindi MT systems are already available, the outputs produced by them are not of high quality all the time. Through this work we intend to analyze the difficulties that lead to this below par performance, and try to provide some solutions for them. b) There are several other major languages (e.g., Bengali, Punjabi, Gujrathi) in the Indian subcontinent. Demand for developing MT systems from English to these languages is increasing rapidly. But at the same time, development of computational resources in these languages is still at its infancy. Since many of these languages are similar to Hindi, syntactically as well as lexicon wise, the research carried out here should help developing MT systems from English to these languages as well.

The major contributions of this research may be described as follows: 1) Development of a systematic adaptation scheme. We proposed an adaptation scheme consisting of ten basic operations. These operations work not only at word level, but at suffix level as well. This makes adaptation less expensive in many situations. 2) Study of Divergence. We observe that occurrence of divergence causes major difficulty for any MT systems. In this work we make an in depth study of the different types of divergence, and categorize them. 3) Development of Retrieval scheme. We propose a novel approach for measuring similarity between sentences. We suggest that retrieval strategy, with respect to an EBMT system, will be most efficient if it measures similarity on the basis of cost of adaptation. In this work we provide a complete framework for an efficient retrieval scheme on the basis of our studies on "divergence" and "cost of adaptation". 4) Dealing with Complex sentences. Handling complex sentences by an MT system is generally considered to be difficult. In this work we propose a "split and translate" technique for translating complex sentences under an EBMT framework. We feel that the overall scheme proposed in this research will pave the way for developing an efficient EBMT system for translating from English to Hindi. We hope that this research will also help development of 1\4T systems from English to other languages of the Indian subcontinent. ii

Contents 1 Introduction 1 1.1 Description of the Work Done and Summary of the Chapters 6 1.2 Some Critical Points 19 2 Adaptation in English to Hindi Translation: A Systematic Approach 23 2.1 Introduction 23 2.2 Description of the Adaptation Operations 29 2.3 Study of Adaptation Procedure for Morphological Variation of Active Verbs 36 2.3.1 Same Tense Same Verb Form 38 2.3.2 Different Tenses Same Verb Form 42 2.3.3 Same Tense Different Verb Forms 46 2.3.4 Different Tenses Different Verb Forms 48 2.4 Adaptation Procedure for Morphological Variation of Passive Verbs 51 2.5 Study of Adaptation Procedures for Subject/ Object Functional Slot 56 2.5.1 Adaptation Rules for Variations in the Morpho Tags of DN> 59

Contents 2.5.2 Adaptation Rules for Variations in the Morpho Tags of @GN> GO 2.5.3 Adaptation Rules for Variations in the Morpho Tags of @QN. 64 2.5.4 Adaptation Rules for Variations in the Morpho Tags of Premodifier Adjective OAN> 64 2.5.5 Adaptation Rules for Variations in the Morpho Tags of @SUB 69 2.6 Adaptation of Interrogative Words 73 2.7 Adaptation Rules for Variation in Kind of Sentences 83 2.8 Concluding Remarks 85 3 An FT and SPAC Based Divergence Identification Technique From Example Base 87 3.1 Introduction 87 3.2 Divergence and Its Identification: Some Relevant Past Work 89 3.3 Divergences and Their Identification in English to Hindi Translation. 96 3.3.1 Structural Divergence 97 3.3.2 Categorial Divergence 100 3.3.3 Nominal Divergence 104 3.3.4 Pronominal Divergence 107 3.3.5 Demotional Divergence 111 3.3.6 Conflational Divergence 117 3.3.7 Possessional Divergence 121 3.3.8 Some Critical Comments 131 iv

Contents 3.4 Concluding Remarks 132 4 A Corpus-Evidence Based Approach for Prior Determination of Divergence 135 4.1 Introduction 135 4.2 Corpus-Based Evidences and Their Use in Divergence Identification 136 4.2.1 Roles of Different Functional Tags 138 4.3 The Proposed Approach 147 4.4 Illustrations and Experimental Results 155 4.4.1 Illustration 1 1.55 4.4.2 Illustration 2 157 4.4.3 Illustration 3 1:58 4.4.4 Experimental Results 166 4.5 Concluding Remarks 16S 5 A Cost of Adaptation Based Scheme for Efficient Retrieval of Translation Examples 171 5.1 Introduction 171 5.2 Brief Review of Related Past Work 171 5.3 Evaluation of Cost of Adaptation 178 5.3.1 Cost of Different Adaptation Operations 182 5.4 Cost Due to Different Functional Slots and Kind of Sentences. 18.5 V

Contents 5.4.1 Costs Due to Variation in Kind of Sentences 186 5.4.2 Cost Due to Active Verb Morphological Variation 187 5.4.3 Cost Due to Subject/Object Functional Slot 192 5.4.4 Use of Adaptation Cost as a Measure of Similarity 197 5.5 The Proposed Approach vis-a-vis Some Similarity Measurement Schemes 198 5.5.1 Semantic Similarity 198 5.5.2 Syntactic Similarity 201 5.5.3 A Proposed Approach: Cost of Adaptation Based Similarity 203 5.5.4 Drawbacks of the Proposed Scheme 211 5.6 Two-level Filtration Scheme 213 5.6.1 Measurement of Structural Similarity 214 5.6.2 Measurement of Characteristic Feature Dissimilarity 217 5.7 Complexity Analysis of the Proposed Scheme 222 5.8 Difficulties in Handling Complex Sentences 226 5.9 Splitting Rules for Converting Complex Sentence into Simple Sentences229 5.9.1 Splitting Rule for the Connectives "when", "where", "whenever" and "wherever" 231 5.9.2 Splitting Rule for the Connective "who" 241 5.10 Adaptation Procedure for Complex Sentence 253 5.10.1 Adaptation Procedure for Connectives "when", "where", "whenever" and -wherever" 254 vi

Content," 5.10.2 Adaptation Procedure for Connective "who" 256 5.11 Illustrations 260 5.11.1 Illustration 1 260 5.11.2 Illustration 2 262 5.12 Concluding Remarks 264 6 Discussions and Conclusions 267 6.1 Goals and Motivation 267 6.2 Contributions Made by This Research 268 6.3 Possible extensions 272 6.4 Epilogue 273 6.4.1 Pre-editing and Post-editing 274 6.4.2 Evaluation Measures of Machine Translation 276 Appendices 280 A 281 A.1 English and Hindi Language Variations 281 A.2 Verb Morphological and Structure Variations 285 A.2.1 Conjugation of Root Verb 286 B 291 B.1 Functional Tags 291 B.2 Morpho Tags 294 vii

Contents C 299 C.1 Definitions of Some Non-typical Functional Tags and SPAC Sturctures 299 D 303 D.1 Semantic Similarity 303 E 305 E.1 Cost Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective305 Bibliography 308 viii