METEOR-Hindi : Automatic MT Evaluation Metric for Hindi as a Target Language

Similar documents
क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

arxiv: v1 [cs.cl] 2 Apr 2017

HinMA: Distributed Morphology based Hindi Morphological Analyzer

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

TINE: A Metric to Assess MT Adequacy

Re-evaluating the Role of Bleu in Machine Translation Research

S. RAZA GIRLS HIGH SCHOOL

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Two methods to incorporate local morphosyntactic features in Hindi dependency

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

Regression for Sentence-Level MT Evaluation with Pseudo References


ENGLISH Month August

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

Linking Task: Identifying authors and book titles in verbose queries

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

ह द स ख! Hindi Sikho!

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Cross Language Information Retrieval

Grammar Extraction from Treebanks for Hindi and Telugu

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

The Role of String Similarity Metrics in Ontology Alignment

Distant Supervised Relation Extraction with Wikipedia and Freebase

ScienceDirect. Malayalam question answering system

(Sub)Gradient Descent

Timeline. Recommendations

Introduction to Simulation

Indian Institute of Technology, Kanpur

On document relevance and lexical cohesion between query terms

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Software Maintenance

Probabilistic Latent Semantic Analysis

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Individual Differences & Item Effects: How to test them, & how to test them well

Leveraging Sentiment to Compute Word Similarity

Detecting English-French Cognates Using Orthographic Edit Distance

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Corpus Linguistics (L615)

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Australian Journal of Basic and Applied Sciences

CX 101/201/301 Latin Language and Literature 2015/16

Learning From the Past with Experiment Databases

A Simple Surface Realization Engine for Telugu

The stages of event extraction

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Ensemble Technique Utilization for Indonesian Dependency Parser

Value Creation Through! Integration Workshop! Value Stream Analysis and Mapping for PD! January 31, 2002!

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

AQUA: An Ontology-Driven Question Answering System

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Multi-Lingual Text Leveling

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Accreditation of Prior Experiential and Certificated Learning (APECL) Guidance for Applicants/Students

Derivational and Inflectional Morphemes in Pak-Pak Language

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

A hybrid approach to translate Moroccan Arabic dialect

What is PDE? Research Report. Paul Nichols

Named Entity Recognition: A Survey for the Indian Languages

Phonological Processing for Urdu Text to Speech System

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Radius STEM Readiness TM

Language Model and Grammar Extraction Variation in Machine Translation

Applications of memory-based natural language processing

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

The College Board Redesigned SAT Grade 12

South Carolina English Language Arts

Statewide Framework Document for:

Context Free Grammars. Many slides from Michael Collins

First Grade Standards

The Ups and Downs of Preposition Error Detection in ESL Writing

CS Machine Learning

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Extracting Verb Expressions Implying Negative Opinions

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

The Good Judgment Project: A large scale test of different methods of combining expert predictions

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

California Department of Education English Language Development Standards for Grade 8

Noisy SMS Machine Translation in Low-Density Languages

Prediction of Maximal Projection for Semantic Role Labeling

Copyright Corwin 2015

Yoshida Honmachi, Sakyo-ku, Kyoto, Japan 1 Although the label set contains verb phrases, they

1.11 I Know What Do You Know?

Test Effort Estimation Using Neural Network

Rule Learning With Negation: Issues Regarding Effectiveness

Language Independent Passage Retrieval for Question Answering

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

A heuristic framework for pivot-based bilingual dictionary induction

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Transcription:

METEOR-Hindi : Automatic MT Evaluation Metric for Hindi as a Target Language Ankush Gupta, Sriram Venkatapathy and Rajeev Sangal Language Technologies Research Centre IIIT-Hyderabad

NEED FOR MT EVALUATION MT Systems becoming widespread How well do they work in practice? Are they reliable enough? Absolute vs. Relative quality MT is a technology still in research stages How can we tell if we are making progress? Metrics that can drive experimental development. 12/10/10 2

Human vs. Automatic MT Evaluation Human MT Evaluations Subjective Expensive Time-consuming Cannot be reused Most reliable Automatic MT Evaluations Objective Cheap Fast Reusable Highly correlated with subjective evaluation? 12/10/10 3

Problems with BLEU Metric (Papineni et al., 2002) Sentence-level Scores not based on meaning (Liu et al. 2005; Liu and Gildea., 2006) Only Exact Matches Lack of Recall Equal weightage Admits too much variation by using Higher Order n-grams for Fluency and Grammaticality Geometric averaging of n-grams 12/10/10 4

Important Aspects in English-Hindi MT Word order not so important : Does not strictly convey grammatical roles of words. Correct Case marking Morphological richness The boys are playing cricket लड़क कक ट ख ल रह ह The boys brought the book from the market लड़क न ब ज़ र स कत ब खर द Synonym Matching 12/10/10 5

METEOR (Banerjee and Lavie., 2005) Creates a word alignment between reference(s) and test sentence 3 series of word mapping modules : Exact Match Stem Match Synonym Match Each module maps words not mapped in any of earlier stage(s) Score is computed as the harmonic mean of unigram precision and recall. Additional penalties computed to capture order of words 12/10/10 6

Advantages of METEOR for Eng-Hnd MT Evaluation Flexible word matching (Exact, Stem and Synonym) Unigram matching : Not rely totally on word-order, thus suitable for Hindi which is relatively free word-order language Uses and Emphasizes Recall in addition to Precision Parametrized Features : Can be tuned for diff. types of human judgements and for diff. Languages (Lavie and Denkowski., 2009) 12/10/10 7

METEOR-Hindi Aligner Extended implementation of METEOR to support evaluation of translation into Hindi Word alignment algorithm same New stemming module for Hindi : Hindi Morph Analyzer used New synonym module for Hindi : Hindi Wordnet 1.2 used 12/10/10 8

METEOR-Hindi Parameters Apart from using only word-based features, also use other linguistic features like Local word Group, Part-of-Speech and Clause Match. 12/10/10 9

METEOR-Hindi Parameters Local Word Group Match : Consist of a content word and associated function word(s). Reference : Test1 : Test2 : बल न क त क म र क त क बल न म र क त न बल क म र Reference Test1 Test2 बल न बल न बल क क त क क त क क त न म र म र म र 12/10/10 10

METEOR-Hindi Parameters Part-of-Speech Matching : Compute number of matching words with same POS tag. Used CRF POS tagger. Reference : Test : र म ख ल रह ह ख ल र म रह ह Reference POS Test POS र म NN ख ल NNP ख ल VM र म NNP रह VAUX रह VM ह VAUX ह VAUX 12/10/10 11

METEOR-Hindi Parameters Clause Match : Phrase containing exactly one verb (finite or non-finite). Used Hindi clause boundary identifier. Reference : Test : र हत सक ल ज कर क दन लग र हत सक ल ज कर ख लन लग Reference र हत सक ल ज कर क दन लग Test र हत सक ल ज कर ख लन लग 12/10/10 12

METEOR-Hindi Scoring Function Stage Features Exact Match Precision, Recall Stem Match Precision, Recall Synonym Match Precision, Recall Local Word Group Precision, Recall Part-of-Speech Precision, Recall Clause Match Precision, Recall Parameters Used for Scoring in METEOR-Hindi Score, s = ( W i * f i ) / ( W i ) [W i : Weight of feature i ] 12/10/10 13

METEOR-Hindi Scoring Function Penalty not used in METEOR-Hindi General equation for scoring facilitates use of standard Machine Learning techniques Due to unavailability of high-quality training data, currently all the weights are set to 1 12/10/10 14

Example Reference Test : र हत न ब ज़ र स प सतक खर द : ब ज़ र स र हत कत ब खर दन Exact Matches : ब ज़ र, स, र हत Stem Matches : ब ज़ र, स, र हत, खर दन Synonym Matches : ब ज़ र, स, र हत, कत ब, खर दन LWG Matches : ब ज़ र स, कत ब, खर दन POS Matches : ब ज़ र, स, कत ब, खर दन Clause Matches : - 12/10/10 15

Example Exact Match Precision : f1 = 3/5 Exact Match Recall : f2 = 3/6 Stem Match Precision : f3 = 4/5 Stem Match Recall : f4 = 4/6 Synonym Match Precision : f5 = 5/5 Synonym Match Recall : f6 = 5/6 LWG Precision : f7 = 3/4 LWG Recall : f8 = 3/4 POS Precision : f9 = 4/5 POS Recall : f10 = 4/6 Clause Precision : f11 = 0/5 Clause Recall : f12 = 0/6 12/10/10 16

Example Score = ( W i * f i ) / ( W i ) = (1*3/5 + 1*3/6 +... 1*0/6) / (12) = 0.613 12/10/10 17

Tools Used Morph Analyzer : Hindi Morph 2.5.2 Hindi Wordnet : Hindi Wordnet 1.2 (Jha et al., 2001) CRF Part of Speech (POS) Tagger (PVS and Karthik G, 2007) Hindi Local Word Grouper (Bharati et al., 1998) Hindi Clause Boundary Identifier (Developed at IIIT-H as part of ILMT Project) 12/10/10 18

Experiments and Results Dataset of 100 sentences ; 60 test translations from System1 and 40 from System2 Number of Sentences 100 Avg. Test Sentence length 11.24 Avg. Ref. Sentence length 11.23 Exact Matches 433 Stem Matches 574 Synonym Matches 622 LWG Matches 426 POS Matches 576 Clause Matches 9 12/10/10 19

Experiments and Results Metric Features Correlation BLEU - 0.271 METEOR-Hindi Exact 0.656 METEOR-Hindi Exact + Stem 0.687 METEOR-Hindi Exact + Stem + Synonym 0.700 METEOR-Hindi Exact + Stem + Synonym + 0.681 LWG METEOR-Hindi Exact + Stem + Synonym + 0.703 POS METEOR-Hindi Exact + Stem + Synonym + 0.658 Clause METEOR-Hindi Exact + Stem + Synonym + LWG + POS + Clause 0.666 12/10/10 20

Experiments and Results Reason of low correlation with BLEU is it gives score zero in most of the sentences 12/10/10 21

Experiments and Results METEOR-Hindi Human 12/10/10 22

Experiments and Results Highest correlation (0.703) obtained using Exact, Stem, Synonym and POS features. Using linguistic features ( stemming, synonym ) resulted in better correlation Surprisingly, LWG and Clause features did not show increase in correlation Errors in reference sentences like ह र क, उनम स and ग ड़य म while it should be ह र क, उनम स and ग ड़य म 12/10/10 23

Experiments and Results Adding Clause feature also decreased correlation Only 9 clauses matched in 100 sentences Most of sentences have only one verb; so entire sentence is a clause Clause being a much higher level concept than words, scores should be less penalized if clauses do not match 12/10/10 24

Experiments and Results Scatterplot of BLEU and Human Score 12/10/10 25

Experiments and Results Scatterplot of METEOR-Hindi and Human Score 12/10/10 26

Experiments and Results Metric Average-Score BLEU 0.0815 METEOR-Hindi 0.4919 Human 0.615 Average Scores Compared MT System1 and System2 METEOR-Hindi correlated better (as compared to BLEU) Human annotators as well as METOR-Hindi gave greater score to System2 while BLEU ranked System1 higher 12/10/10 27

Experiments and Results Metric System Average Pearson correlation Score BLEU System1 0.041 0.343 METEOR-Hindi System1 0.460 0.712 Human System1 0.567 - Metric System Average Pearson correlation Score BLEU System2 0.032 0.163 METEOR-Hindi System2 0.540 0.684 Human System2 0.688-12/10/10 28

Future Work Train METEOR-Hindi on large amount of high-quality data and find optimum values of weightages for various parameters Use additional features like paraphrase match to achieve better correlation Putting up tool online for others to use 12/10/10 29

REFERENCES Some Issues in Automatic Evaluation of English-Hindi MT : More Blues for Bleu Ananthakrishnan et al., 2007 Bleu: a method for automatic evaluation of machine translation. Papineni et al., 2002 Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. Alon Lavie and Abhay Agarwal., 2007 The meteor metric for automatic evaluation of machine translation. Alon Lavie and Michael Denkowski, 2009 Local Word Grouping and its relevance to Indian Languages. Bharati et al., 1991 Part-of-speech tagging and chunking using conditional random fields and transformation based learning. Avinesh PVS and Karthik G., 2007 12/10/10 30

REFERENCES Re-evaluating the role of BLEU in machine translation research. Chris Callison-Burch et al. 2006 The significance of recall in automatic metrics for mt evaluation. Alon Lavie et al., 2004 Syntactic features for evaluation of machine translation. Liu et al., 2004 Stochastic iterative alignment for machine translation evaluation. Liu et al., 2005 A wordnet for Hindi. Jha et al., 2001 12/10/10 31

THANK YOU 12/10/10 32