METEOR-Hindi : Automatic MT Evaluation Metric for Hindi as a Target Language Ankush Gupta, Sriram Venkatapathy and Rajeev Sangal Language Technologies Research Centre IIIT-Hyderabad
NEED FOR MT EVALUATION MT Systems becoming widespread How well do they work in practice? Are they reliable enough? Absolute vs. Relative quality MT is a technology still in research stages How can we tell if we are making progress? Metrics that can drive experimental development. 12/10/10 2
Human vs. Automatic MT Evaluation Human MT Evaluations Subjective Expensive Time-consuming Cannot be reused Most reliable Automatic MT Evaluations Objective Cheap Fast Reusable Highly correlated with subjective evaluation? 12/10/10 3
Problems with BLEU Metric (Papineni et al., 2002) Sentence-level Scores not based on meaning (Liu et al. 2005; Liu and Gildea., 2006) Only Exact Matches Lack of Recall Equal weightage Admits too much variation by using Higher Order n-grams for Fluency and Grammaticality Geometric averaging of n-grams 12/10/10 4
Important Aspects in English-Hindi MT Word order not so important : Does not strictly convey grammatical roles of words. Correct Case marking Morphological richness The boys are playing cricket लड़क कक ट ख ल रह ह The boys brought the book from the market लड़क न ब ज़ र स कत ब खर द Synonym Matching 12/10/10 5
METEOR (Banerjee and Lavie., 2005) Creates a word alignment between reference(s) and test sentence 3 series of word mapping modules : Exact Match Stem Match Synonym Match Each module maps words not mapped in any of earlier stage(s) Score is computed as the harmonic mean of unigram precision and recall. Additional penalties computed to capture order of words 12/10/10 6
Advantages of METEOR for Eng-Hnd MT Evaluation Flexible word matching (Exact, Stem and Synonym) Unigram matching : Not rely totally on word-order, thus suitable for Hindi which is relatively free word-order language Uses and Emphasizes Recall in addition to Precision Parametrized Features : Can be tuned for diff. types of human judgements and for diff. Languages (Lavie and Denkowski., 2009) 12/10/10 7
METEOR-Hindi Aligner Extended implementation of METEOR to support evaluation of translation into Hindi Word alignment algorithm same New stemming module for Hindi : Hindi Morph Analyzer used New synonym module for Hindi : Hindi Wordnet 1.2 used 12/10/10 8
METEOR-Hindi Parameters Apart from using only word-based features, also use other linguistic features like Local word Group, Part-of-Speech and Clause Match. 12/10/10 9
METEOR-Hindi Parameters Local Word Group Match : Consist of a content word and associated function word(s). Reference : Test1 : Test2 : बल न क त क म र क त क बल न म र क त न बल क म र Reference Test1 Test2 बल न बल न बल क क त क क त क क त न म र म र म र 12/10/10 10
METEOR-Hindi Parameters Part-of-Speech Matching : Compute number of matching words with same POS tag. Used CRF POS tagger. Reference : Test : र म ख ल रह ह ख ल र म रह ह Reference POS Test POS र म NN ख ल NNP ख ल VM र म NNP रह VAUX रह VM ह VAUX ह VAUX 12/10/10 11
METEOR-Hindi Parameters Clause Match : Phrase containing exactly one verb (finite or non-finite). Used Hindi clause boundary identifier. Reference : Test : र हत सक ल ज कर क दन लग र हत सक ल ज कर ख लन लग Reference र हत सक ल ज कर क दन लग Test र हत सक ल ज कर ख लन लग 12/10/10 12
METEOR-Hindi Scoring Function Stage Features Exact Match Precision, Recall Stem Match Precision, Recall Synonym Match Precision, Recall Local Word Group Precision, Recall Part-of-Speech Precision, Recall Clause Match Precision, Recall Parameters Used for Scoring in METEOR-Hindi Score, s = ( W i * f i ) / ( W i ) [W i : Weight of feature i ] 12/10/10 13
METEOR-Hindi Scoring Function Penalty not used in METEOR-Hindi General equation for scoring facilitates use of standard Machine Learning techniques Due to unavailability of high-quality training data, currently all the weights are set to 1 12/10/10 14
Example Reference Test : र हत न ब ज़ र स प सतक खर द : ब ज़ र स र हत कत ब खर दन Exact Matches : ब ज़ र, स, र हत Stem Matches : ब ज़ र, स, र हत, खर दन Synonym Matches : ब ज़ र, स, र हत, कत ब, खर दन LWG Matches : ब ज़ र स, कत ब, खर दन POS Matches : ब ज़ र, स, कत ब, खर दन Clause Matches : - 12/10/10 15
Example Exact Match Precision : f1 = 3/5 Exact Match Recall : f2 = 3/6 Stem Match Precision : f3 = 4/5 Stem Match Recall : f4 = 4/6 Synonym Match Precision : f5 = 5/5 Synonym Match Recall : f6 = 5/6 LWG Precision : f7 = 3/4 LWG Recall : f8 = 3/4 POS Precision : f9 = 4/5 POS Recall : f10 = 4/6 Clause Precision : f11 = 0/5 Clause Recall : f12 = 0/6 12/10/10 16
Example Score = ( W i * f i ) / ( W i ) = (1*3/5 + 1*3/6 +... 1*0/6) / (12) = 0.613 12/10/10 17
Tools Used Morph Analyzer : Hindi Morph 2.5.2 Hindi Wordnet : Hindi Wordnet 1.2 (Jha et al., 2001) CRF Part of Speech (POS) Tagger (PVS and Karthik G, 2007) Hindi Local Word Grouper (Bharati et al., 1998) Hindi Clause Boundary Identifier (Developed at IIIT-H as part of ILMT Project) 12/10/10 18
Experiments and Results Dataset of 100 sentences ; 60 test translations from System1 and 40 from System2 Number of Sentences 100 Avg. Test Sentence length 11.24 Avg. Ref. Sentence length 11.23 Exact Matches 433 Stem Matches 574 Synonym Matches 622 LWG Matches 426 POS Matches 576 Clause Matches 9 12/10/10 19
Experiments and Results Metric Features Correlation BLEU - 0.271 METEOR-Hindi Exact 0.656 METEOR-Hindi Exact + Stem 0.687 METEOR-Hindi Exact + Stem + Synonym 0.700 METEOR-Hindi Exact + Stem + Synonym + 0.681 LWG METEOR-Hindi Exact + Stem + Synonym + 0.703 POS METEOR-Hindi Exact + Stem + Synonym + 0.658 Clause METEOR-Hindi Exact + Stem + Synonym + LWG + POS + Clause 0.666 12/10/10 20
Experiments and Results Reason of low correlation with BLEU is it gives score zero in most of the sentences 12/10/10 21
Experiments and Results METEOR-Hindi Human 12/10/10 22
Experiments and Results Highest correlation (0.703) obtained using Exact, Stem, Synonym and POS features. Using linguistic features ( stemming, synonym ) resulted in better correlation Surprisingly, LWG and Clause features did not show increase in correlation Errors in reference sentences like ह र क, उनम स and ग ड़य म while it should be ह र क, उनम स and ग ड़य म 12/10/10 23
Experiments and Results Adding Clause feature also decreased correlation Only 9 clauses matched in 100 sentences Most of sentences have only one verb; so entire sentence is a clause Clause being a much higher level concept than words, scores should be less penalized if clauses do not match 12/10/10 24
Experiments and Results Scatterplot of BLEU and Human Score 12/10/10 25
Experiments and Results Scatterplot of METEOR-Hindi and Human Score 12/10/10 26
Experiments and Results Metric Average-Score BLEU 0.0815 METEOR-Hindi 0.4919 Human 0.615 Average Scores Compared MT System1 and System2 METEOR-Hindi correlated better (as compared to BLEU) Human annotators as well as METOR-Hindi gave greater score to System2 while BLEU ranked System1 higher 12/10/10 27
Experiments and Results Metric System Average Pearson correlation Score BLEU System1 0.041 0.343 METEOR-Hindi System1 0.460 0.712 Human System1 0.567 - Metric System Average Pearson correlation Score BLEU System2 0.032 0.163 METEOR-Hindi System2 0.540 0.684 Human System2 0.688-12/10/10 28
Future Work Train METEOR-Hindi on large amount of high-quality data and find optimum values of weightages for various parameters Use additional features like paraphrase match to achieve better correlation Putting up tool online for others to use 12/10/10 29
REFERENCES Some Issues in Automatic Evaluation of English-Hindi MT : More Blues for Bleu Ananthakrishnan et al., 2007 Bleu: a method for automatic evaluation of machine translation. Papineni et al., 2002 Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. Alon Lavie and Abhay Agarwal., 2007 The meteor metric for automatic evaluation of machine translation. Alon Lavie and Michael Denkowski, 2009 Local Word Grouping and its relevance to Indian Languages. Bharati et al., 1991 Part-of-speech tagging and chunking using conditional random fields and transformation based learning. Avinesh PVS and Karthik G., 2007 12/10/10 30
REFERENCES Re-evaluating the role of BLEU in machine translation research. Chris Callison-Burch et al. 2006 The significance of recall in automatic metrics for mt evaluation. Alon Lavie et al., 2004 Syntactic features for evaluation of machine translation. Liu et al., 2004 Stochastic iterative alignment for machine translation evaluation. Liu et al., 2005 A wordnet for Hindi. Jha et al., 2001 12/10/10 31
THANK YOU 12/10/10 32