METEOR-Hindi : Automatic MT Evaluation Metric for Hindi as a Target Language

Size: px

Start display at page:

Download "METEOR-Hindi : Automatic MT Evaluation Metric for Hindi as a Target Language"

Jesse Neal
6 years ago
Views:

1 METEOR-Hindi : Automatic MT Evaluation Metric for Hindi as a Target Language Ankush Gupta, Sriram Venkatapathy and Rajeev Sangal Language Technologies Research Centre IIIT-Hyderabad

2 NEED FOR MT EVALUATION MT Systems becoming widespread How well do they work in practice? Are they reliable enough? Absolute vs. Relative quality MT is a technology still in research stages How can we tell if we are making progress? Metrics that can drive experimental development. 12/10/10 2

3 Human vs. Automatic MT Evaluation Human MT Evaluations Subjective Expensive Time-consuming Cannot be reused Most reliable Automatic MT Evaluations Objective Cheap Fast Reusable Highly correlated with subjective evaluation? 12/10/10 3

4 Problems with BLEU Metric (Papineni et al., 2002) Sentence-level Scores not based on meaning (Liu et al. 2005; Liu and Gildea., 2006) Only Exact Matches Lack of Recall Equal weightage Admits too much variation by using Higher Order n-grams for Fluency and Grammaticality Geometric averaging of n-grams 12/10/10 4

5 Important Aspects in English-Hindi MT Word order not so important : Does not strictly convey grammatical roles of words. Correct Case marking Morphological richness The boys are playing cricket लड़क कक ट ख ल रह ह The boys brought the book from the market लड़क न ब ज़ र स कत ब खर द Synonym Matching 12/10/10 5

6 METEOR (Banerjee and Lavie., 2005) Creates a word alignment between reference(s) and test sentence 3 series of word mapping modules : Exact Match Stem Match Synonym Match Each module maps words not mapped in any of earlier stage(s) Score is computed as the harmonic mean of unigram precision and recall. Additional penalties computed to capture order of words 12/10/10 6

7 Advantages of METEOR for Eng-Hnd MT Evaluation Flexible word matching (Exact, Stem and Synonym) Unigram matching : Not rely totally on word-order, thus suitable for Hindi which is relatively free word-order language Uses and Emphasizes Recall in addition to Precision Parametrized Features : Can be tuned for diff. types of human judgements and for diff. Languages (Lavie and Denkowski., 2009) 12/10/10 7

8 METEOR-Hindi Aligner Extended implementation of METEOR to support evaluation of translation into Hindi Word alignment algorithm same New stemming module for Hindi : Hindi Morph Analyzer used New synonym module for Hindi : Hindi Wordnet 1.2 used 12/10/10 8

9 METEOR-Hindi Parameters Apart from using only word-based features, also use other linguistic features like Local word Group, Part-of-Speech and Clause Match. 12/10/10 9

10 METEOR-Hindi Parameters Local Word Group Match : Consist of a content word and associated function word(s). Reference : Test1 : Test2 : बल न क त क म र क त क बल न म र क त न बल क म र Reference Test1 Test2 बल न बल न बल क क त क क त क क त न म र म र म र 12/10/10 10

11 METEOR-Hindi Parameters Part-of-Speech Matching : Compute number of matching words with same POS tag. Used CRF POS tagger. Reference : Test : र म ख ल रह ह ख ल र म रह ह Reference POS Test POS र म NN ख ल NNP ख ल VM र म NNP रह VAUX रह VM ह VAUX ह VAUX 12/10/10 11

12 METEOR-Hindi Parameters Clause Match : Phrase containing exactly one verb (finite or non-finite). Used Hindi clause boundary identifier. Reference : Test : र हत सक ल ज कर क दन लग र हत सक ल ज कर ख लन लग Reference र हत सक ल ज कर क दन लग Test र हत सक ल ज कर ख लन लग 12/10/10 12

13 METEOR-Hindi Scoring Function Stage Features Exact Match Precision, Recall Stem Match Precision, Recall Synonym Match Precision, Recall Local Word Group Precision, Recall Part-of-Speech Precision, Recall Clause Match Precision, Recall Parameters Used for Scoring in METEOR-Hindi Score, s = ( W i * f i ) / ( W i ) [W i : Weight of feature i ] 12/10/10 13

14 METEOR-Hindi Scoring Function Penalty not used in METEOR-Hindi General equation for scoring facilitates use of standard Machine Learning techniques Due to unavailability of high-quality training data, currently all the weights are set to 1 12/10/10 14

15 Example Reference Test : र हत न ब ज़ र स प सतक खर द : ब ज़ र स र हत कत ब खर दन Exact Matches : ब ज़ र, स, र हत Stem Matches : ब ज़ र, स, र हत, खर दन Synonym Matches : ब ज़ र, स, र हत, कत ब, खर दन LWG Matches : ब ज़ र स, कत ब, खर दन POS Matches : ब ज़ र, स, कत ब, खर दन Clause Matches : - 12/10/10 15

16 Example Exact Match Precision : f1 = 3/5 Exact Match Recall : f2 = 3/6 Stem Match Precision : f3 = 4/5 Stem Match Recall : f4 = 4/6 Synonym Match Precision : f5 = 5/5 Synonym Match Recall : f6 = 5/6 LWG Precision : f7 = 3/4 LWG Recall : f8 = 3/4 POS Precision : f9 = 4/5 POS Recall : f10 = 4/6 Clause Precision : f11 = 0/5 Clause Recall : f12 = 0/6 12/10/10 16

17 Example Score = ( W i * f i ) / ( W i ) = (1*3/5 + 1*3/ *0/6) / (12) = /10/10 17

18 Tools Used Morph Analyzer : Hindi Morph Hindi Wordnet : Hindi Wordnet 1.2 (Jha et al., 2001) CRF Part of Speech (POS) Tagger (PVS and Karthik G, 2007) Hindi Local Word Grouper (Bharati et al., 1998) Hindi Clause Boundary Identifier (Developed at IIIT-H as part of ILMT Project) 12/10/10 18

19 Experiments and Results Dataset of 100 sentences ; 60 test translations from System1 and 40 from System2 Number of Sentences 100 Avg. Test Sentence length Avg. Ref. Sentence length Exact Matches 433 Stem Matches 574 Synonym Matches 622 LWG Matches 426 POS Matches 576 Clause Matches 9 12/10/10 19

20 Experiments and Results Metric Features Correlation BLEU METEOR-Hindi Exact METEOR-Hindi Exact + Stem METEOR-Hindi Exact + Stem + Synonym METEOR-Hindi Exact + Stem + Synonym LWG METEOR-Hindi Exact + Stem + Synonym POS METEOR-Hindi Exact + Stem + Synonym Clause METEOR-Hindi Exact + Stem + Synonym + LWG + POS + Clause /10/10 20

21 Experiments and Results Reason of low correlation with BLEU is it gives score zero in most of the sentences 12/10/10 21

22 Experiments and Results METEOR-Hindi Human 12/10/10 22

23 Experiments and Results Highest correlation (0.703) obtained using Exact, Stem, Synonym and POS features. Using linguistic features ( stemming, synonym ) resulted in better correlation Surprisingly, LWG and Clause features did not show increase in correlation Errors in reference sentences like ह र क, उनम स and ग ड़य म while it should be ह र क, उनम स and ग ड़य म 12/10/10 23

24 Experiments and Results Adding Clause feature also decreased correlation Only 9 clauses matched in 100 sentences Most of sentences have only one verb; so entire sentence is a clause Clause being a much higher level concept than words, scores should be less penalized if clauses do not match 12/10/10 24

25 Experiments and Results Scatterplot of BLEU and Human Score 12/10/10 25

26 Experiments and Results Scatterplot of METEOR-Hindi and Human Score 12/10/10 26

27 Experiments and Results Metric Average-Score BLEU METEOR-Hindi Human Average Scores Compared MT System1 and System2 METEOR-Hindi correlated better (as compared to BLEU) Human annotators as well as METOR-Hindi gave greater score to System2 while BLEU ranked System1 higher 12/10/10 27

28 Experiments and Results Metric System Average Pearson correlation Score BLEU System METEOR-Hindi System Human System Metric System Average Pearson correlation Score BLEU System METEOR-Hindi System Human System /10/10 28

29 Future Work Train METEOR-Hindi on large amount of high-quality data and find optimum values of weightages for various parameters Use additional features like paraphrase match to achieve better correlation Putting up tool online for others to use 12/10/10 29

30 REFERENCES Some Issues in Automatic Evaluation of English-Hindi MT : More Blues for Bleu Ananthakrishnan et al., 2007 Bleu: a method for automatic evaluation of machine translation. Papineni et al., 2002 Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. Satanjeev Banerjee and Alon Lavie Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. Alon Lavie and Abhay Agarwal., 2007 The meteor metric for automatic evaluation of machine translation. Alon Lavie and Michael Denkowski, 2009 Local Word Grouping and its relevance to Indian Languages. Bharati et al., 1991 Part-of-speech tagging and chunking using conditional random fields and transformation based learning. Avinesh PVS and Karthik G., /10/10 30

31 REFERENCES Re-evaluating the role of BLEU in machine translation research. Chris Callison-Burch et al The significance of recall in automatic metrics for mt evaluation. Alon Lavie et al., 2004 Syntactic features for evaluation of machine translation. Liu et al., 2004 Stochastic iterative alignment for machine translation evaluation. Liu et al., 2005 A wordnet for Hindi. Jha et al., /10/10 31

32 THANK YOU 12/10/10 32

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD FROM PRINCIPAL S KALAM Dear all, Only when one is equipped with both, worldly education for living and spiritual education, he/she deserves respect