Pre-Retrieval based Strategies for Cross Language News Story Search

Pre-Retrieval based Strategies for Cross Language News Story Search Presented by: Aarti Kumar & Sujoy Das Research Scholar Associate Professor Department of Computer Applications MANIT, Bhopal

CLINSS 2013 To find Cross language same news event and same focal event between English and Hindi pair of language. A set of 50691 potential source news stories S, written in Hindi. A set of 25 target news stories T, written in English.

Objective of Study To test Pre-retrieval strategies To Compare dictionary based and machine translation based CLIR Approach

Approach Pre-retrieval strategies Query formed using Proper Noun Query formed using higher frequency words whose frequency is equal to or higher than average frequency are used to retrieve the Hindi news stories. Translation strategy: dictionary based or machine translation based Indexing and Retrieval Terrier 3.5 retrieval engine

Pre Retrieval Approach for CLINSS English Documents Preprocessing Pre Retrieval Strategies Hindi Documents Proper Noun Greater than equal to frequency average Dictionary based CLIR System Machine Translation Based System Formulated Query Retrieval Engine Retrieved Hindi Documents Top 100

Preprocessing: Query Formulation: Experiment Pre-Retrieval Strategies and Dictionary Based approach for MANIT-1-Run-1 and MANIT-1-Run-2 Pre-Retrieval Strategy and Machine Translation Based approach for MANIT-1-Run-3 Indexing and retrieval:

Preprocessing For all the three runs all the words of <title>and <content>were extracted from each of the English document. Punctuation was removed at the time of tokenization and stopwords, verbs and adverbs were removed from <content> part only using a list of 430 stop-words [5] and 1514 verbs and adverbs [6] which were compiled from the web. Dates and numbers were also removed at time of preprocessing from both <title> and <content> as it was observed at the time of trial runs that query started drifting if one considers them. No other preprocessing was done on the <title> and all the words were taken as it is at the time of query formulation.

Query Formulation Pre-Retrieval Strategies and Dictionary Based approach for MANIT-1-Run-1 and MANIT-1-Run-2 MANIT-1-Run-1 In this run only Proper nouns are extracted from<content> of the English news story. The grammar rule, that proper noun begins with a capital letter, has been used to identify Proper nouns instead of using part of speech tagger. The idea behind choosing proper nouns for formulating queries to retrieve the source documents is that they are the ones that are never changed while translating text and more so in news stories as they are important entities in any news.

MANIT-1-Run-2 In this run only those words whose frequency is greater than or equal to the average word frequency of the <content>, has been selected at the time of query formulation. Taking words having greater than or equal to average word frequency for forming query words is considered in view of the fact that out of those words which appear more than average number of times, some of the words must be of importance in catching the linked documents

Query Formulation continued In both of these runs Porter Stemmer[10] is used for stemming. Dictionary based approach is used for translating query in Hindi. The Shabdanjali dictionary[9] is used for translating English tokens to Hindi and only the first Hindi translation of each word is considered. The words that didn t have Hindi equivalent in Hindi Shabdanjali dictionary were transliterated using a transliterator developed by us. The translated queries are submitted to Terrier retrieval engine [11] and top 100 documents are retrieved.

MANIT-1-Run-3 It is same as that of MANIT-1-Run-1 but machine translation based approach is used for translating query words. Freely available online Hindi Google Translate[7] is used to translate/transliterate English query words to Hindi. For those words which Google translate [7] failed to transliterate online Changathi Hindi transliterator [8] was used. The process was carried out manually. This manual intervention was with the purpose of getting the correct Hindi words and then comparing the results thus obtained, with our fully automated approaches used for MANIT-1-Run-1 and MANIT-1-Run-2.

Problems with transliteration: few examples Banka was transliterated as ब क but BANKA was not transliterated by Google. Interpretation of alphabet a in Hindi Kamal Mayawati Mulayam Akriti Akhilesh कमल म य वत म ल यम आक तत(not transliterated by Google) अख ल श Interpretation of bigram an in Hindi Anubha Anshu Anand Kanak Janki Pranav अन भ (not transliterated by Google) अ श आन द कनक ज नक प रणव Our transliterator gave 1-8 combinations of such words

Indexing and retrieval Indexing of Hindi documents and retrieval of linked news stories in Hindi for each English document has been done using Terrier 3.5[11] using TF-IDF ranking model.

Result MANIT-1-Run-1 gives performance of 0.6, 0.545 and 0.5388 for NDCG@1, NDCG@5 and NDCG@10 respectively. MANIT-1-Run-2 gives performance of 0.56, 0.4521 and 0.4828 for NDCG@1, NDCG@5 and NDCG@10 respectively. MANIT-1-Run-3 gives performance of 0.5, 0.4803 and 0.4867 for NDCG@1, NDCG@5 and NDCG@10 respectively. It is observed that proper noun based pre-retrieval strategy clubbed with dictionary based CLIR approach has performed fairly well. At NDCG@5 and NDCG@10 Google Translate based approach performed next.

Comparative performance Run NDCG@1 NDCG@5 NDCG@10 run-1-manit1 0.6 0.545 0.5388 run-2-manit1 0.56 0.4521 0.4828 run-3-manit1 0.5 0.4803 0.4867 Table 1.Comparative performance of the three runs

Analysis 1 Out of 140 rel. documents Run-1 (Proper D) Run-2 (GT) Run-3 (Proper Google) Found as 1st 16 15 14 Found among top 5 Found among top 10 Found among top 100 Not found in the top 100 38 31 44 48 46 63 95 75 120 45 65 20

Analysis 1 continued 120 120 100 95 80 63 75 65 60 40 20 0 161514 44 38 31 4846 45 20 Run-1(Proper) Run-2(GT) Run- 3(ProperGoogle)

Analysis 2 Out of the 8 documents with score 2 i.e. documents with "same news event + same focal event the no. of documents retrieved by the different query strategies are: Run-1 (Proper) Run-2 (GT) Run-3 (Proper Google) As 1st document 5 4 5 in top 5 7 5 6 in top 10 8 7 6 MANIT-1-Run-1 performed the best in this. This might be the reason for the degradation in the NDCG performance of the queries formed using Google Translate

Analysis 2 continued. 8 7 6 5 4 5 7 8 4 5 7 5 6 6 3 2 1 0 As 1st document in top 5 in top 10

Analysis III English-Hindi Relevant Document Pair Linked Hindi Documents english-document-00002.txt 0 hindi-document-00416.txt 1 For 2 and 23 english-document-00016.txt 0 hindi-document-48171.txt 1 For 16 and 9 english-document-00016.txt 0 hindi-document-29606.txt 1 For 16 and 23 english-document-00016.txt 0 hindi-document-32003.txt 1 For 16 and 13 english-document-00005.txt 0 hindi-document-00414.txt 1 For 5 and 11 english-document-00019.txt 0 hindi-document-10863.txt 1 For 19 and 21 english-document-00019.txt 0 hindi-document-19273.txt 1 For 19 and 25 english-document-00019.txt 0 hindi-document-19272.txt 1 For 19 and 25 english-document-00001.txt 0 hindi-document-16606.txt 1 For 1 and 4 english-document-00001.txt 0 hindi-document-39272.txt 1 For 1 and 4 english-document-00001.txt 0 hindi-document-17481.txt 1 For 1 and 2 english-document-00001.txt 0 hindi-document-08897.txt 1 For 1, 21 and 24 english-document-00001.txt 0 hindi-document-19255.txt 1 For 1 and 8 english-document-00001.txt 0 hindi-document-46293.txt 1 For 1 and 4 english-document-00001.txt 0 hindi-document-08773.txt 1 For 1 and 4 english-document-00017.txt 0 hindi-document-14001.txt 2 For 17, 9 and 12 english-document-00004.txt 0 hindi-document-20282.txt 2 For 4, 10 and 21 english-document-00004.txt 0 hindi-document-37101.txt 1 For 4 and 21

Conclusion It is observed that dictionary based approach clubbed up with proper noun based pre-retrieval strategy performed better than other two runs in all the three cases. MANIT-1-Run-3 which aimed at getting the right translation and transliteration for given query words, did not show a good performance at NDCG@1 level. In this study some of the pre-retrieval strategies to retrieve a subset of source Hindi documents from large corpus has been studied. The post processing techniques to link the exact news stories shall be studied in future.

Acknowledgement We are thankful to Terrier group for providing us Terrier Retrieval Engine to carry out our research work. One of the presenters, Aarti Kumar, is thankful to Maulana Azad National Institute of Technology, Bhopal for providing her the financial support to pursue her Doctoral work as a full time research scholar.

References Paul D. Clough, Department of Computer Science University of SheÆeld, England : Measuring Text Reuse in Journalistic Domain Parth Gupta, Paul Clough, Paolo Rosso, Mark Stevenson: PAN@FIRE: Overview of the Cross-Language!ndian News Story Search (CL!NSS) Track. In:Forum for Information Retrieval Evaluation, ISI, Kolkata,India(2012) YuriiPalkovskii, Alexei Belov: Using TF-IDF Weight Ranking Model in CLINSS as Effective Similarity Measure to Identify Cases of Journalistic Text Re-use In: Overview paper CLINSS 2012, Forum for Information Retrieval Evaluation, ISI, Kolkata,India(2012) NitishAggarwal, KartikAsooja, Paul Buitelaar, Tamara Polajanar, Jorge Gracia: Cross-Lingual Linking of News Stories using ESA. In:Overview paper CLINSS 2012, Forum for Information Retrieval Evaluation, ISI, Kolkata,India(2012) List of Stopwords Available on http://www.ranks.nl/resources/stopwords.html,http://norm.al/2 009/04/14/list-of-english-stopwords/,http://www.webconfs.com/stopwords.php,http://jmlr.org/papers/volume5/lewis04a/a11-smartstop-list/english.stop

References continued List of Verbs and Adverbs Available on http://www.englishclub.com/vocabulary/regular-verbslist.htm,http://www.momswhothink.com/reading/list-ofverbs.html,http://www.linguanaut.com/verbs.htm,http://www.acme2k.co.uk /acme/3star%20verbs.htm,http://www.enchantedlearning.com/wordlist/verb s.shtml, http://www.enchantedlearning.com/wordlist/adverbs.shtml http://translate.google.com/?prev=hp&hl=en&text=&sl=en&tl=hi#en/hi/- ChangathiTransliterator Available on http://hindi.changathi.com/ Shabdanjali available on http://ltrc.iiit.ac.in/onlineservices/dictionaries/dict_frame.html Porter stemmer available on http://ir.dcs.gla.ac.uk/resources/linguistic_utils/porter.java Terrier 3.5 available on http://terrier.org/download/