Pre-Retrieval based Strategies for Cross Language News Story Search

Size: px

Start display at page:

Download "Pre-Retrieval based Strategies for Cross Language News Story Search"

Everett Campbell
5 years ago
Views:

1 Pre-Retrieval based Strategies for Cross Language News Story Search Presented by: Aarti Kumar & Sujoy Das Research Scholar Associate Professor Department of Computer Applications MANIT, Bhopal

2 CLINSS 2013 To find Cross language same news event and same focal event between English and Hindi pair of language. A set of potential source news stories S, written in Hindi. A set of 25 target news stories T, written in English.

3 Objective of Study To test Pre-retrieval strategies To Compare dictionary based and machine translation based CLIR Approach

4 Approach Pre-retrieval strategies Query formed using Proper Noun Query formed using higher frequency words whose frequency is equal to or higher than average frequency are used to retrieve the Hindi news stories. Translation strategy: dictionary based or machine translation based Indexing and Retrieval Terrier 3.5 retrieval engine

5 Pre Retrieval Approach for CLINSS English Documents Preprocessing Pre Retrieval Strategies Hindi Documents Proper Noun Greater than equal to frequency average Dictionary based CLIR System Machine Translation Based System Formulated Query Retrieval Engine Retrieved Hindi Documents Top 100

6 Preprocessing: Query Formulation: Experiment Pre-Retrieval Strategies and Dictionary Based approach for MANIT-1-Run-1 and MANIT-1-Run-2 Pre-Retrieval Strategy and Machine Translation Based approach for MANIT-1-Run-3 Indexing and retrieval:

7 Preprocessing For all the three runs all the words of <title>and <content>were extracted from each of the English document. Punctuation was removed at the time of tokenization and stopwords, verbs and adverbs were removed from <content> part only using a list of 430 stop-words [5] and 1514 verbs and adverbs [6] which were compiled from the web. Dates and numbers were also removed at time of preprocessing from both <title> and <content> as it was observed at the time of trial runs that query started drifting if one considers them. No other preprocessing was done on the <title> and all the words were taken as it is at the time of query formulation.

8 Query Formulation Pre-Retrieval Strategies and Dictionary Based approach for MANIT-1-Run-1 and MANIT-1-Run-2 MANIT-1-Run-1 In this run only Proper nouns are extracted from<content> of the English news story. The grammar rule, that proper noun begins with a capital letter, has been used to identify Proper nouns instead of using part of speech tagger. The idea behind choosing proper nouns for formulating queries to retrieve the source documents is that they are the ones that are never changed while translating text and more so in news stories as they are important entities in any news.

MANIT-1-Run-2 In this run only those words whose frequency is greater than or equal to the average word frequency of the <content>, has been selected at the time of query formulation.

9 MANIT-1-Run-2 In this run only those words whose frequency is greater than or equal to the average word frequency of the <content>, has been selected at the time of query formulation. Taking words having greater than or equal to average word frequency for forming query words is considered in view of the fact that out of those words which appear more than average number of times, some of the words must be of importance in catching the linked documents

10 Query Formulation continued In both of these runs Porter Stemmer[10] is used for stemming. Dictionary based approach is used for translating query in Hindi. The Shabdanjali dictionary[9] is used for translating English tokens to Hindi and only the first Hindi translation of each word is considered. The words that didn t have Hindi equivalent in Hindi Shabdanjali dictionary were transliterated using a transliterator developed by us. The translated queries are submitted to Terrier retrieval engine [11] and top 100 documents are retrieved.

11 MANIT-1-Run-3 It is same as that of MANIT-1-Run-1 but machine translation based approach is used for translating query words. Freely available online Hindi Google Translate[7] is used to translate/transliterate English query words to Hindi. For those words which Google translate [7] failed to transliterate online Changathi Hindi transliterator [8] was used. The process was carried out manually. This manual intervention was with the purpose of getting the correct Hindi words and then comparing the results thus obtained, with our fully automated approaches used for MANIT-1-Run-1 and MANIT-1-Run-2.

12 Problems with transliteration: few examples Banka was transliterated as ब क but BANKA was not transliterated by Google. Interpretation of alphabet a in Hindi Kamal Mayawati Mulayam Akriti Akhilesh कमल म य वत म ल यम आक तत(not transliterated by Google) अख ल श Interpretation of bigram an in Hindi Anubha Anshu Anand Kanak Janki Pranav अन भ (not transliterated by Google) अ श आन द कनक ज नक प रणव Our transliterator gave 1-8 combinations of such words

13 Indexing and retrieval Indexing of Hindi documents and retrieval of linked news stories in Hindi for each English document has been done using Terrier 3.5[11] using TF-IDF ranking model.

Result MANIT-1-Run-1 gives performance of 0.6, 0.545 and 0.5388 for NDCG@1, NDCG@5 and NDCG@10 respectively. MANIT-1-Run-2 gives performance of 0.56, 0.4521 and 0.

14 Result MANIT-1-Run-1 gives performance of 0.6, and for and respectively. MANIT-1-Run-2 gives performance of 0.56, and for and respectively. MANIT-1-Run-3 gives performance of 0.5, and for and respectively. It is observed that proper noun based pre-retrieval strategy clubbed with dictionary based CLIR approach has performed fairly well. At and Google Translate based approach performed next.

15 Comparative performance Run run-1-manit run-2-manit run-3-manit Table 1.Comparative performance of the three runs

16 Analysis 1 Out of 140 rel. documents Run-1 (Proper D) Run-2 (GT) Run-3 (Proper Google) Found as 1st Found among top 5 Found among top 10 Found among top 100 Not found in the top

Analysis 1 continued 120 120 100 95 80 63 75 65 60 40 20 0

17 Analysis 1 continued Run-1(Proper) Run-2(GT) Run- 3(ProperGoogle)

18 Analysis 2 Out of the 8 documents with score 2 i.e. documents with "same news event + same focal event the no. of documents retrieved by the different query strategies are: Run-1 (Proper) Run-2 (GT) Run-3 (Proper Google) As 1st document in top in top MANIT-1-Run-1 performed the best in this. This might be the reason for the degradation in the NDCG performance of the queries formed using Google Translate

19 Analysis 2 continued As 1st document in top 5 in top 10

20 Analysis III English-Hindi Relevant Document Pair Linked Hindi Documents english-document txt 0 hindi-document txt 1 For 2 and 23 english-document txt 0 hindi-document txt 1 For 16 and 9 english-document txt 0 hindi-document txt 1 For 16 and 23 english-document txt 0 hindi-document txt 1 For 16 and 13 english-document txt 0 hindi-document txt 1 For 5 and 11 english-document txt 0 hindi-document txt 1 For 19 and 21 english-document txt 0 hindi-document txt 1 For 19 and 25 english-document txt 0 hindi-document txt 1 For 19 and 25 english-document txt 0 hindi-document txt 1 For 1 and 4 english-document txt 0 hindi-document txt 1 For 1 and 4 english-document txt 0 hindi-document txt 1 For 1 and 2 english-document txt 0 hindi-document txt 1 For 1, 21 and 24 english-document txt 0 hindi-document txt 1 For 1 and 8 english-document txt 0 hindi-document txt 1 For 1 and 4 english-document txt 0 hindi-document txt 1 For 1 and 4 english-document txt 0 hindi-document txt 2 For 17, 9 and 12 english-document txt 0 hindi-document txt 2 For 4, 10 and 21 english-document txt 0 hindi-document txt 1 For 4 and 21

Conclusion It is observed that dictionary based approach clubbed up with proper noun based pre-retrieval strategy performed better than other two runs in all the three cases.

21 Conclusion It is observed that dictionary based approach clubbed up with proper noun based pre-retrieval strategy performed better than other two runs in all the three cases. MANIT-1-Run-3 which aimed at getting the right translation and transliteration for given query words, did not show a good performance at NDCG@1 level. In this study some of the pre-retrieval strategies to retrieve a subset of source Hindi documents from large corpus has been studied. The post processing techniques to link the exact news stories shall be studied in future.

22 Acknowledgement We are thankful to Terrier group for providing us Terrier Retrieval Engine to carry out our research work. One of the presenters, Aarti Kumar, is thankful to Maulana Azad National Institute of Technology, Bhopal for providing her the financial support to pursue her Doctoral work as a full time research scholar.

23 References Paul D. Clough, Department of Computer Science University of SheÆeld, England : Measuring Text Reuse in Journalistic Domain Parth Gupta, Paul Clough, Paolo Rosso, Mark Stevenson: PAN@FIRE: Overview of the Cross-Language!ndian News Story Search (CL!NSS) Track. In:Forum for Information Retrieval Evaluation, ISI, Kolkata,India(2012) YuriiPalkovskii, Alexei Belov: Using TF-IDF Weight Ranking Model in CLINSS as Effective Similarity Measure to Identify Cases of Journalistic Text Re-use In: Overview paper CLINSS 2012, Forum for Information Retrieval Evaluation, ISI, Kolkata,India(2012) NitishAggarwal, KartikAsooja, Paul Buitelaar, Tamara Polajanar, Jorge Gracia: Cross-Lingual Linking of News Stories using ESA. In:Overview paper CLINSS 2012, Forum for Information Retrieval Evaluation, ISI, Kolkata,India(2012) List of Stopwords Available on 009/04/14/list-of-english-stopwords/,

24 References continued List of Verbs and Adverbs Available on /acme/3star%20verbs.htm, s.shtml, ChangathiTransliterator Available on Shabdanjali available on Porter stemmer available on Terrier 3.5 available on

An evolutionary survey from Monolingual Text Reuse to Cross Lingual Text Reuse in context to English-Hindi. Aarti Kumar*, Sujoy Das** IJSER

996 An evolutionary survey from Monolingual Text Reuse to Cross Lingual Text Reuse in context to English-Hindi Aarti Kumar*, Sujoy Das** Abstract-With enormous amount of information in multiple efficient