Pre-Retrieval based Strategies for Cross Language News Story Search

Similar documents
An evolutionary survey from Monolingual Text Reuse to Cross Lingual Text Reuse in context to English-Hindi. Aarti Kumar*, Sujoy Das** IJSER

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Cross Language Information Retrieval

S. RAZA GIRLS HIGH SCHOOL

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

A Case Study: News Classification Based on Term Frequency

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Indian Institute of Technology, Kanpur


Language Independent Passage Retrieval for Question Answering

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

Cross-Lingual Text Categorization

AQUA: An Ontology-Driven Question Answering System

Linking Task: Identifying authors and book titles in verbose queries

ScienceDirect. Malayalam question answering system

Dictionary-based techniques for cross-language information retrieval q

Constructing Parallel Corpus from Movie Subtitles

The Role of String Similarity Metrics in Ontology Alignment

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

On document relevance and lexical cohesion between query terms

1. Introduction. 2. The OMBI database editor

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Cross-Language Information Retrieval

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

Finding Translations in Scanned Book Collections

Probabilistic Latent Semantic Analysis

ENGLISH Month August

Modeling full form lexica for Arabic

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Spring 2016 Stony Brook University Instructor: Dr. Paul Fodor

HLTCOE at TREC 2013: Temporal Summarization

The Smart/Empire TIPSTER IR System

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

THE VERB ARGUMENT BROWSER

As a high-quality international conference in the field

Myths, Legends, Fairytales and Novels (Writing a Letter)

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

I. INTRODUCTION. for conducting the research, the problems in teaching vocabulary, and the suitable

Vocabulary Agreement Among Model Summaries And Source Documents 1

A Bayesian Learning Approach to Concept-Based Document Classification

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

arxiv:cs/ v2 [cs.cl] 7 Jul 1999

CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES. Christian E. Loza. Thesis Prepared for the Degree of MASTER OF SCIENCE

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Parsing of part-of-speech tagged Assamese Texts

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

SAMPLE PAPER SYLLABUS

indexing many slides courtesy James

Chapter 9 Banked gap-filling

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Universiteit Leiden ICT in Business

Matching Meaning for Cross-Language Information Retrieval

An Evaluation of E-Resources in Academic Libraries in Tamil Nadu

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Columbia University at DUC 2004

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

IN THIS UNIT YOU LEARN HOW TO: SPEAKING 1 Work in pairs. Discuss the questions. 2 Work with a new partner. Discuss the questions.

Exposé for a Master s Thesis

Intermediate Academic Writing

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

ARNE - A tool for Namend Entity Recognition from Arabic Text

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

National Literacy and Numeracy Framework for years 3/4

Information Retrieval

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge

Programma di Inglese

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Mining Association Rules in Student s Assessment Data

21st Century Community Learning Center

LITERACY ACROSS THE CURRICULUM POLICY

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Controlled vocabulary

Development of the First LRs for Macedonian: Current Projects

CS 598 Natural Language Processing

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Word Segmentation of Off-line Handwritten Documents

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

Virtually Anywhere Episodes 1 and 2. Teacher s Notes

Transcription:

Pre-Retrieval based Strategies for Cross Language News Story Search Presented by: Aarti Kumar & Sujoy Das Research Scholar Associate Professor Department of Computer Applications MANIT, Bhopal

CLINSS 2013 To find Cross language same news event and same focal event between English and Hindi pair of language. A set of 50691 potential source news stories S, written in Hindi. A set of 25 target news stories T, written in English.

Objective of Study To test Pre-retrieval strategies To Compare dictionary based and machine translation based CLIR Approach

Approach Pre-retrieval strategies Query formed using Proper Noun Query formed using higher frequency words whose frequency is equal to or higher than average frequency are used to retrieve the Hindi news stories. Translation strategy: dictionary based or machine translation based Indexing and Retrieval Terrier 3.5 retrieval engine

Pre Retrieval Approach for CLINSS English Documents Preprocessing Pre Retrieval Strategies Hindi Documents Proper Noun Greater than equal to frequency average Dictionary based CLIR System Machine Translation Based System Formulated Query Retrieval Engine Retrieved Hindi Documents Top 100

Preprocessing: Query Formulation: Experiment Pre-Retrieval Strategies and Dictionary Based approach for MANIT-1-Run-1 and MANIT-1-Run-2 Pre-Retrieval Strategy and Machine Translation Based approach for MANIT-1-Run-3 Indexing and retrieval:

Preprocessing For all the three runs all the words of <title>and <content>were extracted from each of the English document. Punctuation was removed at the time of tokenization and stopwords, verbs and adverbs were removed from <content> part only using a list of 430 stop-words [5] and 1514 verbs and adverbs [6] which were compiled from the web. Dates and numbers were also removed at time of preprocessing from both <title> and <content> as it was observed at the time of trial runs that query started drifting if one considers them. No other preprocessing was done on the <title> and all the words were taken as it is at the time of query formulation.

Query Formulation Pre-Retrieval Strategies and Dictionary Based approach for MANIT-1-Run-1 and MANIT-1-Run-2 MANIT-1-Run-1 In this run only Proper nouns are extracted from<content> of the English news story. The grammar rule, that proper noun begins with a capital letter, has been used to identify Proper nouns instead of using part of speech tagger. The idea behind choosing proper nouns for formulating queries to retrieve the source documents is that they are the ones that are never changed while translating text and more so in news stories as they are important entities in any news.

MANIT-1-Run-2 In this run only those words whose frequency is greater than or equal to the average word frequency of the <content>, has been selected at the time of query formulation. Taking words having greater than or equal to average word frequency for forming query words is considered in view of the fact that out of those words which appear more than average number of times, some of the words must be of importance in catching the linked documents

Query Formulation continued In both of these runs Porter Stemmer[10] is used for stemming. Dictionary based approach is used for translating query in Hindi. The Shabdanjali dictionary[9] is used for translating English tokens to Hindi and only the first Hindi translation of each word is considered. The words that didn t have Hindi equivalent in Hindi Shabdanjali dictionary were transliterated using a transliterator developed by us. The translated queries are submitted to Terrier retrieval engine [11] and top 100 documents are retrieved.

MANIT-1-Run-3 It is same as that of MANIT-1-Run-1 but machine translation based approach is used for translating query words. Freely available online Hindi Google Translate[7] is used to translate/transliterate English query words to Hindi. For those words which Google translate [7] failed to transliterate online Changathi Hindi transliterator [8] was used. The process was carried out manually. This manual intervention was with the purpose of getting the correct Hindi words and then comparing the results thus obtained, with our fully automated approaches used for MANIT-1-Run-1 and MANIT-1-Run-2.

Problems with transliteration: few examples Banka was transliterated as ब क but BANKA was not transliterated by Google. Interpretation of alphabet a in Hindi Kamal Mayawati Mulayam Akriti Akhilesh कमल म य वत म ल यम आक तत(not transliterated by Google) अख ल श Interpretation of bigram an in Hindi Anubha Anshu Anand Kanak Janki Pranav अन भ (not transliterated by Google) अ श आन द कनक ज नक प रणव Our transliterator gave 1-8 combinations of such words

Indexing and retrieval Indexing of Hindi documents and retrieval of linked news stories in Hindi for each English document has been done using Terrier 3.5[11] using TF-IDF ranking model.

Result MANIT-1-Run-1 gives performance of 0.6, 0.545 and 0.5388 for NDCG@1, NDCG@5 and NDCG@10 respectively. MANIT-1-Run-2 gives performance of 0.56, 0.4521 and 0.4828 for NDCG@1, NDCG@5 and NDCG@10 respectively. MANIT-1-Run-3 gives performance of 0.5, 0.4803 and 0.4867 for NDCG@1, NDCG@5 and NDCG@10 respectively. It is observed that proper noun based pre-retrieval strategy clubbed with dictionary based CLIR approach has performed fairly well. At NDCG@5 and NDCG@10 Google Translate based approach performed next.

Comparative performance Run NDCG@1 NDCG@5 NDCG@10 run-1-manit1 0.6 0.545 0.5388 run-2-manit1 0.56 0.4521 0.4828 run-3-manit1 0.5 0.4803 0.4867 Table 1.Comparative performance of the three runs

Analysis 1 Out of 140 rel. documents Run-1 (Proper D) Run-2 (GT) Run-3 (Proper Google) Found as 1st 16 15 14 Found among top 5 Found among top 10 Found among top 100 Not found in the top 100 38 31 44 48 46 63 95 75 120 45 65 20

Analysis 1 continued 120 120 100 95 80 63 75 65 60 40 20 0 161514 44 38 31 4846 45 20 Run-1(Proper) Run-2(GT) Run- 3(ProperGoogle)

Analysis 2 Out of the 8 documents with score 2 i.e. documents with "same news event + same focal event the no. of documents retrieved by the different query strategies are: Run-1 (Proper) Run-2 (GT) Run-3 (Proper Google) As 1st document 5 4 5 in top 5 7 5 6 in top 10 8 7 6 MANIT-1-Run-1 performed the best in this. This might be the reason for the degradation in the NDCG performance of the queries formed using Google Translate

Analysis 2 continued. 8 7 6 5 4 5 7 8 4 5 7 5 6 6 3 2 1 0 As 1st document in top 5 in top 10

Analysis III English-Hindi Relevant Document Pair Linked Hindi Documents english-document-00002.txt 0 hindi-document-00416.txt 1 For 2 and 23 english-document-00016.txt 0 hindi-document-48171.txt 1 For 16 and 9 english-document-00016.txt 0 hindi-document-29606.txt 1 For 16 and 23 english-document-00016.txt 0 hindi-document-32003.txt 1 For 16 and 13 english-document-00005.txt 0 hindi-document-00414.txt 1 For 5 and 11 english-document-00019.txt 0 hindi-document-10863.txt 1 For 19 and 21 english-document-00019.txt 0 hindi-document-19273.txt 1 For 19 and 25 english-document-00019.txt 0 hindi-document-19272.txt 1 For 19 and 25 english-document-00001.txt 0 hindi-document-16606.txt 1 For 1 and 4 english-document-00001.txt 0 hindi-document-39272.txt 1 For 1 and 4 english-document-00001.txt 0 hindi-document-17481.txt 1 For 1 and 2 english-document-00001.txt 0 hindi-document-08897.txt 1 For 1, 21 and 24 english-document-00001.txt 0 hindi-document-19255.txt 1 For 1 and 8 english-document-00001.txt 0 hindi-document-46293.txt 1 For 1 and 4 english-document-00001.txt 0 hindi-document-08773.txt 1 For 1 and 4 english-document-00017.txt 0 hindi-document-14001.txt 2 For 17, 9 and 12 english-document-00004.txt 0 hindi-document-20282.txt 2 For 4, 10 and 21 english-document-00004.txt 0 hindi-document-37101.txt 1 For 4 and 21

Conclusion It is observed that dictionary based approach clubbed up with proper noun based pre-retrieval strategy performed better than other two runs in all the three cases. MANIT-1-Run-3 which aimed at getting the right translation and transliteration for given query words, did not show a good performance at NDCG@1 level. In this study some of the pre-retrieval strategies to retrieve a subset of source Hindi documents from large corpus has been studied. The post processing techniques to link the exact news stories shall be studied in future.

Acknowledgement We are thankful to Terrier group for providing us Terrier Retrieval Engine to carry out our research work. One of the presenters, Aarti Kumar, is thankful to Maulana Azad National Institute of Technology, Bhopal for providing her the financial support to pursue her Doctoral work as a full time research scholar.

References Paul D. Clough, Department of Computer Science University of SheÆeld, England : Measuring Text Reuse in Journalistic Domain Parth Gupta, Paul Clough, Paolo Rosso, Mark Stevenson: PAN@FIRE: Overview of the Cross-Language!ndian News Story Search (CL!NSS) Track. In:Forum for Information Retrieval Evaluation, ISI, Kolkata,India(2012) YuriiPalkovskii, Alexei Belov: Using TF-IDF Weight Ranking Model in CLINSS as Effective Similarity Measure to Identify Cases of Journalistic Text Re-use In: Overview paper CLINSS 2012, Forum for Information Retrieval Evaluation, ISI, Kolkata,India(2012) NitishAggarwal, KartikAsooja, Paul Buitelaar, Tamara Polajanar, Jorge Gracia: Cross-Lingual Linking of News Stories using ESA. In:Overview paper CLINSS 2012, Forum for Information Retrieval Evaluation, ISI, Kolkata,India(2012) List of Stopwords Available on http://www.ranks.nl/resources/stopwords.html,http://norm.al/2 009/04/14/list-of-english-stopwords/,http://www.webconfs.com/stopwords.php,http://jmlr.org/papers/volume5/lewis04a/a11-smartstop-list/english.stop

References continued List of Verbs and Adverbs Available on http://www.englishclub.com/vocabulary/regular-verbslist.htm,http://www.momswhothink.com/reading/list-ofverbs.html,http://www.linguanaut.com/verbs.htm,http://www.acme2k.co.uk /acme/3star%20verbs.htm,http://www.enchantedlearning.com/wordlist/verb s.shtml, http://www.enchantedlearning.com/wordlist/adverbs.shtml http://translate.google.com/?prev=hp&hl=en&text=&sl=en&tl=hi#en/hi/- ChangathiTransliterator Available on http://hindi.changathi.com/ Shabdanjali available on http://ltrc.iiit.ac.in/onlineservices/dictionaries/dict_frame.html Porter stemmer available on http://ir.dcs.gla.ac.uk/resources/linguistic_utils/porter.java Terrier 3.5 available on http://terrier.org/download/