Hindi and Marathi to English Cross Language Information Retrieval at CLEF 2007

Similar documents
CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Cross Language Information Retrieval

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Leveraging Sentiment to Compute Word Similarity

Cross-Lingual Text Categorization

A Case Study: News Classification Based on Term Frequency

Transliteration Systems Across Indian Languages Using Parallel Corpora

Detecting English-French Cognates Using Orthographic Edit Distance

AQUA: An Ontology-Driven Question Answering System

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Dictionary-based techniques for cross-language information retrieval q

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

GENERAL COMMENTS Some students performed well on the 2013 Tamil written examination. However, there were some who did not perform well.

Using dialogue context to improve parsing performance in dialogue systems

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Disambiguation of Thai Personal Name from Online News Articles

Linking Task: Identifying authors and book titles in verbose queries

Matching Meaning for Cross-Language Information Retrieval

Cross-Language Information Retrieval

Learning Methods in Multilingual Speech Recognition

Robust Sense-Based Sentiment Classification

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Indian Institute of Technology, Kanpur

Initial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Semantic Evidence for Automatic Identification of Cognates

An evolutionary survey from Monolingual Text Reuse to Cross Lingual Text Reuse in context to English-Hindi. Aarti Kumar*, Sujoy Das** IJSER

Multilingual Sentiment and Subjectivity Analysis

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Language Independent Passage Retrieval for Question Answering

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

Resolving Ambiguity for Cross-language Retrieval

arxiv:cs/ v2 [cs.cl] 7 Jul 1999

Finding Translations in Scanned Book Collections

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Test Effort Estimation Using Neural Network

Transfer of Training

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

B.A.B.Ed (Integrated) Course

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

English to Marathi Rule-based Machine Translation of Simple Assertive Sentences

Translating Collocations for Use in Bilingual Lexicons

Running head: LISTENING COMPREHENSION OF UNIVERSITY REGISTERS 1

Constructing Parallel Corpus from Movie Subtitles

UK flood management scheme

Named Entity Recognition: A Survey for the Indian Languages

Number Line Moves Dash -- 1st Grade. Michelle Eckstein

Organizational Knowledge Distribution: An Experimental Evaluation

A heuristic framework for pivot-based bilingual dictionary induction

Learning to Rank with Selection Bias in Personal Search

Short Text Understanding Through Lexical-Semantic Analysis

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

A Comparison of Two Text Representations for Sentiment Analysis

Annotation Projection for Discourse Connectives

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

PREREQIR: Recovering Pre-Requirements via Cluster Analysis

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

IIT. That s where I long to belong.

Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text

Rule Learning With Negation: Issues Regarding Effectiveness

Term Weighting based on Document Revision History

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

MAHATMA GANDHI KASHI VIDYAPITH Deptt. of Library and Information Science B.Lib. I.Sc. Syllabus

Efficient Online Summarization of Microblogging Streams

Modeling function word errors in DNN-HMM based LVCSR systems

World University Rankings. Where s India?

Heritage Korean Stage 6 Syllabus Preliminary and HSC Courses

A Simple Surface Realization Engine for Telugu

Cross-lingual Text Classification

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Matching Similarity for Keyword-Based Clustering

Variations of the Similarity Function of TextRank for Automated Summarization

Progressive Aspect in Nigerian English

Parsing of part-of-speech tagged Assamese Texts

The NICT Translation System for IWSLT 2012

ACCOMMODATIONS FOR STUDENTS WITH DISABILITIES

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Language Model and Grammar Extraction Variation in Machine Translation

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Radius STEM Readiness TM

City University of Hong Kong Course Syllabus. offered by Department of Architecture and Civil Engineering with effect from Semester A 2017/18

Word Translation Disambiguation without Parallel Texts

2017 Florence, Italty Conference Abstract

A study of speaker adaptation for DNN-based speech synthesis

On the Combined Behavior of Autonomous Resource Management Agents

Exams: Accommodations Guidelines. English Language Learners

Summarizing Text Documents: Carnegie Mellon University 4616 Henry Street

Georgetown University at TREC 2017 Dynamic Domain Track

Transcription:

Hindi and Marathi to English Cross Language Information Retrieval at CLEF 2007 Manoj Kumar Chinnakotla Joint work with Sagar Ranadive, Pushpak Bhattacharyya and Om P. Damani Department of Computer Science and Engineering IIT Bombay Mumbai, INDIA

Motivation English still the most dominant language on the web contributes 72% of the content Number of non-english users steadily rising on the web English penetration in India Estimated to be less than 3-4% Presence mostly in the urban educated sections CLIR systems key to enable access to English content through non-english languages 2007, IIT Bombay 2

Hindi and Marathi Hindi Official language of India Spoken by almost 40% of population Marathi Widely spoken language in Western India Spoken by almost 7% of population Both of them Written in Devanagari A phonetic script Derive vocabulary from Sanskrit 2007, IIT Bombay 3

System Architecture 2007, IIT Bombay 4

Language Resources Developed at Center for Indian Language Technologies (CFILT), IIT Bombay Stemmer and Morphological Analyzer Rule-Based Stemmer and MA Bi-lingual Dictionaries Hindi English 1,15,571 entries Available online http://www.cfilt.iitb.ac.in/~hdict/webinterface_user/dict_search_user.php Marathi English Relatively less coverage 6110 entries 2007, IIT Bombay 5

Devanagari-English Transliteration A simple rule based transliteration scheme Manually created Devanagari to English transliteration mapping table for each Devanagari letter Given a string start from left->right and transliterate each letter using above table 2007, IIT Bombay 6

Devanagari-English Transliteration (Contd..) Sometimes leads to invalid English words Resulting transliteration compared with unique words in corpus to find k closest matches Closeness defined in terms of string edit-distance (Levenshtein Distance) In current experiments, k set to 3 Simple Rule Based Transliteration aastreliyai (Invalid Word in English) Find k Closest Matches in Corpus Final top 3 Transliterations australian australia estrella 2007, IIT Bombay 7

Translation Disambiguation Disambiguates various translation choices for each source word based word-word association measures For example Hindi Query (River Water) Translation Choices {River} {Water, to Burn} Choose Based on Word- Word Association Strength Choice 1 Choice 2 2007, IIT Bombay 8

Iterative Translation Disambiguation Algorithm Proposed by Christof Monz et. al. (SIGIR 2005) S i Construct Graph t i,1 Nodes Translation Choices for given source word Links Between different source word translations t j,1 S j t j,2 t j,3 t k,2 t k,1 S k Initialize node weights assuming all translations of given source word equally likely 2007, IIT Bombay 9

Iterative Translation Disambiguation Algorithm (Contd..) Link strength between two nodes computed based on term-term co-occurrence statistics Dice Coefficient (Dice) Point-wise Mutual Information (PMI) The weight updation equation Weight of Neighbour Previous Weight Link Strength 2007, IIT Bombay 10

Results (Summary) Experiment MAP Recall P@20 Hindi Dice 0.2366 (61.36%) 72.58% (89.16%) 0.2700 (69.05%) Title PMI 0.2089 (54.17%) 68.53% (84.19%) 0.2390 (61.12%) Hindi Dice 0.2952 (67.06%) 76.55% (87.32%) 0.3150 (73.77%) Title + Desc PMI 0.2645 (60.08%) 72.76% (82.99%) 0.2950 (69.09%) Marathi Dice 0.2163 (56.09%) 62.44% (76.70%) 0.2510 (64.19%) Title PMI 0.1935 (50.18%) 54.07% (66.42%) 0.2280 (58.31%) 2007, IIT Bombay 11

Results (P-R Curves) Title Only 2007, IIT Bombay 12

Results (P-R Curves) Title + Desc 2007, IIT Bombay 13

Conclusion A query translation based approach taken for Hindi and Marathi to English CLIR using bi-lingual dictionaries Results quite encouraging 67.06% of Monolingual baseline for Hindi, 56.09% of Monolingual baseline for Marathi Simple rule based transliteration taking closest editdistance based matches from corpus performs well Translation disambiguation helps in selecting correct translation choices 2007, IIT Bombay 14

Acknowledgements First author supported by the Infosys Fellowship Award Project linguists at CFILT, IIT Bombay Manish Shrivastava for help on many stemmer related issues 2007, IIT Bombay 15

References Christof Monz and Bonnie J. Dorr, Iterative Translation Disambiguation for Cross-Language Information Retrieval, In SIGIR 05, Pages 520-527, New York, USA, ACM Press Nicola Bertoldi and Marcello Federico, Statistical Models for Monolingual and Bilingual Information Retrieval, Information Retrieval, 7 (1-2): 53-72, 2004 Martin Braschler and Carol Peters, Cross Language Evaluation Forum: Objectives, Results, Achievements, Information Retrieval, 7 (1-2): 7-31, 2004 Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval, Pearson Education, 2005. Dan Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997. 2007, IIT Bombay 16