CLIA The Third International Joint Conference On Natural Language Processing IJCNLP Proceedings of the Workshop

Similar documents
CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Named Entity Recognition: A Survey for the Indian Languages

Survey of Named Entity Recognition Systems with respect to Indian and Foreign Languages

Transliteration Systems Across Indian Languages Using Parallel Corpora

Cross Language Information Retrieval

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

Indian Institute of Technology, Kanpur

Language Independent Passage Retrieval for Question Answering

Postprint.

TextGraphs: Graph-based algorithms for Natural Language Processing

English to Marathi Rule-based Machine Translation of Simple Assertive Sentences

Leveraging Sentiment to Compute Word Similarity

Curriculum Vitae of Dr. Bani Bhattacharya

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Investigation of Indian English Speech Recognition using CMU Sphinx

End-to-End SMT with Zero or Small Parallel Texts 1. Abstract

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Linking Task: Identifying authors and book titles in verbose queries

Welcome to. ECML/PKDD 2004 Community meeting

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

ANNEXURE VII (Part-II) PRACTICAL WORK FIRST YEAR ( )

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

VI Jaen Conference on Approximation

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

HinMA: Distributed Morphology based Hindi Morphological Analyzer

The NICT Translation System for IWSLT 2012

IT4BI, Semester 2, UFRT. Welcome address, February 1 st, 2013 Arnaud Giacometti / Patrick Marcel

INDIAN STATISTICAL INSTITUTE, DELHI PLACEMENT BROCHURE

SUMMARY ON JEE (ADVANCED) [KANPUR ZONE] P Gupta & R N Sen Gupta

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

An evolutionary survey from Monolingual Text Reuse to Cross Lingual Text Reuse in context to English-Hindi. Aarti Kumar*, Sujoy Das** IJSER

Available online at ScienceDirect. Procedia Computer Science 54 (2015 )

World University Rankings. Where s India?

STATUS OF OPAC AND WEB OPAC IN LAW UNIVERSITY LIBRARIES IN SOUTH INDIA

A Simple Surface Realization Engine for Telugu

COMMISSIONER AND DIRECTOR OF SCHOOL EDUCATION ANDHRA PRADESH :: HYDERABAD NOTIFICATION FOR RECRUITMENT OF TEACHERS 2012

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

An Evaluation of E-Resources in Academic Libraries in Tamil Nadu

2017 Florence, Italty Conference Abstract

CURRICULUM VITAE Davide Ticchi

FONDAMENTI DI INFORMATICA

Matching Similarity for Keyword-Based Clustering

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Disambiguation of Thai Personal Name from Online News Articles

Multiobjective Optimization for Biomedical Named Entity Recognition and Classification

Dr. M.MADHUSUDHAN. University of Delhi. Title Dr. First Name Margam Last Name Madhusudhan Photograph. Department of Library and Information Science

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

Academic Partnerships with Asian Universities Paul Wheeler Utah State University, USA

ISSN Volume 3 No. 2, August 2005 EDITORS-IN-CHIEF

Initial steps to be followed before filling Online Application Form

Distant Supervised Relation Extraction with Wikipedia and Freebase

Jadavpur University Kolkata

INDIAN STATISTICAL INSTITUTE 203, BARRACKPORE TRUNK ROAD KOLKATA

Cross-Lingual Text Categorization

arxiv:cs/ v2 [cs.cl] 7 Jul 1999

MAHATMA GANDHI KASHI VIDYAPITH Deptt. of Library and Information Science B.Lib. I.Sc. Syllabus

Resolving Ambiguity for Cross-language Retrieval

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

University Faculty Details Page on DU Web-site

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Ph.D. Computer Engineering and Information Science. Case Western Reserve University. Cleveland, OH, 1986

The Fatima Center s India Apostolate

1. Introduction. 2. The OMBI database editor

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Finding Translations in Scanned Book Collections

IBM University Relations India Newsletter Volume 1 (January March, 2010)

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Cross-Language Information Retrieval

IIT. That s where I long to belong.

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

English-German Medical Dictionary And Phrasebook By A.H. Zemback

Term Weighting based on Document Revision History

ARNE - A tool for Namend Entity Recognition from Arabic Text

The Smart/Empire TIPSTER IR System

Dirty Minds The Business Quiz. IQL Anniversary Quiz 3

English for Researchers: A Study of Reference Skills

GLOBAL MEET FOR A RESURGENT BIHAR

INSTITUTE OF MANAGEMENT STUDIES NOIDA

Instructional Approach(s): The teacher should introduce the essential question and the standard that aligns to the essential question

EACL th Conference of the European Chapter of the Association for Computational Linguistics. Proceedings of the 2nd International Workshop on

Constructing Parallel Corpus from Movie Subtitles

July 13, Maureen Bartolotta, Chair; Jim Sorum, Vice Chair; Maureen Peterson, Clerk; Arlene Bush, Treasurer; Mark Hibbs and Chuck Walter.

Search right and thou shalt find... Using Web Queries for Learner Error Detection

GENERAL COMMENTS Some students performed well on the 2013 Tamil written examination. However, there were some who did not perform well.

PUBLIC CASE REPORT Use of the GeoGebra software at upper secondary school

HLTCOE at TREC 2013: Temporal Summarization

THERMANS-2018 DAE BRNS 21ST WORKSHOP & SYMPOSIUM ON THERMAL ANALYSIS. Department of Chemistry Goa University, Goa, India

Document WSIS/PC-3/CONTR/187-E 5 November 2003 Original: English and French

Multilingual Sentiment and Subjectivity Analysis

CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES. Christian E. Loza. Thesis Prepared for the Degree of MASTER OF SCIENCE

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Applications of memory-based natural language processing

Use of Online Information Resources for Knowledge Organisation in Library and Information Centres: A Case Study of CUSAT

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Indian Institute of Technology Kharagpur (IIT Kharagpur)

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

INDIAN INSTITUTE OF SCIENCE EDUCATION AND RESEARCH KOLKATA Mohanpur Ref.No.: IISER-K/Rectt.NT-01/2016/Admn Date:

Transcription:

CLIA 2008 2nd International Workshop on Cross Lingual Information Access (CLIA) Addressing the Information Need of Multilingual Societies The Third International Joint Conference On Natural Language Processing IJCNLP 2008 Proceedings of the Workshop 11 January 2008, Hyderabad, India

Asian Federation of Natural Language Processing (AFNLP) i

Preface Welcome to the second international workshop on Cross Lingual Information Access (CLIA 2008), with a focus on "Addressing the Information Need of Multilingual Societies". In this workshop, like in the previous year, our aim was to bring together various trends in cross and multi-lingual information retrieval and access. This year we have accepted eight papers after a careful review process and these accepted papers are included in the proceedings. The workshop will have four sessions, each focusing on a specific theme: Cross Language Information Retrieval, Translations and Transliterations in CLIR, Information Extraction/Summarization in CLIR contexts, and, finally a session on the overview of the experiences of Indian research groups in the CLEF-2007 competition. There are three papers in the first session on Cross Language Information Retrieval: The first paper explores the effects of language relatedness on multilingual Information retrieval. This paper presents a case study with Indo-European and Semitic Languages and addresses some of the challenges posed by Semitic languages IR. The paper on Identifying Similar and Co-referring Documents Across Languages, authors make use of Vector Space Model (VSM) and Named Entities in identifying the co-reference and similarity. In the paper on finding parallel texts on the web using cross-language information retrieval, CLIR techniques are used in combination with structural features to retrieve candidate document pairs from the web. These three papers are part of the session on Cross Language Information Retrieval. In the second session on Translations and Transliterations in CLIR, we will again have three papers will be presented: The first paper presents results of some experiments in Mining Named Entity Transliteration Pairs from Comparable Corpora, employing English-Tamil named entity parallel comparable corpus texts. The second paper on Domain-Specific Query Translation for Multilingual Information Access using Machine Translation Augmented with Dictionaries Mined from Wikipedia authors demonstrates that effective query translation for CLIA can be achieved in the domain of cultural heritage using a standard MT system, and that domain specific phrase dictionaries that are may be automatically mined from the online Wikipedia. The paper Statistical Transliteration for Cross Language Information Retrieval using HMM alignment model and CRF, presents a technique that combines HMM and CRF for transliteration task in CLIR. In the third session we have two papers. The first paper is Script Independent Word Spotting in Multilingual Documents, which describes a system that accepts a query in the form of text from the user and returns a ranked list of word images from document image corpus based on similarity with the query word. The second paper is about building a document graph based multi-document summarizer that makes use of a graph model at offline processing time as well as the query time. ii

Finally, in addition to all the refereed papers, we have six invited presentations by various teams focusing on Indian language CLIR. These presentations are based on the work done by these teams for Ad-hoc task in Cross Language Evaluation Forum (CLEF) in 2007. Teams from IIT Bombay (focusing Marathi, Hindi), IIT Kharagpur (Bengali and Hindi), IIIT Hyderabad (Telugu and Hindi), Microsoft Research India (Tamil, Telugu and Hindi) and Jadhavpur University (Bengali, Telugu and Hindi) will present their work to achieve CLIR for queries in Indian languages and documents in English. In this special session, a team from ISI, Kolkata will make a presentation on FIRE (Forum for Information Retrieval Evaluation), a proposed cross language evaluation forum, specifically for Indian languages. Abstracts of these presentations are also included in these proceedings. We would like to thank all authors for the hard word that they have put in, in submission, rework and presentation. The workshop would not be possible without them. We would also like to thank the program committee and all the reviewers for their valuable feedback. We hope you would enjoy the workshop. "We would like to thank Minhaj Babji for all his help in preparing these proceedings as well as supporting the organizing committee during all phases of the workshop." Vasudeva Varma, Pushpak Bhattacharya, Sivaji Bandyopadhyay, A. Kumaran, Sudeshna Sarkar. (Editors CLIA 2008 Workshop) iii

Committees Organizing Committee Vasudeva Varma, IIIT Hyderabad, India Pushpak Bhattacharya, IIT Bombay, India Sudeshna Sarkar, IIT Kharagpur, India A Kumaran Microsoft Research, India Sivaji Bandyopadhyay, Jadavpur University, Kolkata, India Program Committee Asanee Kawtrakul, Kasetsart University, Bangkok, Thailand Carol Peters, Istituto di Scienza e Tecnologie dell Informazione and CLEF campaign, Italy Gilles Serasset, GETALP-LIG, Grenoble, France Kumaran A, Microsoft Research, Bangalore, India Lucy Vanderwende, Microsoft Research, USA Mandar Mitra, ISI Kolkata, India Paolo Rosso, Universidad Politecnica de Valencia (UPV), Spain Patrick Saint Dizier, IRIT, Universite Paul Sabatier, Toulouse, France Paul McNamee, Johns Hopkins University, USA Petri Myllymaki, University of Helsinki, Finland Pushpak Bhattacharya, IIT Bombay, India Ralf Steinberger, European Commission - Joint Research Centre, Italy Sivaji Bandyopadhyay, Jadavpur University, Kolkata, India Sobha L, AU-KBC, Chennai, India Sudeshna Sarkar, IIT Kharagpur, India Vasudeva Varma, IIIT Hyderabad, India iv

Workshop Program 11 January 2008, Hyderabad, India 08:45-09:00 Workshop Introduction and Opening Remarks 09:00-10:30 Session-1 Cross Language Information Retrieval 10:30-11:00 Tea Break The Effects of Language Relatedness on Multilingual Information Retrieval: A Case Study With Indo-European and Semitic Languages Peter Chew and Ahmed Abdelali. Identifying Similar and Co-referring Documents Across Languages Pattabhi R K Rao T and Sobha L. Finding parallel texts on the web using cross-language information retrieval Achim Ruopp and Fei Xia. 11:00-12:30 Session II Translation and Transliteration in CLIR Some Experiments in Mining Named Entity Transliteration Pairs from Comparable Corpora K Saravanan and A Kumaran. Domain-Specific Query Translation for Multilingual Information Access using Machine Translation Augmented With Dictionaries Mined from Wikipedia Gareth Jones, Fabio Fantino, Eamonn Newman and Ying Zhang. Statistical Transliteration for Cross Language Information Retrieval using HMM alignment model and CRF Prasad Pingali, Suryaganesh, Sreeharsha Yella and Vasudeva Varma. 12:30-14:00 Lunch Break v

14:00-15:00 Session III Cross Language Information Access and Evaluation 15:00-15:30 Tea Break 15:30-17:30 Session IV Script Independent Word Spotting in Multilingual Documents Anurag Bhardwaj, Damien Jose and Venu Govindaraju. A Document Graph Based Query Focused Multi-Document Summarizer Sibabrata Paladhi and Sivaji Bandyopadhyay. CLIR in Indian Languages - Invited Talks Hindi and Marathi to English Cross Language Information Retrieval Manoj Kumar Chinnakotla, Sagar Ranadive, Om P. Damani and Pushpak Bhattacharyya Bengali and Hindi to English CLIR Evaluation Debasis Mandal, Sandipan Dandapat, Mayank Gupta, Pratyush Banerjee, Sudeshna Sarkar Bengali, Hindi and Telugu to English Ad-hoc Bilingual task Sivaji Bandyopadhyay, Tapabrata Mondal, Sudip Kumar Naskar, Asif Ekbal, Rejwanul Haque, Srinivasa Rao Godavarthy Cross-Lingual Information Retrieval System for Indian Languages Jagadeesh Jagarlamudi and A Kumaran Hindi and Telugu to English CLIR using Query Expansion Prasad Pingali, Vasudeva Varma FIRE: Forum for Information Retrieval Evaluation Mandar Mitra and Prosenjit Majumdar. 17:30-17:45 Conclusions and Closing Remarks vi

Table of Contents The Effects of Language Relatedness on Multilingual Information Retrieval: A Case Study With Indo-European and Semitic Languages Peter Chew and Ahmed Abdelali...01 Identifying Similar and Co-referring Documents Across Languages Pattabhi R K Rao T and Sobha L....10 Finding parallel texts on the web using cross-language information retrieval Achim Ruopp and Fei Xia....18 Some Experiments in Mining Named Entity Transliteration Pairs from Comperable Corpora K Saravanan and A Kumaran....26 Domain-Specific Query Translation for Multilingual Information Access using Machine Translation Augmented With Dictionaries Mined from Wikipedia Gareth Jones, Fabio Fantino, Eamonn Newman and Ying Zhang....34 Statistical Transliteration for Cross Language Information Retrieval using HMM alignment model and CRF Prasad Pingali, Suryaganesh Veeravalli, Sreeharsha Yella and Vasudeva Varma....42 Script Independent Word Spotting in Multilingual Documents Anurag Bhardwaj, Damien Jose and Venu Govindaraju....48 A Document Graph Based Query Focused Multi-Document Summarizer Sibabrata Paladhi and Sivaji Bandyopadhyay....55 CLIR in Indian Languages - Invited Talks Hindi and Marathi to English Cross Language Information Retrieval Manoj Kumar Chinnakotla, Sagar Ranadive, Om P. Damani and Pushpak Bhattacharyya...64 Bengali and Hindi to English CLIR Evaluation Debasis Mandal, Sandipan Dandapat, Mayank Gupta, Pratyush Banerjee, Sudeshna Sarkar...65 Bengali, Hindi and Telugu to English Ad-hoc Bilingual task Sivaji Bandyopadhyay, Tapabrata Mondal, Sudip Kumar Naskar, Asif Ekbal, Rejwanul Haque, Srinivasa Rao Godavarthy...66 vii

Cross-Lingual Information Retrieval System for Indian Languages Jagadeesh Jagarlamudi and A Kumaran...67 Hindi and Telugu to English CLIR using Query Expansion Prasad Pingali, Vasudeva Varma...68 FIRE: Forum for Information Retrieval Evaluation Mandar Mitra and Prosenjit Majumdar...69 viii