Cross-Lingual Information Retrieval. Language Technology I

Similar documents
Cross Language Information Retrieval

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Cross-Lingual Text Categorization

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Ontological spine, localization and multilingual access

Matching Meaning for Cross-Language Information Retrieval

Controlled vocabulary

Language Independent Passage Retrieval for Question Answering

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Resolving Ambiguity for Cross-language Retrieval

arxiv: v1 [cs.cl] 2 Apr 2017

Finding Translations in Scanned Book Collections

On document relevance and lexical cohesion between query terms

BYLINE [Heng Ji, Computer Science Department, New York University,

Test Blueprint. Grade 3 Reading English Standards of Learning

Modeling full form lexica for Arabic

Linking Task: Identifying authors and book titles in verbose queries

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A heuristic framework for pivot-based bilingual dictionary induction

Multilingual Sentiment and Subjectivity Analysis

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

ScienceDirect. Malayalam question answering system

A Bayesian Learning Approach to Concept-Based Document Classification

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

Word Sense Disambiguation

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Text-to-Speech Application in Audio CASI

Evaluation for Scenario Question Answering Systems

Applications of memory-based natural language processing

Constructing Parallel Corpus from Movie Subtitles

A Case Study: News Classification Based on Term Frequency

Python Machine Learning

1. Introduction. 2. The OMBI database editor

Dictionary-based techniques for cross-language information retrieval q

Speech Recognition at ICSI: Broadcast News and beyond

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Parsing of part-of-speech tagged Assamese Texts

English-German Medical Dictionary And Phrasebook By A.H. Zemback

AQUA: An Ontology-Driven Question Answering System

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Modeling user preferences and norms in context-aware systems

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

Distant Supervised Relation Extraction with Wikipedia and Freebase

arxiv:cs/ v2 [cs.cl] 7 Jul 1999

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Universiteit Leiden ICT in Business

Learning Methods in Multilingual Speech Recognition

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

2.1 The Theory of Semantic Fields

Using Virtual Manipulatives to Support Teaching and Learning Mathematics

A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain

The Smart/Empire TIPSTER IR System

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

Introduction, Organization Overview of NLP, Main Issues

Postprint.

A Comparison of Two Text Representations for Sentiment Analysis

CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES. Christian E. Loza. Thesis Prepared for the Degree of MASTER OF SCIENCE

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Ensemble Technique Utilization for Indonesian Dependency Parser

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Semantic Evidence for Automatic Identification of Cognates

Analysis of Lexical Structures from Field Linguistics and Language Engineering

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

English-Chinese Cross-Lingual Retrieval Using a Translation Package

The Moodle and joule 2 Teacher Toolkit

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

The CESAR Project: Enabling LRT for 70M+ Speakers

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

The MEANING Multilingual Central Repository

Multi-Lingual Text Leveling

Ontologies vs. classification systems

Computer Software Evaluation Form

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Software Maintenance

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Proceedings of the 19th COLING, , 2002.

Taking into Account the Oral-Written Dichotomy of the Chinese language :

Loughton School s curriculum evening. 28 th February 2017

GACE Computer Science Assessment Test at a Glance

Session Six: Software Evaluation Rubric Collaborators: Susan Ferdon and Steve Poast

Systematic reviews in theory and practice for library and information studies

K 1 2 K 1 2. Iron Mountain Public Schools Standards (modified METS) Checklist by Grade Level Page 1 of 11

Cross-Language Information Retrieval

Developing a TT-MCTAG for German with an RCG-based Parser

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

What is a Mental Model?

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

A process by any other name

Knowledge-Based - Systems

Effect of Word Complexity on L2 Vocabulary Learning

Transcription:

Cross-Lingual Information Retrieval Language Technology I

Terminology monolingual, multilingual, cross-lingual Query (en) monolingual Documents (en) Query (en) Query (de) multilingual Documents (en) Documents (de) Query (en) Query (de) croslingual Documents (en) Documents (de)

Use Scenarios (I) a user has no knowledge of a target language, i.e., she cannot search for documents in that language at all with CLIR she can make use of media data pools that are indexed with captions in that language, for example for picture pools, music databases, etc. with CLIR she can get a pre-selection of documents that can then be passed on to a translator

Use Scenarios (II) a user has only passive knowledge of a target language, i.e., she cannot actively search for documents in that language with CLIR she can make use of relevant texts

Use Scenarios (III) a document collection has such a large number of languages that it would be impractical to formulate a query in each of these languages with CLIR one could get relevant documents with only a search query in one of these languages

CLIR approaches Machine translation: uses NLP tools like PoS-tagger, parser, morphological analyzers, etc. Thesaurus-based approaches manual use of thesauri: controlled vocabulary systems automatic use of thesauri: concept retrieval systems Corpus-based methods: work with frequency analysis Implication: aboutness of the two collections should be similar

MT Approach - Architecture CLIR Index (de)??? Query (en) Documents (de) Document Translation Query (en) Index (de) Index (en) Documents (de) Documents (en) Index Translation Query (en) Index (de) Index (en) Documents (de) Query (de) Index (de) Documents (de) Query (en) Query Translation

Document Translation Problem solved by multiplying the texts Make texts available in all languages multilingual (= several monolingual) retrieval Feasibility: Required in some applications Patents, multilingual states (EG, Belgium, ) Impossible in other areas (Internet) Evaluation: From costly to impossible Results depend on translation quality translation dictionary updates invalidate search on existing document pool (->retranslate everything)

Index Translation Idea: multilingual Index Analyze query in query language, translate terms Search with all document language index terms (Problem of retranslation of the hits) Feasibility: Not feasible Ambiguity of index terms Multiword terms not in index Context dependency of translations Fehler: mistake, fault, error, bug nuclear: Kern~, zentral, nuklear power: Macht, Kraft, Strom plant: Pflanze, Unternehmen => Organize the index as a special resource!

Query Translation Approach: Translation of query Analyse and translate the query terms Search in (monolingual) Backend-System Evaluation Backend database stays unchanged Translation changes do not affect document base Cross-lingual component as system frontend contains multilingual linguistic resource Which is also usable for re-translation And can be maintained independently Cross-linguality is transparent for the users Fine-tuning between frontend and backend required

MT Approach pros: straightforward (if an MT system is available) user can directly use the retrieved documents documents usually have more context which allows more robust MT than for query translation cons: translation of document collections may be very time consuming offline translation of document collections may require lots of additional storage inherits most weaknesses of MT and MT system implementations

Thesaurus-Based Approach: Thesauri thesaurus: a resource which organizes the terminology of a domain of knowledge, i.e., an ontology for terminology multilingual thesauri encode usually: cross-linguistic synonymy sometimes: hierarchical relations between terms (hyperonymy,hyponymy, etc.) seldom: associative relations between terms the thesaurus-based approach to CLIR uses multilingual thesauri has a rather broad definition of a thesaurus examples of multilingual thesauri used for CLIR: simple cross-language synonym lists collection of concepts with attached cross-lingual information classic syntax and semantics lexicons

Thesaurus-Based Approach: Thesauri pros: very productive, especially for skilled users works transparently for the user unambiguous mapping between the query and the target document cons: very expensive to create good thesauri target documents must be labeled with concepts may be difficult to use for unexperienced users (e.g., because of the manual selection of the intended concept) doesn t scale restricted to certain domains IR queries can only be as precise as the predefined thesaurus concepts

Corpus-Based Approach use of statistical information about term usage from parallel corpora usually based on two general retrieval principles: target documents with frequent usage of query terms are potentially more relevant than target documents with infrequent query term usage rare query terms are more useful than query terms that are very frequent in the overall target document collection pros: usage of recent terminology (as provided by the corpora) is possible cons: parallel corpora needed restricted to the domains of the parallel corpora

Pseudo-Relevance Feedback Enter query terms in French Find top French documents in parallel corpus Construct a query from English translations Perform a monolingual free text search

Learning From Document Pairs Count how often each term occurs in each pair Treat each pair as a single document English Terms Spanish Terms E1 E2 E3 E4 E5 S1 S2 S3 S4 Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 4 2 2 1 8 4 4 2 2 2 2 1 2 1 2 1 4 1 2 1

Similarity based Dictionaries Automatically developed from aligned documents Terms E1 and E3 are used in similar ways Terms E1 & S1 (or E3 & S4) are even more similar For each term, find most similar in other language Retain only the top few (5 or so)

CLIR Research Community Text REtrieval Conference (TREC, http://trec.nist.gov/) Arabic, English, Spanish, Chinese, etc. CLIR at TREC: http://www.glue.umd.edu/~dlrg/clir/trec2002/ Cross-Language Evaluation Forum (CLEF) European languages http://www.clef-campaign.org/ NTCIR (NII Test Collection for IR Systems) http://research.nii.ac.jp/ntcir/index-en.html with related workshops Information Retrieval for Asian Language (IRAL) internaltional workshop and quite a few others