Introduction to Information Retrieval

Similar documents
Cross Language Information Retrieval

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Learning Methods in Multilingual Speech Recognition

Arabic Orthography vs. Arabic OCR

Linking Task: Identifying authors and book titles in verbose queries

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Modeling full form lexica for Arabic

Constructing Parallel Corpus from Movie Subtitles

Controlled vocabulary

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

arxiv:cs/ v2 [cs.cl] 7 Jul 1999

Resolving Ambiguity for Cross-language Retrieval

A heuristic framework for pivot-based bilingual dictionary induction

A Case Study: News Classification Based on Term Frequency

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Problems of the Arabic OCR: New Attitudes

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Language Independent Passage Retrieval for Question Answering

Task Tolerance of MT Output in Integrated Text Processes

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Cross-Lingual Text Categorization

Finding Translations in Scanned Book Collections

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Information Retrieval

On document relevance and lexical cohesion between query terms

Matching Meaning for Cross-Language Information Retrieval

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Dictionary-based techniques for cross-language information retrieval q

Ontological spine, localization and multilingual access

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

Chapter 5: Language. Over 6,900 different languages worldwide

Noisy SMS Machine Translation in Low-Density Languages

SYRACUSE UNIVERSITY. and BELLEVUE COLLEGE

Cross-Language Information Retrieval

On-Line Data Analytics

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

ROSETTA STONE PRODUCT OVERVIEW

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Language. Name: Period: Date: Unit 3. Cultural Geography

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Character Stream Parsing of Mixed-lingual Text

Probabilistic Latent Semantic Analysis

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

ARNE - A tool for Namend Entity Recognition from Arabic Text

Detecting English-French Cognates Using Orthographic Edit Distance

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Rule Learning With Negation: Issues Regarding Effectiveness

Multilingual Sentiment and Subjectivity Analysis

EUROPEAN DAY OF LANGUAGES

THE VERB ARGUMENT BROWSER

Semantic Evidence for Automatic Identification of Cognates

A First-Pass Approach for Evaluating Machine Translation Systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

Speech Recognition at ICSI: Broadcast News and beyond

arxiv: v1 [cs.cl] 2 Apr 2017

A Bayesian Learning Approach to Concept-Based Document Classification

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

Anti-Money Laundering with Text Analytics

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Postprint.

Baku Regional Seminar in a nutshell

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Florida Reading Endorsement Alignment Matrix Competency 1

1. Introduction. 2. The OMBI database editor

ScienceDirect. Malayalam question answering system

Lesson M4. page 1 of 2

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Basic German: CD/Book Package (LL(R) Complete Basic Courses) By Living Language

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

What the National Curriculum requires in reading at Y5 and Y6

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Analysis of Lexical Structures from Field Linguistics and Language Engineering

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The Role of String Similarity Metrics in Ontology Alignment

South Carolina English Language Arts

Modern Languages. Introduction. Degrees Offered

Language Model and Grammar Extraction Variation in Machine Translation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

English-Chinese Cross-Lingual Retrieval Using a Translation Package

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Applications of memory-based natural language processing

Transcription:

Introduction to Information Retrieval http://informationretrieval.org Cross Language IR Hinrich Schütze, Christina Lioma Institute for Natural Language Processing, University of Stuttgart 2010-07-05 Schütze, Lioma: Cross Language IR 1 / 30

Outline 1 Introduction 2 Language-specific problems 3 IR problems 4 Translation approaches Schütze, Lioma: Cross Language IR 2 / 30

Definitions Crosslingual (a.k.a. cross-language) IR (CLIR) Multilingual (a.k.a. multi-language) IR (MLIR) Schütze, Lioma: Cross Language IR 3 / 30

Definitions Crosslingual (a.k.a. cross-language) IR (CLIR): retrieval of documents in a language different from that of a query. E.g., bilingual or trilingual IR Multilingual (a.k.a. multi-language) IR (MLIR) Schütze, Lioma: Cross Language IR 3 / 30

Definitions Crosslingual (a.k.a. cross-language) IR (CLIR): retrieval of documents in a language different from that of a query. E.g., bilingual or trilingual IR Multilingual (a.k.a. multi-language) IR (MLIR): retrieval of documents in several languages Schütze, Lioma: Cross Language IR 3 / 30

Definitions Crosslingual (a.k.a. cross-language) IR (CLIR): retrieval of documents in a language different from that of a query. E.g., bilingual or trilingual IR Multilingual (a.k.a. multi-language) IR (MLIR): retrieval of documents in several languages Motivation Internet usage: 29.5% English, 70.5% non-english (Lazarinis et al. 2007) Schütze, Lioma: Cross Language IR 3 / 30

Definitions Crosslingual (a.k.a. cross-language) IR (CLIR): retrieval of documents in a language different from that of a query. E.g., bilingual or trilingual IR Multilingual (a.k.a. multi-language) IR (MLIR): retrieval of documents in several languages Motivation Internet usage: 29.5% English, 70.5% non-english (Lazarinis et al. 2007) user scenarios: monolingual / multilingual users (partly or passively) Schütze, Lioma: Cross Language IR 3 / 30

Definitions Crosslingual (a.k.a. cross-language) IR (CLIR): retrieval of documents in a language different from that of a query. E.g., bilingual or trilingual IR Multilingual (a.k.a. multi-language) IR (MLIR): retrieval of documents in several languages Motivation Internet usage: 29.5% English, 70.5% non-english (Lazarinis et al. 2007) user scenarios: monolingual / multilingual users (partly or passively) intelligence: state companies (finding competing companies, finding calls for tenders, etc...) Schütze, Lioma: Cross Language IR 3 / 30

Outline 1 Introduction 2 Language-specific problems 3 IR problems 4 Translation approaches Schütze, Lioma: Cross Language IR 4 / 30

Language-specific problems 1 Encoding, capitalisation, diacritics, compounding, complicated morphology... Schütze, Lioma: Cross Language IR 5 / 30

Language-specific problems 1 Encoding, capitalisation, diacritics, compounding, complicated morphology... 2 Lack of script standards, e.g. Khrushchev, Chrustschev, Khrooshtchoff, Chruhszhtchow, Jruchev, Chroesjtjov, Crustsciof Schütze, Lioma: Cross Language IR 5 / 30

Language-specific problems 1 Encoding, capitalisation, diacritics, compounding, complicated morphology... 2 Lack of script standards, e.g. Khrushchev, Chrustschev, Khrooshtchoff, Chruhszhtchow, Jruchev, Chroesjtjov, Crustsciof Transliteration: spelling words from one language with characters from the alphabet of another, usually in a character-by-character replacement Schütze, Lioma: Cross Language IR 5 / 30

Language-specific problems 1 Encoding, capitalisation, diacritics, compounding, complicated morphology... 2 Lack of script standards, e.g. Khrushchev, Chrustschev, Khrooshtchoff, Chruhszhtchow, Jruchev, Chroesjtjov, Crustsciof Transliteration: spelling words from one language with characters from the alphabet of another, usually in a character-by-character replacement Transcription: representation of the sound of words in a language using any set of symbols, i.e., the International Phonetic Alphabet (IPA) Schütze, Lioma: Cross Language IR 5 / 30

Language-specific problems 1 Encoding, capitalisation, diacritics, compounding, complicated morphology... 2 Lack of script standards, e.g. Khrushchev, Chrustschev, Khrooshtchoff, Chruhszhtchow, Jruchev, Chroesjtjov, Crustsciof Transliteration: spelling words from one language with characters from the alphabet of another, usually in a character-by-character replacement Transcription: representation of the sound of words in a language using any set of symbols, i.e., the International Phonetic Alphabet (IPA) Latin script predominance on the Web, e.g. Greeklish Often adhoc use of numbers and symbols, e.g. 8 for θ Schütze, Lioma: Cross Language IR 5 / 30

Language-specific problems 3 Not always one-to-one correspondence with Latin characters, e.g., standard Hebrew (undotted & unvocalised) orthography Schütze, Lioma: Cross Language IR 6 / 30

Language-specific problems 3 Not always one-to-one correspondence with Latin characters, e.g., standard Hebrew (undotted & unvocalised) orthography 4 Writing order: Standard Indo-European: top-to-bottom, left-to-right Hebrew, Japanese: right-to-left Schütze, Lioma: Cross Language IR 6 / 30

Language-specific problems 3 Not always one-to-one correspondence with Latin characters, e.g., standard Hebrew (undotted & unvocalised) orthography 4 Writing order: Standard Indo-European: top-to-bottom, left-to-right Hebrew, Japanese: right-to-left 5 Need tokenisation Arabic, Iranian, Uzbeki (use variants of the Arabic script): no capitalisation, no punctuation, hence difficult to detect sentence boundaries. Also, letters may be joined: letter looks different when it stands alone, when it is the first letter of a connected set of letters, when it is somewhere in the middle of a connection, and when it appears at the end of a set of connected letters. costly, may introduce error Schütze, Lioma: Cross Language IR 6 / 30

Language-specific problems 6 Under-represented languages Schütze, Lioma: Cross Language IR 7 / 30

Language-specific problems 6 Under-represented languages Example Armenian uses its own script (its own I-E branch): not widely known in the world Small number of native speakers (3 million in Armenia, 8 million abroad) Changes in the script: 1920s Soviet Armenia reformed spelling, which however was rejected by the Armenian diaspora (which outnumbers significantly the country s population) Result: already weak presence of Armenian on the Web lacks uniformity in script, which practically means noise for search engines. Schütze, Lioma: Cross Language IR 7 / 30

Outline 1 Introduction 2 Language-specific problems 3 IR problems 4 Translation approaches Schütze, Lioma: Cross Language IR 8 / 30

IR problems IR problems arising from non-standard script The same language entities are represented under different forms: no new words are added to the language, only different ways of writing the same words Schütze, Lioma: Cross Language IR 9 / 30

IR problems IR problems arising from non-standard script The same language entities are represented under different forms: no new words are added to the language, only different ways of writing the same words Indexing problem: Should all these term variants be indexed as one entry or as separate entries? Should these terms be normalised in some way, e.g., stemmed? Schütze, Lioma: Cross Language IR 9 / 30

IR problems IR problems arising from non-standard script The same language entities are represented under different forms: no new words are added to the language, only different ways of writing the same words Indexing problem: Should all these term variants be indexed as one entry or as separate entries? Should these terms be normalised in some way, e.g., stemmed? Matching problem: Should a query containing the term in Russian letters be matched to a relevant document containing the term in Latin letters? Should a term written in Russian letters receive the same term weight as the same term written in Latin letters? Schütze, Lioma: Cross Language IR 9 / 30

Solution: key problem = translation Treat as monolingual IR with translation Schütze, Lioma: Cross Language IR 10 / 30

Solution: key problem = translation Treat as monolingual IR with translation 1. Document translation - translate documents into the query language Schütze, Lioma: Cross Language IR 10 / 30

Solution: key problem = translation Treat as monolingual IR with translation 1. Document translation - translate documents into the query language Advantages: Translation may be more precise (in principle) Documents become readable by the user Disadvantages: Huge volume to be translated Impossible to translate them in all languages (Eng Fre, Ger, Ita...) Schütze, Lioma: Cross Language IR 10 / 30

Solution: key problem = translation 2. Query translation - translate query into the document language(s) Schütze, Lioma: Cross Language IR 11 / 30

Solution: key problem = translation 2. Query translation - translate query into the document language(s) Advantages: Flexibility (translation on demand) Less text to translate Disadvantages: Less precise (2-3-word queries) The retrieved documents need to be translated (gist) to be readable Schütze, Lioma: Cross Language IR 11 / 30

Integration of translation to IR Approach 1: translate the query into different languages retrieve doc. in each language merge the results into a single file Schütze, Lioma: Cross Language IR 12 / 30

Integration of translation to IR Approach 1: translate the query into different languages retrieve doc. in each language merge the results into a single file round-robin: take the first from each list, then the second, and so on... Assumption: similar number of documents ranked similarly raw score: mix all the lists together and sort according to the similarity score. Assumption: similar IR method & collection statistics Schütze, Lioma: Cross Language IR 12 / 30

Integration of translation to IR Approach 2: translate the query into all the languages concatenate them into a mixed query IR using mixed query on mixed documents Schütze, Lioma: Cross Language IR 13 / 30

Integration of translation to IR Approach 2: translate the query into all the languages concatenate them into a mixed query IR using mixed query on mixed documents avoid merging homograph in different languages (but, pour,...) possible improvement: distinguish language (e.g. add a tag to the index, e.g. but f, pour e) Schütze, Lioma: Cross Language IR 13 / 30

Outline 1 Introduction 2 Language-specific problems 3 IR problems 4 Translation approaches Schütze, Lioma: Cross Language IR 14 / 30

How to translate 1 Machine translation (MT) 2 Bilingual dictionaries, thesauri, lexical resources 3 Parallel texts: translated texts Schütze, Lioma: Cross Language IR 15 / 30

Approach 1: using MT Good solution iff translation quality is high Schütze, Lioma: Cross Language IR 16 / 30

Approach 1: using MT Good solution iff translation quality is high Problems: Quality Availability Development cost Schütze, Lioma: Cross Language IR 16 / 30

Problems of MT Translation quality Schütze, Lioma: Cross Language IR 17 / 30

Problems of MT Translation quality Wrong choice of translation word/term organic food nourriture organique ambiguity Schütze, Lioma: Cross Language IR 17 / 30

Problems of MT Translation quality Wrong choice of translation word/term organic food nourriture organique ambiguity Wrong syntax Human-assisted machine translation traduction automatique humain-aideé Schütze, Lioma: Cross Language IR 17 / 30

Problems of MT Translation quality Wrong choice of translation word/term organic food nourriture organique ambiguity Wrong syntax Human-assisted machine translation traduction automatique humain-aideé Unknown words Personal names Transliteration, transcription Schütze, Lioma: Cross Language IR 17 / 30

Approach 2: using bilingual dictionaries General form of dict. (e.g. Freedict) access: attaque, accéder, entrée, accès academic: étudiant, académique branch: filiale, succursale, spécialité, branche data: données, matériau, data Schütze, Lioma: Cross Language IR 18 / 30

Approach 2: using bilingual dictionaries General form of dict. (e.g. Freedict) access: attaque, accéder, entrée, accès academic: étudiant, académique branch: filiale, succursale, spécialité, branche data: données, matériau, data Approaches for each word in a query 1 select the best translation word 2 select all the translation words Schütze, Lioma: Cross Language IR 18 / 30

Approach 2: using bilingual dictionaries General form of dict. (e.g. Freedict) access: attaque, accéder, entrée, accès academic: étudiant, académique branch: filiale, succursale, spécialité, branche data: données, matériau, data Approaches for each word in a query 1 select the best translation word 2 select all the translation words for all query words select the translation words that create the highest cohesion Schütze, Lioma: Cross Language IR 18 / 30

Cohesion cohesion frequency of two translation words together Example data: données, matériau, data access: attaque, accéder, entrée, accès (accès, données) 152 (accéder, données) 31 (données, entrée) 21 (entrée, matériau) 3... Schütze, Lioma: Cross Language IR 19 / 30

Approach 3: parallel texts Parallel texts contain possible translations of query words Schütze, Lioma: Cross Language IR 20 / 30

Approach 3: parallel texts Parallel texts contain possible translations of query words Given a query in French Find relevant documents in the parallel corpus Extract keywords from their parallel documents, and consider them as a query translation Schütze, Lioma: Cross Language IR 20 / 30

Parallel texts (cont.) Training a translation model Principle: Train a statistical translation model from a set of parallel texts: p(t j s i ) The more s i appears in parallel texts of t j, the higher p(t j s i ) Given a query, use the translation words with the highest probabilities as its translation Schütze, Lioma: Cross Language IR 21 / 30

Principle of model training p(t j s i ) is estimated from a parallel training corpus, aligned into parallel sentences IBM models 1,2,3,... process: Schütze, Lioma: Cross Language IR 22 / 30

Principle of model training p(t j s i ) is estimated from a parallel training corpus, aligned into parallel sentences IBM models 1,2,3,... process: Input = parallel texts Sentence alignment A: S k T h Initial probability assignment: t(t j s i, A) Expectation Maximisation (EM): p(t j s i, A) Final result: p(t j s i ) = p(t j s i, A) Schütze, Lioma: Cross Language IR 22 / 30

Sentence alignment Assumptions: 1 The order of sentences in two parallel texts is similar 2 A sentence and its translation have similar length (length-based alignment) 3 A translation contains some known translation words or cognates Schütze, Lioma: Cross Language IR 23 / 30

Effectiveness: mean average precision F-E (TREC6) F-E (TREC7) E-F (TREC6) E-F (TREC7) monolingual 0.2865 0.3203 0.3686 0.2764 Dict. 0.1707 0.1701 0.2305 0.1352 Systran 0.3098 0.3293 0.2727 0.2327 Hansard PT 0.2166 0.3124 0.2501 0.2587 Hansard PT+dict 0.2560 0.3245 0.3053 0.2649 Schütze, Lioma: Cross Language IR 24 / 30

Problem of parallel texts Only a few large parallel corpora (e.g. Canadian Hansards, EU parliament, HK Hansards, UN documents...) Minor languages are not covered Schütze, Lioma: Cross Language IR 25 / 30

Problem of parallel texts Only a few large parallel corpora (e.g. Canadian Hansards, EU parliament, HK Hansards, UN documents...) Minor languages are not covered Is it possible to extract parallel texts from the WEB? STRANDS: If a Web page contains two pointers, the anchor text of each pointer identifies a language. Then, the two pages are references as parallel PTMiner: parallel web pages = similar URLs at the difference of a tag identifying a language index.html vs. index f.html /english/index.html vs. /french/index.html Schütze, Lioma: Cross Language IR 25 / 30

Mining results (Nie 2003) French - English Exploration of 30% of 5474 candidate sites 14198 pairs of parallel pages 135MB French texts and 118MB English texts Chinese - English 196 candidate sites 14820 pairs of parallel pages 117.2M Chinese texts and 136.5M English texts Schütze, Lioma: Cross Language IR 26 / 30

CLIR results: F-E F-E (TREC6) F-E (TREC7) E-F (TREC6) E-F (TREC7) monolingual 0.2865 0.3203 0.3686 0.2764 Dict. 0.1707 0.1701 0.2305 0.1352 Systran 0.3098 0.3293 0.2727 0.2327 Hansard PT 0.2166 0.3124 0.2501 0.2587 Web PT 0.2389 0.3146 0.2504 0.2289 Schütze, Lioma: Cross Language IR 27 / 30

Problems of using parallel corpora Not strictly parallel (Web) Coverage In a different domain than the documents to be retrieved Not applicable to minor languages Schütze, Lioma: Cross Language IR 28 / 30

Summary High-quality MT is still the best solution Translation based on parallel texts can match MT Dictionary: Simple utilisation is not good Complex approaches improve quality The performance of CLIR/MLIR is usually lower than monolingual IR (between 50% and 90% of monolingual in general) Schütze, Lioma: Cross Language IR 29 / 30

Wrap up Develop better translation tools for IR (e.g. for special types of data such as personal names) Integrating multiple translation results Translate non-english languages Integration of query translation and retrieval process Schütze, Lioma: Cross Language IR 30 / 30