CROSS-LINGUAL INFORMATION RETRIEVAL WITH EXPLICIT SEMANTIC ANALYSIS

Similar documents
CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Cross Language Information Retrieval

Cross-lingual Text Fragment Alignment using Divergence from Randomness

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Language Independent Passage Retrieval for Question Answering

Probabilistic Latent Semantic Analysis

Finding Translations in Scanned Book Collections

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Term Weighting based on Document Revision History

Matching Similarity for Keyword-Based Clustering

Constructing Parallel Corpus from Movie Subtitles

Comparing different approaches to treat Translation Ambiguity in CLIR: Structured Queries vs. Target Co occurrence Based Selection

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

A Case Study: News Classification Based on Term Frequency

Cross-Lingual Text Categorization

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Latent Semantic Analysis

Assignment 1: Predicting Amazon Review Ratings

A Bayesian Learning Approach to Concept-Based Document Classification

Probabilistic Models of Cross-Lingual Semantic Similarity in Context Based on Latent Cross-Lingual Concepts Induced from Comparable Data

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Universiteit Leiden ICT in Business

Postprint.

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

HLTCOE at TREC 2013: Temporal Summarization

On document relevance and lexical cohesion between query terms

Unsupervised Cross-Lingual Scaling of Political Texts

Linking Task: Identifying authors and book titles in verbose queries

Variations of the Similarity Function of TextRank for Automated Summarization

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Multilingual Information Access Douglas W. Oard College of Information Studies, University of Maryland, College Park

Word Translation Disambiguation without Parallel Texts

Multilingual Sentiment and Subjectivity Analysis

arxiv:cs/ v2 [cs.cl] 7 Jul 1999

arxiv: v1 [cs.cl] 2 Apr 2017

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Evaluating vector space models with canonical correlation analysis

Learning to Rank with Selection Bias in Personal Search

Switchboard Language Model Improvement with Conversational Data from Gigaword

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Detecting English-French Cognates Using Orthographic Edit Distance

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Matching Meaning for Cross-Language Information Retrieval

Exposé for a Master s Thesis

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Word Segmentation of Off-line Handwritten Documents

The Smart/Empire TIPSTER IR System

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

arxiv: v2 [cs.ir] 22 Aug 2016

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES. Christian E. Loza. Thesis Prepared for the Degree of MASTER OF SCIENCE

Comment-based Multi-View Clustering of Web 2.0 Items

The Role of String Similarity Metrics in Ontology Alignment

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Grade 6: Correlated to AGS Basic Math Skills

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Mining meaning from Wikipedia

A Comparison of Two Text Representations for Sentiment Analysis

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Measuring Web-Corpus Randomness: A Progress Report

Cross-lingual Short-Text Document Classification for Facebook Comments

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Cross-Lingual Semantic Similarity of Words as the Similarity of Their Semantic Word Responses

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Cross-Language Information Retrieval

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

As a high-quality international conference in the field

Online Updating of Word Representations for Part-of-Speech Tagging

Indian Institute of Technology, Kanpur

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

End-to-End SMT with Zero or Small Parallel Texts 1. Abstract

Summarizing Answers in Non-Factoid Community Question-Answering

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Experts Retrieval with Multiword-Enhanced Author Topic Model

Georgetown University at TREC 2017 Dynamic Domain Track

Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications

Ontological spine, localization and multilingual access

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Evaluation for Scenario Question Answering Systems

Translating Collocations for Use in Bilingual Lexicons

arxiv: v1 [cs.cl] 20 Jul 2015

Concepts and Properties in Word Spaces

Segmentation of Multi-Sentence Questions: Towards Effective Question Retrieval in cqa Services

Handling Sparsity for Verb Noun MWE Token Classification

Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text

Finding the Best Approach for Multi-lingual Text Summarisation: A Comparative Analysis

Discovery of Topical Authorities in Instagram

Transcription:

1 CROSS-LINGUAL INFORMATION RETRIEVAL WITH EXPLICIT SEMANTIC ANALYSIS Philipp Sorg and Philipp Cimiano Working Notes of the Annual CLEF Meeting, 2008 Tiago Luís

Outline 2 Cross-Language IR Explicit Semantic Analysis (ESA) Cross-Lingual ESA (CL-ESA) Implementation Evaluation Conclusions

Cross-Language IR (CLIR) 3 retrieve documents in one language to a query in another language example: a user may create a query in English but retrieve relevant documents written in French two approaches: translation of documents or queries [Hull et al. and Demner-Fushman et al.] mapping of queries and documents into a multilingual space

Cross-Language IR (CLIR) 4 multilingual spaces approaches latent model compute latent concepts from data and map documents to these concepts example: LSA (Latent Semantic Analysis) [Dumais et al. 1988] external category model map documents to a set of external categories, topics, or concepts vectors remain constant across different document collections example: ESA (Explicit Semantic Analysis) [Gabrilovich et al. 2007]

Explicit Semantic Analysis (ESA) 5 Explicit Semantic Analysis [Gabrilovich et al 2007] maps documents into a high-dimensional vector space Φ k : T R W k where Φ k (t) = v 1,...,v Wk W k is the number of articles in Wikipedia W k corresponding to language L k v i expresses the strength of association between t and the Wikipedia article a i

Explicit Semantic Analysis (cont.) 6 v i the values of can be computed as the sum of the association strength of all words of t = <w 1,,w s > to the article a i where w j t v i = as(w j,a i ) as( w j,a ) i = tf idf ( ai w ) j

Explicit Semantic Analysis (cont.) 7 image retrieved from Philipp Sorg s slides

Explicit Semantic Analysis (cont.) 8 top-10 ranked article differ between languages can be explained by the cultural background differences

Cross-Lingual ESA (CL-ESA) 9 Wikipedia overwhelming amount of information articles are linked across languages 95% of the cross-lingual link structure between German and English Wikipedia are bi-directional [Sorg et al 2008] they assume the existence of a mapping function m i j that maps an article of Wikipedia W i to its corresponding article in Wikipedia W j

Cross-Lingual ESA (cont.) 10 given n languages, there are n 2 mapping functions ψ i j : R W i R W j where ψ i j v1,...,v Wi = v' 1,...,v' Wj with v' p = v q q { q* m i j (a q* )= a p } 1 p W i 1 q W j

Cross-Lingual ESA (cont.) 11 the ESA representation of the document t in language L i with respect to Wikipedia W j is simply ψ ( i j Φ i ( t) ) query and documents can be compared with the cosine similarity measure

Cross-Lingual ESA (cont.) 12 image retrieved from Philipp Sorg s slides

Cross-Lingual ESA (cont.) 13 top-ranked results for the query Scary Movies

Cross-Lingual ESA (cont.) 14 English vector and German mapped vector have common non-zero dimensions however, the rank of these dimensions differ a lot

Implementation 15 Preprocessing of documents tokenization stop-word filtering stemmer ESA implementation Wikipedia Article Preprocessing discard articles with less than 100 words or less than 5 incoming pagelinks restrict articles to those that have at least a language link to one of the two other languages we consider

Implementation (cont.) 16 ESA implementation (cont.) ESA vector computation choice of the association strength function was motivated by the good performance on IR tasks

Implementation (cont.) 17 ESA implementation (cont.) Multi-lingual mapping normalizations example: replace cross-language redirect pages with the page to which the redirect was leading

Evaluation 18 datasets (parallel corpora) JRC-Acquis consists of 21,000 legislative documents of the European Union they randomly selected 3,000 documents as queries Multext JOC Corpus written questions asked by members of the European Parliament 3100 question/answer pairs in English, German, and French (aligned) they used the English, German and French documents only

Evaluation (cont.) 19 LSI/LDA Wikipedia as parallel corpus linked articles are almost translations of each other training corpus for latent topic extraction Cross-Lingual ESA pruning of concept vectors only use highest m values

Evaluation (cont.) 20 methodology mate retrieval evaluation use documents in one language as query to retrieve documents of another language the only relevant document is the translated document no manual relevance assessment is needed Mean Reciprocal Rank is the multiplicative inverse of the rank of the first correct answer MRR = 1 Q Q i=1 1 rank i

Evaluation (cont.) 21 Multext dataset

Evaluation (cont.) 22 JRC-Aquis dataset

Conclusions 23 presented a cross-lingual extension to the Explicit Semantic Analysis (ESA) approach unless LSI/LDA are trained on the document collection itself (instead of on background collection, i.e., Wikipedia), ESA produce better results than LSI/LDA ESA is also computationally more efficient