CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Similar documents
Probabilistic Latent Semantic Analysis

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Cross Language Information Retrieval

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Assignment 1: Predicting Amazon Review Ratings

Latent Semantic Analysis

Initial approaches on Cross-Lingual Information Retrieval using Statistical Machine Translation on User Queries

arxiv: v1 [cs.cl] 2 Apr 2017

Language Independent Passage Retrieval for Question Answering

Comment-based Multi-View Clustering of Web 2.0 Items

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Finding Translations in Scanned Book Collections

Combining Bidirectional Translation and Synonymy for Cross-Language Information Retrieval

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Python Machine Learning

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Measurement. When Smaller Is Better. Activity:

Cross-lingual Text Fragment Alignment using Divergence from Randomness

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Word Translation Disambiguation without Parallel Texts

Detecting English-French Cognates Using Orthographic Edit Distance

Linking Task: Identifying authors and book titles in verbose queries

Ontological spine, localization and multilingual access

A Comparison of Two Text Representations for Sentiment Analysis

Evaluating vector space models with canonical correlation analysis

A Bayesian Learning Approach to Concept-Based Document Classification

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

ROSETTA STONE PRODUCT OVERVIEW

Knowledge-Free Induction of Inflectional Morphologies

Matching Meaning for Cross-Language Information Retrieval

Speech Recognition at ICSI: Broadcast News and beyond

Matching Similarity for Keyword-Based Clustering

Statewide Framework Document for:

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

STUDENTS' RATINGS ON TEACHER

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Attributed Social Network Embedding

ENGLISH LANGUAGE LEARNERS (ELL) UPDATE FOR SUNSHINE STATE TESOL 2013

As a high-quality international conference in the field

Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure

The Role of String Similarity Metrics in Ontology Alignment

Introducing the New Iowa Assessments Mathematics Levels 12 14

Lecture 2: Quantifiers and Approximation

Primary National Curriculum Alignment for Wales

Changes to GCSE and KS3 Grading Information Booklet for Parents

Grade 6: Correlated to AGS Basic Math Skills

Term Weighting based on Document Revision History

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Alignment of Australian Curriculum Year Levels to the Scope and Sequence of Math-U-See Program

10.2. Behavior models

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Progress Monitoring for Behavior: Data Collection Methods & Procedures

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Concepts and Properties in Word Spaces

Software Maintenance

Constructing Parallel Corpus from Movie Subtitles

Contact: For more information on Breakthrough visit or contact Carmel Crévola at Resources:

Text-mining the Estonian National Electronic Health Record

On document relevance and lexical cohesion between query terms

NORTH CAROLINA VIRTUAL PUBLIC SCHOOL IN WCPSS UPDATE FOR FALL 2007, SPRING 2008, AND SUMMER 2008

A Case Study: News Classification Based on Term Frequency

The Smart/Empire TIPSTER IR System

arxiv: v2 [cs.ir] 22 Aug 2016

Busuu The Mobile App. Review by Musa Nushi & Homa Jenabzadeh, Introduction. 30 TESL Reporter 49 (2), pp

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Learning to Rank with Selection Bias in Personal Search

ADVANCED PLACEMENT STUDENTS IN COLLEGE: AN INVESTIGATION OF COURSE GRADES AT 21 COLLEGES. Rick Morgan Len Ramist

STAT 220 Midterm Exam, Friday, Feb. 24

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Rule Learning With Negation: Issues Regarding Effectiveness

Language Center. Course Catalog

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

Postprint.

Unsupervised Cross-Lingual Scaling of Political Texts

A study of speaker adaptation for DNN-based speech synthesis

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Extending Place Value with Whole Numbers to 1,000,000

Summarizing Answers in Non-Factoid Community Question-Answering

English-German Medical Dictionary And Phrasebook By A.H. Zemback

CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES. Christian E. Loza. Thesis Prepared for the Degree of MASTER OF SCIENCE

MTH 215: Introduction to Linear Algebra

Matrices, Compression, Learning Curves: formulation, and the GROUPNTEACH algorithms

Switchboard Language Model Improvement with Conversational Data from Gigaword

Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

Transcription:

1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís

Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis (LSA) CLIR using LSA Results of CLIR using LSA CLIR using PARAFAC2 Results of CLIR using PARAFAC2 Conclusions

Cross-Language IR (CLIR) 3 retrieve documents in one language to a query in another language example: a user may create a query in English but retrieve relevant documents written in French two approaches: translation of documents or queries [Hull et al. and Demner-Fushman et al.] mapping of queries and documents into a multilingual space

Cross-Language IR (CLIR) 4 multilingual spaces approaches latent model compute latent concepts from data and map documents to these concepts example: LSA (Latent Semantic Analysis) [Dumais et al. 1988] external category model map documents to a set of external categories, topics, or concepts vectors remain constant across different document collections example: ESA (Explicit Semantic Analysis) [Gabrilovich et al. 2007]

Latent Semantic Analysis 5 analyze relationships between documents and terms produce a set of latent concepts related to the documents and terms merge the dimensions (terms) that have similar meanings latent concepts correspond to certain topics that emerge from the document collection

Latent Semantic Analysis (cont.) 6 SVD (Singular Value Decomposition) performs the factorization of the term-by-document matrix X = USV t U - term-by-concept matrix S - diagonal matrix with the singular values ( strength of each concept) V- document-by-concept matrix image retrieved from Wikipedia

Latent Semantic Analysis (cont.) 7 example (two concepts: computer science and medicine) data inf. retrieval brain lung CS = x x MD image retrieved from Jure Leskovec s recitation

Latent Semantic Analysis (cont.) 8 querying map the query into the semantic space q concept = qus 1 calculate the cosine similarity measure between the query and the documents

CLIR using LSA 9 trained with a multilingual parallel aligned corpus example: Europarl [Philipp Koehn 2005], JRC-Acquis [Steinberger et al. 2006], etc English - that is almost a personal record for me this autumn! Portuguese - é quase o meu recorde pessoal deste semestre! each document consists of the concatenation of all the languages [Paul G. Young 1994] terms from all languages appear in any given document

CLIR using LSA (cont.) 10 example (two concepts: computer science and medicine) data inf. retrieval brain lung informacion datos CS MD image retrieved from Jure Leskovec s recitation

Results of CLIR using LSA 11 used the Bible as a parallel corpus the world s most widely translated book 2,426 partial translations 429 full translations how representative the vocabulary of the Bible is of modern vocabulary coverage of vocabulary is around 70% (according to their experiments)

Results of CLIR using LSA (cont.) 12 parallel corpus (the Bible) with 77 translations term-by-document matrix was 1,454,289 x 31,226 extremely sparse number of concepts (dimensions) equal to 280 test data 114 chapters of the Quran select languages: Arabic, English, French, Russian, and Spanish (total of 570 documents)

Results of CLIR using LSA (cont.) 13 each document divided into words and their frequencies were weighted vector was multiplied by US 1 (projected into a 300 -dimensional LSA space) evaluation measures precision at 1 document (for a given source and target language) multilingual precision at 5 documents (for 5 languages)

Results of CLIR using LSA (cont.) 14 precision at 1 document proportion of cases where the translation was retrieved first multilingual precision at 5 documents proportion of the top 5 retrieved results which are translations of the query

Results of CLIR using LSA (cont.) 15 precision at 1 document with LSA (280 dimensions) average: 0.780

Results of CLIR using LSA (cont.) 16 good ability to identify translations translation is retrieved first almost 80% of the time low multilingual precision documents cluster by language and not by topics

Results of CLIR using LSA (cont.) 17 - table illustrating low multilingual precision - statistical differences between languages

Results of CLIR using LSA (cont.) 18 advantages of LSA in CLIR relies only on the ability to tokenize text at he boundaries between words limitations of LSA in CLIR cannot distinguish homographs words from the different languages example: English coin versus French coin (corner) clustering documents language independently goal: group documents about similar topics problem: documents are clustered by language, not by topic

CLIR using PARAFAC2 19 tries to overcome LSA problems unable to make associations between words in different languages PARAFAC2 is a variant of PARAFAC PARAFAC is a multi-way generalization of the SVD [Richard A. Harshman 1970] imposes a constraint (not present in LSA) concepts in all documents in the parallel corpus are the same regardless of language

CLIR using PARAFAC2 (cont.) 20 form an irregular three-way array each slice is a separate term-by-document matrix for a single language in the parallel corpus

CLIR using PARAFAC2 (cont.) 21 X k = U k HS k V t Xk - M x N matrix (k denote the kth slice) Uk - a Mk x R matrix (R is the number of dimensions of the LSA space) H - a R x R matrix Sk - a x R diagonal matrix of weights for the kth slice of X V - an N x R factor matrix for the documents separate mapping for each language

CLIR using PARAFAC2 (cont.) 22 querying map the query into the semantic space q concept = qu k S k 1 1 vector is multiplied by the U k S k specific to the language of the query, rather than a general for all languages US 1 calculate the cosine similarity measure between the query and the documents

Results of CLIR using PARAFAC2 23 multilingual precision metrics PARAFAC2 outperforms LSA by a significant margin average: 0.866 average: 0.760

Results of CLIR using PARAFAC2 (cont.) 24 clustering precision metrics PARAFAC2 also outperforms LSA

Results of CLIR using PARAFAC2 (cont.) 25 disadvantages PARAFAC2 needs more computation to obtain matrix decomposition advantages language specific general ones U k matrix multiplication is faster can deal with homographs matrices are smaller than the

Conclusions 26 PARAFAC2 is a highly compelling technique promising way for truly language-independent clustering of documents by topic PARAFAC2 is also a good technique for other problems in CLIR (not only for multilingual document clustering)