CROSS-LINGUAL INFORMATION RETRIEVAL WITH EXPLICIT SEMANTIC ANALYSIS

1 CROSS-LINGUAL INFORMATION RETRIEVAL WITH EXPLICIT SEMANTIC ANALYSIS Philipp Sorg and Philipp Cimiano Working Notes of the Annual CLEF Meeting, 2008 Tiago Luís

Outline 2 Cross-Language IR Explicit Semantic Analysis (ESA) Cross-Lingual ESA (CL-ESA) Implementation Evaluation Conclusions

Cross-Language IR (CLIR) 3 retrieve documents in one language to a query in another language example: a user may create a query in English but retrieve relevant documents written in French two approaches: translation of documents or queries [Hull et al. and Demner-Fushman et al.] mapping of queries and documents into a multilingual space

Cross-Language IR (CLIR) 4 multilingual spaces approaches latent model compute latent concepts from data and map documents to these concepts example: LSA (Latent Semantic Analysis) [Dumais et al. 1988] external category model map documents to a set of external categories, topics, or concepts vectors remain constant across different document collections example: ESA (Explicit Semantic Analysis) [Gabrilovich et al. 2007]

Explicit Semantic Analysis (ESA) 5 Explicit Semantic Analysis [Gabrilovich et al 2007] maps documents into a high-dimensional vector space Φ k : T R W k where Φ k (t) = v 1,...,v Wk W k is the number of articles in Wikipedia W k corresponding to language L k v i expresses the strength of association between t and the Wikipedia article a i

Explicit Semantic Analysis (cont.) 6 v i the values of can be computed as the sum of the association strength of all words of t = <w 1,,w s > to the article a i where w j t v i = as(w j,a i ) as( w j,a ) i = tf idf ( ai w ) j

Explicit Semantic Analysis (cont.) 7 image retrieved from Philipp Sorg s slides

Explicit Semantic Analysis (cont.) 8 top-10 ranked article differ between languages can be explained by the cultural background differences

Cross-Lingual ESA (CL-ESA) 9 Wikipedia overwhelming amount of information articles are linked across languages 95% of the cross-lingual link structure between German and English Wikipedia are bi-directional [Sorg et al 2008] they assume the existence of a mapping function m i j that maps an article of Wikipedia W i to its corresponding article in Wikipedia W j

Cross-Lingual ESA (cont.) 10 given n languages, there are n 2 mapping functions ψ i j : R W i R W j where ψ i j v1,...,v Wi = v' 1,...,v' Wj with v' p = v q q { q* m i j (a q* )= a p } 1 p W i 1 q W j

Cross-Lingual ESA (cont.) 11 the ESA representation of the document t in language L i with respect to Wikipedia W j is simply ψ ( i j Φ i ( t) ) query and documents can be compared with the cosine similarity measure

Cross-Lingual ESA (cont.) 12 image retrieved from Philipp Sorg s slides

Cross-Lingual ESA (cont.) 13 top-ranked results for the query Scary Movies

Cross-Lingual ESA (cont.) 14 English vector and German mapped vector have common non-zero dimensions however, the rank of these dimensions differ a lot

Implementation 15 Preprocessing of documents tokenization stop-word filtering stemmer ESA implementation Wikipedia Article Preprocessing discard articles with less than 100 words or less than 5 incoming pagelinks restrict articles to those that have at least a language link to one of the two other languages we consider

Implementation (cont.) 16 ESA implementation (cont.) ESA vector computation choice of the association strength function was motivated by the good performance on IR tasks

Implementation (cont.) 17 ESA implementation (cont.) Multi-lingual mapping normalizations example: replace cross-language redirect pages with the page to which the redirect was leading

Evaluation 18 datasets (parallel corpora) JRC-Acquis consists of 21,000 legislative documents of the European Union they randomly selected 3,000 documents as queries Multext JOC Corpus written questions asked by members of the European Parliament 3100 question/answer pairs in English, German, and French (aligned) they used the English, German and French documents only

Evaluation (cont.) 19 LSI/LDA Wikipedia as parallel corpus linked articles are almost translations of each other training corpus for latent topic extraction Cross-Lingual ESA pruning of concept vectors only use highest m values

Evaluation (cont.) 20 methodology mate retrieval evaluation use documents in one language as query to retrieve documents of another language the only relevant document is the translated document no manual relevance assessment is needed Mean Reciprocal Rank is the multiplicative inverse of the rank of the first correct answer MRR = 1 Q Q i=1 1 rank i

Evaluation (cont.) 21 Multext dataset

Evaluation (cont.) 22 JRC-Aquis dataset

Conclusions 23 presented a cross-lingual extension to the Explicit Semantic Analysis (ESA) approach unless LSI/LDA are trained on the document collection itself (instead of on background collection, i.e., Wikipedia), ESA produce better results than LSI/LDA ESA is also computationally more efficient