1 CROSS-LINGUAL INFORMATION RETRIEVAL WITH EXPLICIT SEMANTIC ANALYSIS Philipp Sorg and Philipp Cimiano Working Notes of the Annual CLEF Meeting, 2008 Tiago Luís
Outline 2 Cross-Language IR Explicit Semantic Analysis (ESA) Cross-Lingual ESA (CL-ESA) Implementation Evaluation Conclusions
Cross-Language IR (CLIR) 3 retrieve documents in one language to a query in another language example: a user may create a query in English but retrieve relevant documents written in French two approaches: translation of documents or queries [Hull et al. and Demner-Fushman et al.] mapping of queries and documents into a multilingual space
Cross-Language IR (CLIR) 4 multilingual spaces approaches latent model compute latent concepts from data and map documents to these concepts example: LSA (Latent Semantic Analysis) [Dumais et al. 1988] external category model map documents to a set of external categories, topics, or concepts vectors remain constant across different document collections example: ESA (Explicit Semantic Analysis) [Gabrilovich et al. 2007]
Explicit Semantic Analysis (ESA) 5 Explicit Semantic Analysis [Gabrilovich et al 2007] maps documents into a high-dimensional vector space Φ k : T R W k where Φ k (t) = v 1,...,v Wk W k is the number of articles in Wikipedia W k corresponding to language L k v i expresses the strength of association between t and the Wikipedia article a i
Explicit Semantic Analysis (cont.) 6 v i the values of can be computed as the sum of the association strength of all words of t = <w 1,,w s > to the article a i where w j t v i = as(w j,a i ) as( w j,a ) i = tf idf ( ai w ) j
Explicit Semantic Analysis (cont.) 7 image retrieved from Philipp Sorg s slides
Explicit Semantic Analysis (cont.) 8 top-10 ranked article differ between languages can be explained by the cultural background differences
Cross-Lingual ESA (CL-ESA) 9 Wikipedia overwhelming amount of information articles are linked across languages 95% of the cross-lingual link structure between German and English Wikipedia are bi-directional [Sorg et al 2008] they assume the existence of a mapping function m i j that maps an article of Wikipedia W i to its corresponding article in Wikipedia W j
Cross-Lingual ESA (cont.) 10 given n languages, there are n 2 mapping functions ψ i j : R W i R W j where ψ i j v1,...,v Wi = v' 1,...,v' Wj with v' p = v q q { q* m i j (a q* )= a p } 1 p W i 1 q W j
Cross-Lingual ESA (cont.) 11 the ESA representation of the document t in language L i with respect to Wikipedia W j is simply ψ ( i j Φ i ( t) ) query and documents can be compared with the cosine similarity measure
Cross-Lingual ESA (cont.) 12 image retrieved from Philipp Sorg s slides
Cross-Lingual ESA (cont.) 13 top-ranked results for the query Scary Movies
Cross-Lingual ESA (cont.) 14 English vector and German mapped vector have common non-zero dimensions however, the rank of these dimensions differ a lot
Implementation 15 Preprocessing of documents tokenization stop-word filtering stemmer ESA implementation Wikipedia Article Preprocessing discard articles with less than 100 words or less than 5 incoming pagelinks restrict articles to those that have at least a language link to one of the two other languages we consider
Implementation (cont.) 16 ESA implementation (cont.) ESA vector computation choice of the association strength function was motivated by the good performance on IR tasks
Implementation (cont.) 17 ESA implementation (cont.) Multi-lingual mapping normalizations example: replace cross-language redirect pages with the page to which the redirect was leading
Evaluation 18 datasets (parallel corpora) JRC-Acquis consists of 21,000 legislative documents of the European Union they randomly selected 3,000 documents as queries Multext JOC Corpus written questions asked by members of the European Parliament 3100 question/answer pairs in English, German, and French (aligned) they used the English, German and French documents only
Evaluation (cont.) 19 LSI/LDA Wikipedia as parallel corpus linked articles are almost translations of each other training corpus for latent topic extraction Cross-Lingual ESA pruning of concept vectors only use highest m values
Evaluation (cont.) 20 methodology mate retrieval evaluation use documents in one language as query to retrieve documents of another language the only relevant document is the translated document no manual relevance assessment is needed Mean Reciprocal Rank is the multiplicative inverse of the rank of the first correct answer MRR = 1 Q Q i=1 1 rank i
Evaluation (cont.) 21 Multext dataset
Evaluation (cont.) 22 JRC-Aquis dataset
Conclusions 23 presented a cross-lingual extension to the Explicit Semantic Analysis (ESA) approach unless LSI/LDA are trained on the document collection itself (instead of on background collection, i.e., Wikipedia), ESA produce better results than LSI/LDA ESA is also computationally more efficient