1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís
Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis (LSA) CLIR using LSA Results of CLIR using LSA CLIR using PARAFAC2 Results of CLIR using PARAFAC2 Conclusions
Cross-Language IR (CLIR) 3 retrieve documents in one language to a query in another language example: a user may create a query in English but retrieve relevant documents written in French two approaches: translation of documents or queries [Hull et al. and Demner-Fushman et al.] mapping of queries and documents into a multilingual space
Cross-Language IR (CLIR) 4 multilingual spaces approaches latent model compute latent concepts from data and map documents to these concepts example: LSA (Latent Semantic Analysis) [Dumais et al. 1988] external category model map documents to a set of external categories, topics, or concepts vectors remain constant across different document collections example: ESA (Explicit Semantic Analysis) [Gabrilovich et al. 2007]
Latent Semantic Analysis 5 analyze relationships between documents and terms produce a set of latent concepts related to the documents and terms merge the dimensions (terms) that have similar meanings latent concepts correspond to certain topics that emerge from the document collection
Latent Semantic Analysis (cont.) 6 SVD (Singular Value Decomposition) performs the factorization of the term-by-document matrix X = USV t U - term-by-concept matrix S - diagonal matrix with the singular values ( strength of each concept) V- document-by-concept matrix image retrieved from Wikipedia
Latent Semantic Analysis (cont.) 7 example (two concepts: computer science and medicine) data inf. retrieval brain lung CS = x x MD image retrieved from Jure Leskovec s recitation
Latent Semantic Analysis (cont.) 8 querying map the query into the semantic space q concept = qus 1 calculate the cosine similarity measure between the query and the documents
CLIR using LSA 9 trained with a multilingual parallel aligned corpus example: Europarl [Philipp Koehn 2005], JRC-Acquis [Steinberger et al. 2006], etc English - that is almost a personal record for me this autumn! Portuguese - é quase o meu recorde pessoal deste semestre! each document consists of the concatenation of all the languages [Paul G. Young 1994] terms from all languages appear in any given document
CLIR using LSA (cont.) 10 example (two concepts: computer science and medicine) data inf. retrieval brain lung informacion datos CS MD image retrieved from Jure Leskovec s recitation
Results of CLIR using LSA 11 used the Bible as a parallel corpus the world s most widely translated book 2,426 partial translations 429 full translations how representative the vocabulary of the Bible is of modern vocabulary coverage of vocabulary is around 70% (according to their experiments)
Results of CLIR using LSA (cont.) 12 parallel corpus (the Bible) with 77 translations term-by-document matrix was 1,454,289 x 31,226 extremely sparse number of concepts (dimensions) equal to 280 test data 114 chapters of the Quran select languages: Arabic, English, French, Russian, and Spanish (total of 570 documents)
Results of CLIR using LSA (cont.) 13 each document divided into words and their frequencies were weighted vector was multiplied by US 1 (projected into a 300 -dimensional LSA space) evaluation measures precision at 1 document (for a given source and target language) multilingual precision at 5 documents (for 5 languages)
Results of CLIR using LSA (cont.) 14 precision at 1 document proportion of cases where the translation was retrieved first multilingual precision at 5 documents proportion of the top 5 retrieved results which are translations of the query
Results of CLIR using LSA (cont.) 15 precision at 1 document with LSA (280 dimensions) average: 0.780
Results of CLIR using LSA (cont.) 16 good ability to identify translations translation is retrieved first almost 80% of the time low multilingual precision documents cluster by language and not by topics
Results of CLIR using LSA (cont.) 17 - table illustrating low multilingual precision - statistical differences between languages
Results of CLIR using LSA (cont.) 18 advantages of LSA in CLIR relies only on the ability to tokenize text at he boundaries between words limitations of LSA in CLIR cannot distinguish homographs words from the different languages example: English coin versus French coin (corner) clustering documents language independently goal: group documents about similar topics problem: documents are clustered by language, not by topic
CLIR using PARAFAC2 19 tries to overcome LSA problems unable to make associations between words in different languages PARAFAC2 is a variant of PARAFAC PARAFAC is a multi-way generalization of the SVD [Richard A. Harshman 1970] imposes a constraint (not present in LSA) concepts in all documents in the parallel corpus are the same regardless of language
CLIR using PARAFAC2 (cont.) 20 form an irregular three-way array each slice is a separate term-by-document matrix for a single language in the parallel corpus
CLIR using PARAFAC2 (cont.) 21 X k = U k HS k V t Xk - M x N matrix (k denote the kth slice) Uk - a Mk x R matrix (R is the number of dimensions of the LSA space) H - a R x R matrix Sk - a x R diagonal matrix of weights for the kth slice of X V - an N x R factor matrix for the documents separate mapping for each language
CLIR using PARAFAC2 (cont.) 22 querying map the query into the semantic space q concept = qu k S k 1 1 vector is multiplied by the U k S k specific to the language of the query, rather than a general for all languages US 1 calculate the cosine similarity measure between the query and the documents
Results of CLIR using PARAFAC2 23 multilingual precision metrics PARAFAC2 outperforms LSA by a significant margin average: 0.866 average: 0.760
Results of CLIR using PARAFAC2 (cont.) 24 clustering precision metrics PARAFAC2 also outperforms LSA
Results of CLIR using PARAFAC2 (cont.) 25 disadvantages PARAFAC2 needs more computation to obtain matrix decomposition advantages language specific general ones U k matrix multiplication is faster can deal with homographs matrices are smaller than the
Conclusions 26 PARAFAC2 is a highly compelling technique promising way for truly language-independent clustering of documents by topic PARAFAC2 is also a good technique for other problems in CLIR (not only for multilingual document clustering)