A comparison between Latent Semantic Analysis and Correspondence Analysis

Similar documents
Probabilistic Latent Semantic Analysis

Using Web Searches on Important Words to Create Background Sets for LSI Classification

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A Comparison of Two Text Representations for Sentiment Analysis

Latent Semantic Analysis

On-the-Fly Customization of Automated Essay Scoring

Python Machine Learning

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

A Case Study: News Classification Based on Term Frequency

Matching Similarity for Keyword-Based Clustering

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A Bayesian Learning Approach to Concept-Based Document Classification

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Comment-based Multi-View Clustering of Web 2.0 Items

Cross Language Information Retrieval

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

As a high-quality international conference in the field

Issues in the Mining of Heart Failure Datasets

Assignment 1: Predicting Amazon Review Ratings

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A Statistical Approach to the Semantics of Verb-Particles

Automatic Essay Assessment

Switchboard Language Model Improvement with Conversational Data from Gigaword

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Universiteit Leiden ICT in Business

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Evaluating vector space models with canonical correlation analysis

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Lecture 1: Machine Learning Basics

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Cross-Lingual Text Categorization

Knowledge-Free Induction of Inflectional Morphologies

On document relevance and lexical cohesion between query terms

arxiv: v1 [cs.lg] 3 May 2013

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

CS Machine Learning

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Attributed Social Network Embedding

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Assessing Entailer with a Corpus of Natural Language From an Intelligent Tutoring System

A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Term Weighting based on Document Revision History

Beyond the Pipeline: Discrete Optimization in NLP

2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o

(Sub)Gradient Descent

Concepts and Properties in Word Spaces

Measuring Web-Corpus Randomness: A Progress Report

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The Smart/Empire TIPSTER IR System

Vocabulary Usage and Intelligibility in Learner Language

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Evidence for Reliability, Validity and Learning Effectiveness

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

CaMLA Working Papers

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Leveraging Sentiment to Compute Word Similarity

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Multi-Lingual Text Leveling

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Rule Learning With Negation: Issues Regarding Effectiveness

Artificial Neural Networks written examination

arxiv: v2 [cs.ir] 22 Aug 2016

Summarizing Text Documents: Carnegie Mellon University 4616 Henry Street

Online Updating of Word Representations for Part-of-Speech Tagging

Truth Inference in Crowdsourcing: Is the Problem Solved?

A study of speaker adaptation for DNN-based speech synthesis

Learning to Rank with Selection Bias in Personal Search

Visit us at:

Text-mining the Estonian National Electronic Health Record

Cross-lingual Text Fragment Alignment using Divergence from Randomness

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

AQUA: An Ontology-Driven Question Answering System

arxiv: v1 [math.at] 10 Jan 2016

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Georgetown University at TREC 2017 Dynamic Domain Track

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Corpus Linguistics (L615)

Handling Sparsity for Verb Noun MWE Token Classification

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

The Role of String Similarity Metrics in Ontology Alignment

Discovery of Topical Authorities in Instagram

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Accuracy (%) # features

Speech Recognition at ICSI: Broadcast News and beyond

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Linking Task: Identifying authors and book titles in verbose queries

A Graph Based Authorship Identification Approach

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Transcription:

A comparison between Latent Semantic Analysis and Correspondence Analysis Julie Séguéla, Gilbert Saporta CNAM, Cedric Lab Multiposting.fr February 9th 2011 - CARME

Outline 1 Introduction 2 Latent Semantic Analysis Presentation Method 3 Application in a real context Presentation Methodology Results and comparisons 4 Conclusion A comparison between LSA & CA February 9th 2011 - CARME 2 / 29

Introduction Outline 1 Introduction 2 Latent Semantic Analysis 3 Application in a real context 4 Conclusion A comparison between LSA & CA February 9th 2011 - CARME 3 / 29

Introduction Context Text representation for categorization task A comparison between LSA & CA February 9th 2011 - CARME 4 / 29

Introduction Objectives Comparison of several text representation techniques through theory and application In particular, comparison between a statistical technique : Correspondence Analysis, and an information retrieval (IR) oriented method : Latent Semantic Analysis Is there an optimal technique for performing document clustering? A comparison between LSA & CA February 9th 2011 - CARME 5 / 29

Latent Semantic Analysis Outline 1 Introduction 2 Latent Semantic Analysis Presentation Method 3 Application in a real context 4 Conclusion A comparison between LSA & CA February 9th 2011 - CARME 6 / 29

Latent Semantic Analysis Presentation Uses of LSA LSA was patented in 1988 (US Patent 4,839,853) by Deerwester, Dumais, Furnas, Harshman, Landauer, Lochbaum and Streeter. Find semantic relations between terms Helps to overcome synonymy and polysemy problems Dimensionality reduction (from several thousands of features to 40-400 dimensions) Applications Document clustering and document classication Matching queries to documents of similar topic meaning (information retrieval) Text summarization... A comparison between LSA & CA February 9th 2011 - CARME 7 / 29

Latent Semantic Analysis Method LSA theory How to obtain document coordinates? 1) Document-Term matrix 2) Weighting T =.... f ij.... T W =.... l ij (f ij ) g j (f ij ).... 3) SVD 4) Document coordinates in the latent semantic space : T W = UΣV C = U k Σ k We need to nd the optimal dimension for nal representation A comparison between LSA & CA February 9th 2011 - CARME 8 / 29

Latent Semantic Analysis Method Common weighting functions Local weighting Term frequency l ij (f ij ) = f ij Binary l ij (f ij ) = 1 if term j occurs in document i, else 0 Logarithm l ij (f ij ) = log(f ij + 1) Global weighting Normalisation g j (f ij ) = 1 i f 2 ij IDF (Inverse Document Frequency) g j (f ij ) = 1 + log( n n j ) n : number of documents n j : number of documents in which term j occurs Entropy g j (f ij ) = 1 i f ij f.j log( f ij f.j ) log(n) A comparison between LSA & CA February 9th 2011 - CARME 9 / 29

Latent Semantic Analysis Method LSA vs CA Latent Semantic Analysis 1) T = [f ij ] i,j 2) T W = [l ij (f ij ) g j (f ij )] i,j 3) T W = UΣV 4) C = U k Σ k Correspondence Analysis 1) T = [f ij ] i,j [ ] 2) T W = f ij f i. f.j 3) T W = UΣV 3 ) Ũ = diag( 4) C = Ũ k Σ k i,j f.. f i. )U CA : l ij (f ij ) = f ij fi. and g j (f ij ) = 1 f.j A comparison between LSA & CA February 9th 2011 - CARME 10 / 29

Application in a real context Outline 1 Introduction 2 Latent Semantic Analysis 3 Application in a real context Presentation Methodology Results and comparisons 4 Conclusion A comparison between LSA & CA February 9th 2011 - CARME 11 / 29

Application in a real context Presentation Objectives Corpus of job oers Find the best representation method to assess "job similarity" between oers in a non-supervised framework Comparison of several representation techniques Discussion about the optimal number of dimensions to keep Comparison between two similarity measures A comparison between LSA & CA February 9th 2011 - CARME 12 / 29

Application in a real context Presentation Data Oers have been manually labeled by recruiters into 8 categories during the posting procedure Distribution among job categories : Category Freq. % Category Freq. % Sales/Business Development 360 24 Marketing/Product 141 10 R&D/Science 69 5 Production/Operations 127 9 Accounting/Finance 338 23 Human Resources 138 9 Logistics/Transportation 118 8 Information Systems 192 13 Total 1483 100 We keep only the "title"+"mission description" parts ("rm description" and "prole searched" are excluded) A comparison between LSA & CA February 9th 2011 - CARME 13 / 29

Application in a real context Methodology Preprocessing of texts Lemmatisation and tagging Filtering according to grammatical category (we keep nouns, verbs and adjectives) Filtering terms occuring in less than 5 oers Vector space model ("bag of words") A comparison between LSA & CA February 9th 2011 - CARME 14 / 29

Application in a real context Methodology Several representations are compared Representation method LSA, weighting : Term Frequency LSA, weighting : TF-IDF LSA, weighting : Log Entropy CA Dissimilarity measure Euclidian distance between documents i and i 1 - cosine similarity between documents i and i A comparison between LSA & CA February 9th 2011 - CARME 15 / 29

Application in a real context Methodology Method of clustering Clustering steps Computing of dissimilarity matrix from document coordinates in the latent semantic space Hierarchical Agglomerative Clustering until a 8 class partition Computation of class centroids K-means clustering initialized from previous centroids A comparison between LSA & CA February 9th 2011 - CARME 16 / 29

Application in a real context Methodology Measures of agreement between two partitions P 1, P 2 : two partitions of n objects with the same number of classes k N = [n ij ] i=1,..,k : corresponding contingency table j=1,..,k Rand index R = 2 i j n2 ij i n2 i. j n2.j + n2 n 2, 0 R 1 Rand index is based on the number of pairs of units which belong to the same clusters. It doesn't depend on cluster labeling. A comparison between LSA & CA February 9th 2011 - CARME 17 / 29

Application in a real context Methodology Measures of agreement between two partitions Cohen's Kappa and F-measure values are depending on clusters' labels. To overcome label switching, we are looking for their maximum values over all label allocations. Cohen's Kappa κ opt = max { 1 } n i n ii 1 n 2 i n i.n.i, 1 1 1 κ 1 n i 2 n i.n.i F -measure F opt = max { 2 1 1 k k i n ii i n i. 1 k n ii n i. + 1 k i n ii n.i i n ii n.i }, 0 F 1 A comparison between LSA & CA February 9th 2011 - CARME 18 / 29

Application in a real context Results and comparisons Correlation between coordinates issued from the dierent methods A comparison between LSA & CA February 9th 2011 - CARME 19 / 29

Application in a real context Results and comparisons Clustering quality according to the method and the number of dimensions : Rand index A comparison between LSA & CA February 9th 2011 - CARME 20 / 29

Application in a real context Results and comparisons Clustering quality according to the method and the number of dimensions : Cohen's Kappa A comparison between LSA & CA February 9th 2011 - CARME 21 / 29

Application in a real context Results and comparisons Clustering quality according to the method and the number of dimensions : F-measure A comparison between LSA & CA February 9th 2011 - CARME 22 / 29

Application in a real context Results and comparisons Clustering quality according to the dissimilarity function : LSA + Log Entropy A comparison between LSA & CA February 9th 2011 - CARME 23 / 29

Application in a real context Results and comparisons Clustering quality according to the dissimilarity function : CA A comparison between LSA & CA February 9th 2011 - CARME 24 / 29

Conclusion Outline 1 Introduction 2 Latent Semantic Analysis 3 Application in a real context 4 Conclusion A comparison between LSA & CA February 9th 2011 - CARME 25 / 29

Conclusion Conclusions CA seems to be less stable than other methods but with cosine similarity, it provides better results under 100 dimensions As it is said in literature, cosine similarity between vectors seems to be more adapted to textual data than usual dot similarity : slight increase of eciency and more stability for agreement measures Optimal number of dimensions to keep? It is varying according to the type of text studied and the method used (around 60 dimensions with CA) We should prefer a dissimilarity measure which provides stable results with the number of dimensions kept (in the context of automated tasks, it's problematic if optimal dimension is depending too much on the collection of documents) A comparison between LSA & CA February 9th 2011 - CARME 26 / 29

Conclusion Limitations & future work Limitations of the study Clusters obtained are compared with categories choosen by recruiters, which are sometimes subjective and could explain some errors We are working on a very particular type of corpus : short texts, variable length, sometimes very similar but not really duplicates Future work Test other clustering methods (the representation to adopt may depend on it) Repeat the study with a supervised algorithm for classication (index values are disappointing in unsupervised framework) Study the eect of using the dierent parts of job oers for classication A comparison between LSA & CA February 9th 2011 - CARME 27 / 29

Some references Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41, 391-407. Greenacre, M. (2007). Correspondence Analysis in Practice, Second Edition. London : Chapman & Hall/CRC. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284. Landauer, T. K., McNamara, D., Dennis, S., & Kintsch, W. (Eds.) (2007). Handbook of Latent Semantic Analysis. Mahwah, NJ : Erlbaum. Picca, D., Curdy, B., & Bavaud, F. (2006). Non-linear correspondence analysis in text retrieval : a kernel view. In JADT'06, pp. 741-747. Wild, F. (2007). An LSA package for R. In LSA-TEL'07, pp. 11-12.

Thanks!