A comparison between Latent Semantic Analysis and Correspondence Analysis
|
|
- Oswald Gardner
- 6 years ago
- Views:
Transcription
1 A comparison between Latent Semantic Analysis and Correspondence Analysis Julie Séguéla, Gilbert Saporta CNAM, Cedric Lab Multiposting.fr February 9th CARME
2 Outline 1 Introduction 2 Latent Semantic Analysis Presentation Method 3 Application in a real context Presentation Methodology Results and comparisons 4 Conclusion A comparison between LSA & CA February 9th CARME 2 / 29
3 Introduction Outline 1 Introduction 2 Latent Semantic Analysis 3 Application in a real context 4 Conclusion A comparison between LSA & CA February 9th CARME 3 / 29
4 Introduction Context Text representation for categorization task A comparison between LSA & CA February 9th CARME 4 / 29
5 Introduction Objectives Comparison of several text representation techniques through theory and application In particular, comparison between a statistical technique : Correspondence Analysis, and an information retrieval (IR) oriented method : Latent Semantic Analysis Is there an optimal technique for performing document clustering? A comparison between LSA & CA February 9th CARME 5 / 29
6 Latent Semantic Analysis Outline 1 Introduction 2 Latent Semantic Analysis Presentation Method 3 Application in a real context 4 Conclusion A comparison between LSA & CA February 9th CARME 6 / 29
7 Latent Semantic Analysis Presentation Uses of LSA LSA was patented in 1988 (US Patent 4,839,853) by Deerwester, Dumais, Furnas, Harshman, Landauer, Lochbaum and Streeter. Find semantic relations between terms Helps to overcome synonymy and polysemy problems Dimensionality reduction (from several thousands of features to dimensions) Applications Document clustering and document classication Matching queries to documents of similar topic meaning (information retrieval) Text summarization... A comparison between LSA & CA February 9th CARME 7 / 29
8 Latent Semantic Analysis Method LSA theory How to obtain document coordinates? 1) Document-Term matrix 2) Weighting T =.... f ij.... T W =.... l ij (f ij ) g j (f ij ).... 3) SVD 4) Document coordinates in the latent semantic space : T W = UΣV C = U k Σ k We need to nd the optimal dimension for nal representation A comparison between LSA & CA February 9th CARME 8 / 29
9 Latent Semantic Analysis Method Common weighting functions Local weighting Term frequency l ij (f ij ) = f ij Binary l ij (f ij ) = 1 if term j occurs in document i, else 0 Logarithm l ij (f ij ) = log(f ij + 1) Global weighting Normalisation g j (f ij ) = 1 i f 2 ij IDF (Inverse Document Frequency) g j (f ij ) = 1 + log( n n j ) n : number of documents n j : number of documents in which term j occurs Entropy g j (f ij ) = 1 i f ij f.j log( f ij f.j ) log(n) A comparison between LSA & CA February 9th CARME 9 / 29
10 Latent Semantic Analysis Method LSA vs CA Latent Semantic Analysis 1) T = [f ij ] i,j 2) T W = [l ij (f ij ) g j (f ij )] i,j 3) T W = UΣV 4) C = U k Σ k Correspondence Analysis 1) T = [f ij ] i,j [ ] 2) T W = f ij f i. f.j 3) T W = UΣV 3 ) Ũ = diag( 4) C = Ũ k Σ k i,j f.. f i. )U CA : l ij (f ij ) = f ij fi. and g j (f ij ) = 1 f.j A comparison between LSA & CA February 9th CARME 10 / 29
11 Application in a real context Outline 1 Introduction 2 Latent Semantic Analysis 3 Application in a real context Presentation Methodology Results and comparisons 4 Conclusion A comparison between LSA & CA February 9th CARME 11 / 29
12 Application in a real context Presentation Objectives Corpus of job oers Find the best representation method to assess "job similarity" between oers in a non-supervised framework Comparison of several representation techniques Discussion about the optimal number of dimensions to keep Comparison between two similarity measures A comparison between LSA & CA February 9th CARME 12 / 29
13 Application in a real context Presentation Data Oers have been manually labeled by recruiters into 8 categories during the posting procedure Distribution among job categories : Category Freq. % Category Freq. % Sales/Business Development Marketing/Product R&D/Science 69 5 Production/Operations Accounting/Finance Human Resources Logistics/Transportation Information Systems Total We keep only the "title"+"mission description" parts ("rm description" and "prole searched" are excluded) A comparison between LSA & CA February 9th CARME 13 / 29
14 Application in a real context Methodology Preprocessing of texts Lemmatisation and tagging Filtering according to grammatical category (we keep nouns, verbs and adjectives) Filtering terms occuring in less than 5 oers Vector space model ("bag of words") A comparison between LSA & CA February 9th CARME 14 / 29
15 Application in a real context Methodology Several representations are compared Representation method LSA, weighting : Term Frequency LSA, weighting : TF-IDF LSA, weighting : Log Entropy CA Dissimilarity measure Euclidian distance between documents i and i 1 - cosine similarity between documents i and i A comparison between LSA & CA February 9th CARME 15 / 29
16 Application in a real context Methodology Method of clustering Clustering steps Computing of dissimilarity matrix from document coordinates in the latent semantic space Hierarchical Agglomerative Clustering until a 8 class partition Computation of class centroids K-means clustering initialized from previous centroids A comparison between LSA & CA February 9th CARME 16 / 29
17 Application in a real context Methodology Measures of agreement between two partitions P 1, P 2 : two partitions of n objects with the same number of classes k N = [n ij ] i=1,..,k : corresponding contingency table j=1,..,k Rand index R = 2 i j n2 ij i n2 i. j n2.j + n2 n 2, 0 R 1 Rand index is based on the number of pairs of units which belong to the same clusters. It doesn't depend on cluster labeling. A comparison between LSA & CA February 9th CARME 17 / 29
18 Application in a real context Methodology Measures of agreement between two partitions Cohen's Kappa and F-measure values are depending on clusters' labels. To overcome label switching, we are looking for their maximum values over all label allocations. Cohen's Kappa κ opt = max { 1 } n i n ii 1 n 2 i n i.n.i, κ 1 n i 2 n i.n.i F -measure F opt = max { k k i n ii i n i. 1 k n ii n i. + 1 k i n ii n.i i n ii n.i }, 0 F 1 A comparison between LSA & CA February 9th CARME 18 / 29
19 Application in a real context Results and comparisons Correlation between coordinates issued from the dierent methods A comparison between LSA & CA February 9th CARME 19 / 29
20 Application in a real context Results and comparisons Clustering quality according to the method and the number of dimensions : Rand index A comparison between LSA & CA February 9th CARME 20 / 29
21 Application in a real context Results and comparisons Clustering quality according to the method and the number of dimensions : Cohen's Kappa A comparison between LSA & CA February 9th CARME 21 / 29
22 Application in a real context Results and comparisons Clustering quality according to the method and the number of dimensions : F-measure A comparison between LSA & CA February 9th CARME 22 / 29
23 Application in a real context Results and comparisons Clustering quality according to the dissimilarity function : LSA + Log Entropy A comparison between LSA & CA February 9th CARME 23 / 29
24 Application in a real context Results and comparisons Clustering quality according to the dissimilarity function : CA A comparison between LSA & CA February 9th CARME 24 / 29
25 Conclusion Outline 1 Introduction 2 Latent Semantic Analysis 3 Application in a real context 4 Conclusion A comparison between LSA & CA February 9th CARME 25 / 29
26 Conclusion Conclusions CA seems to be less stable than other methods but with cosine similarity, it provides better results under 100 dimensions As it is said in literature, cosine similarity between vectors seems to be more adapted to textual data than usual dot similarity : slight increase of eciency and more stability for agreement measures Optimal number of dimensions to keep? It is varying according to the type of text studied and the method used (around 60 dimensions with CA) We should prefer a dissimilarity measure which provides stable results with the number of dimensions kept (in the context of automated tasks, it's problematic if optimal dimension is depending too much on the collection of documents) A comparison between LSA & CA February 9th CARME 26 / 29
27 Conclusion Limitations & future work Limitations of the study Clusters obtained are compared with categories choosen by recruiters, which are sometimes subjective and could explain some errors We are working on a very particular type of corpus : short texts, variable length, sometimes very similar but not really duplicates Future work Test other clustering methods (the representation to adopt may depend on it) Repeat the study with a supervised algorithm for classication (index values are disappointing in unsupervised framework) Study the eect of using the dierent parts of job oers for classication A comparison between LSA & CA February 9th CARME 27 / 29
28 Some references Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41, Greenacre, M. (2007). Correspondence Analysis in Practice, Second Edition. London : Chapman & Hall/CRC. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, Landauer, T. K., McNamara, D., Dennis, S., & Kintsch, W. (Eds.) (2007). Handbook of Latent Semantic Analysis. Mahwah, NJ : Erlbaum. Picca, D., Curdy, B., & Bavaud, F. (2006). Non-linear correspondence analysis in text retrieval : a kernel view. In JADT'06, pp Wild, F. (2007). An LSA package for R. In LSA-TEL'07, pp
29 Thanks!
Probabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationUsing Web Searches on Important Words to Create Background Sets for LSI Classification
Using Web Searches on Important Words to Create Background Sets for LSI Classification Sarah Zelikovitz and Marina Kogan College of Staten Island of CUNY 2800 Victory Blvd Staten Island, NY 11314 Abstract
More informationCROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2
1 CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2 Peter A. Chew, Brett W. Bader, Ahmed Abdelali Proceedings of the 13 th SIGKDD, 2007 Tiago Luís Outline 2 Cross-Language IR (CLIR) Latent Semantic Analysis
More informationUnsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model
Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.
More informationA Comparison of Two Text Representations for Sentiment Analysis
010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational
More informationLatent Semantic Analysis
Latent Semantic Analysis Adapted from: www.ics.uci.edu/~lopes/teaching/inf141w10/.../lsa_intro_ai_seminar.ppt (from Melanie Martin) and http://videolectures.net/slsfs05_hofmann_lsvm/ (from Thomas Hoffman)
More informationOn-the-Fly Customization of Automated Essay Scoring
Research Report On-the-Fly Customization of Automated Essay Scoring Yigal Attali Research & Development December 2007 RR-07-42 On-the-Fly Customization of Automated Essay Scoring Yigal Attali ETS, Princeton,
More informationPython Machine Learning
Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled
More informationData Integration through Clustering and Finding Statistical Relations - Validation of Approach
Data Integration through Clustering and Finding Statistical Relations - Validation of Approach Marek Jaszuk, Teresa Mroczek, and Barbara Fryc University of Information Technology and Management, ul. Sucharskiego
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationMatching Similarity for Keyword-Based Clustering
Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web
More informationLearning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationA Bayesian Learning Approach to Concept-Based Document Classification
Databases and Information Systems Group (AG5) Max-Planck-Institute for Computer Science Saarbrücken, Germany A Bayesian Learning Approach to Concept-Based Document Classification by Georgiana Ifrim Supervisors
More informationA DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA
International Journal of Semantic Computing Vol. 5, No. 4 (2011) 433 462 c World Scientific Publishing Company DOI: 10.1142/S1793351X1100133X A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF
More informationComment-based Multi-View Clustering of Web 2.0 Items
Comment-based Multi-View Clustering of Web 2.0 Items Xiangnan He 1 Min-Yen Kan 1 Peichu Xie 2 Xiao Chen 3 1 School of Computing, National University of Singapore 2 Department of Mathematics, National University
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationPerformance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database
Journal of Computer and Communications, 2016, 4, 79-89 Published Online August 2016 in SciRes. http://www.scirp.org/journal/jcc http://dx.doi.org/10.4236/jcc.2016.410009 Performance Analysis of Optimized
More informationAs a high-quality international conference in the field
The New Automated IEEE INFOCOM Review Assignment System Baochun Li and Y. Thomas Hou Abstract In academic conferences, the structure of the review process has always been considered a critical aspect of
More informationIssues in the Mining of Heart Failure Datasets
International Journal of Automation and Computing 11(2), April 2014, 162-179 DOI: 10.1007/s11633-014-0778-5 Issues in the Mining of Heart Failure Datasets Nongnuch Poolsawad 1 Lisa Moore 1 Chandrasekhar
More informationAssignment 1: Predicting Amazon Review Ratings
Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for
More informationA Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval
A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval Yelong Shen Microsoft Research Redmond, WA, USA yeshen@microsoft.com Xiaodong He Jianfeng Gao Li Deng Microsoft Research
More informationA Statistical Approach to the Semantics of Verb-Particles
A Statistical Approach to the Semantics of Verb-Particles Colin Bannard School of Informatics University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW, UK c.j.bannard@ed.ac.uk Timothy Baldwin CSLI Stanford
More informationAutomatic Essay Assessment
Assessment in Education, Vol. 10, No. 3, November 2003 Automatic Essay Assessment THOMAS K. LANDAUER University of Colorado and Knowledge Analysis Technologies, USA DARRELL LAHAM Knowledge Analysis Technologies,
More informationSwitchboard Language Model Improvement with Conversational Data from Gigaword
Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationUniversiteit Leiden ICT in Business
Universiteit Leiden ICT in Business Ranking of Multi-Word Terms Name: Ricardo R.M. Blikman Student-no: s1184164 Internal report number: 2012-11 Date: 07/03/2013 1st supervisor: Prof. Dr. J.N. Kok 2nd supervisor:
More informationLarge-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy
Large-Scale Web Page Classification by Sathi T Marath Submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy at Dalhousie University Halifax, Nova Scotia November 2010
More informationA Semantic Similarity Measure Based on Lexico-Syntactic Patterns
A Semantic Similarity Measure Based on Lexico-Syntactic Patterns Alexander Panchenko, Olga Morozova and Hubert Naets Center for Natural Language Processing (CENTAL) Université catholique de Louvain Belgium
More informationEvaluating vector space models with canonical correlation analysis
Natural Language Engineering: page 1 of 38. c Cambridge University Press 211 doi:1.117/s1351324911271 1 Evaluating vector space models with canonical correlation analysis SAMI VIRPIOJA 1, MARI-SANNA PAUKKERI
More informationLEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE
LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE Submitted in partial fulfillment of the requirements for the degree of Sarjana Sastra (S.S.)
More informationUMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.
UMass at TDT James Allan, Victor Lavrenko, David Frey, and Vikas Khandelwal Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst, MA 3 We spent
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationSession 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design
Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design Paper #3 Five Q-to-survey approaches: did they work? Job van Exel
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationLongest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. IV (Nov Dec. 2015), PP 01-07 www.iosrjournals.org Longest Common Subsequence: A Method for
More informationLecture 1: Machine Learning Basics
1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3
More informationChapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard
Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA Alta de Waal, Jacobus Venter and Etienne Barnard Abstract Most actionable evidence is identified during the analysis phase of digital forensic investigations.
More informationCross-Lingual Text Categorization
Cross-Lingual Text Categorization Nuria Bel 1, Cornelis H.A. Koster 2, and Marta Villegas 1 1 Grup d Investigació en Lingüística Computacional Universitat de Barcelona, 028 - Barcelona, Spain. {nuria,tona}@gilc.ub.es
More informationKnowledge-Free Induction of Inflectional Morphologies
Knowledge-Free Induction of Inflectional Morphologies Patrick SCHONE Daniel JURAFSKY University of Colorado at Boulder University of Colorado at Boulder Boulder, Colorado 80309 Boulder, Colorado 80309
More informationOn document relevance and lexical cohesion between query terms
Information Processing and Management 42 (2006) 1230 1247 www.elsevier.com/locate/infoproman On document relevance and lexical cohesion between query terms Olga Vechtomova a, *, Murat Karamuftuoglu b,
More informationarxiv: v1 [cs.lg] 3 May 2013
Feature Selection Based on Term Frequency and T-Test for Text Categorization Deqing Wang dqwang@nlsde.buaa.edu.cn Hui Zhang hzhang@nlsde.buaa.edu.cn Rui Liu, Weifeng Lv {liurui,lwf}@nlsde.buaa.edu.cn arxiv:1305.0638v1
More informationFeature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes
Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes Viviana Molano 1, Carlos Cobos 1, Martha Mendoza 1, Enrique Herrera-Viedma 2, and
More informationCS Machine Learning
CS 478 - Machine Learning Projects Data Representation Basic testing and evaluation schemes CS 478 Data and Testing 1 Programming Issues l Program in any platform you want l Realize that you will be doing
More informationLIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting
LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting El Moatez Billah Nagoudi Laboratoire d Informatique et de Mathématiques LIM Université Amar
More informationAttributed Social Network Embedding
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2017 1 Attributed Social Network Embedding arxiv:1705.04969v1 [cs.si] 14 May 2017 Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua Abstract Embedding
More informationLinguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis
International Journal of Arts Humanities and Social Sciences (IJAHSS) Volume 1 Issue 1 ǁ August 216. www.ijahss.com Linguistic Variation across Sports Category of Press Reportage from British Newspapers:
More informationAssessing Entailer with a Corpus of Natural Language From an Intelligent Tutoring System
Assessing Entailer with a Corpus of Natural Language From an Intelligent Tutoring System Philip M. McCarthy, Vasile Rus, Scott A. Crossley, Sarah C. Bigham, Arthur C. Graesser, & Danielle S. McNamara Institute
More informationA Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain
A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain Myongho Yi 1 and Sam Gyun Oh 2* 1 School of Library and Information Studies, Texas Woman
More informationIntra-talker Variation: Audience Design Factors Affecting Lexical Selections
Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and
More informationTerm Weighting based on Document Revision History
Term Weighting based on Document Revision History Sérgio Nunes, Cristina Ribeiro, and Gabriel David INESC Porto, DEI, Faculdade de Engenharia, Universidade do Porto. Rua Dr. Roberto Frias, s/n. 4200-465
More informationBeyond the Pipeline: Discrete Optimization in NLP
Beyond the Pipeline: Discrete Optimization in NLP Tomasz Marciniak and Michael Strube EML Research ggmbh Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp Abstract We
More information2 Mitsuru Ishizuka x1 Keywords Automatic Indexing, PAI, Asserted Keyword, Spreading Activation, Priming Eect Introduction With the increasing number o
PAI: Automatic Indexing for Extracting Asserted Keywords from a Document 1 PAI: Automatic Indexing for Extracting Asserted Keywords from a Document Naohiro Matsumura PRESTO, Japan Science and Technology
More information(Sub)Gradient Descent
(Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include
More informationConcepts and Properties in Word Spaces
Concepts and Properties in Word Spaces Marco Baroni 1 and Alessandro Lenci 2 1 University of Trento, CIMeC 2 University of Pisa, Department of Linguistics Abstract Properties play a central role in most
More informationMeasuring Web-Corpus Randomness: A Progress Report
Measuring Web-Corpus Randomness: A Progress Report Massimiliano Ciaramita (m.ciaramita@istc.cnr.it) Istituto di Scienze e Tecnologie Cognitive (ISTC-CNR) Via Nomentana 56, Roma, 00161 Italy Marco Baroni
More informationA Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many
Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationVocabulary Usage and Intelligibility in Learner Language
Vocabulary Usage and Intelligibility in Learner Language Emi Izumi, 1 Kiyotaka Uchimoto 1 and Hitoshi Isahara 1 1. Introduction In verbal communication, the primary purpose of which is to convey and understand
More informationLearning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for
Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com
More informationSystem Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering
More informationEvidence for Reliability, Validity and Learning Effectiveness
PEARSON EDUCATION Evidence for Reliability, Validity and Learning Effectiveness Introduction Pearson Knowledge Technologies has conducted a large number and wide variety of reliability and validity studies
More informationDeep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach
#BaselOne7 Deep search Enhancing a search bar using machine learning Ilgün Ilgün & Cedric Reichenbach We are not researchers Outline I. Periscope: A search tool II. Goals III. Deep learning IV. Applying
More informationClickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models
Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models Jianfeng Gao Microsoft Research One Microsoft Way Redmond, WA 98052 USA jfgao@microsoft.com Xiaodong He Microsoft
More informationCaMLA Working Papers
CaMLA Working Papers 2015 02 The Characteristics of the Michigan English Test Reading Texts and Items and their Relationship to Item Difficulty Khaled Barkaoui York University Canada 2015 The Characteristics
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationLeveraging Sentiment to Compute Word Similarity
Leveraging Sentiment to Compute Word Similarity Balamurali A.R., Subhabrata Mukherjee, Akshat Malu and Pushpak Bhattacharyya Dept. of Computer Science and Engineering, IIT Bombay 6th International Global
More informationTraining a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski
Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Problem Statement and Background Given a collection of 8th grade science questions, possible answer
More informationMulti-Lingual Text Leveling
Multi-Lingual Text Leveling Salim Roukos, Jerome Quin, and Todd Ward IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {roukos,jlquinn,tward}@us.ibm.com Abstract. Determining the language proficiency
More informationWE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT
WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working
More informationImproving Machine Learning Input for Automatic Document Classification with Natural Language Processing
Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing Jan C. Scholtes Tim H.W. van Cann University of Maastricht, Department of Knowledge Engineering.
More informationRule Learning With Negation: Issues Regarding Effectiveness
Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United
More informationArtificial Neural Networks written examination
1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14
More informationarxiv: v2 [cs.ir] 22 Aug 2016
Exploring Deep Space: Learning Personalized Ranking in a Semantic Space arxiv:1608.00276v2 [cs.ir] 22 Aug 2016 ABSTRACT Jeroen B. P. Vuurens The Hague University of Applied Science Delft University of
More informationSummarizing Text Documents: Carnegie Mellon University 4616 Henry Street
Summarizing Text Documents: Sentence Selection and Evaluation Metrics Jade Goldstein y Mark Kantrowitz Vibhu Mittal Jaime Carbonell y jade@cs.cmu.edu mkant@jprc.com mittal@jprc.com jgc@cs.cmu.edu y Language
More informationOnline Updating of Word Representations for Part-of-Speech Tagging
Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org
More informationTruth Inference in Crowdsourcing: Is the Problem Solved?
Truth Inference in Crowdsourcing: Is the Problem Solved? Yudian Zheng, Guoliang Li #, Yuanbing Li #, Caihua Shan, Reynold Cheng # Department of Computer Science, Tsinghua University Department of Computer
More informationA study of speaker adaptation for DNN-based speech synthesis
A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,
More informationLearning to Rank with Selection Bias in Personal Search
Learning to Rank with Selection Bias in Personal Search Xuanhui Wang, Michael Bendersky, Donald Metzler, Marc Najork Google Inc. Mountain View, CA 94043 {xuanhui, bemike, metzler, najork}@google.com ABSTRACT
More informationVisit us at:
White Paper Integrating Six Sigma and Software Testing Process for Removal of Wastage & Optimizing Resource Utilization 24 October 2013 With resources working for extended hours and in a pressurized environment,
More informationText-mining the Estonian National Electronic Health Record
Text-mining the Estonian National Electronic Health Record Raul Sirel rsirel@ut.ee 13.11.2015 Outline Electronic Health Records & Text Mining De-identifying the Texts Resolving the Abbreviations Terminology
More informationCross-lingual Text Fragment Alignment using Divergence from Randomness
Cross-lingual Text Fragment Alignment using Divergence from Randomness Sirvan Yahyaei, Marco Bonzanini, and Thomas Roelleke Queen Mary, University of London Mile End Road, E1 4NS London, UK {sirvan,marcob,thor}@eecs.qmul.ac.uk
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationarxiv: v1 [math.at] 10 Jan 2016
THE ALGEBRAIC ATIYAH-HIRZEBRUCH SPECTRAL SEQUENCE OF REAL PROJECTIVE SPECTRA arxiv:1601.02185v1 [math.at] 10 Jan 2016 GUOZHEN WANG AND ZHOULI XU Abstract. In this note, we use Curtis s algorithm and the
More information*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN
From: AAAI Technical Report WS-98-08. Compilation copyright 1998, AAAI (www.aaai.org). All rights reserved. Recommender Systems: A GroupLens Perspective Joseph A. Konstan *t, John Riedl *t, AI Borchers,
More informationGeorgetown University at TREC 2017 Dynamic Domain Track
Georgetown University at TREC 2017 Dynamic Domain Track Zhiwen Tang Georgetown University zt79@georgetown.edu Grace Hui Yang Georgetown University huiyang@cs.georgetown.edu Abstract TREC Dynamic Domain
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More informationHandling Sparsity for Verb Noun MWE Token Classification
Handling Sparsity for Verb Noun MWE Token Classification Mona T. Diab Center for Computational Learning Systems Columbia University mdiab@ccls.columbia.edu Madhav Krishna Computer Science Department Columbia
More informationEvaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation
Multimodal Technologies and Interaction Article Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation Kai Xu 1, *,, Leishi Zhang 1,, Daniel Pérez 2,, Phong
More informationThe Role of String Similarity Metrics in Ontology Alignment
The Role of String Similarity Metrics in Ontology Alignment Michelle Cheatham and Pascal Hitzler August 9, 2013 1 Introduction Tim Berners-Lee originally envisioned a much different world wide web than
More informationDiscovery of Topical Authorities in Instagram
Discovery of Topical Authorities in Instagram Aditya Pal, Amaç Herdağdelen, Sourav Chatterji, Sumit Taank, Deepayan Chakrabarti Facebook {apal,amac,sourav,staank}@fb.com, deepay@utexas.edu ABSTRACT Instagram
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationAccuracy (%) # features
Question Terminology and Representation for Question Type Classication Noriko Tomuro DePaul University School of Computer Science, Telecommunications and Information Systems 243 S. Wabash Ave. Chicago,
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationOPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS
OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationA Graph Based Authorship Identification Approach
A Graph Based Authorship Identification Approach Notebook for PAN at CLEF 2015 Helena Gómez-Adorno 1, Grigori Sidorov 1, David Pinto 2, and Ilia Markov 1 1 Center for Computing Research, Instituto Politécnico
More informationMultilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities
Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martínez NLP&IR Group UNED Arantza Casillas Dpt. EE UPV-EHU Víctor Fresno GAVAB
More information