Text Analytics Using Latent Semantic Analysis

Similar documents
Probabilistic Latent Semantic Analysis

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Latent Semantic Analysis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Automatic Essay Assessment

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A Comparison of Two Text Representations for Sentiment Analysis

On-the-Fly Customization of Automated Essay Scoring

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Assignment 1: Predicting Amazon Review Ratings

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Matching Similarity for Keyword-Based Clustering

A Case Study: News Classification Based on Term Frequency

Knowledge-Free Induction of Inflectional Morphologies

Assessing Entailer with a Corpus of Natural Language From an Intelligent Tutoring System

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Comment-based Multi-View Clustering of Web 2.0 Items

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Mathematics. Mathematics

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Evidence for Reliability, Validity and Learning Effectiveness

Statewide Framework Document for:

Linking Task: Identifying authors and book titles in verbose queries

Learning Methods for Fuzzy Systems

Applications of memory-based natural language processing

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

CS Machine Learning

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

University of Groningen. Systemen, planning, netwerken Bosman, Aart

A PEDAGOGY OF TEACHING THE TEST

Term Weighting based on Document Revision History

Automating the E-learning Personalization

Python Machine Learning

The Smart/Empire TIPSTER IR System

Cross Language Information Retrieval

Generative models and adversarial training

Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure

SCHEMA ACTIVATION IN MEMORY FOR PROSE 1. Michael A. R. Townsend State University of New York at Albany

2 nd grade Task 5 Half and Half

BENCHMARK TREND COMPARISON REPORT:

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

(Sub)Gradient Descent

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Rule Learning With Negation: Issues Regarding Effectiveness

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

As a high-quality international conference in the field

A Statistical Approach to the Semantics of Verb-Particles

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

Lecture 1: Basic Concepts of Machine Learning

Mathematics Scoring Guide for Sample Test 2005

The CTQ Flowdown as a Conceptual Model of Project Objectives

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Postprint.

The Representation of Concrete and Abstract Concepts: Categorical vs. Associative Relationships. Jingyi Geng and Tatiana T. Schnur

Epistemic Cognition. Petr Johanes. Fourth Annual ACM Conference on Learning at Scale

Meta-Cognitive Strategies

Constraining X-Bar: Theta Theory

ME 4495 Computational Heat Transfer and Fluid Flow M,W 4:00 5:15 (Eng 177)

Australian Journal of Basic and Applied Sciences

Concepts and Properties in Word Spaces

The role of word-word co-occurrence in word learning

arxiv: v1 [math.at] 10 Jan 2016

The Good Judgment Project: A large scale test of different methods of combining expert predictions

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Circuit Simulators: A Revolutionary E-Learning Platform

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

English Language and Applied Linguistics. Module Descriptions 2017/18

Physics 270: Experimental Physics

Parsing of part-of-speech tagged Assamese Texts

EECS 700: Computer Modeling, Simulation, and Visualization Fall 2014

arxiv: v2 [cs.ir] 22 Aug 2016

AQUA: An Ontology-Driven Question Answering System

Bug triage in open source systems: a review

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Evaluating vector space models with canonical correlation analysis

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

On-Line Data Analytics

MTH 215: Introduction to Linear Algebra

Second Exam: Natural Language Parsing with Neural Networks

Procedia - Social and Behavioral Sciences 143 ( 2014 ) CY-ICER Teacher intervention in the process of L2 writing acquisition

WHEN THERE IS A mismatch between the acoustic

Multi-Lingual Text Leveling

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Compositional Semantics

Radius STEM Readiness TM

Attributed Social Network Embedding

Transcription:

Text Analytics Using Latent Semantic Analysis John Martin Small Bear Technologies, Inc. <John.Martin@SmallBearTechnologies.com> www.smallbeartechnologies.com

Overview Text Analytics Need for automated methods What is LSA How LSA works Applications of LSA Misconceptions Conclusion - Q&A 2011 Small Bear Technologies, Inc. 2

What is Text Analytics? A set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation (Wikipedia) Text Analytics Text Mining 2011 Small Bear Technologies, Inc. 3

Text Analytics Derive meaning from (textual) data sources Structured data Fixed format Known attributes Unstructured data Natural language 2011 Small Bear Technologies, Inc. 4

Unstructured Text News feeds Call center logs E-mail traffic Surveys Social network postings Publishing Observational data 2011 Small Bear Technologies, Inc. 5

Automated Methods Required Volume of information Speed of change/production Complexity Need impartial/consistent analysis 2011 Small Bear Technologies, Inc. 6

The Goal Maintaining and increasing the value of information Collecting information is not the problem gaining understanding is the issue Too much information is just as useless at too little information 2011 Small Bear Technologies, Inc. 7

Some Common Methods Lexical matching Statistical evaluation Vector space models Rule based systems Parts of speech analysis 2011 Small Bear Technologies, Inc. 8

The Problem Failure to capture meaning and provide insight Methods not universally applicable Language or domain dependent Need for specialized prior knowledge of data Require human interaction Tagging, Keyword identification, Categorizations Not practical for large data sets 2011 Small Bear Technologies, Inc. 9

The Cost Too much information is just as useless at too little information Failure to understand the information we have leads to: Lost opportunities Unsatisfied customers Inability to fulfill mission Financial repercussions 2011 Small Bear Technologies, Inc. 10

What is LSA? Latent Semantic Analysis Latent Semantic Indexing The distinction is really one of application, as the same mathematics and computation are employed for both. LSA may be considered to refer to a broad collection of application while LSI is more closely associated with information retrieval. 2011 Small Bear Technologies, Inc. 11

Latent Semantic Analysis Theory of meaning[8] Creates a mapping of meaning acquired from the text itself Computational model Can perform many of the cognitive tasks that humans do essentially as well as humans [7] 2011 Small Bear Technologies, Inc. 12

How LSA Works LSA processing constructs a mapping of meaning in a semantic space The mapping gives the meaning of words and documents not vice versa 2011 Small Bear Technologies, Inc. 13

Compositionality Constraint The meaning of a document is the sum of the meaning of its words The meaning of a word is defined by the documents in which it appears (and does not appear) 2011 Small Bear Technologies, Inc. 14

LSA Space Construction LSA models a document as a simple linear equation A collection of documents (corpus) is a large set of simultaneous equations 2011 Small Bear Technologies, Inc. 15

Processing a Corpus Divide text corpus into units (documents) Typically paragraphs of text Raw matrix is constructed from units One row for each word type One column for each unit (document) Cells contain the number of times a particular word appears in a particular document Weighting functions may be applied [5] 2011 Small Bear Technologies, Inc. 16

Sparse Matrix The weighted term by document matrix represents a large set of simultaneous equations The term by document matrix is sparse Typically less than 1% of the values are nonzero [2] 2011 Small Bear Technologies, Inc. 17

Solving Simultaneous Equations The system of simultaneous equations is solved for the meaning of each word type and document Sparse matrix Singular Value Decomposition Lanczos algorithm is typically used Only solve for a reduced number of dimensions Produces vectors representing the meaning of each term and document 2011 Small Bear Technologies, Inc. 18

Singular Value Decomposition[10] The rows of matrix U are the vectors for the word types Columns of U are the eigenvectors defining the axes for word type space A=U Σ V T 2011 Small Bear Technologies, Inc. 19

Singular Value Decomposition[10] The rows of matrix V are the text unit (document) vectors Columns of V are eigenvectors defining the axes for document space A=U Σ V T 2011 Small Bear Technologies, Inc. 20

Dimensional Reduction Typically solve for 300 500 dimensions [10] Dimensional reduction allows comparison of all terms and all documents with each other In the sparse matrix comparison was not possible Dimensions are orthogonal 2011 Small Bear Technologies, Inc. 21

Dimensional Reduction With enough variables every object is different With too few variables every object is the same [a r f j] [a r c g] Consider a geographic map 2011 Small Bear Technologies, Inc. 22

Semantic Space Vectors represent the meaning of a document (or term) Items similar in meaning are near each other in the semantic space 2011 Small Bear Technologies, Inc. 23

Computational Issues Nontrivial computation Large sparse symmetric eigenproblem Scalability concerns [11] Size of document set Speed of processing Accuracy issues Finite arithmetic introduces significant error 2011 Small Bear Technologies, Inc. 24

Operations Retrieval Clustering Comparison Interpretation Completion 2011 Small Bear Technologies, Inc. 25

Applications of LSA Library Illustration Retrieval Content analysis Evaluation of fit into an existing collection Comparison of multiple collections Indexing of multilingual collections 2011 Small Bear Technologies, Inc. 26

Applications of LSA Repairing/cleaning data Education Grading Summarizing Non-textual applications Bio-informatics Personality profiles/compatibility analysis 2011 Small Bear Technologies, Inc. 27

Misconceptions and Misunderstandings Driven by term co-occurrence Word order issues Data collection size Content and Meaning 2011 Small Bear Technologies, Inc. 28

Co-occurrence LSA starts with a kind of co-occurrence Appearing in the same document does not make words similar Similarity is determined by the effect of the word meaning on the system of equations 2011 Small Bear Technologies, Inc. 29

Word Order Word order effects almost entirely within single sentences Research indicates only around 10% of meaning is word order dependent (for English) [ order syntax? Much. Ignoring word Missed by is how ][8] 2011 Small Bear Technologies, Inc. 30

Data Collection Size Beware of small data collections Generally - Use at least 100,000 documents 2011 Small Bear Technologies, Inc. 31

Content LSA builds its notion of meaning from the content of the data collection 2011 Small Bear Technologies, Inc. 32

The Problem Revisited Failure to capture meaning and provide insight Methods not universally applicable Need for specialized prior knowledge of data Require human interaction Not practical for large data sets 2011 Small Bear Technologies, Inc. 33

Conclusion LSA offers powerful capabilities for gaining insight and understanding the contents of a data collection LSA provides analysis techniques not available with other Text Analytic methods Small Bear Technologies provides the core technology, tools, and support for performing Latent Semantic Analysis 2011 Small Bear Technologies, Inc. 34

Suggested Reading Handbook of Latent Semantic Analysis; Landauer, T., McNamara D., Dennis, S., Kintsch, W., Eds.; Lawrence Erlbaum Associates, Inc.: Mahwah, New Jersey, 2007. Indexing by Latent Semantic Analysis Deerwester, S.; Dumais, S.; Furnas, G.; Landauer, T.; Harshman, R., Journal of the American Society for Information Sciences 1990, 41, 391-407. Improving the retrieval of information from external sources Dumais, S., Behavior Research Methods, Instruments, & Computers 1991, 23, 229-236. An Introduction to Latent Semantic Analysis Landauer, T.; Foltz, P.; Laham, D., Discourse Processes 1998, 25, 259-284. A solution to Plato's problem: The Latent Semantic Analysis Theory of acquisition, induction, and representation of Knowledge Landauer, T.; Dumais, S., Psychological Review 1997, 104, 211-240. 2011 Small Bear Technologies, Inc. 35

References [1] Berry, M.W., Large Sparse Singular Value Computations. In International Journal of Supercomputer Applications, 1992, Vol. 6, pp. 13-49. [2] Berry, M.W., & Browne, M., Understanding Search Engines: Mathematical Modeling and Text Retrieval (2 nd ed.) SIAM, Philadelphia, 2005. [3] Berry, M.W., & Martin, D., Principle Component Analysis for Information Retrieval Applications. In Statistics: A series of textbooks and monographs: Handbook of Parallel Computing and Statistics, Chapman & Hall/CRC, Boca Raton, 2005, pp. 399-413. [4] Deerwester, S., Dumais, S, Furnas, G., Landauer, T., & Harshman, R, Indexing by Latent Semantic Analysis. In Journal of the American Society of Information Sciences, 1990, Vol. 41, pp. 391-407. [5] Dumais, S., Improving the Retrieval of Information from External Sources. In Behavior Research Methods, Instruments, and Computers, 1991, Vol. 23, pp. 229-236. 2011 Small Bear Technologies, Inc. 36

References (cont.) [6] Grimes, S., Text Analytics 2009: User Perspectives on Solutions and Providers. Alta Plana Research: 2009 http://altaplana.com/ta2009 [7] Landauer, T.K., On the Computational Basis of Cognition: Arguments from LSA. In The Psychology of Learning and Motivation. B.H. Ross (Ed.), Academic Press, New York, 2002; pp. 43-84. [8] Landauer, T. K., LSA as a Theory of Meaning. In The Handbook of Latent Semantic Analysis, Landauer, McNamara, Dennis, & Kintsch (Eds.), Lawrence Erlbaum Associates, Inc., Mahwah, New Jersey, 2007; pp. 3-34. [9] Landauer, T.K., & Dumais, S., A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. In Psychological Review, 1997, Vol. 104, pp. 322-346. [10] Martin, D., Berry, M., Mathematical Foundations Behind Latent Semantic Analysis, In The Handbook of Latent Semantic Analysis, Landauer, McNamara, Dennis, & Kintsch (Eds.), Lawrence Erlbaum Associates, Inc., Mahwah, New Jersey, 2007; pp. 35-55. 2011 Small Bear Technologies, Inc. 37

References (cont.) [11] Martin, D., Martin, J., Berry, M., Browne, M., Out-of-Core SVD performance for document indexing. In Applied Numerical Mathematics, 2007, Vol. 14, No. 10.

John Martin Small Bear Technologies, Inc. <John.Martin@SmallBearTechnologies.com> www.smallbeartechnologies.com