Handout 2 More Similarity Searching; Multidimensional Scaling

Similar documents
Python Machine Learning

A Case Study: News Classification Based on Term Frequency

Artificial Neural Networks written examination

Rule Learning With Negation: Issues Regarding Effectiveness

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Lecture 1: Machine Learning Basics

Probabilistic Latent Semantic Analysis

Getting Started with Deliberate Practice

(Sub)Gradient Descent

Cal s Dinner Card Deals

TUESDAYS/THURSDAYS, NOV. 11, 2014-FEB. 12, 2015 x COURSE NUMBER 6520 (1)

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Grade 6: Correlated to AGS Basic Math Skills

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Rule Learning with Negation: Issues Regarding Effectiveness

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Learning Methods for Fuzzy Systems

Why Pay Attention to Race?

While you are waiting... socrative.com, room number SIMLANG2016

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Reducing Features to Improve Bug Prediction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Physics 270: Experimental Physics

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Conducting an interview

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Students Understanding of Graphical Vector Addition in One and Two Dimensions

Let s think about how to multiply and divide fractions by fractions!

Corpus Linguistics (L615)

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Maths Games Resource Kit - Sample Teaching Problem Solving

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Switchboard Language Model Improvement with Conversational Data from Gigaword

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Managerial Decision Making

Developing Grammar in Context

Measurement. When Smaller Is Better. Activity:

Modeling user preferences and norms in context-aware systems

Lecture 1: Basic Concepts of Machine Learning

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Cross Language Information Retrieval

Foothill College Summer 2016

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

WHEN THERE IS A mismatch between the acoustic

Software Maintenance

Learning From the Past with Experiment Databases

Axiom 2013 Team Description Paper

Introduction to the Practice of Statistics

A Pipelined Approach for Iterative Software Process Model

LEARNER VARIABILITY AND UNIVERSAL DESIGN FOR LEARNING

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Professor Christina Romer. LECTURE 24 INFLATION AND THE RETURN OF OUTPUT TO POTENTIAL April 20, 2017

Reinforcement Learning by Comparing Immediate Reward

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Assignment 1: Predicting Amazon Review Ratings

Linking Task: Identifying authors and book titles in verbose queries

P-4: Differentiate your plans to fit your students

MYCIN. The MYCIN Task

Shockwheat. Statistics 1, Activity 1

A method to teach or reinforce concepts of restriction enzymes, RFLPs, and gel electrophoresis. By: Heidi Hisrich of The Dork Side

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Automatic document classification of biological literature

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Comparison of network inference packages and methods for multiple networks inference

Classify: by elimination Road signs

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

LISTENING STRATEGIES AWARENESS: A DIARY STUDY IN A LISTENING COMPREHENSION CLASSROOM

The Role of String Similarity Metrics in Ontology Alignment

How People Learn Physics

DegreeWorks Advisor Reference Guide

The Success Principles How to Get from Where You Are to Where You Want to Be

Improving Fairness in Memory Scheduling

West s Paralegal Today The Legal Team at Work Third Edition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

If you have problems logging in go to

Lecture 10: Reinforcement Learning

A Comparison of Two Text Representations for Sentiment Analysis

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Name: Class: Date: ID: A

Math 96: Intermediate Algebra in Context

Generative models and adversarial training

Mathematics. Mathematics

Applications of memory-based natural language processing

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Australian Journal of Basic and Applied Sciences

Introduction. 1. Evidence-informed teaching Prelude

Practice Examination IREB

Grammar Lesson Plan: Yes/No Questions with No Overt Auxiliary Verbs

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Ohio s Learning Standards-Clear Learning Targets

Using dialogue context to improve parsing performance in dialogue systems

Text-mining the Estonian National Electronic Health Record

Transcription:

Handout 2 More Similarity Searching; Multidimensional Scaling 36-350: Data Mining August 30, 2006 Reading: Principles of Data Mining, sec. 14.3 (skip 14.3.3 for now) and 14.4. Let s recap similarity searching for documents. We represent each document as a bag of words, i.e., a vector giving the number of times each word occurred in the document. This abstracts away all the grammatical structure, context, etc., leaving us with a matrix whose rows are feature vectors, a data frame. To find documents which are similar to a given document Q, we calculate the distance between Q and all the other documents, i.e., the distance between their feature vectors. We then return the k closest documents. Today we re going to look at some wrinkles and extensions. Stemming As I mentioned in lecture, it is a lot easier to decide what counts as a word in English than in some other languages. 1 Even so, we need to decide whether car and cars are the same word, for our purposes, or not. Stemming takes derived forms of words (like cars, flying ) and reduces them to their stem ( car, fly ). Doing this well requires linguistic knowledge (so the system doesn t think the stem of potatoes is potatoe ), and it can even be harmful (if the document has Saturns, plural, it s most likely about the cars). Multidimensional Scaling The bag-of-words vectors representing our documents generally live in spaces with lots of dimensions, certainly more than three, which are hard for ordinary humans to visualize. However, we can compute the distance between any two vectors, so we know how far apart they are. Multidimensional scaling (MDS) is the general name for a family of algorithms which take high-dimensional vectors and map them down to two- or three-dimensional vectors, trying to preserve all the relevant distances. (See Sec. 3.7 in the textbook for some algorithmic details.) There is almost always some distortion. We will see a lot of multidimensional scaling plots. 1 The Turkish example I was trying to remember is yapabilecekdiyseniz, if you were going to be able to do. 1

Classification Some very important data-mining task is classifying new pieces of data, that is, assigning them to one of a fixed number of classes. Last time, our two classes were about mobiles and about rcycles. Usually, new data doesn t come with a class label, so we have to somehow guess the class from the features. 2 With a nearest neighbor strategy, we guess that the new object is in the same class as the closest already-classified object. (We saw this at the end of the last lecture.) With a prototype strategy, we pick out the most representative member of each class, or perhaps the average of each class, as its prototype, and guess that new objects belong to the class with the closer prototype. We will see many other classifier rules, in addition to these two, but these are ones we can apply as soon as we know how to calculate distance. Queries Are Documents I promised that we could avoid having to come up with an initial document. The trick to this is to realize that a query, whether an actual sentence ( What are the common problems of the 2001 model year Saturn? ) or just a list of key words ( problems 2001 model Saturn ) is a small document. If we represent user queries as bags of words, we can use our similarity searching procedure on them. If this works, we have a search technique which find mostly-relevant things (the precision is high), and most relevant items are found (the recall is high). Inverse Document Frequency (IDF) Weighting We are using features (word counts) to identify documents which are relevant to our query. Not all features are going to be equally useful. Some words are so common that they give us almost no ability at all to discriminate between relevant and irrelevant documents. In (most) collections of English documents, looking at the, of, a, etc., is a waste of time. We could handle this by a fixed list of stop words, which we just don t count, but this at once too crude (all or nothing) and too much work (we need to think up the list). Inverse document frequency (IDF) is a more adaptive approach. The document frequency of a w is the number of documents it appears in, n w. The IDF weight of w is IDF(w) log N n w where N is the total size of our collection. Now when we make our bag-ofwords vector for the document Q, the number of times w appears in Q, Q w, is multiplied by IDF(w). Notice that if w appears in every document, n w = N and it gets an IDF weight of zero; we won t use it to calculate distances. This takes care of most of the things we d use a list of stop-words for, but it also takes into account, implicitly, the kind of documents we re using. (In a data base of papers on genetics, gene and DNA are going to have IDF weights of near zero too.) On the other hand, if w appears in only a few documents, it will get a weight of about log N, and all documents containing w will tend to be close to each other. 2 If it does come with a label, we read the label. 2

Normalization Equal weight IDF weight None 83 79 Document length 63 60 Euclidean length 59 21 Table 1: Number of mis-classifications in a larger (199 document) collection of posts from rec. and rec.rcycles, for different normalizations of Euclidean distance, with and without IDF weighting. (Classification is by the nearest neighbor method.) Table 1 shows how including IDF weighting improves our ability to classify posts as either about cars or about rcycles. You could tell a similar story about any increasing function, not just log, but log happens to work very well in practice, in part because it s not very sensitive to the exact number of documents. So this is not the same log we will see in information theory, or the log in psychophysics. Notice also that this is not guaranteed to work. Even if w appears in every document, so IDF(w) = 0, it might be common in some of them and rare in others, so we ll ignore what might have been useful information. (Maybe genetics papers about laboratory procedures use DNA more often, and papers about hereditary diseases use gene more often.) This is our first look at the problem of feature selection: how do we pick out good, useful features from the very large, perhaps infinite, collection of possible features? We will come back to this in various ways throughout the course. Right now, concentrate on the fact that in search, and other classification problems, we are looking for features that let us discriminate between the classes. Feedback People are much better at telling whether you ve found what they re looking for than explaining what it is that they re looking for. Queries, though, are users trying to explain what they re looking for (to a computer, no less), so they re often not very good. An important idea in data mining is that people should do things at which they are better than computers and vice versa: here they should be deciders, not explainers. Rocchio s algorithm takes feedback from the user, about which documents were relevant, and then refines the search, giving more weight to what they like, and less to what they don t like. The user gives the system some query, whose bag-of-words vector is Q t. The system responses with various documents, some of which the user marks as relevant (R) and others as not-relevant (N R). The system then modifies the query vector: Q t+1 = αq t + β R doc R doc γ NR doc NR where R and N R are the number of relevant and non-relevant documents, doc 3

and α, β and γ are positive constants. α says how much continuity there is between the old search and the new one; β and γ gauge our preference for recall (we find more relevant items) versus precision (more of what we find is relevant). The system then runs another search with Q t+1, and cycle starts over. As this is repeated, Q t becomes closer to the bag-of-words vector which best represents what the user has in mind, assuming they have something definite and consistent in mind. Notice: A word can t appear in a document a negative number of times, so ordinarily bag-of-words vectors have non-negative components. Q t, however, can easily come to have negative components, representing the words whose presence is evidence that the document is not relevant. Returning to the example of problems with used 2001 Saturns, we probably don t want anything which contains Titan or Rhea, since it s either about mythology or astronomy, and giving our query negative components for those words suppresses those documents. Rocchio s algorithm can be applied to any kind of similarity-based search, not just to text. It is closely related to a lot of algorithms in machine learning which incrementally adjust in the direction of what has worked and away from what has not the perceptron algorithm for learning linear classifiers, the stochastic approximation algorithm for estimating functions and curves, reinforcement learning for making decisions. These similarities are no accident; they are all variants on the idea of evolution by means of natural selection. 4

6 4 2 0 2 4 1 3 3 5 1 4 4 2 2 5 10 best words, Un-normalized counts, 1 error (picks 4 for 3) 5 0 5 10 2 0.5 0.0 0.5 2 4 5 3 1 1 3 4 5 Normalized by document length, 1 error (picks 5 for 2) 0.5 0.0 0.5 1.0 1 3 1.0 0.5 0.0 0.5 4 3 5 1 2 2 5 4 Normalized by Euclidean length, No errors 5

0.6 0.4 0.2 0.0 0.2 0.4 0.6 5 1 1 2 4 5 3 2 4 3 182 words, equal weighting 5 errors (1,2,4, 2,4) (as bad as guessing) 0.5 0.0 0.5 5 1 3 0.5 0.0 0.5 1 5 2 3 4 2 4 182 words, IDF weighting 3 errors (4, 1,4) 1 3 1.0 0.5 0.0 0.5 4 3 5 1 2 2 5 4 10 best words (from last time) 6

1.5 1.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 88 89 90 91 92 93 94 95 96 97 98 99 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 7

Nearest-neighbor method test 1 1 2 0.5 0.0 0.5 5 4 2 3 4 3 5 test Prototype method here prototype is the average of already-labeled documents 0.6 0.4 0.2 0.0 0.2 0.4 test 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 8