Lecture 9: Classification and algorithmic methods

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning Methods in Multilingual Speech Recognition

Using Web Searches on Important Words to Create Background Sets for LSI Classification

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Artificial Neural Networks written examination

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

The Good Judgment Project: A large scale test of different methods of combining expert predictions

INPE São José dos Campos

Rule Learning With Negation: Issues Regarding Effectiveness

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Algebra 2- Semester 2 Review

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Using dialogue context to improve parsing performance in dialogue systems

On-Line Data Analytics

Python Machine Learning

Probabilistic Latent Semantic Analysis

Best Practices in Internet Ministry Released November 7, 2008

Measurement. When Smaller Is Better. Activity:

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Mathematics Success Grade 7

Getting Started with Deliberate Practice

Rule Learning with Negation: Issues Regarding Effectiveness

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

DegreeWorks Advisor Reference Guide

Issues in the Mining of Heart Failure Datasets

Chapter 4 - Fractions

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Short vs. Extended Answer Questions in Computer Science Exams

Switchboard Language Model Improvement with Conversational Data from Gigaword

K-Medoid Algorithm in Clustering Student Scholarship Applicants

EXECUTIVE SUMMARY. Online courses for credit recovery in high schools: Effectiveness and promising practices. April 2017

Evolutive Neural Net Fuzzy Filtering: Basic Description

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Name: Class: Date: ID: A

Generative models and adversarial training

Data Stream Processing and Analytics

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Genevieve L. Hartman, Ph.D.

AQUA: An Ontology-Driven Question Answering System

Executive Guide to Simulation for Health

Multivariate k-nearest Neighbor Regression for Time Series data -

Characteristics of the Text Genre Realistic fi ction Text Structure

Software Maintenance

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

A Case Study: News Classification Based on Term Frequency

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Chapter 2 Rule Learning in a Nutshell

Lecture 1: Basic Concepts of Machine Learning

Go fishing! Responsibility judgments when cooperation breaks down

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:

Calibration of Confidence Measures in Speech Recognition

Date: 25 January 2012 Issue: 11

Practical Learning Tools (Communication Tools for the Trainer)

The Evolution of Random Phenomena

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

1. Listen carefully as your teacher assigns you two or more rows of the Biome Jigsaw Chart (page S2) to fill in.

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Answer Key For The California Mathematics Standards Grade 1

Mining Student Evolution Using Associative Classification and Clustering

Learning Methods for Fuzzy Systems

Reading Horizons. Organizing Reading Material into Thought Units to Enhance Comprehension. Kathleen C. Stevens APRIL 1983

Lesson Plan Title Aquatic Ecology

Medical Complexity: A Pragmatic Theory

Language Acquisition Chart

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Practice Examination IREB

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

MYCIN. The MYCIN Task

SARDNET: A Self-Organizing Feature Map for Sequences

Alignment of Australian Curriculum Year Levels to the Scope and Sequence of Math-U-See Program

Learning From the Past with Experiment Databases

Title: Knowledge assessment of trainees and trainers in General Practice in a neighboring country. Making a case for international collaboration.

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Sight Word Assessment

A Version Space Approach to Learning Context-free Grammars

Using computational modeling in language acquisition research

Proof Theory for Syntacticians

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

16.1 Lesson: Putting it into practice - isikhnas

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Functional Maths Skills Check E3/L x

Inquiry Learning Methodologies and the Disposition to Energy Systems Problem Solving

Case study Norway case 1

Transcription:

1/28 Lecture 9: Classification and algorithmic methods Måns Thulin Department of Mathematics, Uppsala University thulin@math.uu.se Multivariate Methods 17/5 2011

2/28 Outline What are algorithmic methods? Algorithmic methods for classification knn classification Decision trees Algorithmic versus probabilistic methods

3/28 Probabilistic methods Previously, we have looked at probabilistic methods for classification methods based on statistical theory and model assumptions. In a statistical problem, the basic situation is the following: Nature {Black box} Data The probabilistic approach is to assume a model for what happens in the black box (normal distribution, ARIMA time series, linear model, Markov chains...). We hope that the models describe the black box accurately enough. All models are wrong, but some are useful. - George Box Some statisticians, and indeed people from other fields as well, argue that it is time to think outside the box.

4/28 Algorithmic methods Suppose that we have a set of data with known classes. Without any model assumptions, we can use heuristics and good ideas to come up with new methods. We can create algorithms that creates rules for classifying new points using the given training data. By splitting the given data into a training set and a test set, we can evaluate the performance of our algorithmic method. All models are wrong, and increasingly you can succeed without them. - Peter Norvig, research director at Google As a motivating example, we ll look at a situation where it is more or less clear that we don t need fancy methods or model assumptions to classify new observations.

A toy example Example: consider a data set with two groups: red and blue. knn classification y 5/28

A toy example How should we classify the new black point? knn classification y 6/28

A toy example It seems reasonable to classify the point as being blue! knn classification y 7/28

A toy example How should we classify the new black point? knn classification y 8/28

A toy example It seems reasonable to classify the point as being red! knn classification y 9/28

A less nice example But what about this point? knn classification y 10/28

11/28 knn: basic idea In the first two examples, we could easily classify the point since all points in its neighbourhood had the same colour. What should we do when there is more than one colour in the neighbourhood? The knn algorithm classifies the new point by letting the k Nearest Neighbours the k points that are the closest to the point vote about the class of the new point.

knn: basic idea Look at the k = 1 closest neighbours. The point is classified as being blue, since the nearest neighbour is blue. knn classification k=1 y 12/28

knn: basic idea Look at the k = 2 closest neighbours. It is not clear how to classify the point (no colour has a majority). knn classification k=2 y 13/28

knn: basic idea Look at the k = 3 closest neighbours. The point is classified as being blue (2 votes against 1). knn classification k=3 y 14/28

15/28 knn: choosing k Clearly, the choice of k is very important. If k is too small, the algorithm becomes sensitive to noise points and outliers. If k is too large, the neighbourhood will probably include points from other classes. How should we choose k? This is a difficult question! There is no right answer. Often a test data set is used to investigate the performance for different k. Typically, we choose the k that has the lowest misclassification rate for the test data.

16/28 knn: no majority In our example, we encountered a problem when k = 2: no colour had a majority. What should we do in such cases? Flip a coin? This ignores some of the information that we have gathered! Let the closest neighbour decide? Or the k 1 closest? A better solution is probably to use weighted votes, so that the votes from closer neighbours are seen as more important. This idea could be used in all cases, and not just when there is no majority. Look at k + 1 neighbours instead? Essentially, this means that when we don t have enough information to make a decision, we gather more information.

17/28 knn: some last comments knn is essentially a rank method: we measure the distance to all points in the data set and rank them accordingly. The k points with the lowest ranks are used to classify the new point. An important question is what we mean by close. Which distance measure should we use? Euclidean distance? Statistical distance? Mahalanobis? Should we look at standardized data? Is it meaningful to use distance measures if the data is binary or categorical? If some of the variables are categorical and some are continuous measurements? Are more general similarity measures useful?

18/28 Decision trees: basic idea Another popular algorithmic classification method is decision trees. Have you ever played the game 20 questions? Decision trees is more or less that game! The idea is to classify the new observation by asking a series of questions. Depending on what the answer to the first question is, different second questions are asked, and so on. Questions are asked until a conclusion is reached.

19/28 Decision trees: basic idea Consider the following data set with vertebrate data: Name Body temp Gives birth Has legs Class Human warm-blooded yes yes mammal Whale warm-blooded yes no mammal Cat warm-blooded yes yes mammal Cow warm-blooded yes yes mammal Python cold-blooded no no reptile Komodo dragon cold-blooded no yes reptile Turtle cold-blooded no yes reptile Salmon cold-blooded no no fish Eel cold-blooded no no fish Pigeon warm-blooded no yes bird Penguin warm-blooded no yes bird Decision tree example: see blackboard!

20/28 Decision trees: building the tree Given training data, how can we build the decision tree? There are many algorithms for building the tree. One of the earliest is Hunt s algorithm: Let D t be a set of observations belonging to a node t. 1. If all observations in D t are of the same class i then t is a leaf node labeled as i. 2. Otherwise, use some condition to partition the observations into two smaller subsets. A child node is created for each outcome of the condition and the observations are distributed to the children based on the outcomes. When should the splitting stop? Other criterions are sometimes used, but a simple and reasonable stopping criterion is to stop splitting when all remaining nodes are leaf nodes. How, then, do we choose the condition for partitioning?

21/28 Decision trees: the best split Let p(i t) be the fraction of observations in class i at the node t and let c be the number of classes. The Gini for node t is defined as Gini(t) = 1 c (p(i t)) 2 Gini is a measure of impurity. If all observations belong to the same class, then i=1 Gini(t) = 1 1 2 0... 0 = 0. The Gini is maximized when all classes have the same number of observations at t. One criterion for splitting could be to minimize the Gini in the next level of the tree. That way we will get purer nodes.

22/28 Decision trees: the best split The situation becomes a bit more complicated if we take into account that the children can have different numbers of observations. To account for this, we try to maximize the gain: Gain = Gini(t) k j=1 n vj n t Gini(v j ) where v j are the children and n i are the number of observations at node i. This is equivalent to minimizing k n vj j=1 n t Gini(v j ). Vertebrate example: see blackboard! Sometimes other impurity measures than Gini are used. One example is the entropy: c Entropy(t) = p(i t) log 2 p(i t). i=1

23/28 Decision trees: vertebrate data Name Body temp Gives birth Has legs Class Human warm-blooded yes yes mammal Whale warm-blooded yes no mammal Cat warm-blooded yes yes mammal Cow warm-blooded yes yes mammal Python cold-blooded no no reptile Komodo dragon cold-blooded no yes reptile Turtle cold-blooded no yes reptile Salmon cold-blooded no no fish Eel cold-blooded no no fish Pigeon warm-blooded no yes bird Penguin warm-blooded no yes bird

24/28 Decision trees: extensions Some further remarks: In our example, we only used binary splits, were each internal node has two children. It is also possible to use non-binary splits, where each internal node can have more than two children. When the data is continuous, it is perhaps not as easy to choose the split criterions. Example: animal weight. A node question could be: is weight <10 kg? Is this a better question than is weight <11 kg? or is weight <9 kg? Having looked at two algorithmic methods, we will now compare the merits of algorithmic and probabilistic method.

25/28 Algorithmic versus probabilistic methods: pros Probabilistic methods Mathematical/probabilistic foundation. Possible to derive optimal methods. Often gives nice interpretations of the results. Possible to control error rates by choosing significance levels. Algorithmic methods No need for model assumptions. Is optimized using the test data. Often has a good heuristic foundation. Some methods work well when p > n.

26/28 Algorithmic versus probabilistic methods: cons Probabilistic methods May be based on asymptotic results, that do not work well when the sample size is small. The model may be a poor description of nature. The conclusions are only about the model s mechanism and not about the true mechanism. Evaluating the model fit can be difficult, especially in higher dimensions. Algorithmic methods Relies heavily on the training data, which may not be representative. Difficult or impossible to find optimal methods. Likely not as good as the probabilistic method if the model is accurate. Some methods lack solid theoretical support.

27/28 Algorithmic versus probabilistic methods: discussion A paper by Leo Breiman from 2001 (Statistical modeling: the two cultures, Statistical Science, Vol. 16) discusses the use of algorithmic methods in modern statistics. Breiman argues that: The data and the problem at hand should lead to the solution not prior ideas about what kind of methods are good. The statistician should focus on finding a good solution regardless of whether that solution uses algorithmic or probabilistic methods. How good a method is should be judged by the predictive accuracy of the method on the test data. This last point is perhaps controversial; we often judge probabilistic method by theoretical properties.

28/28 Algorithmic versus probabilistic methods: discussion Some further comments and questions: Are algorithmic methods simply even more non-parametric non-parametric methods? Today it is not uncommon for new probabilistic methods to be published with nothing but simulation results to back them up (as the underlying mathematics can be quite complicated). Is this any different from the support for algorithmic methods? There are some very interesting research problems in trying to provide probabilistic support for algorithmic methods. Regardless of how we feel about algorithmic methods, we should not be afraid to introduce new tools to our statistical toolbox!