No Free Lunch, Bias-Variance & Ensembles

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Learning From the Past with Experiment Databases

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

CS Machine Learning

Active Learning. Yingyu Liang Computer Sciences 760 Fall

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Rule Learning With Negation: Issues Regarding Effectiveness

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Artificial Neural Networks written examination

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Rule Learning with Negation: Issues Regarding Effectiveness

Discriminative Learning of Beam-Search Heuristics for Planning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

Algebra 2- Semester 2 Review

Activity Recognition from Accelerometer Data

The Boosting Approach to Machine Learning An Overview

Probabilistic Latent Semantic Analysis

12- A whirlwind tour of statistics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Probability and Statistics Curriculum Pacing Guide

Lecture 1: Basic Concepts of Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Universidade do Minho Escola de Engenharia

School Size and the Quality of Teaching and Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Introduction to the Practice of Statistics

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Introduction to Simulation

A Case Study: News Classification Based on Term Frequency

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Reinforcement Learning by Comparing Immediate Reward

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Softprop: Softmax Neural Network Backpropagation Learning

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

arxiv: v1 [math.at] 10 Jan 2016

SARDNET: A Self-Organizing Feature Map for Sequences

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Learning Methods for Fuzzy Systems

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

arxiv: v1 [cs.lg] 15 Jun 2015

4.0 CAPACITY AND UTILIZATION

CSL465/603 - Machine Learning

Chapter 2 Rule Learning in a Nutshell

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Multi-label classification via multi-target regression on data streams

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

arxiv: v2 [cs.cv] 30 Mar 2017

Using focal point learning to improve human machine tacit coordination

A survey of multi-view machine learning

Model Ensemble for Click Prediction in Bing Search Ads

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Shockwheat. Statistics 1, Activity 1

Karla Brooks Baehr, Ed.D. Senior Advisor and Consultant The District Management Council

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Mathematics Scoring Guide for Sample Test 2005

Cooperative evolutive concept learning: an empirical study

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Are You Ready? Simplify Fractions

Mathematics process categories

Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown

STA 225: Introductory Statistics (CT)

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

GACE Computer Science Assessment Test at a Glance

An Introduction to Simio for Beginners

Diagnostic Test. Middle School Mathematics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

A Reinforcement Learning Variant for Control Scheduling

An Empirical Comparison of Supervised Ensemble Learning Approaches

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Probability estimates in a scenario tree

Axiom 2013 Team Description Paper

Comment-based Multi-View Clustering of Web 2.0 Items

Transcription:

09s1: COMP9417 Machine Learning and Data Mining No Free Lunch, Bias-Variance & Ensembles May 27, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997 http://www-2.cs.cmu.edu/~tom/mlbook.html and slides by Andrew W. Moore available at http://www.cs.cmu.edu/~awm/tutorials and the book Data Mining, Ian H. Witten and Eibe Frank, Morgan Kauffman, 2000. http://www.cs.waikato.ac.nz/ml/weka and the book Pattern Classification, Richard O. Duda, Peter E. Hart, and David G. Stork. Copyright (c) 2001 by John Wiley & Sons, Inc. and the book Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani and Jerome Friedman. (c) 2001, Springer. Aims This lecture aims to develop your understanding of some recent advances in machine learning. Following it you should be able to: outline the No Free Lunch Theorem describe the framework of the bias-variance decomposition define the method of bagging define the method of boosting Some questions about Machine Learning Are there reasons to prefer one learning algorithm over another? Can we expect any method to be superior overall? Can we even find an algorithm that is overall superior to random guessing? Relevant Weka methods: Bagging, Random Forests, AdaBoostM1, Stacking, SMO COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 1 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 2

No Free Lunch Theorem uniformly averaged over all target functions, the expected off-trainingset error for all learning algorithms is the same even for a fixed training set, averaged over all target functions no learning algorithm yields an off-training-set error that is superior to any other No Free Lunch example Assuming that the training set D can be learned correctly by all algorithms, averaged over all target functions no learning algorithm gives an offtraining set error superior to any other: Σ F [E 1 (E F, D) E 2 (E F, D)] = 0 Therefore, all statements of the form learning algorithm 1 is better than algorithm 2 are ultimately statements about the relevant target functions. COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 3 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 4 No Free Lunch example No Free Lunch example x F h 1 h 2 000 1 1 1 D 001-1 -1-1 010 1 1 1 011-1 1-1 100 1 1-1 101-1 1-1 110 1 1-1 111 1 1-1 E 1 (E F, D) = 0.4 BUT if we have no prior knowledge about which F we are trying to learn, neither algorithm is superior to the other both fit the training data correctly, but there are 2 5 target functions consistent with D and for each there is exactly one other function whose output is inverted with respect to each of the off-training set patterns so the performance of algorithms 1 and 2 will be inverted thus ensuring average error difference of zero E 2 (E F, D) = 0.6 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 5 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 6

A Conservation Theorem of Generalization Performance For every possible learning algorithm for binary classification the sum of performance over all possible target functions is exactly zero. on some problems we get positive performance so there must be other problems for which we get an equal and opposite amount of negative performance It is the assumptions about the learning domains that are relevant. Ugly Duckling Theorem In the absence of assumptions there is no privileged or best feature representation. In fact, even the notion of similarity between patterns depends on assumptions. Using a finite number of predicates to distinguish any two patterns, the number of predicates shared by any two such patterns is constant and independent of those patterns. Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems. COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 7 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 8 Bias-variance decomposition Theoretical tool for analyzing how much specific training set affects performance of classifier Assume we have an infinite number of classifiers built from different training sets of size n The bias of a learning scheme is the expected error of the combined classifier on new data The variance of a learning scheme is the expected error due to the particular training set used Total expected error: bias + variance Bias-variance: a trade-off Easier to see with regression in the following figure 1 (to see the details you will have to zoom in in your viewer): each column represents a different model class g(x) shown in red each row represents a different set of n = 6 training points, D i, randomly sampled from target function F (x) with noise, shown in black probability functions of mean squared error E are shown 1 from: Elements of Statistical Learning by Hastie, Tibshirani and Friedman (2001) COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 9 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 10

Bias-variance: a trade-off Bias-variance: a trade-off a) is very poor: a linear model with fixed parameters independent of training data; high bias, zero variance b) is better: a linear model with fixed parameters independent of training data; slightly lower bias, zero variance c) is a cubic model with parameters trained by mean-square-error on training data; low bias, moderate variance d) is a linear model with parameters adjusted to fit each training set; intermediate bias and variance training with data n would give c) with bias approaching small value due to noise but not d) variance for all models would approach zero COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 11 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 12 Ensembles: combining multiple models Basic idea of ensembles or meta learning schemes: build different experts and let them vote Advantage: often improves predictive performance Disadvantage: produces output that is very hard to interpret Notable schemes: bagging, boosting, stacking can be applied to both classification and numeric prediction problems Bootstrap error estimation Estimating error rate of a learning method on a data set sampling from data set with replacement e.g. sample from n instances, with replacement, n times to generate another data set of n instances (almost certainly) new data set contains some duplicate instances and does not contain others used as the test set chance of not being picked (1 1 n )n e 1 = 0.368 0.632 training set error estimate = 0.632 err test + 0.368 err train repeat and average with different bootstrap samples COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 13 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 14

Bootstrap Aggregation Bagging Employs simplest way of combining predictions: voting/averaging Each model receives equal weight Generalized version of bagging: Sample several training sets of size n (instead of just having one training set of size n) Build a classifier for each training set Combine the classifiers predictions This improves performance in almost all cases if learning scheme is unstable (i.e. decision trees) Bagging Bagging reduces variance by voting/averaging, thus reducing the overall expected error In the case of classification there are pathological situations where the overall error might increase Usually, the more classifiers the better Problem: we only have one dataset! Solution: generate new datasets of size n by sampling with replacement from original dataset Can help a lot if data is noisy COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 15 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 16 Learning (model generation) Bagging algorithm Let n be the number of instances in the training data. For each of t iterations: Sample n instances with replacement from training set. Apply the learning algorithm to the sample. Store the resulting model. Classification For each of the t models: Predict class of instance using model. Return class that has been predicted most often. An experiment with simulated data: Bagging trees sample of size n = 30, two classes, five features P r(y = 1 x 1 0.5) = 0.2 and P r(y = 1 x 1 > 0.5) = 0.8) test sample of size 2000 from same population fit classification trees to training sample, 200 bootstrap samples trees are different (tree induction is unstable) therefore have high variance averaging reduces variance and leaves bias unchanged (graph: test error for original and bagged trees, with green vote; purple average probabilities) COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 17 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 18

Bagging trees Bagging trees COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 19 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 20 Bagging trees Bagging trees COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 21 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 22

Bagging trees The news is not all good: when we bag a model, any simple structure is lost this is because a bagged tree is no longer a tree...... but a forest this drastically reduces any claim to comprehensibility stable models like nearest neighbour not very affected by bagging unstable models like trees most affected by bagging usually, their design for interpretability (bias) leads to instability more recently, random forests (see Breiman s web-site) Boosting Also uses voting/averaging but each model is weighted according to their performance Iterative procedure: previously built ones new models are influenced by performance of New model is encouraged to become expert for instances classified incorrectly by earlier models Intuitive justification: models should be experts that complement each other There are several variants of this algorithm... COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 23 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 24 The strength of weak learnability Boosting a weak learner reduces error Schapire (1990) - first boosting algorithm showed that weak learners can be boosted into strong learners original setting: weak learner learns initial hypothesis h 1 from N examples next learns hypothesis h 2 from new set of N examples, half of which are misclassified by h 1 then learns hypothesis h 3 from N examples for which h 1 and h 2 disagree boosted hypothesis h gives voted prediction on instance x: if h 1 (x) = h 2 (x) then return agreed prediction, else return h 3 (x) if h 1 gets error α < 0.5 then error of h bounded by 3α 2 2α 3, i.e. better than α COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 25 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 26

Learning (model generation) AdaBoost.M1 Assign equal weight to each training instance. For each of t iterations: Apply learning algorithm to weighted dataset and store resulting model. Compute error e of model on weighted dataset and store error. If e equal to zero, or e greater or equal to 0.5: Terminate model generation. For each instance in dataset: If instance classified correctly by model: Multiply weight of instance by e / (1 - e). EndFor Normalize weight of all instances. EndFor Classification AdaBoost.M1 Assign weight of zero to all classes. For each of the t (or less) models: Add -log(e / (1 - e)) to weight of class predicted by model. EndFor Return class with highest weight. COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 27 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 28 More on boosting A bit more on boosting Can be applied without weights using resampling with probability determined by weights Disadvantage: not all instances are used Advantage: resampling can be repeated if error exceeds 0.5 Stems from computational learning theory Theoretical result: training error decreases exponentially Also: works if base classifiers not too complex and their error doesn t become too large too quickly Puzzling fact: generalization error can decrease long after training error has reached zero Seems to contradict Occam s Razor! However, problem disappears if margin (confidence) is considered instead of error Margin: difference between estimated probability for true class and most likely other class (between -1, 1) Boosting works with weak learners: only condition is that error α doesn t exceed 0.5 (slightly better than random guessing) LogitBoost: more sophisticated boosting scheme in Weka (based on additive logistic regression) COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 29 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 30

Boosting reduces error Boosting reduces error Adaboost applied to a weak learning system can reduce the training error exponentially as the number of component classifiers is increased. focuses on difficult patterns training error of successive classifier on its own weighted training set is generally larger than predecessor training error of ensemble will decrease typically, test error of ensemble will decrease also COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 31 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 32 Boosting enlarges the model class Boosting enlarges the model class A two-dimensional two-category classification task three component linear classifiers final classification is by voting component classifiers gives a non-linear decision boundary each component is a weak learner (slightly better than 0.5) ensemble classifier has error lower than any single component ensemble classifier has error lower than single classifier on complete training set COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 33 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 34

Boosting enlarges the model class Boosting enlarges the model class An experiment with simulated data: 100 instances, two features, two classes target classification is x 1 + x 2 = 1 learn classifier: single split in x 1 or x 2 to give largest decrease in training set misclassification error voting or averaging probabilities does not help over many single splits however, repeated iterations of boosting gets closer approximation to the diagonal COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 35 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 36 Stacking Hard to analyze theoretically: black magic Uses meta learner instead of voting to combine predictions of base learners Predictions of base learners (level-0 models) are used as input for meta learner (level-1 model) Base learners usually different learning schemes Predictions on training data can t be used to generate data for level-1 model! Cross-validation-like scheme is employed Stacking If base learners can output probabilities it s better to use those as input to meta learner Which algorithm to use to generate meta learner? In principle, any learning scheme can be applied David Wolpert: relatively global, smooth model Base learners do most of the work Reduces risk of overfitting Stacking can also be applied to numeric prediction (and density estimation) COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 37 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 38

Stacking Summary Points 1. No Free Lunch and Ugly Duckling Theorems no magic bullet 2. Bias-variance decomposition breaks down the error, illustrates the match of a learning method to a problem 3. Bagging is a simple way to run ensemble methods 4. Boosting often works better but can be susceptible to very noisy data 5. Stacking not widely investigated but useful to combine different learners 6. Kernel methods around for a long time in statistics 7. SVMs a modular approach to machine learning with a choice of different kernels many applications 8. Current most favoured off-the-shelf classifiers boosting, SVMs COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 39 COMP9417: May 27, 2009 No Free Lunch, Bias-Variance & Ensembles: Slide 40