COMP 551 Applied Machine Learning Lecture 11: Ensemble learning

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Python Machine Learning

(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

Learning From the Past with Experiment Databases

CS Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Rule Learning With Negation: Issues Regarding Effectiveness

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Rule Learning with Negation: Issues Regarding Effectiveness

Universidade do Minho Escola de Engenharia

Softprop: Softmax Neural Network Backpropagation Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Reducing Features to Improve Bug Prediction

Model Ensemble for Click Prediction in Bing Search Ads

The Boosting Approach to Machine Learning An Overview

Activity Recognition from Accelerometer Data

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Word Segmentation of Off-line Handwritten Documents

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

A Case Study: News Classification Based on Term Frequency

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The Strong Minimalist Thesis and Bounded Optimality

Generative models and adversarial training

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Evolutive Neural Net Fuzzy Filtering: Basic Description

An Empirical Comparison of Supervised Ensemble Learning Approaches

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Probability and Statistics Curriculum Pacing Guide

Axiom 2013 Team Description Paper

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Word learning as Bayesian inference

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

CSL465/603 - Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Algebra 2- Semester 2 Review

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

Lecture 1: Basic Concepts of Machine Learning

Probabilistic Latent Semantic Analysis

SARDNET: A Self-Organizing Feature Map for Sequences

CSC200: Lecture 4. Allan Borodin

Learning Distributed Linguistic Classes

Introduction to Simulation

Software Maintenance

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

arxiv: v1 [cs.lg] 15 Jun 2015

Content-based Image Retrieval Using Image Regions as Query Examples

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Chapter 2 Rule Learning in a Nutshell

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Discriminative Learning of Beam-Search Heuristics for Planning

School Size and the Quality of Teaching and Learning

Human Emotion Recognition From Speech

Australian Journal of Basic and Applied Sciences

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

A survey of multi-view machine learning

Artificial Neural Networks written examination

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Calibration of Confidence Measures in Speech Recognition

Time series prediction

An investigation of imitation learning algorithms for structured prediction

CS 446: Machine Learning

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Unit 3: Lesson 1 Decimals as Equal Divisions

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Multi-label classification via multi-target regression on data streams

IS FINANCIAL LITERACY IMPROVED BY PARTICIPATING IN A STOCK MARKET GAME?

Linking Task: Identifying authors and book titles in verbose queries

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

An Introduction to the Minimalist Program

Corpus Linguistics (L615)

Using dialogue context to improve parsing performance in dialogue systems

arxiv: v2 [cs.cv] 30 Mar 2017

A Version Space Approach to Learning Context-free Grammars

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

arxiv: v1 [cs.cl] 2 Apr 2017

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Issues in the Mining of Heart Failure Datasets

MYCIN. The MYCIN Task

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Grade 6: Correlated to AGS Basic Math Skills

Speech Emotion Recognition Using Support Vector Machine

Lecture 10: Reinforcement Learning

Transcription:

COMP 551 Applied Machine Learning Lecture 11: Ensemble learning Instructor: Herke van Hoof (herke.vanhoof@mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor s written permission.

Main types of machine learning problems Supervised learning Classification Regression Ensemble methods Unsupervised learning Reinforcement learning 2

Next topic: Ensemble methods Recently seen supervised learning methods: Logistic regression, Naïve Bayes, LDA/QDA Decision trees, Instance-based learning Decision trees? Build complex classifiers from simpler ones. (Linear separator) Ensemble methods use this idea with other simple methods Several ways to do this. Bagging Random forests Boosting Stacking (Next lecture) Lectures 4,5 Linear Classification Lecture 7 Decision Trees 3

Ensemble learning in general Key idea: Run one or more base learning algorithms multiple times, then combine the predictions of the different learners to get a final prediction. What s a base learning algorithm? Naïve Bayes, LDA, Decision trees, SVMs, 4

Ensemble learning in general Key idea: Run one or more base learning algorithms multiple times, then combine the predictions of the different learners to get a final prediction. What s a base learning algorithm? Naïve Bayes, LDA, Decision trees, SVMs, First attempt: Construct several classifiers independently. Bagging. Randomizing the test selection in decision trees (Random forests). Using a different subset of input features to train different trees. 5

Ensemble learning in general Key idea: Run one or more base learning algorithms multiple times, then combine the predictions of the different learners to get a final prediction. What s a base learning algorithm? Naïve Bayes, LDA, Decision trees, SVMs, First attempt: Construct several classifiers independently. Bagging. Randomizing the test selection in decision trees (Random forests). Using a different subset of input features to train different trees. More complex approach: Coordinate the construction of the hypotheses in the ensemble. 6

Ensemble methods in general Training models independently on same dataset tends to yield same result! For an ensemble to be useful, trained models need to be different 1. Use slightly different (randomized) datasets 2. Use (slightly) different (e.g. randomized) training procedure 7

Recall bootstrapping Lecture 6 Evaluation Given dataset D, construct a bootstrap replicate of D, called D k, which has the same number of examples, by drawing samples from D with replacement. Use the learning algorithm to construct a hypothesis h k by training on D k. Compute the prediction of h k on each of the remaining points, from the set T k = D D k. Repeat this process K times, where K is typically a few hundred. 8

Estimating bias and variance For each point x, we have a set of estimates h 1 (x),, h K (x), with K B (since x might not appear in some replicates). The average empirical prediction of x is: ĥ (x) = (1/K) k=1:k h k (x). We estimate the bias as: y ĥ(x). We estimate the variance as: (1/(K-1)) k=1:k ( ĥ(x) - h k (x) ) 2. 9

Bagging: Bootstrap aggregation If we did all the work to get the hypotheses h b, why not use all of them to make a prediction? (as opposed to just estimating bias/variance/error). All hypotheses get to have a vote. For classification: pick the majority class. For regression, average all the predictions. Which hypotheses classes would benefit most from this approach? 10

Bagging For each point x, we have a set of estimates h 1 (x),, h K (x), with K B (since x might not appear in some replicates). The average empirical prediction of x is: ĥ (x) = (1/K) k=1:k h k (x). We estimate the bias as: y ĥ(x). We estimate the variance as: (1/(K-1)) k=1:k ( ĥ(x) - h k (x) ) 2. In theory, bagging eliminates variance altogether. In practice, bagging tends to reduce variance and increase bias. Use this with unstable learners that have high variance, e.g. decision trees, neural networks, nearest-neighbour. 11

Random forests (Breiman, 2001) Basic algorithm: Use K bootstrap replicates to train K different trees. At each node, pick m variables at random (use m<m, the total number of features). Determine the best test (using normalized information gain). Recurse until the tree reaches maximum depth (no pruning). 12

Random forests (Breiman, 2001) Basic algorithm: Use K bootstrap replicates to train K different trees. At each node, pick m variables at random (use m<m, the total number of features). Determine the best test (using normalized information gain). Recurse until the tree reaches maximum depth (no pruning). Comments: Each tree has high variance, but the ensemble uses averaging, which reduces variance. Random forests are very competitive in both classification and regression, but still subject to overfitting. 13

Extremely randomized trees (Geurts et al., 2006) Basic algorithm: Construct K decision trees. Pick m attributes at random (without replacement) and pick a random test involving each attribute. Evaluate all tests (using a normalized information gain metric) and pick the best one for the node. Continue until a desired depth or a desired number of instances (n min ) at the leaf is reached. 14

Extremely randomized trees (Geurts et al., 2005) Basic algorithm: Construct K decision trees. Pick m attributes at random (without replacement) and pick a random test involving each attribute. Evaluate all tests (using a normalized information gain metric) and pick the best one for the node. Continue until a desired depth or a desired number of instances (n min ) at the leaf is reached. Comments: Very reliable method for both classification and regression. The smaller m is, the more randomized the trees are; small m is best, especially with large levels of noise. Small n min means less bias and more variance, but variance is controlled by averaging over trees. Compared to single trees, can pick smaller n min (less bias) 15

Randomization For an ensemble to be useful, trained models need to be different 1. Use slightly different (randomized) datasets Bootstrap Aggregation (Bagging) 2. Use slightly different (randomized) training procedure Extremely randomized trees, Random Forests 16

Randomization in general Instead of searching very hard for the best hypothesis, generate lots of random ones, then average their results. Examples: Random feature selection Random projections. Advantages? Disadvantages? 17

Randomization in general Instead of searching very hard for the best hypothesis, generate lots of random ones, then average their results. Examples: Random feature selection Random projections. Advantages? Very fast, easy, can handle lots of data. Can circumvent difficulties in optimization. Averaging reduces the variance introduced by randomization. Disadvantages? 18

Randomization in general Instead of searching very hard for the best hypothesis, generate lots of random ones, then average their results. Examples: Random feature selection Random projections. Advantages? Very fast, easy, can handle lots of data. Can circumvent difficulties in optimization. Averaging reduces the variance introduced by randomization. Disadvantages? New prediction may be more expensive to evaluate (go over all trees). Still typically subject to overfitting. Low interpretability compared to standard decision trees. 19

Randomization For an ensemble to be useful, trained models need to be different 1. Use slightly different (randomized) datasets Bootstrap Aggregation (Bagging) 2. Use slightly different (randomized) training procedure Extremely randomized trees, Random Forests 3. Alternative method to randomization? 20

Additive models In an ensemble, the output on any instance is computed by averaging the outputs of several hypotheses. Idea: Don t construct the hypotheses independently. Instead, new hypotheses should focus on instances that are problematic for existing hypotheses. If an example is difficult, more components should focus on it. 21

Boosting Boosting: Use the training set to train a simple predictor. Re-weight the training examples, putting more weight on examples that were not properly classified in the previous predictor. Repeat n times. Combine the simple hypotheses into a single, accurate predictor. D1 Weak Learner H1 Original Data D2 Weak Learner H2 Final hypothesis Dn Weak Learner Hn F(H1,H2,...Hn) 22

Notation Assume that examples are drawn independently from some probability distribution P on the set of possible data D. Let J P (h) be the expected error of hypothesis h when data is drawn from P: J P (h) = <x,y> J(h(x),y)P(<x,y>) where J(h(x),y) could be the squared error, or 0/1 loss. 23

Weak learners Assume we have some weak binary classifiers: A decision stump is a single node decision tree: x i >t A single feature Naïve Bayes classifier. A 1-nearest neighbour classifier. Weak means J P (h)<1/2-ɣ (assuming 2 classes), where ɣ>0 So true error of the classifier is only slightly better than random. Questions: How do we re-weight the examples? How do we combine many simple predictors into a single classifier? 24

Example 25

Example: First step 26

Example: Second step 27

Example: Third step 28

Example: Final hypothesis 29

AdaBoost (Freund & Schapire, 1995) 30

AdaBoost (Freund & Schapire, 1995) 31

AdaBoost (Freund & Schapire, 1995) weight of weak learner t 32

AdaBoost (Freund & Schapire, 1995) weight of weak learner t 33

AdaBoost (Freund & Schapire, 1995) weight of weak learner t 34

Why these equations? Loss function: m i = y i NX L = K X k=1 i=1 e m i k h k (x i ) Has a gradient Upper bound on classification loss Stronger signal for wrong classifications Stronger signal if wrong and far from boundary 35

Why these equations? Loss function: m i = y i NX L = K X k=1 i=1 e m i k h k (x i ) Update equations are derived from this loss function 36

Properties of AdaBoost Compared to other boosting algorithms, main insight is to automatically adapt the error rate at each iteration. 37

Properties of AdaBoost Compared to other boosting algorithms, main insight is to automatically adapt the weights at each iteration. Training error on the final hypothesis is at most: recall: ɣ t is how much better than random is h t AdaBoost gradually reduces the training error exponentially fast. 38

Real data set: Text categorization ;5)585,*/%N6% ;5)585,*/%:*1)*7,% 39

Boosting empirical evaluation error error error C4.5: Lecture 7 Decision Trees 40

Bagging vs Boosting 0 5 10 15 20 25 30 boosting C4.5 0 5 10 15 20 25 30 bagging C4.5 41

Bagging vs Boosting Bagging is typically faster, but may get a smaller error reduction (not by much). Bagging works well with reasonable classifiers. Boosting works with very simple classifiers. E.g., Boostexter - text classification using decision stumps based on single words. Boosting may have a problem if a lot of the data is mislabeled, because it will focus on those examples a lot, leading to overfitting. 42

Why does boosting work? 43

Why does boosting work? Weak learners have high bias. By combining them, we get more expressive classifiers. Hence, boosting is a bias-reduction technique. 44

Why does boosting work? Weak learners have high bias. By combining them, we get more expressive classifiers. Hence, boosting is a bias-reduction technique. Adaboost minimizes an upper bound on the misclassifcation error, within the space of functions that can be captured by a linear combination of the base classifiers. What happens as we run boosting longer? Intuitively, we get more and more complex hypotheses. How would you expect bias and variance to evolve over time? 45

A naïve (but reasonable) analysis of error Expect the training error to continue to drop (until it reaches 0). Expect the test error to increase as we get more voters, and h f becomes too complex. 1 0.8 0.6 0.4 0.2 20 40 60 80 100 46

Actual typical run of AdaBoost Test error does not increase even after 1000 runs! (more than 2 million decision nodes!) Test error continues to drop even after training error reaches 0! These are consistent results through many sets of experiments! Conjecture: Boosting does not overfit! 20 15 10 5 0 10 100 1000 47

Other methods Random forests, extremely randomized trees, boosting and bagging all combine many learners of a single type Advantage: we have a recipe for generating many classifiers by randomizing the dataset or the training procedure Disadvantages: since classifiers are of the same family they might make similar errors Next lecture: combining different types of learners (E.g. combine SVM + decision tree + LDA ) 48

What you should know Ensemble methods combine several hypotheses into one prediction. They work better than the best individual hypothesis from the same class because they reduce bias or variance (or both). Random forests, Extremely randomized trees and bagging Average over multiple independently trained classifiers, thus lower variance Bagging is thus useful for complex hypotheses. Can use more aggressive settings that would normally overfit: lower bias The classifiers in boosting are coordinated to lower error Focuses on harder examples Gives a weighted vote to the hypotheses. Reduces the bias of simple hypotheses (not so useful for complex models). 49