Jeff Howbert Introduction to Machine Learning Winter

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Python Machine Learning

Learning From the Past with Experiment Databases

(Sub)Gradient Descent

CS Machine Learning

Lecture 1: Machine Learning Basics

Probabilistic Latent Semantic Analysis

STA 225: Introductory Statistics (CT)

Probability and Statistics Curriculum Pacing Guide

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Assignment 1: Predicting Amazon Review Ratings

Rule Learning With Negation: Issues Regarding Effectiveness

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Human Emotion Recognition From Speech

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Rule Learning with Negation: Issues Regarding Effectiveness

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Model Ensemble for Click Prediction in Bing Search Ads

Speech Recognition at ICSI: Broadcast News and beyond

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Softprop: Softmax Neural Network Backpropagation Learning

Active Learning. Yingyu Liang Computer Sciences 760 Fall

A survey of multi-view machine learning

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Switchboard Language Model Improvement with Conversational Data from Gigaword

Evidence for Reliability, Validity and Learning Effectiveness

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

SARDNET: A Self-Organizing Feature Map for Sequences

Reinforcement Learning by Comparing Immediate Reward

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Generative models and adversarial training

Online Updating of Word Representations for Part-of-Speech Tagging

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v2 [cs.cv] 30 Mar 2017

Truth Inference in Crowdsourcing: Is the Problem Solved?

Artificial Neural Networks written examination

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Measurement. When Smaller Is Better. Activity:

GCE. Mathematics (MEI) Mark Scheme for June Advanced Subsidiary GCE Unit 4766: Statistics 1. Oxford Cambridge and RSA Examinations

Discriminative Learning of Beam-Search Heuristics for Planning

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

On-the-Fly Customization of Automated Essay Scoring

Applications of data mining algorithms to analysis of medical data

Knowledge Transfer in Deep Convolutional Neural Nets

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Australian Journal of Basic and Applied Sciences

Julia Smith. Effective Classroom Approaches to.

Conference Presentation

Axiom 2013 Team Description Paper

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Mathematics subject curriculum

Multi-label classification via multi-target regression on data streams

Universityy. The content of

Universidade do Minho Escola de Engenharia

An Introduction to Simulation Optimization

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Calibration of Confidence Measures in Speech Recognition

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Medical Complexity: A Pragmatic Theory

w o r k i n g p a p e r s

Multi-Lingual Text Leveling

Semi-Supervised Face Detection

AMULTIAGENT system [1] can be defined as a group of

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

A Comparison of Charter Schools and Traditional Public Schools in Idaho

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

An overview of risk-adjusted charts

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Cross-Year Stability in Measures of Teachers and Teaching. Heather C. Hill Mark Chin Harvard Graduate School of Education

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

How Effective is Anti-Phishing Training for Children?

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Grade 6: Correlated to AGS Basic Math Skills

Mathematics Scoring Guide for Sample Test 2005

Peer Influence on Academic Achievement: Mean, Variance, and Network Effects under School Choice

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

On the Distribution of Worker Productivity: The Case of Teacher Effectiveness and Student Achievement. Dan Goldhaber Richard Startz * August 2016

Speech Emotion Recognition Using Support Vector Machine

Functional Skills Mathematics Level 2 assessment

An investigation of imitation learning algorithms for structured prediction

CSL465/603 - Machine Learning

Earnings Functions and Rates of Return

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Probability Therefore (25) (1.33)

WHEN THERE IS A mismatch between the acoustic

Transcription:

Classification Ensemble e Methods 1 Jeff Howbert Introduction to Machine Learning Winter 2012 1

Ensemble methods Basic idea of ensemble methods: Combining predictions from competing models often gives better predictive accuracy than individual models. Shown to be empirically successful in wide variety of applications. See table on p. 294 of textbook. Also now some theory to explain why it works. Jeff Howbert Introduction to Machine Learning Winter 2012 2

Build and using an ensemble 1) Train multiple, separate models using the training data. 2) Predict outcome for a previously unseen sample by aggregating predictions made by the multiple models. Jeff Howbert Introduction to Machine Learning Winter 2012 3

Jeff Howbert Introduction to Machine Learning Winter 2012 4

Jeff Howbert Introduction to Machine Learning Winter 2012 5

Estimation surfaces of five model types Jeff Howbert Introduction to Machine Learning Winter 2012 6

Ensemble methods Useful for classification or regression. For classification, aggregate predictions by voting. For regression, aggregate predictions by averaging. Model types can be: Heterogeneous Example: neural net combined with SVM combined decision tree combined with Homogeneous most common in practice Individual models referred to as base classifiers (or regressors) Example: ensemble of 1000 decision trees Jeff Howbert Introduction to Machine Learning Winter 2012 7

Committee methods Classifier ensembles m base classifiers trained independently on different samples of training i data Predictions combined by unweighted voting Performance: E[ error ] ave / m < E[ error ] committee < E[ error ] ave Example: bagging Adaptive methods m base classifiers trained sequentially, with reweighting of instances in training data Predictions combined by weighted voting Performance: E[ error ] + /n] 1/2 train O( [ md ) Example: boosting Jeff Howbert Introduction to Machine Learning Winter 2012 8

Building and using a committee ensemble Jeff Howbert Introduction to Machine Learning Winter 2012 9

Building and using a committee ensemble TRAINING 1) Create samples of training data 2) Train one base classifier on each sample USING 1) Make predictions with each base classifier separately 2) Combine predictions by voting Test or new data 1 2 3 4 1 2 3 4 1 2 3 4 training sample 1 training sample 2 training sample 3 A B A B A A A B B A A B 1 A 2 A 3 A 4 B Jeff Howbert Introduction to Machine Learning Winter 2012 10

Binomial distribution (a digression) The most commonly used discrete probability distribution. Givens: a random process with two outcomes, referred to as success and failure (just a convention) the probability p that outcome is success probability of failure = 1 - p n trials of the process Binomial distribution describes probabilities that m of the n trials are successes, over values of m in range 0 m n Jeff Howbert Introduction to Machine Learning Winter 2012 11

Binomial distribution p( m successes) n p m m (1 p) = n m Example: p = 09 0.9, n = 5, m = 4 p( 4 successes) = 5 4 1 0.9 4 0.1 = 0.328 Jeff Howbert Introduction to Machine Learning Winter 2012 12

Why do ensembles work? A highly simplified example Suppose there are 21 base classifiers Each classifier is correct with probability p = 070 0.70 Assume classifiers are independent Probability that the ensemble classifier makes a correct prediction: 21 21 p i= 11 i i (1 p) 21 i = 0.97 Jeff Howbert Introduction to Machine Learning Winter 2012 13

Why do ensembles work? Voting by 21 independent classifiers, each correct with p = 0.7 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 ensemble vote makes wrong prediction 0.02 0 0 2 4 6 8 10 12 14 16 18 20 Probability that exactly k of 21 classifiers will make be correct, assuming each classifier is correct with p = 0.7 and makes predictions independently of other classifiers Jeff Howbert Introduction to Machine Learning Winter 2012 14

Ensemble vs. base classifier error As long as base classifier is better than random (error < 0.5), ensemble will be superior to base classifier Jeff Howbert Introduction to Machine Learning Winter 2012 15

Why do ensembles work? In real applications Suppose there are 21 base classifiers You do have direct control over the number of base classifiers. Each classifier is correct with probability p = 0.70 Base classifiers will have variable accuracy, but you can establish post hoc the mean and variability of the accuracy. Assume classifiers are independent Base classifiers always have some significant degree of correlation in their predictions. Jeff Howbert Introduction to Machine Learning Winter 2012 16

Why do ensembles work? In real applications Assume classifiers are independent Base classifiers always have some significant degree of correlation in their predictions. But the expected performance of the ensemble is guaranteed to be no worse than the average of the individual classifiers: E[ error ] ave / m < E[ error ] committee < E[ error ] ave The more uncorrelated the individual classifiers are, the better the ensemble. Jeff Howbert Introduction to Machine Learning Winter 2012 17

Base classifiers: important properties Diversity y( (lack of correlation) Accuracy Computationally fast Jeff Howbert Introduction to Machine Learning Winter 2012 18

Base classifiers: important properties Diversity Predictions vary significantly between classifiers Usually attained by using unstable classifier small change in training data (or initial model weights) produces large change in model structure Examples of unstable classifiers: decision i trees neural nets rule-based Examples of stable classifiers: linear models: logistic regression, linear discriminant, etc. Jeff Howbert Introduction to Machine Learning Winter 2012 19

Diversity in decision trees Bagging trees on simulated dataset. Top left panel shows original tree. Eight of trees grown on bootstrap samples are shown. Jeff Howbert Introduction to Machine Learning Winter 2012 20

Base classifiers: important properties Accurate Error rate of each base classifier better than random Tension between diversity and accuracy Computationally ti fast Usually need to compute large numbers of classifiers Jeff Howbert Introduction to Machine Learning Winter 2012 21

How to create diverse base classifiers Random initialization of model parameters Network weights Resample / subsample training data Sample instances Randomly with replacement (e.g. bagging) Randomly without replacement Disjoint partitions Sample features (random subspace approach) Randomly prior to training Randomly during training (e.g. random forest) Sample both instances and features Random projection to lower-dimensional space Iterative reweighting of training data Jeff Howbert Introduction to Machine Learning Winter 2012 22

Common ensemble methods Bagging g Boosting Jeff Howbert Introduction to Machine Learning Winter 2012 23

Bootstrap sampling Given: a set S containing N samples Goal: a sampled set T containing N samples Bootstrap sampling process: for i =1toN N randomly select from S one sample with replacement place sample in T If S is large, T will contain ~ ( 1-1 / e ) = 63.2% unique samples. Jeff Howbert Introduction to Machine Learning Winter 2012 24

Bagging Bagging = bootstrap + aggregation 1. Create k bootstrap samples. Example: original data 1 2 3 4 5 6 7 8 9 10 bootstrap 1 7 8 10 8 2 5 10 10 5 9 bootstrap 2 1 4 9 1 2 3 2 7 3 2 bootstrap 3 1 8 5 10 5 5 9 6 3 7 2. Train a classifier on each bootstrap t sample. 3. Vote (or average) the predictions of the k models. Jeff Howbert Introduction to Machine Learning Winter 2012 25

Bagging with decision trees Jeff Howbert Introduction to Machine Learning Winter 2012 26

Jeff Howbert Introduction to Machine Learning Winter 2012 27

Bagging with decision trees Jeff Howbert Introduction to Machine Learning Winter 2012 28

Boosting Key difference: Bagging: individual classifiers trained independently. Boosting: training process is sequential and iterative. Look at errors from previous classifiers to decide what to focus on in the next training iteration. Each new classifier depends on its predecessors. Result: more weight on hard samples (the ones where we committed mistakes in the previous iterations). Jeff Howbert Introduction to Machine Learning Winter 2012 29

Boosting Initially, all samples have equal weights. Samples that are wrongly gy classified have their weights increased. Samples that are classified correctly have their weights decreased. d Samples with higher weights have more influence in subsequent training iterations. Adaptively changes training data distribution. Original Data 1 2 3 4 5 6 7 8 9 10 Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3 Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2 Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4 sample 4 is hard to classify its weight is increased Jeff Howbert Introduction to Machine Learning Winter 2012 30

Boosting example Jeff Howbert Introduction to Machine Learning Winter 2012 31

Jeff Howbert Introduction to Machine Learning Winter 2012 32

Jeff Howbert Introduction to Machine Learning Winter 2012 33

Jeff Howbert Introduction to Machine Learning Winter 2012 34

Jeff Howbert Introduction to Machine Learning Winter 2012 35

AdaBoost Training data has N samples K base classifiers: C 1, C 2,, C K Error rate ε i on i th classifier: ε i = 1 N w jδ N j= 1 i j ) ( C ( x y ) j where w j is the weight on the j th sample δ is the indicator function for the j th sample δ ( C i ( x j ) = y j ) = 0 (no error for correct prediction) δ ( C i ( x j ) y j ) = 1 (error = 1 for incorrect prediction) Jeff Howbert Introduction to Machine Learning Winter 2012 36

AdaBoost Importance of classifier i is: α = i 1 2 1 ε i ln εi α i is used in: formula for updating sample weights final weighting of classifiers in voting of ensemble Relationship of classifier importance α to training error ε Jeff Howbert Introduction to Machine Learning Winter 2012 37

AdaBoost Weight updates: w ( i+ 1) j = where (i i ) αi w j exp if Ci ( x j ) = y αi Zi exp if Ci ( x j ) y is a normalization factor Z i j j If any intermediate iteration produces error rate greater than 50%, the weights are reverted back to 1 / n and the reweighting procedure is restarted. Jeff Howbert Introduction to Machine Learning Winter 2012 38

AdaBoost Final classification model: K C *( x) = arg max α δ y i= = 1 i ( C ( x) = y) i.e. for test sample x, choose the class label y which maximizes the importance-weighted vote across all classifiers. i Jeff Howbert Introduction to Machine Learning Winter 2012 39

Illustrating AdaBoost Initial weights for each data point Data points for training Jeff Howbert Introduction to Machine Learning Winter 2012 40

Illustrating AdaBoost Jeff Howbert Introduction to Machine Learning Winter 2012 41

Summary: bagging and boosting Bagging Resample data points Weight of each classifier is same Only reduces variance Robust to noise and outliers Easily parallelized Boosting Reweight data points (modify data distribution) Weight of a classifier depends on its accuracy Reduces both bias and variance Noise and outliers can hurt performance Jeff Howbert Introduction to Machine Learning Winter 2012 42

Bias-variance decomposition expected error = bias 2 + variance + noise where expected means the average behavior of the models trained on all possible samples of underlying distribution of data Jeff Howbert Introduction to Machine Learning Winter 2012 43

Bias-variance decomposition An analogy from the Society for Creative Anachronism Jeff Howbert Introduction to Machine Learning Winter 2012 44

Bias-variance decomposition Examples of utility for understanding classifiers Decision trees generally have low bias but high variance. Bagging reduces the variance but not the bias of a classifier. Therefore expect decision trees to perform well in bagging ensembles. Jeff Howbert Introduction to Machine Learning Winter 2012 45

Bias-variance decomposition General relationship to model complexity Jeff Howbert Introduction to Machine Learning Winter 2012 46