Lecture 13: Ensemble Methods

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning From the Past with Experiment Databases

A Case Study: News Classification Based on Term Frequency

Python Machine Learning

Universidade do Minho Escola de Engenharia

CS Machine Learning

The Boosting Approach to Machine Learning An Overview

Generative models and adversarial training

Softprop: Softmax Neural Network Backpropagation Learning

Activity Recognition from Accelerometer Data

Semi-Supervised Face Detection

(Sub)Gradient Descent

An Empirical Comparison of Supervised Ensemble Learning Approaches

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Rule Learning With Negation: Issues Regarding Effectiveness

Learning Distributed Linguistic Classes

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Word learning as Bayesian inference

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Reducing Features to Improve Bug Prediction

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Assignment 1: Predicting Amazon Review Ratings

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Why Did My Detector Do That?!

Artificial Neural Networks written examination

Probabilistic Latent Semantic Analysis

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Rule Learning with Negation: Issues Regarding Effectiveness

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Lecture 1: Basic Concepts of Machine Learning

Cooperative evolutive concept learning: an empirical study

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v1 [cs.lg] 15 Jun 2015

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

CS 446: Machine Learning

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

End-of-Module Assessment Task

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Probability and Statistics Curriculum Pacing Guide

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

w o r k i n g p a p e r s

CSL465/603 - Machine Learning

Going to School: Measuring Schooling Behaviors in GloFish

Missouri Mathematics Grade-Level Expectations

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Evolution of Symbolisation in Chimpanzees and Neural Nets

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Word Segmentation of Off-line Handwritten Documents

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Lecture 2: Quantifiers and Approximation

Speech Recognition at ICSI: Broadcast News and beyond

Australian Journal of Basic and Applied Sciences

Go fishing! Responsibility judgments when cooperation breaks down

A Version Space Approach to Learning Context-free Grammars

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Running head: DELAY AND PROSPECTIVE MEMORY 1

Multivariate k-nearest Neighbor Regression for Time Series data -

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Multi-label classification via multi-target regression on data streams

SARDNET: A Self-Organizing Feature Map for Sequences

Beyond the Pipeline: Discrete Optimization in NLP

Diagnostic Test. Middle School Mathematics

Lab 1 - The Scientific Method

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

Model Ensemble for Click Prediction in Bing Search Ads

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Learning By Asking: How Children Ask Questions To Achieve Efficient Search

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Cultivating DNN Diversity for Large Scale Video Labelling

Introduction to Causal Inference. Problem Set 1. Required Problems

Probability estimates in a scenario tree

An investigation of imitation learning algorithms for structured prediction

Chapter 2 Rule Learning in a Nutshell

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Levels of processing: Qualitative differences or task-demand differences?

NEURAL PROCESSING INFORMATION SYSTEMS 2 DAVID S. TOURETZKY ADVANCES IN EDITED BY CARNEGI-E MELLON UNIVERSITY

Applications of data mining algorithms to analysis of medical data

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Learning goal-oriented strategies in problem solving

Abstractions and the Brain

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Language Acquisition Chart

Finding truth even if the crowd is wrong

Truth Inference in Crowdsourcing: Is the Problem Solved?

Mathematics (JUN14MS0401) General Certificate of Education Advanced Level Examination June Unit Statistics TOTAL.

Intermediate Computable General Equilibrium (CGE) Modelling: Online Single Country Course

Transcription:

Lecture 13: Ensemble Methods What are ensemble methods? Bagging Bias-variance decomposition: how ensembles work Part of the slides are based on talks by Dietterich and Schapire. 1 Horse race prediction Ask a professional for advice Presented with a set of races, the professional can give rules of thumb that are better than random But cannot specify a single rule that is very accurate How can you make money? 2

Main idea Derive simple rules of thumb (i.e. classifiers) based on data Combine their prediction (e.g. by some voting scheme) This works well if: Individual classifiers are accurate I.e. their true error is Individual classifiers are diverse I.e. they make independent errors (better than random) 3 Ensemble methods Several popular approaches: 1. Manipulating the training examples Bagging (Breiman, 1996) Boosting (Freund & Schapire, 1995) 2. Injecting randomness Randomized splits in decision trees Random initial weights in neural networks 3. Using feature subsets 4. Changing the class label encoding E.g., Error-correcting codes (Dietterich, 1996) 4

Why is this a good idea? Three main reasons (Dietterich, 2): Statistical: reducing the risk of finding the wrong classifier Computational: avoiding local minima Representational: being able to represent hypothesis outside the language space we re considering 5 Bagging (Boostrap Aggregating) Given a training set of size : 1. Construct several bootstrap replicates ) I.e. drawn samples with replacement from (typically That means the same example can be drawn multiple times 2. Construct a classifier based on every set. Classifying new instances is done by a majority vote among the classifiers 6

Experiment: Bagging decision trees 35 3 25 Error Rate of C4.5 2 15 5 5 15 2 25 3 Error Rate of Bagged C4.5 Why does bagging work? 7 Bias-variance theory The expected error of a classifier can be decomposed into bias and variance components. (Note that we are talking here about statistical bias!) c(x) Large Bias Large Variance x Bias comes from not having good hypotheses in the considered class Variance results from the hypothesis class containing too many 8

.1in=.4192.1in.1in=.4192.1in hypotheses 9 Bias-variance decomposition Several definitions of bias and variance for classification tasks (e.g. Friedman, Kong & Dietterich, Tibshirani, Kohavi & Wolpert). We will use the (Kong & Dietterich, 1995) decomposition. Imagine we have infinitely many training sets, all of size, drawn according to the same probability distribution. We train a learning algorithm on each set and obtain a series of hypotheses. Consider a particular instance and use each classifier to predict its class:. Let be the proportion of predictions that are incorrect

Bias The bias of the learning algorithm on instance for training sets of size is: if if The bias captures systematic errors of a learning algorithm An algorithm is said to be biased on if it misclassifies more often than it correctly classifies it. 11 Sources of bias Inability to represent certain decision boundaries E.g. linear threshold units, naive Bayes, decision trees Incorrect assumptions E.g. failure of independence assumption in naive Bayes Classifiers that are too global (too smooth) E.g. a single linear separator, a small decision tree 12

Variance Unbiased variance: if is unbiased at otherwise Intuitively, this measures the error rate on examples where the algorithm is unbiased Biased variance: if is biased at otherwise This measures the error overestimate on examples on which the algorithm is biased. 13 Source of variance Statistical sources Classifiers that are too local and can easily fit the data E.g. nearest neighbor, large decision trees, RBF Classifiers with large VC dimension Computational sources Making decisions based on small subsets of the data E.g. decision tree splits near the leaves Randomization in the learning algorithm E.g. neural nets with random initial weights Learning algorithms that make sharp decisions Can be unstable (e.g. the decision boundary can change if one training example changes) 14

Bias-variance error decomposition The expected error of at is: If you do the computation,, but this way of writing emphasizes the source of the errors Two important lessons: There is usually a trade-off between bias and variance Just increasing the expressive power of the hypotheses does not necessarily improve the accuracy of the classifiers! 15 Effect of in -nearest neighbor on bias and variance Loss (%) 14 12 8 6 4 2 L B V Vu Vb 5 15 2 k Loss (%) 45 4 35 3 25 2 15 5 L B V Vu Vb 5 15 2 k Increasing reduces variance and increases bias 16

Effect of Decision Tree depth on bias and variance Loss (%) 6 5 4 3 2 L B V Vu Vb 1 2 3 4 5 6 7 8 9 Level Loss (%) 8 7 6 5 4 3 2 L B V Vu Vb 2 4 6 8 12 Level Deeper decision trees reduce bias and increase variance 17 Why does bagging work? Takes several classifiers and averages the predictions Averaging decreases variance Bagging is a variance reduction technique 18