Ensemble Learning CS534

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Lecture 1: Machine Learning Basics

Python Machine Learning

(Sub)Gradient Descent

Learning From the Past with Experiment Databases

CS Machine Learning

Generative models and adversarial training

The Boosting Approach to Machine Learning An Overview

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Softprop: Softmax Neural Network Backpropagation Learning

Probability and Statistics Curriculum Pacing Guide

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Assignment 1: Predicting Amazon Review Ratings

Model Ensemble for Click Prediction in Bing Search Ads

Probabilistic Latent Semantic Analysis

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Evolutive Neural Net Fuzzy Filtering: Basic Description

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Word Segmentation of Off-line Handwritten Documents

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

Software Maintenance

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Universidade do Minho Escola de Engenharia

Active Learning. Yingyu Liang Computer Sciences 760 Fall

arxiv: v1 [cs.cl] 2 Apr 2017

A survey of multi-view machine learning

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

STAT 220 Midterm Exam, Friday, Feb. 24

School Size and the Quality of Teaching and Learning

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Knowledge Transfer in Deep Convolutional Neural Nets

The Evolution of Random Phenomena

Algebra 2- Semester 2 Review

Mining Association Rules in Student s Assessment Data

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Lecture 10: Reinforcement Learning

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

An Empirical Comparison of Supervised Ensemble Learning Approaches

Speech Recognition at ICSI: Broadcast News and beyond

Evidence for Reliability, Validity and Learning Effectiveness

Activity Recognition from Accelerometer Data

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Discriminative Learning of Beam-Search Heuristics for Planning

Learning Methods in Multilingual Speech Recognition

Rule Learning With Negation: Issues Regarding Effectiveness

Mathematics (JUN14MS0401) General Certificate of Education Advanced Level Examination June Unit Statistics TOTAL.

Calibration of Confidence Measures in Speech Recognition

Monitoring Metacognitive abilities in children: A comparison of children between the ages of 5 to 7 years and 8 to 11 years

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

WHEN THERE IS A mismatch between the acoustic

w o r k i n g p a p e r s

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

An investigation of imitation learning algorithms for structured prediction

A Case Study: News Classification Based on Term Frequency

Human Emotion Recognition From Speech

Linking Task: Identifying authors and book titles in verbose queries

VOL. 3, NO. 5, May 2012 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Short vs. Extended Answer Questions in Computer Science Exams

Detailed course syllabus

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Reducing Features to Improve Bug Prediction

Truth Inference in Crowdsourcing: Is the Problem Solved?

CSL465/603 - Machine Learning

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Using focal point learning to improve human machine tacit coordination

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Chapter 2 Rule Learning in a Nutshell

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Infrared Paper Dryer Control Scheme

Australian Journal of Basic and Applied Sciences

Learning Distributed Linguistic Classes

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Peer Influence on Academic Achievement: Mean, Variance, and Network Effects under School Choice

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Using dialogue context to improve parsing performance in dialogue systems

arxiv: v1 [cs.cv] 10 May 2017

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Analysis of Enzyme Kinetic Data

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Tun your everyday simulation activity into research

Transcription:

Ensemble Learning CS534

Ensemble Learning

How to generate ensembles? There have been a wide range of methods developed We will study to popular approaches Bagging Boosting Both methods take a single (base) learning algorithm and generate ensembles

Base Learning Algorithm We are given a black box learning algorithm Learn referred to as the base learner.

Bootstrap Aggregating (Bagging) Leo Breiman, Bagging Predictors, Machine Learning, 24, 123 140 (1996) Consider creating many training data sets by drawing instances from some distribution and then using Learn to output a hypothesis for each dataset. The resulting hypotheses will likely vary in performance due to variation in the training sets What happens if we combine these hypothesesusing a majority vote?

Bagging Algorithm Given training set S, bagging works as follows: 1. Create T bootstrap samples { of S as follows: For each : Randomly drawing S examples from S with replacement 2. For each, 3. Output With large S, each will contain 1 63.2% unique examples

Stability of Learn A learning algorithm is unstable if small changes in the training data can produce large changes in the output hypothesis (otherwise stable). Clearly bagging will have little benefit when used with stable base learning algorithms (i.e., most ensemble members will be very similar). Bagging generally works best when used with unstable yet relatively accurate base learners

The Bias Variance Decomposition Bagging reduces variance of a classifier. Most appropriate for classifiers of low bias and high variance (e.g., decision tree).

Target concept Single decision tree 100 bagged decision tree

Boosting Key difference compared to bagging? Its iterative. Bagging : Individual classifiers were independent. Boosting: Look at errors from previous classifiers to decide what to focus on for the next iteration over data Successive classifiers depends upon its predecessors. Result: more weights on hard examples. (the ones on which we committed mistakes in the previous iterations)

Some Boosting History The idea of boosting began with a learning theory question first asked in the late 80 s. The question was answered in 1989 by Robert Shapire resulting in the first theoretical boosting algorithm Shapire and Freund later developed a practical boosting algorithm called Adaboost Many empirical studies show that Adaboost is highly effective (very often they outperform ensembles produced by bagging)

History: Strong vs weak learning Strong = weak?

Strong = Weak PAC learning The key idea is that we can learn a little on every distribution Produce 3 hypothesis as follows is the result of applying Learn to all training data. is the result of applying Learn to filtered data distribution such that has only 50% accuracy on the data. (e.g., to generate an example flip a coin, if head then draw examples until makes an error, and give it to Learn; if tail then wait until is correct, and give it to Learn) is the result of applying Learn to training data on which and disagree. We can then let them vote, the resulting error rate will be improved. We can repeat this until reaching the target error rate

Consider E = <, majorityvote>. If,, have error rates less than, it can be shown that the error rate of E is upper bounded by :3 2 This fact leads to a recursive algorithm that creates a hypothesis of arbitrary accuracy from weak hypotheses. Assume we desire an error rate less than e. These need only achieve an error rate less than As we move down the tree, the error we needs to achieve increases according to Eventually the error rate needed will be attainable by the weak learner

AdaBoost The boosting algorithm derived from the original proof is impractical requires to many calls to Learn, though only polynomially many Practically efficient boosting algorithm Adaboost Makes more effective use of each call of Learn

Specifying Input Distributions AdaBoost works by invoking Learn many times on different distributions over the training data set. Need to modify base learner protocol to accept a training set distribution as an input. D(i) can be viewed as indicating to base learner Learn the importance of correctly classifying the i th training instance

AdaBoost (High level steps) AdaBoost performs L boosting rounds, the operations in each boosting round are: 1. Call Learn on data set S with distribution to produce l th ensemble member, where is the distribution of round. 2. Compute the 1 round distribution by putting more weight on instances that makes mistakes on 3. Compute a voting weight for The ensemble hypothesis returned is: H=<,,,,, >

Learning with Weights It is often straightforward to convert a base learner to take into account an input distribution D. Decision trees? Neural nets? Logistic regression? When it s not straightforward, we can resample the training data according to D

Schapire 1989. Letter recognition

Margin Based Error bound (schapire, Freund, Bartlett and Lee 1989) Boosting increases the margin very aggressively since it concentrates on the hardest examples. If margin is large, more weak learners agree and hence more rounds does not necessarily imply that final classifier is getting more complex. Bound is independent of number of rounds T! Boosting can still overfit if margin is too small, weak learners are too complex or perform arbitrarily close to random guessing

AdaBoost as a Additive Model We will now derive AdaBoost in a way that can be adapted in various ways This recipe will let you derive boosting style algorithms for particular learning settings of interest E.g., general misprediction cost, semi supervised learning these boosting style algorithms will not generally be boosting algorithms in the theoretical sense but they often work quite well

AdaBoost: Iterative Learning of Additive Models Consider the final hypothesis: it takes the sign of an additive expansion of a set of base classifiers AdaBoost iteratively finds at each iteration an add to to The goal is to minimize a loss function on the training example:

Instead, Adaboost can be viewed as minimizing an exponential loss function, which is a smooth upper bound on 0/1 error:

Fix and optimize

Pitfall of Boosting: sensitive to noise and outliers

Summary: Bagging and Boosting Bagging Resample data points Weight of each classifier is the same Only variance reduction Robust to noise and outliers Boosting Reweight data points (modify data distribution) Weight of classifier vary depending on accuracy Reduces both bias and variance Can hurt performance with noise and outliers