Ensembles. CS Ensembles 1

Similar documents
Lecture 1: Machine Learning Basics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

(Sub)Gradient Descent

Python Machine Learning

CS Machine Learning

Learning From the Past with Experiment Databases

Softprop: Softmax Neural Network Backpropagation Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Case Study: News Classification Based on Term Frequency

Lecture 1: Basic Concepts of Machine Learning

Artificial Neural Networks written examination

CSL465/603 - Machine Learning

Probabilistic Latent Semantic Analysis

Universidade do Minho Escola de Engenharia

Generative models and adversarial training

Model Ensemble for Click Prediction in Bing Search Ads

Switchboard Language Model Improvement with Conversational Data from Gigaword

Knowledge Transfer in Deep Convolutional Neural Nets

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Learning Methods for Fuzzy Systems

Exploration. CS : Deep Reinforcement Learning Sergey Levine

arxiv: v1 [cs.cv] 10 May 2017

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Semi-Supervised Face Detection

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Time series prediction

Reducing Features to Improve Bug Prediction

Assignment 1: Predicting Amazon Review Ratings

Active Learning. Yingyu Liang Computer Sciences 760 Fall

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Probability and Statistics Curriculum Pacing Guide

Speech Emotion Recognition Using Support Vector Machine

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Activity Recognition from Accelerometer Data

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Cooperative evolutive concept learning: an empirical study

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Australian Journal of Basic and Applied Sciences

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Ensemble Technique Utilization for Indonesian Dependency Parser

Rule Learning With Negation: Issues Regarding Effectiveness

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

arxiv: v1 [cs.cl] 2 Apr 2017

Issues in the Mining of Heart Failure Datasets

Human Emotion Recognition From Speech

Learning to Schedule Straight-Line Code

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v2 [cs.cv] 30 Mar 2017

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Calibration of Confidence Measures in Speech Recognition

CS 446: Machine Learning

Speech Recognition at ICSI: Broadcast News and beyond

Data Fusion Through Statistical Matching

An OO Framework for building Intelligence and Learning properties in Software Agents

arxiv: v1 [cs.lg] 15 Jun 2015

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Algebra 2- Semester 2 Review

Learning Distributed Linguistic Classes

Why Did My Detector Do That?!

STAT 220 Midterm Exam, Friday, Feb. 24

Word learning as Bayesian inference

Word Segmentation of Off-line Handwritten Documents

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Discriminative Learning of Beam-Search Heuristics for Planning

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

A study of speaker adaptation for DNN-based speech synthesis

Rule Learning with Negation: Issues Regarding Effectiveness

Axiom 2013 Team Description Paper

Handling Concept Drifts Using Dynamic Selection of Classifiers

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Multivariate k-nearest Neighbor Regression for Time Series data -

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Syllabus for CHEM 4660 Introduction to Computational Chemistry Spring 2010

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Introduction to Causal Inference. Problem Set 1. Required Problems

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

INPE São José dos Campos

SARDNET: A Self-Organizing Feature Map for Sequences

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Attributed Social Network Embedding

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Residual Stacking of RNNs for Neural Machine Translation

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

An empirical study of learning speed in backpropagation

Transcription:

Ensembles CS 478 - Ensembles 1

A Holy Grail of Machine Learning Outputs Just a Data Set or just an explanation of the problem Automated Learner Hypothesis Input Features CS 478 - Ensembles 2

Ensembles Multiple diverse models (Inductive Biases) are trained on the same problem and then their outputs are combined to come up with a final output The specific overfit of each learning model can be averaged out If models are diverse (uncorrelated errors) then even if the individual models are weak generalizers, the ensemble can be very accurate Many different Ensemble approaches Stacking, Gating/Mixture of Experts, Bagging, Boosting, Wagging, Mimicking, Heuristic Weighted Voting, Combinations Combining Technique M 1 M 2 M 3 M n CS 478 - Ensembles 3

Ensembles are Scriptural Mosiah 29:26, 27 Now it is not common that the voice of the people desireth anything contrary to that which is right; but it is common for the lesser part of the people to desire that which is not right; therefore this shall ye observe and make it your law--to do your business by the voice of the people. And if the time comes that the voice of the people doth choose iniquity, then is the time that the judgments of God will come upon you; yea, then is the time he will visit you with great destruction even as he has hitherto visited this land. CS 478 - Ensembles 4

Bias vs. Variance Learning models can have error based on two basic issues: Bias and Variance "Bias" measures the basic capacity of a learning approach to fit the task "Variance" measures the extent to which different hypotheses trained using a learning approach will vary based on initial conditions, training set, etc. MLPs trained with backprop have lower bias error because they can fit the task well, but have relatively high variance error because each model might fall into odd nuances (overfit) based on training set choice, initial weights, and other parameters Typical with the more complex models we want Naïve Bayes has high bias error (doesn't fit that well), but has no variance error. We would like low bias error and low variance error Ensembles using multiple trained models with high-variance and low-bias error can average out the variance, leaving just the bias Less worry about overfit with the base models (stopping criteria, etc.) CS 478 - Ensembles 5

Some classifiers GAUSSIAN QUADRATIC MULTILAYER NEURAL NETWORK LINEAR BAYES SIMPLE PERCEPTRON NEAREST NEIGHBOR SUPPORT VECTOR MACHINE 3/2/2012 How? gn 6

CLASSIFIER BIAS AND VARIANCE Training set # 1 0 0 X 0 X X 0 X X 0 X 0 X 0 X 0 0 0 0 0 0 0 Training set # 2 True decision boundary Error Complex Classifier Error Bias ------- Variance. Simple Classifier Error Bias ------- Variance. Number of training samples CLASSIFIER BIAS AND VARIANCE DON T ADD! Any classifier can be shown to be better than any other. 7

Amplifying Weak Learners Combining weak learners Assume n induced models which are independent of each other with each having accuracy of about 60% on a two class problem. While one model is not dependable, if a good majority of a group of these lean in one direction, then we can have high confidence. If all n give the same class output then you can be confident it is correct with probability 1-(1-.6) n. For n=10, confidence would be 99.4%. Normally not independent (e.g. similar training sets). If all n were the same model, then no advantage could be gained. Also, unlikely that all n would give the same output, but if a majority did, then still get an overall accuracy better than the base accuracy of the models If m models say class 1 and w models say class 2, then P(majority_class) = 1 Binomial(n, min(m,w),.6) P(r) = n! r!(n r)! pr (1 p) n r CS 478 - Ensembles 8

Bagging Bootstrap aggregating (Bagging) Induce m learners starting with the same initial parameters with each training set chosen uniformly at random with replacement from the original data set, training sets might be 2/3 rds of the data set still need to save some separate data for testing All m hypotheses have an equal vote for classifying novel instances Great way to improve overall accuracy by decreasing variance. Consistent significant empirical improvement. Does not overfit (whereas boosting may), but may be more conservative overall on accuracy improvements Bigger m the better (diminishing), but need to consider efficiency trade-off Often used with the same learning algorithm and thus best for those which tend to give more diverse hypotheses based on initial random conditions Could use other schemes to improve the diversity between learners Different initial parameters, sampling approaches, etc. Different learning algorithms The more diversity the better - (yet most often used with the same learning algorithm and just different training sets) CS 478 - Ensembles 9

Boosting Boosting by resampling - Each TS t is chosen randomly with distribution D t with replacement from the original training data. D 1 has all instances equally likely to be chosen. Typically each TS t is the same size as the original data set. Induce first model. Change D t+1 so that instances which are mis-classified by the current model on its current TS have a higher probability of being chosen for future training sets. Keep training new models until stopping criteria met M models induced Overall Accuracy levels out Most recent model has accuracy less than 50% on its TS All models vote, but each model s vote is scaled by its accuracy on the training set it was trained on Boosting is more aggressive than bagging on accuracy but in some cases can overfit and do worse can theoretically converge to training set On average better than bagging, but worse for some tasks In rare cases can be worse than the non-ensemble approach CS 478 - Ensembles 10

Boosting Another approach to boosting is to have each base model train on the entire training set but have the ML algorithm take each current instance weighting into account during learning. How might you do that for MLPs Decision Trees k-nn Then still have each model vote weighted by its overall accuracy CS 478 - Ensembles 11

Boosting Another approach to boosting is to have each base model train on the entire training set but have the ML algorithm take each current instance weighting into account during learning. How might you do that for MLPs Scale learning rate by weight Decision Trees instance membership is scaled by weight k-nn node vote is scaled by weight Then still have each model vote weighted by its overall accuracy CS 478 - Ensembles 12

Ensemble Creation Approaches A good goal is to get less correlated errors between models Injecting randomness initial weights, different learning parameters, etc. Different Training sets Bagging, Boosting, different features, etc. Forcing differences different objective functions, auxiliary tasks Different machine learning models Obvious, but surprisingly it is less used One aspect of COD (Classifier Output Distance) research - which algorithms are most different and thus most appropriate to ensemble CS 478 - Ensembles 13

Ensemble Combining Approaches Unweighted Voting (e.g. Bagging) Weighted voting based on accuracy, etc. (e.g. Boosting) Stacking - Learn the combination function Higher order possibilities Which algorithm should be used for the stacker Must match the input/output data types between models Stacking the stack, etc. Gating function/mixture of Experts The gating function uses the input features to decide which expert or combination (weights) of experts to use in the vote with experts being strong in different part of the input space Heuristic Weighted Voting differs for each instance CS 478 - Ensembles 14

Ensemble Summary Other Models Random Forests, Boosted stumps, Cascading, Arbitration, Delegation, PDDAGS (Parallel Decision DAGs), Bayesian Model Averaging and Combination, Clustering Ensemble, etc. Efficiency Issues Wagging (Weight Averaging) - Multi-layer? Mimicking - Oracle Learning, semi-supervised Great way to decrease variance/overfit Almost always gain accuracy improvements with Ensembles CS 478 - Ensembles 15