Combining multiple models

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Lecture 1: Machine Learning Basics

Learning From the Past with Experiment Databases

(Sub)Gradient Descent

Python Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

CS Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

Assignment 1: Predicting Amazon Review Ratings

Word Segmentation of Off-line Handwritten Documents

Active Learning. Yingyu Liang Computer Sciences 760 Fall

A Case Study: News Classification Based on Term Frequency

Rule Learning with Negation: Issues Regarding Effectiveness

Softprop: Softmax Neural Network Backpropagation Learning

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Generative models and adversarial training

Name: Class: Date: ID: A

A Reinforcement Learning Variant for Control Scheduling

Activity Recognition from Accelerometer Data

Australian Journal of Basic and Applied Sciences

Introduction to Simulation

Probabilistic Latent Semantic Analysis

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Discriminative Learning of Beam-Search Heuristics for Planning

Model Ensemble for Click Prediction in Bing Search Ads

Artificial Neural Networks written examination

Probability Therefore (25) (1.33)

Probability estimates in a scenario tree

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Calibration of Confidence Measures in Speech Recognition

End-of-Module Assessment Task

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Chapter 2 Rule Learning in a Nutshell

SARDNET: A Self-Organizing Feature Map for Sequences

Probability and Statistics Curriculum Pacing Guide

AQUA: An Ontology-Driven Question Answering System

CSL465/603 - Machine Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

The Boosting Approach to Machine Learning An Overview

Algebra 2- Semester 2 Review

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Financing Education In Minnesota

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

South Carolina English Language Arts

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Genevieve L. Hartman, Ph.D.

/ On campus x ICON Grades

Mathematics Scoring Guide for Sample Test 2005

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Learning Distributed Linguistic Classes

Ensemble Technique Utilization for Indonesian Dependency Parser

Using focal point learning to improve human machine tacit coordination

Universidade do Minho Escola de Engenharia

Lecture 1: Basic Concepts of Machine Learning

How People Learn Physics

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Human Emotion Recognition From Speech

ICTCM 28th International Conference on Technology in Collegiate Mathematics

Infrared Paper Dryer Control Scheme

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

C O U R S E. Tools for Group Thinking

Junior Fractions. With reference to the work of Peter Hughes, the late Richard Skemp, Van de Walle and other researchers.

Process improvement, The Agile Way! By Ben Linders Published in Methods and Tools, winter

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

4.0 CAPACITY AND UTILIZATION

AUTHOR ACCEPTED MANUSCRIPT

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

6 Financial Aid Information

MYCIN. The MYCIN Task

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

INPE São José dos Campos

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

EDEXCEL FUNCTIONAL SKILLS PILOT TEACHER S NOTES. Maths Level 2. Chapter 4. Working with measures

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Shelters Elementary School

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Multi-label classification via multi-target regression on data streams

Interpreting ACER Test Results

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

National Survey of Student Engagement (NSSE) Temple University 2016 Results

arxiv: v1 [cs.cv] 10 May 2017

Maths Games Resource Kit - Sample Teaching Problem Solving

Truth Inference in Crowdsourcing: Is the Problem Solved?

An investigation of imitation learning algorithms for structured prediction

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Knowledge-Based - Systems

Cooperative evolutive concept learning: an empirical study

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

A Pipelined Approach for Iterative Software Process Model

Online Updating of Word Representations for Part-of-Speech Tagging

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Transcription:

Combining multiple models Basic idea of meta learning schemes: build different experts and let them vote Advantage: often improves predictive performance Disadvantage: produces output that is very hard to analyze Schemes we will discuss: bagging, boosting, stacking, and error-correcting output codes The first three can be applied to both classification and numeric prediction problems 10/25/2000 21

Bagging Employs simplest way of combining predictions: voting/averaging Each model receives equal weight Idealized version of bagging: Sample several training sets of size n (instead of just having one training set of size n) Build a classifier for each training set Combine the classifier s predictions This improves performance in almost all cases if learning scheme is unstable (i.e. decision trees) 10/25/2000 22

Bias-variance decomposition Theoretical tool for analyzing how much specific training set affects performance of classifier Assume we have an infinite number of classifiers built from different training sets of size n The bias of a learning scheme is the expected error of the combined classifier on new data The variance of a learning scheme is the expected error due to the particular training set used Total expected error: bias + variance 10/25/2000 23

More on bagging Bagging reduces variance by voting/averaging, thus reducing the overall expected error In the case of classification there are pathological situations where the overall error might increase Usually, the more classifiers the better Problem: we only have one dataset! Solution: generate new datasets of size n by sampling with replacement from original dataset Can help a lot if data is noisy 10/25/2000 24

Bagging classifiers model generation Let n be the number of instances in the training data. For each of t iterations: Sample n instances with replacement from training set. Apply the learning algorithm to the sample. Store the resulting model. classification For each of the t models: Predict class of instance using model. Return class that has been predicted most often. 10/25/2000 25

Boosting Also uses voting/averaging but models are weighted according to their performance Iterative procedure: new models are influenced by performance of previously built ones New model is encouraged to become expert for instances classified incorrectly by earlier models Intuitive justification: models should be experts that complement each other There are several variants of this algorithm 10/25/2000 26

AdaBoost.M1 model generation Assign equal weight to each training instance. For each of t iterations: Apply learning algorithm to weighted dataset and store resulting model. Compute error e of model on weighted dataset and store error. If e equal to zero, or e greater or equal to 0.5: Terminate model generation. For each instance in dataset: If instance classified correctly by model: Multiply weight of instance by e / (1 - e). Normalize weight of all instances. classification Assign weight of zero to all classes. For each of the t (or less) models: Add -log(e / (1 - e)) to weight of class predicted by model. Return class with highest weight. 10/25/2000 27

More on boosting Can be applied without weights using resampling with probability determined by weights Disadvantage: not all instances are used Advantage: resampling can be repeated if error exceeds 0.5 Stems from computational learning theory Theoretical result: training error decreases exponentially Also: works if base classifiers not too complex and their error doesn t become too large too quickly 10/25/2000 28

A bit more on boosting Puzzling fact: generalization error can decrease long after training error has reached zero Seems to contradict Occam s Razor! However, problem disappears if margin (confidence) is considered instead of error Margin: difference between estimated probability for true class and most likely other class (between 1, 1) Boosting works with weak learners: only condition is that error doesn t exceed 0.5 LogitBoost: more sophisticated boosting scheme 10/25/2000 29

Stacking Hard to analyze theoretically: black magic Uses meta learner instead of voting to combine predictions of base learners Predictions of base learners (level-0 models) are used as input for meta learner (level-1 model) Base learners usually different learning schemes Predictions on training data can t be used to generate data for level-1 model! Cross-validation-like scheme is employed 10/25/2000 30

More on stacking If base learners can output probabilities it s better to use those as input to meta learner Which algorithm to use to generate meta learner? In principle, any learning scheme can be applied David Wolpert: relatively global, smooth model Base learners do most of the work Reduces risk of overfitting Stacking can also be applied to numeric prediction (and density estimation) 10/25/2000 31