Lecture 12. Ensemble methods. Interim Revision

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Python Machine Learning

Lecture 1: Machine Learning Basics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

(Sub)Gradient Descent

CS Machine Learning

Learning From the Past with Experiment Databases

Softprop: Softmax Neural Network Backpropagation Learning

Assignment 1: Predicting Amazon Review Ratings

arxiv: v1 [cs.lg] 15 Jun 2015

Artificial Neural Networks written examination

CSL465/603 - Machine Learning

Discriminative Learning of Beam-Search Heuristics for Planning

Generative models and adversarial training

Model Ensemble for Click Prediction in Bing Search Ads

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Calibration of Confidence Measures in Speech Recognition

The Good Judgment Project: A large scale test of different methods of combining expert predictions

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Issues in the Mining of Heart Failure Datasets

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Lecture 1: Basic Concepts of Machine Learning

A study of speaker adaptation for DNN-based speech synthesis

arxiv: v2 [cs.cv] 30 Mar 2017

Universidade do Minho Escola de Engenharia

Rule Learning With Negation: Issues Regarding Effectiveness

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Knowledge Transfer in Deep Convolutional Neural Nets

Human Emotion Recognition From Speech

Truth Inference in Crowdsourcing: Is the Problem Solved?

Lecture 10: Reinforcement Learning

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

The Boosting Approach to Machine Learning An Overview

INPE São José dos Campos

Axiom 2013 Team Description Paper

Learning to Schedule Straight-Line Code

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Cultivating DNN Diversity for Large Scale Video Labelling

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Learning Methods for Fuzzy Systems

Reducing Features to Improve Bug Prediction

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Rule Learning with Negation: Issues Regarding Effectiveness

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

NBER WORKING PAPER SERIES INVESTING IN SCHOOLS: CAPITAL SPENDING, FACILITY CONDITIONS, AND STUDENT ACHIEVEMENT

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

An empirical study of learning speed in backpropagation

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

Switchboard Language Model Improvement with Conversational Data from Gigaword

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

Semi-Supervised Face Detection

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Learning to Rank with Selection Bias in Personal Search

Probabilistic Latent Semantic Analysis

WHEN THERE IS A mismatch between the acoustic

Linking Task: Identifying authors and book titles in verbose queries

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Word Segmentation of Off-line Handwritten Documents

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Attributed Social Network Embedding

Multi-label classification via multi-target regression on data streams

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

arxiv: v1 [cs.cv] 10 May 2017

Investing in Schools: Capital Spending, Facility Conditions, and Student Achievement Abstract

Test Effort Estimation Using Neural Network

Deep Facial Action Unit Recognition from Partially Labeled Data

Multivariate k-nearest Neighbor Regression for Time Series data -

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Evolutive Neural Net Fuzzy Filtering: Basic Description

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

CS 446: Machine Learning

Probability and Statistics Curriculum Pacing Guide

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

SARDNET: A Self-Organizing Feature Map for Sequences

On-the-Fly Customization of Automated Essay Scoring

Using dialogue context to improve parsing performance in dialogue systems

Detailed course syllabus

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

The Strong Minimalist Thesis and Bounded Optimality

Using focal point learning to improve human machine tacit coordination

Comment-based Multi-View Clustering of Web 2.0 Items

A survey of multi-view machine learning

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

An OO Framework for building Intelligence and Learning properties in Software Agents

Transcription:

Lecture 12. Ensemble methods. Interim Revision COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne

Ensemble methods This lecture Bagging and random forest Boosting and stacking Frequentist supervised learning Interim summary Discussion art: OpenClipartVectors at pixabay.com (CC0) 2

Ensemble Methods Overview of model combination approaches 3

Choosing a model Thus far, we have mostly discussed individual models and considered each of them in isolation/competition We know how to evaluate each model s performance (via accuracy, F-measure, etc.) which allows us to choose the best model for a dataset overall This best model is still likely to make errors on some instances. Overall-worse models, might still be superior on some instances! 4

Panel of experts Consider a panel of 3 experts that make a classification decision independently. Each expert makes a mistake with the probability of 0.3. The consensus decision is majority vote. What is the probability of a mistake in the consensus decision? 3 0.3 0.3 0.7 = 0.189 0.7 + 0.3 0.3 = 0.79 0.3 3 + 3 0.63 = 0.216 art: OpenClipartVectors at pixabay.com (CC0) 5

Combining models Model combination (aka. ensemble learning) constructs a set of base models (aka learners) from a given set of training data and aggregates the outputs into a single meta-model Classification via (weighted) majority vote Regression via (weighed) averaging More generally: meta-model = f(base models) Recall bias-variance trade-off: EE ll YY, ff xx 0 = EE YY EE ff 2 + VVVVVV ff + VVVVVV YY Test error = (bias) 2 + variance + irreducible error Averaging kk independent and identically distributed predictions reduces variance: VVVVVV ff aaaaaa = 1 VVVVVV ff kk How to generate multiple learners from a single training dataset? 6

Bagging (bootstrap aggregating; Breiman 94) Method: construct novel datasets via sampling with replacement Generate kk datasets, each size nn sampled from training data with replacement Build base classifier on each constructed dataset Combine predictions via voting/averaging Original training dataset: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} Bootstrap samples: {7, 2, 6, 7, 5, 4, 8, 8, 1 0} out-of-sample 3, 9 {1, 3, 8, 0, 3, 5, 8, 0, 1, 9} out-of-sample 2, 4, 6, 7 {2, 9, 4, 2, 7, 9, 3, 0, 1, 0} out-of-sample 3, 5, 6, 8 7

Refresher on decision trees xx 1 θθ 1 xx 2 no xx 2 θθ 2 yes AA θθ 2 BB AA no yes AA AA BB θθ 1 xx 1 Training criterion: Purity of each final partition Optimisation: Heuristic greedy iterative approach Model complexity is defined by the depth of the tree Deep trees: Very fine tuned to a specific data high variance, low bias Shallow trees: Crude approximation low variance, high bias 8

Bagging example: Random forest Just bagged trees! Algorithm (parameters: #trees kk, #features ll mm) 1. Initialise forest as empty 2. For cc = 1 kk a) Create new bootstrap sample of training data b) Select random subset of ll of the mm features c) Train decision tree on bootstrap sample using the ll features d) Add tree to forest 3. Making predictions via majority vote or averaging Works well in many practical settings 9

Putting out-of-sample data to use At each round, a particular training example has a probability of (1 1 ) of not being selected nn Thus probability of being left out is 1 1 nn For large nn, this probability approaches ee 1 = 0.368 On average only 63.2% of the data will be included per training dataset Can use this for error estimate of ensemble Essentially cross-validation Evaluate each base classifier on corresponding out-of-sample 36.8% data Average these accuracies nn 10

Bagging: Reflections Simple method based on sampling and voting Possibility to parallelise computation of individual base classifiers Highly effective over noisy datasets Performance is generally significantly better than the base classifiers but never substantially worse Improves unstable classifiers by reducing variance 11

Boosting Intuition: focus attention of base classifiers on examples hard to classify Method: iteratively change the distribution on examples to reflect performance of the classifier on the previous iteration Start with each training instance having a 1/nn probability of being included in the sample Over kk iterations, train a classifier and update the weight of each instance according to classifier s ability to classify it Combine the base classifiers via weighted voting 12

Boosting: Sampling example Original training dataset: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} Boosting samples: Iteration 1: 7, 22, 6, 7, 5, 4, 8, 8, 1 0 Suppose that example 2 was misclassified Iteration 2: {1, 3, 8, 22, 3, 5, 22, 0, 1, 9} Suppose that example 2 was misclassified still Iteration 3: {22, 9, 22, 22, 7, 9, 3, 22, 1, 0} 13

Boosting Example: AdaBoost 1. Initialise example distribution PP 1 ii = 1/nn, ii = 1,, nn 2. For cc = 1 kk a) Train base classifier AA cc on sample with replacement from PP cc b) Set confidence αα cc = 1 ln 1 εε cc for classifier s error rate εε 2 εε cc cc c) Update example distribution to be normalised of: PP cc+1 ii PP cc ii exp αα cc, iiii AA cc ii = yy ii exp αα cc, ooooooooooooooooo 3. Classify as majority vote weighted by confidences arg max kk cc=1 αα tt δδ AA cc xx = yy yy 14

AdaBoost (cont.) 2αα εε confidence weights εε Technicality: Reinitialise example distribution whenever εε tt > 0.5 Base learners: often decision stumps or trees, anything weak A decision stump is a decision tree with one splitting node 15

Boosting: Reflections Method based on iterative sampling and weighted voting More computationally expensive than bagging The method has guaranteed performance in the form of error bounds over the training data In practical applications, boosting can overfit 16

Bagging vs Boosting Bagging Parallel sampling Minimise variance Simple voting Classification or regression Not prone to overfitting Boosting Iterative sampling Target hard instances Weighted voting Classification or regression Prone to overfitting (unless base learners are simple) 17

Stacking Intuition: smooth errors over a range of algorithms with different biases Method: train a meta-model over the outputs of the base learners Train base- and meta-learners using cross-validation Simple meta-classifier: logistic regression Generalisation of bagging and boosting 18

Stacking: Reflections Compare this to ANNs and basis expansion Mathematically simple but computationally expensive method Able to combine heterogeneous classifiers with varying performance With care, stacking results in as good or better results than the best of the base classifiers 19

Supervised Learning Interim summary of frequentist supervised learning methods covered so far 20

Supervised learning* 1. Assume a model (e.g., linear model) Denote parameters of the model as θθ Model predictions are ff xx, θθ 2. Choose a way to measure discrepancy between predictions and training data E.g., sum of squared residuals yy XXXX 2 3. Training = parameter estimation = optimisation θθ = argmin LL dddddddd, θθ θθ Θ *This is the setup of what s called frequentist supervised learning. A different view on parameter estimation/training will be presented later in the subject. 21

Supervised learning methods (1/3) Linear Regression (Galton, Pearson) Model: YY = xx ww + εε, where εε~nn 0, σσ 2 Loss function: Squared loss Optimisation: Analytic solution (the normal equations) Notes: Can also be optimised iteratively Logistic Regression (Cox) Model: pp yy xx = BBBBBBBBBBBBBBBBBB yy θθ xx = Loss function: Cross-entropy (aka log loss) Optimisation: Iterative, 2 nd order method Perceptron (Rosenblatt) 1 1+exp xx ww Model: Label is based on sign of ww 0 + ww xx Loss function: Perceptron loss Optimisation: Stochastic gradient descent Notes: Provable convergence for linearly separable data 22

Supervised learning methods (2/3) Artificial Neural Networks (Hinton, LeCun) Model: Defined by network topology Loss function: Varies Optimisation: Variations of gradient descent Notes: Backpropagation used to compute partial derivatives Support Vector Machines (Vapnik) Model: Label is based on sign of bb + ww xx Loss function: Hard margin SVM loss; hinge loss Optimisation: Quadratic Programming Notes: Specialised optimisation algorithms (e.g., SMO, chunking) Random Forest (Breiman) Model: Average of decision trees (combination of piece-wise constant models) Loss function: Cross-entropy (aka log loss); squared loss Optimisation: Greedy growth of each tree Notes: This is an example of model averaging 23

Supervised learning methods (3/3) The Next Super-Method (You) (that is, if you really need a new one) What are the aims of the method? What is the scope of the method? Intended use? Assumptions? Model: Analytically or algorithmically defined? Loss function: What is the relevant goodness criterion? Optimisation: Is there an efficient method for training? 24

Basis expansion All Methods Manually craft a feature space transformation (e.g., polynomial basis, RBF basis), before using the method Artificial Neural Networks Earlier layers can be viewed as transformation Topology needs to be pre-defined, but weights are learned from data Linear Regression, Logistic Regression, Perceptron, Support vector machines Name a common aspect of these methods Kernelise and use implicit transformation by choosing a kernel Ensemble Methods, including Random Forest Base models as feature space transformation (learned) 25

Can be used for various purposes Regularisation Add resilience to (nearly) collinear features Introduce prior knowledge into the process of learning Control model complexity Ability to generalise reflected in test error Simple models: underfit, high bias, low variance Complex models: overfit, low bias, high variance Method 1: Analytically, by adding a data-independent term to the objective function, e.g.: Ridge regression Lasso Method 2: Algorithmically, by not allowing the model to fine-tune, e.g.: Early sopping in ANN Weights sharing in CNN Restricting tree depth in Random Forests 26

What is Machine Learning? 27

Ensemble methods This lecture Bagging and random forest Boosting and stacking Frequentist supervised learning Interim summary Discussion art: OpenClipartVectors at pixabay.com (CC0) 28