Machine Learning. Ensemble Learning. Machine Learning

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Python Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning From the Past with Experiment Databases

CS Machine Learning

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

School Size and the Quality of Teaching and Learning

Word Segmentation of Off-line Handwritten Documents

Probabilistic Latent Semantic Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Switchboard Language Model Improvement with Conversational Data from Gigaword

Active Learning. Yingyu Liang Computer Sciences 760 Fall

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Case Study: News Classification Based on Term Frequency

Knowledge Transfer in Deep Convolutional Neural Nets

Activity Recognition from Accelerometer Data

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The Boosting Approach to Machine Learning An Overview

Softprop: Softmax Neural Network Backpropagation Learning

Mandarin Lexical Tone Recognition: The Gating Paradigm

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Rule Learning with Negation: Issues Regarding Effectiveness

Cultivating DNN Diversity for Large Scale Video Labelling

Cooperative evolutive concept learning: an empirical study

Probability and Statistics Curriculum Pacing Guide

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Generative models and adversarial training

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

On-the-Fly Customization of Automated Essay Scoring

Discriminative Learning of Beam-Search Heuristics for Planning

Artificial Neural Networks written examination

Learning to Rank with Selection Bias in Personal Search

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Issues in the Mining of Heart Failure Datasets

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

12- A whirlwind tour of statistics

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Universidade do Minho Escola de Engenharia

Semi-Supervised Face Detection

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

A survey of multi-view machine learning

Model Ensemble for Click Prediction in Bing Search Ads

Classifying combinations: Do students distinguish between different types of combination problems?

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

An Empirical Comparison of Supervised Ensemble Learning Approaches

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Truth Inference in Crowdsourcing: Is the Problem Solved?

Human Emotion Recognition From Speech

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A Version Space Approach to Learning Context-free Grammars

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Applications of data mining algorithms to analysis of medical data

Linking Task: Identifying authors and book titles in verbose queries

MYCIN. The MYCIN Task

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Universityy. The content of

Assignment 1: Predicting Amazon Review Ratings

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

INPE São José dos Campos

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Calibration of Confidence Measures in Speech Recognition

A Case-Based Approach To Imitation Learning in Robotic Agents

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Speech Recognition at ICSI: Broadcast News and beyond

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Learning Distributed Linguistic Classes

arxiv: v1 [math.at] 10 Jan 2016

Speech Emotion Recognition Using Support Vector Machine

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Analysis of Enzyme Kinetic Data

Multi-label classification via multi-target regression on data streams

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Psychometric Research Brief Office of Shared Accountability

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

Handling Concept Drifts Using Dynamic Selection of Classifiers

Transcription:

1 Ensemble Learning

2 Introduction In our daily life Asking different doctors opinions before undergoing a major surgery Reading user reviews before purchasing a product There are countless number of examples where we consider the decision of mixture of experts. Ensemble systems follow exactly the same approach to data analysis. Problem Definition Given Training data set D for supervised learning D drawn from common instance space X Collection of inductive learning algorithms Hypotheses produced by applying inducers to s(d) s: X vector X vector (sampling, transformation, partitioning, etc.) Return: new classification algorithm (not necessarily H) for x X that combines outputs from collection of classification algorithms Desired Properties Guarantees of performance of combined prediction Two Solution Approaches Train and apply each classifier; learn combiner function (s) from result Train classifier and combiner function (s) concurrently

3 Why We Combine Classifiers? [1] Reasons for Using Ensemble Based Systems Statistical Reasons A set of classifiers with similar training data may have different generalization performance. Classifiers with similar performance may perform differently in field (depends on test data). In this case, averaging (combining) may reduce the overall risk of decision. In this case, averaging (combining) may or may not beat the performance of the best classifier. Large Volumes of Data Usually training of a classifier with a large volumes of data is not practical. A more efficient approach is to o o o To Little Data Partition the data into smaller subsets Training different Classifiers with different partitions of data Combining their outputs using an intelligent combination rule We can use resampling techniques to produce non-overlapping random training data. Each of training set can be used to train a classifier. Data Fusion Multiple sources of data (sensors, domain experts, etc.) Need to combine systematically, Example : A neurologist may order several tests o o o MRI Scan, EEG Recording, Blood Test A single classifier cannot be used to classify data from different sources (heterogeneous features).

Why We Combine Classifiers? [2] 4 Divide and Conquer Regardless of the amount of data, certain problems are difficult for solving by a classifier. Complex decision boundaries can be implemented using ensemble Learning.

5 Diversity Strategy of ensemble systems Creation of many classifiers and combine their outputs in a such a way that combination improves upon the performance of a single classifier. Requirement The individual classifiers must make errors on different inputs. If errors are different then strategic combination of classifiers can reduce total error. Requirement We need classifiers whose decision boundaries are adequately different from those of others. Such a set of classifiers is said to be diverse. Classifier diversity can be obtained Using different training data sets for training different classifiers. Using unstable classifiers. Using different training parameters (such as different topologies for NN). Using different feature sets (such as random subspace method). G. Brown, J. Wyatt, R. Harris, and X. Yao, Diversity creation methods : a survey and categorization, Information fusion, Vo. 6, pp. 5-20, 2005.

Classifier diversity using different training sets 6

7 Diversity Measures (1) Pairwise measures (assuming that we have T classifiers) h j is correct h j is incorrect h i is correct a b h i is incorrect c d Correlation (Maximum diversity is obtained when ρ=0) ad bc ρi, j = 0 ρ 1 ( a + b)( c + d)( a + c)( c + d) Q-Statistics (Maximum diversity is obtained when Q=0) ρ Q Q i j +, = ( ad bc) /( ad bc) Disagreement measure (the prob. that two classifiers disagree) D j i, = b + c Double fault measure (the prob. that two classifiers are incorrect) DF i, j = d For a team of T classifiers, the diversity measures are averaged over all pairs: D T 1 T 2 avg = D i, j T ( T 1) i= 1 j= 1

8 Diversity Measures (2) Non-Pairwise measures (assuming that we have T classifiers) Entropy Measure : Makes the assumption that the diversity is highest if half of the classifiers are correct and the remaining ones are incorrect. Kohavi-Wolpert Variance Measure of difficulty Comparison of different diversity measures

Diversity Measures (3) 9 No Free Lunch Theorem : No classification algorithm is universally correlates with the higher accuracy. Conclusion : There is no diversity measure that consistently correlates with the higher accuracy. Suggestion : In the absence of additional information, the Q statistics is suggested because of its intuitive meaning and simple implementation. Reference : L. I. Kuncheva and C. J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy,, Vol. 51, pp. 181-207, 2003. R. E. Banfield, L. O. Hall, K. W. Bowyer, W. P. Kegelmeyer, Ensemble diversity measures and their application to thinning, Information Fusion, Vol. 6, pp. 49-62, 2005.

Design of Ensemble Systems 10 Two key components of an ensemble system Creating an ensemble by creating weak learners Bagging Boosting Stacked generalization Mixture of experts Combination of classifiers outputs Majority Voting Weighted Majority Voting Averaging What Is A Weak Classifier? One not guaranteed to do better than random guessing (1 / number of classes) Goal: combine multiple weak classifiers, get one at least as accurate as strongest. Combination Rules Trainable vs. Non-Trainable Labels vs. Continuous outputs

Combination Rule [1] 11 In ensemble learning, a rule is needed to combine outputs of classifiers. Classifier Selection Each classifier is trained to become an expert in some local area of feature space. Combination of classifiers is based on the given feature vector. Classifier that was trained with the data closest to the vicinity of the feature vector is given the highest credit. One or more local classifiers can be nominated to make the decision. Classifier Fusion Each classifier is trained over the entire feature space. Classifier Combination involves merging the individual waek classifier design to obtain a single Strong classifier.

Combination Rule [2] : Majority Voting 12 Majority Based Combiner Unanimous voting : All classifiers agree the class label Simple majority : At least one or more than half of the classifiers agree the class label Majority voting : Class label that receives the highest number of votes. Weight-Based Combiner Collect votes from pool of classifiers for each training example Decrease weight associated with each classifier that guessed wrong Combiner predicts weighted majority label How we do assign the weights? Based on Training Error Using Validation set Estimate of the classifier s future performance Other combination rules Behavior knowledge space, Borda count Mean rule, Weighted average

13 Bootstrap Aggregating (Bagging ) Bagging [1] Application of bootstrap sampling Given: set D containing m training examples Create S[i] by drawing m examples at random with replacement from D S[i] of size m: expected to leave out 75%-100% of examples from D Bagging Create k bootstrap samples S[1], S[2],, S[k] Train distinct inducer on each S[i] to produce k classifiers Classify new instance by classifier vote (majority vote) Variations Random forests Can be created from decision trees, whose certain parameters vary randomly. Pasting small votes (for large datasets) RVotes : Creates the data sets randomly IVotes : Creates the data sets based on the importance of instances, easy to hard!

Bagging [2] 14

Bagging : Pasting small votes (IVotes) 15

Boosting 16 Schapire proved that a weak learner, an algorithm that generates classifiers that can merely do better than random guessing, can be turned into a strong learner that generates a classifier that can correctly classify all but an arbitrarily small fraction of the instances In boosting, the training data are ordered from easy to hard. Easy samples are classified first, and hard samples are classified later. Create the first classifier same as Bagging The second classifier is trained on training data only half of which is correctly classified by the first one and the other half is misclassified. The third one is trained with data that two first disagree. Variations AdaBoost.M1 AdaBoost.R

Boosting 17

AdaBoost.M1 18

Stacked Generalization 19 Stacked Generalization (Stacking) Intuitive Idea Train multiple learners Each uses subsample of D May be ANN, decision tree, etc. Train combiner on validation segment y y Combiner Stacked Generalization Network y Predictions Predictions Combiner Combiner Predictions y y y y Inducer Inducer Inducer Inducer x 11 x 12 x 21 x 22

20 Intuitive Idea Train multiple learners Each uses subsample of D May be ANN, decision tree, etc. Gating Network usually is NN Mixture Models Gating Network x Σ g 1 g 2 y 1 y 2 Expert Network Expert Network

Cascading 21 Use d j only if preceding ones are not confident Cascade learners in order of complexity

22 Reading T. G. Dietterich, Research: four current directions, Department of computer science, oregon state university T. G. Dietterich, Ensemble Methods in, Department of computer science, Oregon state university Ron Meir, Gunnar Ratsch, An introduction to Boosting and Leveraging, Australian National University David Opitz, Richard Maclin, Popular Ensemble Methods: An Empirical Study, journal of artificial intelligence research,1999, pages 169-198 L.I. Kuncheva, Combining Pattern Classifiers, Methods and Algorithms. New York, NY: Wiley Interscience, 2005.