Lecture About Overview "Using multiple models/individuals to obtain better predictive performance than could be obtained from the constituent models." We use the same technique routinely in our lives by asking the opinion of several experts before making a decision Diversity is key Used for classification or regression (and many other things) Combine predictions of collection of hypotheses Example: generate a hundred different decision trees from the same training set an have them vote on the best classification for a new example Movie Trivia One person versus the world Illustrates ensemble learning Who wins? How does the 'world' come to a consensus? Motivation Improving the performance Model selection
Which model/parameters/etc. to use? Using low training/testing error can be misleading. What about models that all have same testing performance (possibly an infinite amount)? Prevent poor model choice Prevent consequences of a poor choice May not always beat best classifier. But reduces the risk of a bad choice. Sometimes it will beat it. Confidence estimation Naturally by voting Useful to know how sure the ensemble is about a classification Data surplus or lack Too much: partition data and train each ensemble member on a piece of the data Too little: use bootstrapping to train multiple individuals on replacement sampled subsets Appropriate solution outside of hypothesis space of model Linear classifiers cannot learn circular classification (ensemble of them can) Circular classifiers cannot learn blob classification (ensemble of them can) Example Three linear threshold hypotheses Classify positively on the unshaded side We classify as positive, an example that is classified positively by all three The triangle is the hypothesis that results, and it is not expressible in the original hypothesis space Data fusion Data comes from various sources and is unrelated Cannot be used to train a single classifier because data has different features Even if it has the same features it probably won't yield good results Example Ensemble, M = 5, majority voting Ensemble misclassifies only if three of five hypotheses misclassify (hopefully less than single hypothesis) Suppose each hypothesis h i has an error of probability e (the probability that a randomly chosen example is misclassified by it is e) Assume errors are independent
If p is small, then probability of ensemble misclassification is minuscule M = 5, e =.1, 3 of 5 for misclassification, e for ensemble = 1/10 * 1/10 * 1/10 = 1/1000 (Excercies 18.14) In practice errors are highly correlated (not independent, many hypotheses are likely misled in the same way by the data) Goal is to reduce error correlation Ingredients Diversity (most important) Errors on different examples Uncorrelated errors Use different models (MLPs, decision trees, nearest neighbor classifiers, and support vector machines) Train on different subsets of data Use different training parameters (for example, neural networks weight, # of neurons initialization) Simplicity We normally want to use simplistic models (for obvious reasons). Ensembles allow us to. Efficiency More individuals to train Ensemble Methods None Original Data 1 3 0 2 0 1 Split (X > 1) 1 3 0 2 0 1 New Data 1 4? = 0 3 2? = 1
Bagging (bootstrap aggregation) Reduces variance and overfitting Usually used with decision trees (can be used with any model) Needs many comparable classifiers One of the first effective methods One of the simplest methods Method 1. Create M new subsets of training data (sampling with replacement/bootstrap) 2. Train M models one per dataset 3. Combine outputs by averaging/voting (regression/classification) Subset 1 (Split X > 1) Subset 2 (Split Y > 1) 1 3 0 2 0 1 Subset 3 (Split X > 2) New Data 1 4? = (0, 1, 0) = 0 3 2? = (1, 1, 1) = 1
Boosting Using a collection of weak learners to make a strong learner General Approach Iteratively generate weak learners Each example in the training set has a weight (the higher the weight the more importance attached to it during training) Combine weak learners weighted by performance to make strong learner Differ in method of weighting AdaBoost Method 1. Create a weak classifier (slightly better than random guessing) 2. Iteratively train weak classifiers on a dataset in which points misclassified by the previous model are weighted more heavily 3. Weight all models according to their success 4. Combine outputs using voting/averaging with weighting Information Most popular "One of the most powerful learning ideas introduced in the last twenty years." Advantages Reuses same training set (so it can be small) Can combine any number of base-learners Disadvantages Cannot train in parallel Classifier 1 (Y > 3, weight: 3/5) Weight 1 3 0 0.5 0.5 0.5
2 0 1 2 2 Classifier 2 (X > 1, weight: 5/5) Weight 1 3 0 0.5 0.5 0.5 2 0 1 2 2 New Data 1 4? = (1: 3/5, 0: 5/5) = 0 3 2? = (0: 3/5, 1: 5/5) = 1 Demo (AdaBoost) http://cseweb.ucsd.edu/~yfreund/adaboost/index.html Settings: prediction sum, training set Hypothesis space: Decision stumps (vertical or horizontal threshold/line) Stacking (Stacked Generalization) Information Good utilization of training data Uses meta learner Attractive idea, but less widely used than bagging and boosting Can be (and normally is) used to combine models of different types (unlike bagging and boosting)
Method 1. Split data into two disjoint sets (training & testing) 2. Train several base learners on first set 3. Test base learners on second set 4. Using predictions from 3) as inputs, and correct responses as outputs, train a higher level learner Homework & Project [Described in separate document.] Resources Wikipedia article: http://en.wikipedia.org/wiki/ensemble_learning Scholarpedia article: http://www.scholarpedia.org/article/ensemble_learning Russell Norvig book chapter 18.4 Ensemble learning survey paper: "Ensemble Learning" by Martin Sewell, 2008 AdaBoost demo: http://cseweb.ucsd.edu/~yfreund/adaboost/index.html