CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1
No one learner is always best (No Free Lunch) Combination of learners can overcome individual weaknesses How to choose learners that complement one another? How to combine their outputs to maximize accuracy? Ensemble: Weighted maority vote of several learners CptS 570 - Machine Learning 2
Different algorithms E.g., parametric vs. non-parametric Different parameter settings E.g., random initial weights in neural network Different input representations E.g., feature selection E.g., multi-modal training data (e.g., audio & video) Different training sets Bagging: Different samples of same training set Boosting/Cascading: Weight more heavily examples missed by previous learned classifier Partitioning: Mixture of experts CptS 570 - Machine Learning 3
All learners generate an output Voting, stacking One or a few learners generate output Chosen by gating function Mixture of experts Learner output weighted by accuracy and complexity Cascading, boosting CptS 570 - Machine Learning 4
L learners, K outputs d i (x) is prediction of learner for output i Regression y i L = w where 0 and d i w w = 1 L = 1 = 1 Classification Choose C i if y i = K max k = 1 y k CptS 570 - Machine Learning 5
Maority voting: w = 1/L If learner produces P(C i x), then use as weights after normalization Weight w is accuracy of learner on validation set Learn weights (stacked generalization) CptS 570 - Machine Learning 6
Example: CptS 570 - Machine Learning 7
P ( C x) = P( C x, M ) P( M ) i allmodels M where d i =P(C i x,m ) and w =P(M ) Maority voting implies uniform prior Can t include all models, so choose a few with suspected high probability i CptS 570 - Machine Learning 8
Assuming each learner is independent and better than random Then adding more learners will maintain bias, but reduce variance (i.e., error) CptS 570 - Machine Learning 9 [ ] [ ] [ ] ( ) ( ) ( ) d L d L L d L d L y d E d E L L d L E y E Var Var Var Var Var 1 1 1 1 1 1 2 2 = = = = = = =
General case Var 2 L L 1 1 ( y) = Var d = Var( d ) + Cov( di, d ) 2 2 i< If learners positively correlated, then variance (and error) increase If learners negatively correlated, then variance (and error) decrease But bias increases Voting is a form of smoothing that maintains low bias, but decreases variance CptS 570 - Machine Learning 10
Given training set X of size N Generate L different training sets, each of size N, by sampling with replacement from X Called bootstrapping Use one learning algorithm to learn L classifiers from the different training sets Learning algorithm must be unstable I.e., small changes in training set result in different classifiers E.g., decision trees, neural networks CptS 570 - Machine Learning 11
Similar to bagging, but L training sets chosen to increase negative correlation Use one learning algorithm to learn L classifiers Training set for classifier biased toward examples missed by classifier -1 Learning algorithm should be weak (not too accurate) Adaptive Boosting (AdaBoost) CptS 570 - Machine Learning 12
CptS 570 - Machine Learning 13
Each point represents 1 of 27 test domains. Dietterich Machine Learning Research: Four Current Directions, AI Magazine, Winter 1997. CptS 570 - Machine Learning 14
CptS 570 - Machine Learning 15
CptS 570 - Machine Learning 16
Weights depend on the test instance y = L = 1 w ( x) d ( x) Competitive learning Weight w (x) driven toward 1 (others to 0) for learner best at region near x CptS 570 - Machine Learning 17
Combining function f( ) is learned Train f on data not used to train base learners CptS 570 - Machine Learning 18
Ensemble need not be fixed Can modify ensemble to improve accuracy or reduce correlation of base learners Subset selection Add/remove base learners while performance improves Meta-learners Stack learners to construct new features CptS 570 - Machine Learning 19
Use classifier d only if previous classifiers lacked confidence Order classifiers by increasing complexity Differs from boosting Both errant and uncertain examples passed to next learner CptS 570 - Machine Learning 20
Typically, the hypothesis space H does not contain the target function f Weighted combinations of several approximations may represent classifiers outside of H Decision surfaces defined by learned decision trees. Decision surface defined by vote over Learned decision trees. CptS 570 - Machine Learning 21
$1M to team improving NetFlix s movie recommender by 10% Won by team BellKor s Pragmatic Chaos which combined classifiers from 3 teams Bellkor, Big Chaos, Pragmatic Theory Second place The Ensemble combined classifiers from 23 other teams Solutions effectively ensembles of over 800 classifiers www.netflixprize.com CptS 570 - Machine Learning 22
Toscher et al. The BigChaos Solution to the Netflix Grand Prize, 2009. CptS 570 - Machine Learning 23
Combining learners can overcome weaknesses of individual learners Base learners must do better than random and have uncorrelated errors Ensembles typically maority vote of base classifiers Boosting, stacking Application to recommender systems Netflix Prize CptS 570 - Machine Learning 24