Combining Multiple Models

Size: px

Start display at page:

Download "Combining Multiple Models"

Cuthbert Foster
5 years ago
Views:

1 Combining Multiple Models Lecture Outline: Combining Multiple Models Bagging Boosting Stacking Using Unlabeled Data Reading: Chapters 7.5 Witten and Frank, 2nd ed. Nigam, McCallum, Thrun & Mitchell. Text Classification from Labeled and Unlabeled Data using EM. Machine Learning, 39, pp , COM3250 /

2 Combining Multiple Models When making critical decisions people usually consult several experts, rather than just one A model generated by an ML technique over some training data can be viewed as an expert Natural to ask: Can we combine judgements of multiple models to get a decision that is more reliable than that of any single one on its own? Answer is: yes! (though not always) Disadvantage is that resulting combined models may be hard to understand/analyse COM3250 /

3 Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote COM3250 /

4 Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data COM3250 / a

5 Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data If we were to average the error rate of the combined classifier across an infinite number of independently chosen test examples, we arrive at the bias of the learning algorithm for the learning problem the residual error that cannot be eliminated regardless of the number of training sets COM3250 / b

6 Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data If we were to average the error rate of the combined classifier across an infinite number of independently chosen test examples, we arrive at the bias of the learning algorithm for the learning problem the residual error that cannot be eliminated regardless of the number of training sets A second source of error arises from the use, in practice, of finite data sets, which inevitably are not fully representative of entire instance population COM3250 / c

7 Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data If we were to average the error rate of the combined classifier across an infinite number of independently chosen test examples, we arrive at the bias of the learning algorithm for the learning problem the residual error that cannot be eliminated regardless of the number of training sets A second source of error arises from the use, in practice, of finite data sets, which inevitably are not fully representative of entire instance population The average of this error over all training sets of a given size and all test sets is the variance of the learning method for the problem COM3250 / d

8 Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data If we were to average the error rate of the combined classifier across an infinite number of independently chosen test examples, we arrive at the bias of the learning algorithm for the learning problem the residual error that cannot be eliminated regardless of the number of training sets A second source of error arises from the use, in practice, of finite data sets, which inevitably are not fully representative of entire instance population The average of this error over all training sets of a given size and all test sets is the variance of the learning method for the problem Total error is sum of bias and variance (bias-variance decomposition) COM3250 / e

9 Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data If we were to average the error rate of the combined classifier across an infinite number of independently chosen test examples, we arrive at the bias of the learning algorithm for the learning problem the residual error that cannot be eliminated regardless of the number of training sets A second source of error arises from the use, in practice, of finite data sets, which inevitably are not fully representative of entire instance population The average of this error over all training sets of a given size and all test sets is the variance of the learning method for the problem Total error is sum of bias and variance (bias-variance decomposition) Combining classifiers reduces the variance component of the error COM3250 / f

10 Bagging Stands for bootstrap aggregating COM3250 /

11 Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers COM3250 / a

12 Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers Each classifier is learned by applying a single learning scheme to multiple artificial training datasets that are derived from a single, original training dataset COM3250 / b

13 Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers Each classifier is learned by applying a single learning scheme to multiple artificial training datasets that are derived from a single, original training dataset The artificial datasets are obtained by randomly sampling with replacement from the original dataset, creating new datasets of the same size COM3250 / c

14 Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers Each classifier is learned by applying a single learning scheme to multiple artificial training datasets that are derived from a single, original training dataset The artificial datasets are obtained by randomly sampling with replacement from the original dataset, creating new datasets of the same size The sampling procedure deletes some instances and replicates others E.g. a decison tree learner could be applied to k artificial datasets derived by this random sampling procedure, resulting in k decision trees COM3250 / d

15 Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers Each classifier is learned by applying a single learning scheme to multiple artificial training datasets that are derived from a single, original training dataset The artificial datasets are obtained by randomly sampling with replacement from the original dataset, creating new datasets of the same size The sampling procedure deletes some instances and replicates others E.g. a decison tree learner could be applied to k artificial datasets derived by this random sampling procedure, resulting in k decision trees The combined classifier works by applying each of the learned classifiers (e.g. the k decision trees) to novel instances and deciding their classification by majority vote COM3250 / e

16 Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers Each classifier is learned by applying a single learning scheme to multiple artificial training datasets that are derived from a single, original training dataset The artificial datasets are obtained by randomly sampling with replacement from the original dataset, creating new datasets of the same size The sampling procedure deletes some instances and replicates others E.g. a decison tree learner could be applied to k artificial datasets derived by this random sampling procedure, resulting in k decision trees The combined classifier works by applying each of the learned classifiers (e.g. the k decision trees) to novel instances and deciding their classification by majority vote For numeric prediction final values are determined by averaging classifier outputs COM3250 / f

17 A Bagging Algorithm Model Generation Let n be the number of instances in the training data For each of t iterations: Sample n instances with replacement from training data Apply the learning algorithm to the sample Store the resulting model Classification For each of the t models: Predict class of instance using model Return class that has been predicted most often Bagging produces a combined model that often performs significantly better than a single model built from the original data set and never performs substantially worse COM3250 /

18 Randomisation Bagging generates an ensemble of classifiers by introducing randomness into the learner s input COM3250 /

19 Randomisation Bagging generates an ensemble of classifiers by introducing randomness into the learner s input Some learning algorithms have randomness built-in For example, perceptrons start out with randomly assigned connections weights which are then adjusted during training One way to make such algorithms more stable is to run them several times with different random number seeds and combine classifier predictions by voting/averaging COM3250 / a

20 Randomisation Bagging generates an ensemble of classifiers by introducing randomness into the learner s input Some learning algorithms have randomness built-in For example, perceptrons start out with randomly assigned connections weights which are then adjusted during training One way to make such algorithms more stable is to run them several times with different random number seeds and combine classifier predictions by voting/averaging A random element can be added into most learning algorithms E.g. for decision trees instead of picking best attribute to split on at each node, randomly pick one of the best n attributes COM3250 / b

21 Randomisation Bagging generates an ensemble of classifiers by introducing randomness into the learner s input Some learning algorithms have randomness built-in For example, perceptrons start out with randomly assigned connections weights which are then adjusted during training One way to make such algorithms more stable is to run them several times with different random number seeds and combine classifier predictions by voting/averaging A random element can be added into most learning algorithms E.g. for decision trees instead of picking best attribute to split on at each node, randomly pick one of the best n attributes Randomisation requires more work than bagging, because the learning algorithm has to be modified; however it can be applied to a wider range of learners. For example: Bagging fails with stable learners those whose output is insensitive to small changes in input, such as knn However, randomisation can be applied by, e.g., selecting different randomly chosen subsets of attributes on which to base the classifiers COM3250 / c

22 Boosting Like bagging works by combining, via voting or averaging, multiple models produced by a single learning scheme COM3250 /

23 Boosting Like bagging works by combining, via voting or averaging, multiple models produced by a single learning scheme Unlike bagging does not derive models from artificially produced datasets generated by random sampling Instead builds models iteratively each model takes into account performance of the last COM3250 / a

24 Boosting Like bagging works by combining, via voting or averaging, multiple models produced by a single learning scheme Unlike bagging does not derive models from artificially produced datasets generated by random sampling Instead builds models iteratively each model takes into account performance of the last Boosting encourages subsequent models to emphasize examples badly handled by earlier ones builds classifiers whose strengths complement each other COM3250 / b

25 Boosting Like bagging works by combining, via voting or averaging, multiple models produced by a single learning scheme Unlike bagging does not derive models from artificially produced datasets generated by random sampling Instead builds models iteratively each model takes into account performance of the last Boosting encourages subsequent models to emphasize examples badly handled by earlier ones builds classifiers whose strengths complement each other In AdaBoost.M1 this is achieved by using the notion of weighted instance: Error is computed by taking into account the weights of misclassified instances rather than just the proportion of misclassified instances By increasing the weight of misclassified instances following the training of one model, the next model can be made to attend to these instances Final classification is determined by weighted voting across all the classifiers, where weighting is based on classifier performance in the AdaBoost.M1 case by error of the individual classifiers COM3250 / c

26 The ADABOOST.M1 algorithm Model Generation Assign equal weight to each training instance For each of t iterations: Apply learning algorithm to weighted dataset and store resulting model Compute error e of model on weighted dataset and store error If e = 0 or e 0.5 Then terminate model generation For each instance i in dataset If i classified correctly by model Then weight i weight i e/(1 e) Normalise weight of all instances so that their summed weight remains constant Classification Assign weight of zero to all classes For each of the t (or less) models: Add log(e/(1 e)) to weight of each class predicted by model Return class with highest weight COM3250 /

27 Boosting: Observations Boosting often performs substantially better than bagging COM3250 /

28 Boosting: Observations Boosting often performs substantially better than bagging Unlike bagging which never produces a combined classifier which is substantially worse than a single classifier built from the same data, boosting can sometimes do so (overfitting) COM3250 / a

29 Boosting: Observations Boosting often performs substantially better than bagging Unlike bagging which never produces a combined classifier which is substantially worse than a single classifier built from the same data, boosting can sometimes do so (overfitting) Interestingly performing more boosting iterations after error on training data has dropped to zero, can further improve performance on new test data Seems to contradict Occam s razor (prefer simpler hypothesis), since more iterations lead to more complex hypothesis which does not explain training data any better However, more iterations improves classifier s confidence in its predictions difference between estimated probability of true class and that of most likely predicted class other than true class (called the margin) COM3250 / b

30 Boosting: Observations Boosting often performs substantially better than bagging Unlike bagging which never produces a combined classifier which is substantially worse than a single classifier built from the same data, boosting can sometimes do so (overfitting) Interestingly performing more boosting iterations after error on training data has dropped to zero, can further improve performance on new test data Seems to contradict Occam s razor (prefer simpler hypothesis), since more iterations lead to more complex hypothesis which does not explain training data any better However, more iterations improves classifier s confidence in its predictions difference between estimated probability of true class and that of most likely predicted class other than true class (called the margin) Boosting allows powerful combined classifiers to be built from simple ones (provided they achieve < 50% error on reweighted data) Such simple learners are called weak learners Examples are learners such as decision stumps (one level decision trees) or OneR (single conjunctive rule) COM3250 / c

31 Boosting: Observations Boosting often performs substantially better than bagging Unlike bagging which never produces a combined classifier which is substantially worse than a single classifier built from the same data, boosting can sometimes do so (overfitting) Interestingly performing more boosting iterations after error on training data has dropped to zero, can further improve performance on new test data Seems to contradict Occam s razor (prefer simpler hypothesis), since more iterations lead to more complex hypothesis which does not explain training data any better However, more iterations improves classifier s confidence in its predictions difference between estimated probability of true class and that of most likely predicted class other than true class (called the margin) Boosting allows powerful combined classifiers to be built from simple ones (provided they achieve < 50% error on reweighted data) Such simple learners are called weak learners Examples are learners such as decision stumps (one level decision trees) or OneR (single conjunctive rule) Good example: Weka decision stump on mushroom data try without, then with, boosting COM3250 / d

32 Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme COM3250 /

33 Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms COM3250 / a

34 Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms Rather than simply voting, stacking attempts to learn which are the reliable classifiers using a metalearner COM3250 / b

35 Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms Rather than simply voting, stacking attempts to learn which are the reliable classifiers using a metalearner Inputs to the metalearner are instances built of the outputs of level 0, or base level models COM3250 / c

36 Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms Rather than simply voting, stacking attempts to learn which are the reliable classifiers using a metalearner Inputs to the metalearner are instances built of the outputs of level 0, or base level models These level 1 instances consist of one attribute for each level 0 learner the class the level 0 learner predicts for the level instance 1 COM3250 / d

37 Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms Rather than simply voting, stacking attempts to learn which are the reliable classifiers using a metalearner Inputs to the metalearner are instances built of the outputs of level 0, or base level models These level 1 instances consist of one attribute for each level 0 learner the class the level 0 learner predicts for the level instance 1 From these instances the level 1 model makes the final prediction COM3250 / e

38 Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms Rather than simply voting, stacking attempts to learn which are the reliable classifiers using a metalearner Inputs to the metalearner are instances built of the outputs of level 0, or base level models These level 1 instances consist of one attribute for each level 0 learner the class the level 0 learner predicts for the level instance 1 From these instances the level 1 model makes the final prediction During training the level 1 model is given instances which are the level 0 predictions for level 0 instances plus the actual class of the instance However, if the predictions of the level 0 learners over the data they were trained on are used the result will be a metalearner trained to prefer classifiers that overfit the training data COM3250 / f

39 Stacking (2) In order to avoid overfitting the level 1 instances must either be formed from level 0 predictions over instances that were held out from level 0 training; or from predictions on the instances in the test folds, if cross-validation was used for training at level 0 COM3250 /

40 Stacking (2) In order to avoid overfitting the level 1 instances must either be formed from level 0 predictions over instances that were held out from level 0 training; or from predictions on the instances in the test folds, if cross-validation was used for training at level 0 Stacking can be extended to deal with level 0 classifiers that produce probability distributions over output class labels numeric prediction rather than classification COM3250 / a

41 Stacking (2) In order to avoid overfitting the level 1 instances must either be formed from level 0 predictions over instances that were held out from level 0 training; or from predictions on the instances in the test folds, if cross-validation was used for training at level 0 Stacking can be extended to deal with level 0 classifiers that produce probability distributions over output class labels numeric prediction rather than classification While any ML algorithms could be used at level 1, simple level 1 algorithms such as linear regression have proved best COM3250 / b

42 Using Unlabeled Data Labeled training data i.e. data with associated target class is always limited frequently requires extensive/expensive manual annnotation or cleaning However, large amounts of unlabeled data may be readily available pre-classified text hard to get (e.g. catalogued news articles) unclassified text very easy to get Is there any way we can utilise unlabeled training data to improve a classifier? COM3250 /

43 Using Unlabeled Data: Clustering for Classification One possibility is to couple a probabilistic classifier, such as Naïve Bayes classification, with Expectation-Maximisation (EM) iterative probabilistic clustering COM3250 /

44 Using Unlabeled Data: Clustering for Classification One possibility is to couple a probabilistic classifier, such as Naïve Bayes classification, with Expectation-Maximisation (EM) iterative probabilistic clustering Suppose we have labelled training data L + unlabelled training data U. Proceed as follows: train Naïve Bayes classifier on L repeat until convergence (E-step) Use current classsifer to estimate component mixture for each instance in U (i.e. probability that each mixture component generated each instance) (M-step) re-estimate the classifier using the estimated component mixture for each instance in L + U output a classifier that predicts labels for unlabelled instances (after Nigam et al. 2000) COM3250 / a

45 Using Unlabeled Data: Clustering for Classification One possibility is to couple a probabilistic classifier, such as Naïve Bayes classification, with Expectation-Maximisation (EM) iterative probabilistic clustering Suppose we have labelled training data L + unlabelled training data U. Proceed as follows: train Naïve Bayes classifier on L repeat until convergence (E-step) Use current classsifer to estimate component mixture for each instance in U (i.e. probability that each mixture component generated each instance) (M-step) re-estimate the classifier using the estimated component mixture for each instance in L + U output a classifier that predicts labels for unlabelled instances (after Nigam et al. 2000) Experiments show such a learner can attain equivalent performance to a traditional learner using < 1/3 the labeled training examples together with 5 times as many unlabeled examples COM3250 / b

46 Using Unlabeled Data: Co-training Suppose there are two independent perspectives or views (feature sets) on a classification task. E.g. for web page classification: the web page s content links to the web page from other pages COM3250 /

47 Using Unlabeled Data: Co-training Suppose there are two independent perspectives or views (feature sets) on a classification task. E.g. for web page classification: the web page s content links to the web page from other pages Co-training exploits these two perspectives: Train model A using perspective 1 on labelled data Train model B using perspective 2 on labelled data Label the unlabeled data using model A and model B separately For each model select the example it most confidently labels positively and the one it most confidently labels negatively and add these to pool of labeled examples Repeat the whole process training both models on the augmented pool of labeled examples until there are no more unlabeled examples COM3250 / b

48 Using Unlabeled Data: Co-training Suppose there are two independent perspectives or views (feature sets) on a classification task. E.g. for web page classification: the web page s content links to the web page from other pages Co-training exploits these two perspectives: Train model A using perspective 1 on labelled data Train model B using perspective 2 on labelled data Label the unlabeled data using model A and model B separately For each model select the example it most confidently labels positively and the one it most confidently labels negatively and add these to pool of labeled examples Repeat the whole process training both models on the augmented pool of labeled examples until there are no more unlabeled examples There is some experimental evidence to indicate co-training using Naïve Bayes as learner outperforms an approach which learns a single model using all features from both perspectives COM3250 / c

49 Using Unlabeled Data: Co-EM Co-EM trains model A using perspective 1 on the labeled data uses model A to probabilistically label all the unlabeled data trains model B using perspective 2 on the original labeled data + the unlabeled data tenatively labeled using model A uses model B to probabilistically relabel all the data for use in retraining model A the process iterates until the classifiers converge COM3250 /

50 Using Unlabeled Data: Co-EM Co-EM trains model A using perspective 1 on the labeled data uses model A to probabilistically label all the unlabeled data trains model B using perspective 2 on the original labeled data + the unlabeled data tenatively labeled using model A uses model B to probabilistically relabel all the data for use in retraining model A the process iterates until the classifiers converge Co-EM appears to perform consistently better than co-training (because it does not commit to class labels, but re-estimates their probabilities at each iteration) COM3250 / a

51 Using Unlabeled Data: Co-EM Co-EM trains model A using perspective 1 on the labeled data uses model A to probabilistically label all the unlabeled data trains model B using perspective 2 on the original labeled data + the unlabeled data tenatively labeled using model A uses model B to probabilistically relabel all the data for use in retraining model A the process iterates until the classifiers converge Co-EM appears to perform consistently better than co-training (because it does not commit to class labels, but re-estimates their probabilities at each iteration) Co-training/co-EM limited to applications where multiple perspectives on the data are available some recent evidence that this split perspective can be artificially manufactured (e.g. random selection of features, though feature independence preferred) some recent arguments/evidence that co-training using models derived by different classifiers (instead of from different feature sets) also works COM3250 / b

52 Summary Multiple learned models may be combined in various ways to produce classifiers whose performance is superior to a single model on its own: COM3250 /

53 Summary Multiple learned models may be combined in various ways to produce classifiers whose performance is superior to a single model on its own: Bagging trains multiple models using a single learning scheme trained on multiple training sets artificially derived from a single data set through random deletion and repitition of instances; final classification is arrived at by simple majority voting COM3250 / a

54 Summary Multiple learned models may be combined in various ways to produce classifiers whose performance is superior to a single model on its own: Bagging trains multiple models using a single learning scheme trained on multiple training sets artificially derived from a single data set through random deletion and repitition of instances; final classification is arrived at by simple majority voting Boosting builds multiple models using a single learning scheme iteratively over a single data set where the instances are re-weighted between iterations so that subsequent models pay more attention to instances misclassified by earlier models; final classification is arrived at by weighted voting of all classifiers, each vote weighted by classifier performance COM3250 / b

55 Summary Multiple learned models may be combined in various ways to produce classifiers whose performance is superior to a single model on its own: Bagging trains multiple models using a single learning scheme trained on multiple training sets artificially derived from a single data set through random deletion and repitition of instances; final classification is arrived at by simple majority voting Boosting builds multiple models using a single learning scheme iteratively over a single data set where the instances are re-weighted between iterations so that subsequent models pay more attention to instances misclassified by earlier models; final classification is arrived at by weighted voting of all classifiers, each vote weighted by classifier performance Stacking combines the models built by different learning schemes by training a metalearner to decide amongst the predictions of the base level learners COM3250 / c

56 Summary Multiple learned models may be combined in various ways to produce classifiers whose performance is superior to a single model on its own: Bagging trains multiple models using a single learning scheme trained on multiple training sets artificially derived from a single data set through random deletion and repitition of instances; final classification is arrived at by simple majority voting Boosting builds multiple models using a single learning scheme iteratively over a single data set where the instances are re-weighted between iterations so that subsequent models pay more attention to instances misclassified by earlier models; final classification is arrived at by weighted voting of all classifiers, each vote weighted by classifier performance Stacking combines the models built by different learning schemes by training a metalearner to decide amongst the predictions of the base level learners Unlabeled data can be utilised to improve the performance of classifiers, or to allow them to attain equivalent performance using less labeled (expensive) training data. Approaches include: Learning over probabilistically clustered unlabeled data (Naïve Bayes + EM) co-learning and co-em which assume different perspectives (feature views) over the same data with models/estimates iteratively improved over the unlabeled data COM3250 / d

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007 Indiana University Outline Introduction Bias and