Trees: Themes and Variations Prof. Mari Ostendorf Outline Preface Decision Trees Bagging Boosting BoosTexter 1
Preface: Vector Classifiers Today we again deal with vector classifiers and supervised training: Given a labeled training set {(x i, c i )} with vector observations, learn a classifier ĉ = F (x). (In the remainder of the notes, I ll omit boldface for vectors to simplify things, but everything will be a vector.) Classification issues and concepts that we ll touch on: Multiclass problems: Some classifiers are good for > 2 classes; others are developed for binary decisions and need special tricks for multiclass problems. Features playing a key role, including: complicating the classifier, feature selection, feature analysis Direct probabilistic model: On Tuesday we used p(x c)p(c); today we will learn p(c x) directly. Bias and variance: In statistical learning, models (or classifiers) are learned from a random sample of data (the training set). Because the data is random, the resulting classifier is random, i.e. can give slightly different answers if trained on a different data set. A high variance classifier is one that is very sensitive to the training sample (not a good thing). Class distribution skew: when you have a lot more data from one class than another, and when all errors are treated equally, classifiers tend to put their effort into the more popular classes. This makes sense from a min error perspective, but sometimes it makes it hard to learn to predict rare events. 2
Decision Trees A decision tree is an ordered (tree-structured) sequence of questions that are asked about the features x i in the vector x. The feature vector passes from the root of the tree to a specific leaf based on the answers to the each successive question. Questions correspond to nodes of the tree, and the leaf nodes (terminal nodes) are associated with the classifier decision and/or the predicted class posterior p(c T (x)) where T ( ) is the tree. Typically questions are binary. They may take many forms and handle different types of features, e.g. is x i > T? (for numeric features) is x i = green? (for categorical features, attribute-value questions) is x i A? (for categorical features, set membership) Decision trees are one of the most popular methods of machine learning, in part because: They easily handle multiclass problems. They easily handle heterogeneous features (categorical and numeric) without requiring independence assumptions. They take care of feature selection automatically (x i only asked about if it is useful), as well as account for the relative importance of features (fewer questions about less important features). The sequence of questions learned is easy to interpret, so trees can be used for data analysis. They allow you to combine knowledge engineering (question design) and ignorance modeling (statistical learning). 3
There are two main steps: Tree growing Learning a Decision Tree Tree pruning (determining the right size) Tree growing is based on a greedy algorithm for improving some objective function, such as minimum entropy of p(c T (x)) (which is the same as maximum mutual information) or minimum error rate. For each leaf node t in the current tree For each possible question q For each possible parameter a of the question compute the objective function gain G(t, q, a) Find the best parameter for q and t: a = argmax a G(t, q, a) Find the best question for t: q = argmax q G(t, q, a ) Find the best node to split: t = argmax t G(t, q, a ) Split that node and repeat. (Note: you can save the q and a information so that you don t need to redo all the tests.) The greedy approach is used because the optimal search is way too slow. However, since it is greedy, it is often better to use other objective functions besides minimum error rate. Like any learning problem, if you learn a model with too many parameters relative to the amount of data (overtraining), then the model won t generalize very well to new samples. It is easy to overtrain decision trees, so you need a mechanism to pick the right size. 4
Important concepts: Learning the Right Sized Tree It is better to prune back a big tree than to stop the growing process, since big gains can follow small gains. Consider the 2-class problem with 2 modes per class as in: A B B A The first split doesn t change the predictions, but the second split allows you to predict the classes perfectly. You need to use different data for growing vs. pruning. If you have a lot of data, just use a held out set. If you don t have a lot of data, use cross-validation. Cross-validation pruning: Partition the training data into N sets. Rotate through each set, training on all but the i-th set and pruning with that set. Find the cost/complexity trade-off for each case, and average to come up with the optimal pruning point. Then retrain a tree on the full data set, and prune according to this cost/complexity criterion (loss in G relative to number of nodes pruned). Most decision tree software takes care of this for you, but you need to remember to enable pruning. 5
Knowledge Engineering and Tree Design A good sequence of questions is learned automatically from data, but the set of possible questions can be improved by a human. Questions that software packages can think of: if you specify that the feature is numeric: is x i > T? if you specify that the feature is categorical (including binary): is x i = a? (attribute-value questions) is x i A? (set membership with A learned automatically, only in some toolkits and only when the possible values of x i is small, e.g. < 10) In theory, the decision tree learns complex questions through the sequence it asks (set membership, combinations of variables), BUT in practice, limited data impedes learning. Answer: knowledge engineering. Set membership: the human designer incorporates questions (or features, depending on software) that are flags for different sets that might be useful. Design simple combinations of categorical features by hand. Outside of tree design, learn a good linear transformation (x = w t x) of a subvector of continuous variables using principal component analysis (PCA) or linear discriminant analysis (LDA). Use new feature x and let tree design learn threshold. [covered next week] The decision tree will pick, so err on the side of too many of such groups and feature combinations instead of too few. 6
Interpreting Decision Trees Decision trees have the advantage that they are easy to interpret. The most important prediction variables are the first questions in the tree (near the root node). Variables that are associated with questions in many places in the tree are usually important (though this can also be a reflection of the need for complex questions). Some decision tree software provides output that scores variable for their importance based on information gain associated with each question in training. BUT, because of the complex interactions and instability of tree design, feature analysis and selection often benefits from further analysis, e.g. Design trees with individual features (or subgroups of features) how much information does this feature give on its own? Design trees leaving out one feature at a time (or subgroups of features) how much does this feature give in combination with other features? 7
Limitations of Decision Trees Decision trees divide up the training data with each question that is learned, which is good when there are dependencies but not good for variables that are independent. (This also motivates use of complex questions.) Subsequent decisions are based on less data, so may be less reliable. Feature selection is not perfect. If samples from an infrequent class get split up, it may be impossible to learn questions that predict that class. Decision trees are high variance (not stable) a change in the data sample could cause a very different tree to be learned. This is particularly a problem when there is not a lot of training data. Decision trees can have trouble learning to predict infrequent classes. So what do we do if we like the positive aspects of decision trees? Downsampling the more frequent classes to learn p(c T (x)), then compensate for change (so as to correctly weight the more frequent class) by: p(c T (x)) p(c T (x))p 0 (c) where p 0 (c) is the empirical (skewed) class prior. Bagging (to deal with instability and underutilization of data in downsampling) Boosting (another way to deal with skew) 8
Bagging Bagging is a general approach for designing lower variance classifiers, but it is especially popular for decision trees. Repeat for i = 1,..., N: Randomly sample (with replacement) from the training data to create a smaller training sample i. Train tree T i on this data sample, providing p(c T i (x)) Apply all N classifiers to a test sample and average the class posteriors: p(c x) = 1 N Then make a decision according to N i=1 p(c T i (x)) c = argmax c p(c x) Typically, each individual sampled training subset i would be about 70% of the size of the full sample, and N would be fairly large, chosen based on a development set. If you resample to balance class distribution, then the sampled subset would be smaller, and you would probably want a larger N. Since bigger N is more costly in terms of both memory and compute, you don t want it bigger than it needs to be for good performance. Does bagging always help? Not necessarily. The approach trades off increased model error associated with having a smaller training set with reduced variance due to averaging. For stable classifiers, bagging often isn t worth the added cost. 9
AdaBoost Like bagging, boosting is a general method for improving the accuracy of a given learning algorithm. It is similar to bagging in that you combine a bunch of classifiers, but the classifiers are designed by reweighting (rather than resampling) the data. A practical boosting algorithm is AdaBoost: Let D 1 (i) = 1/m be the initial weight of the i-th data sample. For t = 1,..., T Train a weak learner h t (x) using distribution D t (to weight samples or for sampling if learner can t use weighted samples). Get a weak hypothesis with error ɛ t on the training data. Choose α t = 1 2 ln[1 ɛ t ɛ t ]. (Note: we assume that the first learner gives ɛ 1 < 0.5, so α t > 0 for all t.) Update D t+1 (i) by e ±α t according to whether that sample was correctly classified, i.e. increase the weight for incorrectly classified samples and decrease it for correctly classified samples. If decisions h t (x i ) and class labels y i take on values ±1, then the new weight is: D t+1 (i) = 1 Z t D t (i) exp( α t y i h t (x i )) where Z t is a normalization term chosen so that D t+1 will be a valid distribution (sum to 1). The final classifier will be a weighted combination of the weak learners: T t=1 where T is determined empirically. α t h t (x) 10
Notes on boosting: AdaBoost works well for 2-class problems, but not always for multiclass problems. If the initial learner is too weak, then you need to implement multiclass decisions as a combination of binary decisions. The theory of boosting aimed at showing how to make weak learners strong, but you can use AdaBoost to make good learners better as well as making weak learners better. AdaBoost tends to be less sensitive to problems of skewed priors, because it boosts up the weight on infrequent classes without dividing the data as for decision trees. 11
BoosTexter BoosTexter is AdaBoost specially designed for text classification problems. The weak learner is a single-question decision tree (called a decision stump ), so typically large T is required. This makes BoosTexter very fast and often gives good results, but it may be possible to do better by boosting on top of decision trees. The features can include almost anything (like decision trees), but the software easily incorporates word and word pair features since it is designed for text problems. BoosTexter has been used with success for problems like: Topic classification Sentence segmentation and punctuation prediction Dialog act tagging Sentence extraction for information distillation For more information, see paper by Schapire and Singer, Machine Learning, 39(2/3): 135-168, 2000. 12