Trees: Themes and Variations

Similar documents
Lecture 1: Machine Learning Basics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CS Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

(Sub)Gradient Descent

Python Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Rule Learning With Negation: Issues Regarding Effectiveness

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Assignment 1: Predicting Amazon Review Ratings

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Rule Learning with Negation: Issues Regarding Effectiveness

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Word Segmentation of Off-line Handwritten Documents

Calibration of Confidence Measures in Speech Recognition

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Generative models and adversarial training

Model Ensemble for Click Prediction in Bing Search Ads

MYCIN. The MYCIN Task

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Probability and Statistics Curriculum Pacing Guide

12- A whirlwind tour of statistics

Learning From the Past with Experiment Databases

Artificial Neural Networks written examination

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Semi-Supervised Face Detection

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

The stages of event extraction

Universidade do Minho Escola de Engenharia

Truth Inference in Crowdsourcing: Is the Problem Solved?

The Evolution of Random Phenomena

Learning Methods in Multilingual Speech Recognition

A Case Study: News Classification Based on Term Frequency

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Comment-based Multi-View Clustering of Web 2.0 Items

Corrective Feedback and Persistent Learning for Information Extraction

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Linking Task: Identifying authors and book titles in verbose queries

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Axiom 2013 Team Description Paper

Knowledge Transfer in Deep Convolutional Neural Nets

Discriminative Learning of Beam-Search Heuristics for Planning

Probabilistic Latent Semantic Analysis

INPE São José dos Campos

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

An Empirical and Computational Test of Linguistic Relativity

Online Updating of Word Representations for Part-of-Speech Tagging

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Prediction of Maximal Projection for Semantic Role Labeling

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Algebra 2- Semester 2 Review

Multi-Lingual Text Leveling

Chapter 2 Rule Learning in a Nutshell

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

WHEN THERE IS A mismatch between the acoustic

4.0 CAPACITY AND UTILIZATION

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Proof Theory for Syntacticians

Softprop: Softmax Neural Network Backpropagation Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Evolutive Neural Net Fuzzy Filtering: Basic Description

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

While you are waiting... socrative.com, room number SIMLANG2016

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

CSL465/603 - Machine Learning

Disambiguation of Thai Personal Name from Online News Articles

Individual Differences & Item Effects: How to test them, & how to test them well

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

How to Judge the Quality of an Objective Classroom Test

Speech Emotion Recognition Using Support Vector Machine

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Word learning as Bayesian inference

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

No Parent Left Behind

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Detecting English-French Cognates Using Orthographic Edit Distance

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Getting Started with Deliberate Practice

Mining Student Evolution Using Associative Classification and Clustering

A Version Space Approach to Learning Context-free Grammars

Beyond the Pipeline: Discrete Optimization in NLP

Analysis of Enzyme Kinetic Data

On-the-Fly Customization of Automated Essay Scoring

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

School Size and the Quality of Teaching and Learning

Transcription:

Trees: Themes and Variations Prof. Mari Ostendorf Outline Preface Decision Trees Bagging Boosting BoosTexter 1

Preface: Vector Classifiers Today we again deal with vector classifiers and supervised training: Given a labeled training set {(x i, c i )} with vector observations, learn a classifier ĉ = F (x). (In the remainder of the notes, I ll omit boldface for vectors to simplify things, but everything will be a vector.) Classification issues and concepts that we ll touch on: Multiclass problems: Some classifiers are good for > 2 classes; others are developed for binary decisions and need special tricks for multiclass problems. Features playing a key role, including: complicating the classifier, feature selection, feature analysis Direct probabilistic model: On Tuesday we used p(x c)p(c); today we will learn p(c x) directly. Bias and variance: In statistical learning, models (or classifiers) are learned from a random sample of data (the training set). Because the data is random, the resulting classifier is random, i.e. can give slightly different answers if trained on a different data set. A high variance classifier is one that is very sensitive to the training sample (not a good thing). Class distribution skew: when you have a lot more data from one class than another, and when all errors are treated equally, classifiers tend to put their effort into the more popular classes. This makes sense from a min error perspective, but sometimes it makes it hard to learn to predict rare events. 2

Decision Trees A decision tree is an ordered (tree-structured) sequence of questions that are asked about the features x i in the vector x. The feature vector passes from the root of the tree to a specific leaf based on the answers to the each successive question. Questions correspond to nodes of the tree, and the leaf nodes (terminal nodes) are associated with the classifier decision and/or the predicted class posterior p(c T (x)) where T ( ) is the tree. Typically questions are binary. They may take many forms and handle different types of features, e.g. is x i > T? (for numeric features) is x i = green? (for categorical features, attribute-value questions) is x i A? (for categorical features, set membership) Decision trees are one of the most popular methods of machine learning, in part because: They easily handle multiclass problems. They easily handle heterogeneous features (categorical and numeric) without requiring independence assumptions. They take care of feature selection automatically (x i only asked about if it is useful), as well as account for the relative importance of features (fewer questions about less important features). The sequence of questions learned is easy to interpret, so trees can be used for data analysis. They allow you to combine knowledge engineering (question design) and ignorance modeling (statistical learning). 3

There are two main steps: Tree growing Learning a Decision Tree Tree pruning (determining the right size) Tree growing is based on a greedy algorithm for improving some objective function, such as minimum entropy of p(c T (x)) (which is the same as maximum mutual information) or minimum error rate. For each leaf node t in the current tree For each possible question q For each possible parameter a of the question compute the objective function gain G(t, q, a) Find the best parameter for q and t: a = argmax a G(t, q, a) Find the best question for t: q = argmax q G(t, q, a ) Find the best node to split: t = argmax t G(t, q, a ) Split that node and repeat. (Note: you can save the q and a information so that you don t need to redo all the tests.) The greedy approach is used because the optimal search is way too slow. However, since it is greedy, it is often better to use other objective functions besides minimum error rate. Like any learning problem, if you learn a model with too many parameters relative to the amount of data (overtraining), then the model won t generalize very well to new samples. It is easy to overtrain decision trees, so you need a mechanism to pick the right size. 4

Important concepts: Learning the Right Sized Tree It is better to prune back a big tree than to stop the growing process, since big gains can follow small gains. Consider the 2-class problem with 2 modes per class as in: A B B A The first split doesn t change the predictions, but the second split allows you to predict the classes perfectly. You need to use different data for growing vs. pruning. If you have a lot of data, just use a held out set. If you don t have a lot of data, use cross-validation. Cross-validation pruning: Partition the training data into N sets. Rotate through each set, training on all but the i-th set and pruning with that set. Find the cost/complexity trade-off for each case, and average to come up with the optimal pruning point. Then retrain a tree on the full data set, and prune according to this cost/complexity criterion (loss in G relative to number of nodes pruned). Most decision tree software takes care of this for you, but you need to remember to enable pruning. 5

Knowledge Engineering and Tree Design A good sequence of questions is learned automatically from data, but the set of possible questions can be improved by a human. Questions that software packages can think of: if you specify that the feature is numeric: is x i > T? if you specify that the feature is categorical (including binary): is x i = a? (attribute-value questions) is x i A? (set membership with A learned automatically, only in some toolkits and only when the possible values of x i is small, e.g. < 10) In theory, the decision tree learns complex questions through the sequence it asks (set membership, combinations of variables), BUT in practice, limited data impedes learning. Answer: knowledge engineering. Set membership: the human designer incorporates questions (or features, depending on software) that are flags for different sets that might be useful. Design simple combinations of categorical features by hand. Outside of tree design, learn a good linear transformation (x = w t x) of a subvector of continuous variables using principal component analysis (PCA) or linear discriminant analysis (LDA). Use new feature x and let tree design learn threshold. [covered next week] The decision tree will pick, so err on the side of too many of such groups and feature combinations instead of too few. 6

Interpreting Decision Trees Decision trees have the advantage that they are easy to interpret. The most important prediction variables are the first questions in the tree (near the root node). Variables that are associated with questions in many places in the tree are usually important (though this can also be a reflection of the need for complex questions). Some decision tree software provides output that scores variable for their importance based on information gain associated with each question in training. BUT, because of the complex interactions and instability of tree design, feature analysis and selection often benefits from further analysis, e.g. Design trees with individual features (or subgroups of features) how much information does this feature give on its own? Design trees leaving out one feature at a time (or subgroups of features) how much does this feature give in combination with other features? 7

Limitations of Decision Trees Decision trees divide up the training data with each question that is learned, which is good when there are dependencies but not good for variables that are independent. (This also motivates use of complex questions.) Subsequent decisions are based on less data, so may be less reliable. Feature selection is not perfect. If samples from an infrequent class get split up, it may be impossible to learn questions that predict that class. Decision trees are high variance (not stable) a change in the data sample could cause a very different tree to be learned. This is particularly a problem when there is not a lot of training data. Decision trees can have trouble learning to predict infrequent classes. So what do we do if we like the positive aspects of decision trees? Downsampling the more frequent classes to learn p(c T (x)), then compensate for change (so as to correctly weight the more frequent class) by: p(c T (x)) p(c T (x))p 0 (c) where p 0 (c) is the empirical (skewed) class prior. Bagging (to deal with instability and underutilization of data in downsampling) Boosting (another way to deal with skew) 8

Bagging Bagging is a general approach for designing lower variance classifiers, but it is especially popular for decision trees. Repeat for i = 1,..., N: Randomly sample (with replacement) from the training data to create a smaller training sample i. Train tree T i on this data sample, providing p(c T i (x)) Apply all N classifiers to a test sample and average the class posteriors: p(c x) = 1 N Then make a decision according to N i=1 p(c T i (x)) c = argmax c p(c x) Typically, each individual sampled training subset i would be about 70% of the size of the full sample, and N would be fairly large, chosen based on a development set. If you resample to balance class distribution, then the sampled subset would be smaller, and you would probably want a larger N. Since bigger N is more costly in terms of both memory and compute, you don t want it bigger than it needs to be for good performance. Does bagging always help? Not necessarily. The approach trades off increased model error associated with having a smaller training set with reduced variance due to averaging. For stable classifiers, bagging often isn t worth the added cost. 9

AdaBoost Like bagging, boosting is a general method for improving the accuracy of a given learning algorithm. It is similar to bagging in that you combine a bunch of classifiers, but the classifiers are designed by reweighting (rather than resampling) the data. A practical boosting algorithm is AdaBoost: Let D 1 (i) = 1/m be the initial weight of the i-th data sample. For t = 1,..., T Train a weak learner h t (x) using distribution D t (to weight samples or for sampling if learner can t use weighted samples). Get a weak hypothesis with error ɛ t on the training data. Choose α t = 1 2 ln[1 ɛ t ɛ t ]. (Note: we assume that the first learner gives ɛ 1 < 0.5, so α t > 0 for all t.) Update D t+1 (i) by e ±α t according to whether that sample was correctly classified, i.e. increase the weight for incorrectly classified samples and decrease it for correctly classified samples. If decisions h t (x i ) and class labels y i take on values ±1, then the new weight is: D t+1 (i) = 1 Z t D t (i) exp( α t y i h t (x i )) where Z t is a normalization term chosen so that D t+1 will be a valid distribution (sum to 1). The final classifier will be a weighted combination of the weak learners: T t=1 where T is determined empirically. α t h t (x) 10

Notes on boosting: AdaBoost works well for 2-class problems, but not always for multiclass problems. If the initial learner is too weak, then you need to implement multiclass decisions as a combination of binary decisions. The theory of boosting aimed at showing how to make weak learners strong, but you can use AdaBoost to make good learners better as well as making weak learners better. AdaBoost tends to be less sensitive to problems of skewed priors, because it boosts up the weight on infrequent classes without dividing the data as for decision trees. 11

BoosTexter BoosTexter is AdaBoost specially designed for text classification problems. The weak learner is a single-question decision tree (called a decision stump ), so typically large T is required. This makes BoosTexter very fast and often gives good results, but it may be possible to do better by boosting on top of decision trees. The features can include almost anything (like decision trees), but the software easily incorporates word and word pair features since it is designed for text problems. BoosTexter has been used with success for problems like: Topic classification Sentence segmentation and punctuation prediction Dialog act tagging Sentence extraction for information distillation For more information, see paper by Schapire and Singer, Machine Learning, 39(2/3): 135-168, 2000. 12