Learning Imbalanced Data with Random Forests

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Australian Journal of Basic and Applied Sciences

Rule Learning With Negation: Issues Regarding Effectiveness

Python Machine Learning

Detecting English-French Cognates Using Orthographic Edit Distance

Lecture 1: Machine Learning Basics

Learning From the Past with Experiment Databases

Using dialogue context to improve parsing performance in dialogue systems

Rule Learning with Negation: Issues Regarding Effectiveness

CS Machine Learning

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Radius STEM Readiness TM

(Sub)Gradient Descent

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Disambiguation of Thai Personal Name from Online News Articles

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Issues in the Mining of Heart Failure Datasets

For Jury Evaluation. The Road to Enlightenment: Generating Insight and Predicting Consumer Actions in Digital Markets

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Multi-label classification via multi-target regression on data streams

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Article A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Generative models and adversarial training

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Introduction to Causal Inference. Problem Set 1. Required Problems

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

A Study of Synthetic Oversampling for Twitter Imbalanced Sentiment Analysis

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

An Empirical Comparison of Supervised Ensemble Learning Approaches

Assignment 1: Predicting Amazon Review Ratings

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

Reducing Features to Improve Bug Prediction

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Data Stream Processing and Analytics

Customized Question Handling in Data Removal Using CPHC

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Getting Started with Deliberate Practice

Memory-based grammatical error correction

Activity Recognition from Accelerometer Data

Science Fair Project Handbook

MYCIN. The MYCIN Task

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Universidade do Minho Escola de Engenharia

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Wenguang Sun CAREER Award. National Science Foundation

Human Emotion Recognition From Speech

Team Formation for Generalized Tasks in Expertise Social Networks

Language Model and Grammar Extraction Variation in Machine Translation

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Softprop: Softmax Neural Network Backpropagation Learning

Cross Language Information Retrieval

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

GCSE English Language 2012 An investigation into the outcomes for candidates in Wales

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Brainstorming Tools Literature Review and Introduction to Code Development

Discriminative Learning of Beam-Search Heuristics for Planning

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Applications of data mining algorithms to analysis of medical data

Cross-lingual Short-Text Document Classification for Facebook Comments

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Multi-label Classification via Multi-target Regression on Data Streams

Linking Task: Identifying authors and book titles in verbose queries

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

On-Line Data Analytics

Probability estimates in a scenario tree

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Truth Inference in Crowdsourcing: Is the Problem Solved?

Lecture 2: Quantifiers and Approximation

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Speech Emotion Recognition Using Support Vector Machine

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Computer Science 141: Computing Hardware Course Information Fall 2012

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Reinforcement Learning by Comparing Immediate Reward

The Flaws, Fallacies and Foolishness of Benchmark Testing

Detecting Student Emotions in Computer-Enabled Classrooms

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

Managerial Decision Making

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

CSC200: Lecture 4. Allan Borodin

A survey of multi-view machine learning

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Transcription:

Learning Imbalanced Data with Random Forests Chao Chen (Stat., UC Berkeley) chenchao@stat.berkeley.edu Andy Liaw (Merck Research Labs) andy_liaw@merck.com Leo Breiman (Stat., UC Berkeley) leo@stat.berkeley.edu Interface 2004, Baltimore

Outline Imbalanced data Common approaches and recent works Balanced random forests Weighted random forests Some comparisons Conclusion 2

Imbalanced Data Data for many classification problems are inherently imbalanced One large, normal class (negative) and one small/rare, interesting class (positive) E.g.: rare diseases, fraud detection, compound screening in drug discovery, etc. Why is this a problem? Most machine learning algorithms focus on overall accuracy, and break down with moderate imbalance in the data Even some cost-sensitive algorithms don t work well when imbalance is extreme 3

Common Approaches Up-sampling minority class random sampling with replacement strategically add cases that reduce error Down-sampling majority class random sampling strategically omit cases that do not help Cost-sensitive learning build misclassification cost into the algorithm Down-sampling tends to work better empirically, but loses some information, as not all training data are used 4

Recent Work One-sided sampling SMOTE: Synthetic Minority Oversampling TEchnique (Chawla et al, 2002) SMOTEBoost SHRINK 5

Random Forest A supervised learning algorithm, constructed by combining multiple decision trees (Breiman, 2001) Draw a bootstrap sample of the data Grow an un-pruned tree At each node, only a small, random subset of predictor variables are tried to split that node Repeat as many times as you d like Make predictions using all trees 6

Balanced Random Forest Natural integration of down-sampling majority class and ensemble learning For each tree in RF, down-sample the majority class to the same size as the minority class Given enough trees, all training data are used, so no loss of information Computationally efficient, since each tree only sees a small sample 7

Weighted Random Forest Incorporate class weights in several places of the RF algorithm: Weighted Gini for split selection Class-weighted votes at terminal nodes for node class Weighted votes over all trees, using average weights at terminal nodes Using weighted Gini alone isn t sufficient 8

Performance Assessment Confusion Matrix Positive Negative Predicted Positive True Positive False Positive Predicted Negative False Negative True Negative True Positive Rate (TPR): TP / (TP + FN) True Negative Rate (TNR): TN / (TN + FP) Precision: TP / (TP + FP) Recall: same as TPR g-mean: (TPR TNR) 1/2 F-measure: (2 Precision Recall) / (Precision + Recall) 9

Benchmark Data Dataset No. of No. of Obs. % Minority Oil Spill Var. 50 937 4.4 Mammograph 6 11183 2.3 y SatImage 36 6435 9.7 10

Oil Spill Data Method TPR TNR Precisio G-mean F- 1-sided 76.0 86.6 n 20.5 81.13 meas 32.3 sampling SHRINK 82.5 60.9 8.85 70.9 16.0 SMOTE 89.5 78.9 16.4 84.0 27.7 BRF 73.2 91.6 28.6 81.9 41.1 WRF 92.7 82.4 19.4 87.4 32.1 Performance for 1-sided sampling, SHRINK, and SMOTE taken from Chawla, et al (2002). 11

Mammography Data Method TPR TNR Precisio G- F- n mean meas RIPPER 48.1 99.6 74.7 69.2 58.1 SMOTE 62.2 99.0 60.5 78.5 60.4 SMOTE-Boost 62.6 99.5 74.5 78.9 68.1 BRF 76.5 98.2 50.5 86.7 60.8 WRF 72.7 99.2 69.7 84.9 71.1 Performance for RIPPER, SMOTE, and SMOTE-Boost taken from Chawla, et al (2003). 12

Satimage Data Method TPR TNR Precisio G- F- n mean meas RIPPER 47.4 97.6 67.9 68.0 55.5 SMOTE 74.9 91.3 48.1 82.7 58.3 SMOTE-Boost 67.9 97.2 72.7 81.2 70.2 BRF 77.0 93.6 56.3 84.9 65.0 WRF 77.5 94.6 60.5 85.6 68.0 Performance for RIPPER, SMOTE, and SMOTE-Boost taken from Chawla, et al (2003). 13

A Simple Experiment: 2Norm Fix size of one class at 100, vary the size of other class among 5e3, 1e4, 5e4 and 1e5 Train both WRF and BRF, predict on same size test set WRF: use reciprocal of class ratio as weights BRF: draw 100 from each class w/replacement to grow each tree With usual prediction, BRF has better false negative rate; WRF has better true positive rate Compare cumulative gain to see difference 14

Comparing Cumulative Gain 0 100 200 300 400 500 1:50 1:100 100 80 60 40 % Positives Found 100 80 1:500 1:1000 20 0 WRF BRF 60 40 20 0 0 100 200 300 400 500 15

To Wrap Up We propose two methods of learning imbalanced data with random forests BRF: down-sampling majority in each tree WRF: incorporate class weights in several places Both show improvements over existing methods The two are about equally effective on real; hard to pick a winner Need further study to see if/when/why one works better than the other 16

Free Software Random Forest (Breiman & Cutler): Fortran code, implements WRF, available at http://stat-www.berkeley.edu/users/breiman/randomforests/ randomforest (Liaw & Wiener): add-on package for R (based on the Fortran code above), implements BRF, available on CRAN (e.g.: http://cran.us.r-roject.org/src/contrib/packages.html) 17

Acknowledgment Adele Cutler (Utah State) Vladimir Svetnik, Chris Tong, Ting Wang (BR) Matt Wiener (ACSM) 18