Competition II: Springleaf

Similar documents
Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Learning From the Past with Experiment Databases

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Multivariate k-nearest Neighbor Regression for Time Series data -

Model Ensemble for Click Prediction in Bing Search Ads

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CSL465/603 - Machine Learning

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

arxiv: v2 [cs.cv] 30 Mar 2017

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Exploration. CS : Deep Reinforcement Learning Sergey Levine

WHEN THERE IS A mismatch between the acoustic

Time series prediction

Speech Emotion Recognition Using Support Vector Machine

Universidade do Minho Escola de Engenharia

Generative models and adversarial training

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Australian Journal of Basic and Applied Sciences

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Word Segmentation of Off-line Handwritten Documents

Activity Recognition from Accelerometer Data

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Article A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Human Emotion Recognition From Speech

Linking Task: Identifying authors and book titles in verbose queries

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Ensemble Technique Utilization for Indonesian Dependency Parser

Reducing Features to Improve Bug Prediction

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Switchboard Language Model Improvement with Conversational Data from Gigaword

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Learning Distributed Linguistic Classes

Evaluation of Teach For America:

Rule Learning With Negation: Issues Regarding Effectiveness

Softprop: Softmax Neural Network Backpropagation Learning

arxiv: v1 [cs.lg] 15 Jun 2015

Calibration of Confidence Measures in Speech Recognition

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Attributed Social Network Embedding

arxiv: v1 [cs.lg] 3 May 2013

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

arxiv: v1 [cs.cl] 2 Apr 2017

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Rule Learning with Negation: Issues Regarding Effectiveness

Indian Institute of Technology, Kanpur

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

CS 446: Machine Learning

INPE São José dos Campos

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Truth Inference in Crowdsourcing: Is the Problem Solved?

Multi-tasks Deep Learning Model for classifying MRI images of AD/MCI Patients

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

A Comparison of Two Text Representations for Sentiment Analysis

Comment-based Multi-View Clustering of Web 2.0 Items

Mining Association Rules in Student s Assessment Data

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Copyright by Sung Ju Hwang 2013

Issues in the Mining of Heart Failure Datasets

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Probabilistic Latent Semantic Analysis

A survey of multi-view machine learning

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

On-the-Fly Customization of Automated Essay Scoring

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Introduction to Causal Inference. Problem Set 1. Required Problems

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Online Updating of Word Representations for Part-of-Speech Tagging

Learning Methods in Multilingual Speech Recognition

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Artificial Neural Networks written examination

Individual Differences & Item Effects: How to test them, & how to test them well

K-Medoid Algorithm in Clustering Student Scholarship Applicants

Multi-Lingual Text Leveling

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Conference Presentation

Probability and Statistics Curriculum Pacing Guide

Active Learning. Yingyu Liang Computer Sciences 760 Fall

A study of speaker adaptation for DNN-based speech synthesis

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Transcription:

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University

Agenda Kaggle Competition: Springleaf dataset introduction Data Preprocessing Classification Methodologies & Results Logistic Regression Random Forest XGBoost Stacking Summary & Conclusion

Kaggle Competition: Springleaf Objective: Predict whether customers will respond to a direct mail loan offer Customers: 145,231 Independent variables: 1932 Anonymous features Dependent variable: target = 0: DID NOT RESPOND target = 1: RESPONDED Training sets: 96,820 obs. Testing sets: 48,411 obs.

Dataset facts 76.7% R package used to read file: data.table::fread 23.3% Class 0 and 1 count Target=0 obs.: 111,458 Target=1 obs.: 33,773 Numerical variables: 1,876 Character variables: 51 Constant variables: 5 Variable level counts: 67.0% columns have levels <= 100 Variables count Count of levels for each column

Missing values, NA : 0.6% [], -1: 2.0% -99999, 96,, 999,, 99999999: 24.9% 25.3% columns have missing values 61.7% Count of NAs in each column Count of NAs in each row

Challenges for classification Huge Dataset (145,231 X 1932) Anonymous features Uneven distribution of response variable 27.6% of missing values Deal with both numerical and categorical variables Undetermined portion of Categorical variables Data pre-processing complexity

Data preprocessing Remove ID and target Replace [] and -1 as NA Remove duplicate cols Replace character cols Replace NA by median Regard NA as a new group Replace NA randomly Remove low variance cols Normalize Log(1+ x )

Principal Component Analysis When PC is close to 400, it can explain 90% variance. pc1

LDA: Linear discriminant analysis We are interested in the most discriminatory direction, not the maximum variance. Find the direction that best separates the two classes. Significant overlap µ1 and µ2 are close µ1 µ2 Var1 and Var2 are large

Methodology K Nearest Neighbor (KNN) Support Vector Machine (SVM) Logistic Regression Random Forest XGBoost (extreme Gradient Boosting) Stacking

K Nearest Neighbor (KNN) ² Overall ² Target = 1 Target =0 Target =1 100.0 90.0 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 KNN Overall Target=1 72.1 73.9 75.0 76.1 76.5 76.8 77.0 22.8 18.3 15.3 12.1 10.5 9.4 7.5 3 5 7 11 15 21 39 K

Support Vector Machine (SVM) Expensive; takes long time for each run Good results for numerical data Confusion matrix Truth Prediction 0 1 0 19609 483 1 5247 803 Overall 78.1% Target = 1 13.3% Target = 0 97.6%

Logistic Regression Logistic regression is a regression model where the dependent variable is categorical. Measures the relationship between dependent variable and independent variables by estimating probabilities

Logistic Regression Overall Target=1 80.00 100.00 79.00 80.00 78.00 77.00 60.00 40.00 76.00 20.00 75.00 2 5 15 25 35 45 55 65 75 85 95 105 115 125 135 145 155 165 175 185 195 210 240 280 320 PC 0.00 2 10 25 40 55 70 85 100 115 130 145 160 175 190 210 260 320 PC Confusion matrix Truth Prediction 0 1 0 53921 3159 1 12450 4853 Overall 79.2 % Target = 1 28.1 % Target = 0 94.5 %

Random Forest Machine learning ensemble algorithm -- Combining multiple predictors Based on tree model For both regression and classification Automatic variable selection Handles missing values Robust, improving model stability and accuracy

Random Forest Train dataset Draw Bootstrap Samples A Random Tree Build random tree Predict based on each tree Majority vote

Random Forest Tree number(500) vs Misclassification Error Target =1 Overall Target =0 Confusion matrix Truth Prediction 0 1 0 36157 1181 1 8850 2223 Overall 79.3% Target = 1 20.1% Target = 0 96.8%

XGBoost Additive tree model: add new trees that complement the already-built ones Response is the optimal linear combination of all decision trees Popular in Kaggle competitions for efficiency and accuracy Additive tree model.. Error Greedy Algorithm Number of Tree

XGBoost Additive tree model: add new trees that complement the already-built ones Response is the optimal linear combination of all decision trees Popular in Kaggle competitions for efficiency and accuracy

XGBoost Test error Confusion matrix Prediction 0 1 Truth 0 35744 1467 1 8201 2999 Train error Overall 80.0% Target = 1 26.8% Target = 0 96.1%

Methods Comparison 100.0 Overall Target =1 90.0 80.0 77.0 78.1 77.8 79.0 79.2 80.0 70.0 60.0 50.0 40.0 30.0 20.0 10.0 6.6 13.3 19.0 20.1 28.1 26.8 0.0

Winner or Combination?

Stacking Main Idea: Learn and combine multiple classifiers Train Test Base learner C1 Labeled data Base learner C2 Meta features Final prediction Base learner Cn Base learners Meta learner

Generating Base and Meta Learners Base model efficiency, accuracy and diversity Sampling training examples Sampling features Using different learning models Meta learner Majority voting Weighted averaging Kmeans Unsupervised Higher level classifier Supervised(XGBoost) 24

Stacking model Meta Features Total data Sparse Condense Low level PCA XGBoost Logistic Regression Random Forest Predictions Total data XGBoost Final prediction Base learners ❶ Combined data ❷ Meta learner ❸

Stacking Results Base Model (target=1) XGB + total data 80.0% 28.5% XGB + condense data 79.5% 27.9% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% of Base Model XGB + Low level data Logistic regression+ sparse data Logistic regression+ condense data Random forest + PCA 79.5% 27.7% 78.2% 26.8 % 79.1% 28.1% 77.6% 20.9% (target=1) of XGB Meta Model (target=1) XGB 81.11% 29.21% Averaging 79.44% 27.31% Kmeans 77.45% 23.91%

Stacking Results Base Model (target=1) XGB + total data 80.0% 28.5% XGB + condense data 79.5% 27.9% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% of Base Model XGB + Low level data Logistic regression+ sparse data Logistic regression+ condense data Random forest + PCA 79.5% 27.7% 78.2% 26.8 % 79.1% 28.1% 77.6% 20.9% (target=1) of XGB Meta Model (target=1) XGB 81.11% 29.21% Averaging 79.44% 27.31% Kmeans 77.45% 23.91%

Summary and Conclusion Data mining project in the real world Huge and noisy data Data preprocessing Feature encoding Different missing value process: New level, Median / Mean, or Random assignment Classification techniques Classifiers based on distance are not suitable Classifiers handling mixed type of variables are preferred Categorical variables are dominant Stacking makes further promotion Biggest improvement came from model selection, parameter tuning, stacking Result comparison: Winner result: 80.4% Our result: 79.5%

Acknowledgements We would like to express our deep gratitude to the following people / organization: Profs. Bremer and Simic for their proposal that made this project possible Woodward Foundation for funding Profs. Simic and CAMCOS for all the support Prof. Chen for his guidance, valuable comments and suggestions

QUESTIONS?