Elie Kawerk Data Scientist

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Python Machine Learning

CS Machine Learning

Learning From the Past with Experiment Databases

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lecture 1: Machine Learning Basics

Applications of data mining algorithms to analysis of medical data

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Assignment 1: Predicting Amazon Review Ratings

Universidade do Minho Escola de Engenharia

Rule Learning With Negation: Issues Regarding Effectiveness

While you are waiting... socrative.com, room number SIMLANG2016

Rule Learning with Negation: Issues Regarding Effectiveness

Learning Methods in Multilingual Speech Recognition

An Empirical Comparison of Supervised Ensemble Learning Approaches

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

A Case Study: News Classification Based on Term Frequency

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Multi-Lingual Text Leveling

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

(Sub)Gradient Descent

Word Segmentation of Off-line Handwritten Documents

Generative models and adversarial training

arxiv: v1 [cs.lg] 15 Jun 2015

Multi-label classification via multi-target regression on data streams

Human Emotion Recognition From Speech

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Softprop: Softmax Neural Network Backpropagation Learning

Modeling function word errors in DNN-HMM based LVCSR systems

Indian Institute of Technology, Kanpur

Activity Recognition from Accelerometer Data

Learning Distributed Linguistic Classes

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Model Ensemble for Click Prediction in Bing Search Ads

Mining Association Rules in Student s Assessment Data

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Modeling function word errors in DNN-HMM based LVCSR systems

Content-based Image Retrieval Using Image Regions as Query Examples

A Vector Space Approach for Aspect-Based Sentiment Analysis

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Truth Inference in Crowdsourcing: Is the Problem Solved?

Probability and Statistics Curriculum Pacing Guide

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Linking Task: Identifying authors and book titles in verbose queries

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Analysis of Enzyme Kinetic Data

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Multi-label Classification via Multi-target Regression on Data Streams

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Computerized Adaptive Psychological Testing A Personalisation Perspective

Reducing Features to Improve Bug Prediction

CS 446: Machine Learning

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Multi-tasks Deep Learning Model for classifying MRI images of AD/MCI Patients

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Term Weighting based on Document Revision History

Cooperative evolutive concept learning: an empirical study

A survey of multi-view machine learning

Interactive Whiteboard

Australian Journal of Basic and Applied Sciences

Switchboard Language Model Improvement with Conversational Data from Gigaword

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Issues in the Mining of Heart Failure Datasets

Speech Emotion Recognition Using Support Vector Machine

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Using dialogue context to improve parsing performance in dialogue systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

CSL465/603 - Machine Learning

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Conference Presentation

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Algebra 2- Semester 2 Review

An OO Framework for building Intelligence and Learning properties in Software Agents

Dublin City Schools Mathematics Graded Course of Study GRADE 4

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Calibration of Confidence Measures in Speech Recognition

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Physics 270: Experimental Physics

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Transcription:

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Bagging Elie Kawerk Data Scientist

Ensemble Methods Voting Classifier same training set, algorithms. Bagging one algorithm, subsets of the training set.

Bagging Bagging: Bootstrap Aggregation. Uses a technique known as the bootsrap. Reduces variance of individual models in the ensemble.

Bootstrap

Bagging: Training

Bagging: Prediction

Bagging: Classification & Regression Classification: Aggregates predictions by majority voting. BaggingClassifier in scikit-learn. Regression: Aggregates predictions through averaging. BaggingRegressor in scikit-learn.

Bagging Classifier in sklearn (Breast-Cancer dataset) # Import models and utility functions In [1]: from sklearn.ensemble import BaggingClassifier In [2]: from sklearn.tree import DecisionTreeClassifier In [3]: from sklearn.metrics import accuracy_score In [4]: from sklearn.model_selection import train_test_split # Set seed for reproducibility In [5]: SEED = 1 # Split data into 70% train and 30% test In [6]: X_train, X_test, y_train, y_test = \ train_test_split(x, y, test_size=0.3, stratify=y, random_state=seed)

Bagging Classifier in sklearn (Breast-Cancer dataset) # Instantiate a classification-tree 'dt' In [7]: dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=0.16, random_state=seed) # Instantiate a BaggingClassifier 'bc' In [8]: bc = BaggingClassifier(base_estimator=dt, n_estimators=300, n_jobs=-1) # Fit 'bc' to the training set In [9]: bc.fit(x_train, y_train) # Predict test set labels In [10]: y_pred = bc.predict(x_test) # Evaluate and print test-set accuracy In [11]: accuracy = accuracy_score(y_test, y_pred) In [12]: print('accuracy of Bagging Classifier: {:.3f}'.format(accuracy)) Out[12]: Accuracy of Bagging Classifier: 0.936

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Let's practice!

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Out Of Bag Evaluation Elie Kawerk Data Scientist

Bagging some instances may be sampled several times for one model, other instances may not be sampled at all.

Out Of Bag (OOB) instances On average, for each model, 63% of the training instances are sampled. The remaining 37% constitute the OOB instances.

OOB Evaluation

OOB Evaluation in sklearn (Breast Cancer Dataset) # Import models and split utility function In [1]: from sklearn.ensemble import BaggingClassifier In [2]: from sklearn.tree import DecisionTreeClassifier In [3]: from sklearn.metrics import accuracy_score In [4]: from sklearn.model_selection import train_test_split # Set seed for reproducibility In [5]: SEED = 1 # Split data into 70% train and 30% test In [6]: X_train, X_test, y_train, y_test = \ train_test_split(x, y, test_size= 0.3, stratify= y, random_state=seed)

OOB Evaluation in sklearn (Breast Cancer Dataset) # Instantiate a classification-tree 'dt' In [7]: dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=0.16, random_state=seed) # Instantiate a BaggingClassifier 'bc'; set oob_score= True In [8]: bc = BaggingClassifier(base_estimator=dt, n_estimators=300, oob_score=true, n_jobs=-1) # Fit 'bc' to the traing set In [9]: bc.fit(x_train, y_train) # Predict the test set labels In [10]: y_pred = bc.predict(x_test)

OOB Evaluation in sklearn (Breast Cancer Dataset) # Evaluate test set accuracy In [11]: test_accuracy = accuracy_score(y_test, y_pred) # Extract the OOB accuracy from 'bc' In [12]: oob_accuracy = bc.oob_score_ # Print test set accuracy In [13]: print('test set accuracy: {:.3f}'.format(test_accuracy)) Out[13]: Test set accuracy: 0.936 # Print OOB accuracy In [14]: print('oob accuracy: {:.3f}'.format(oob_accuracy)) Out[14]: OOB accuracy: 0.925

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Let's practice!

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Random Forests Elie Kawerk Data Scientist

Bagging Base estimator: Decision Tree, Logistic Regression, Neural Net,... Each estimator is trained on a distinct bootstrap sample of the training set Estimators use all features for training and prediction

Further Diversity with Random Forests Base estimator: Decision Tree Each estimator is trained on a different bootstrap sample having the same size as the training set RF introduces further randomization in the training of individual trees d features are sampled at each node without replacement ( d < total number of features )

Random Forests: Training

Random Forests: Prediction

Random Forests: Classification & Regression Classification: Aggregates predictions by majority voting RandomForestClassifier in scikit-learn Regression: Aggregates predictions through averaging RandomForestRegressor in scikit-learn

Random Forests Regressor in sklearn (auto dataset) # Basic imports In [1]: from sklearn.ensemble import RandomForestRegressor In [2]: from sklearn.model_selection import train_test_split In [3]: from sklearn.metrics import mean_squared_error as MSE # Set seed for reproducibility In [4]: SEED = 1 # Split dataset into 70% train and 30% test In [5]: X_train, X_test, y_train, y_test = \ train_test_split(x, y, test_size=0.3, random_state=seed)

Random Forests Regressor in sklearn (auto dataset) # Instantiate a random forests regressor 'rf' 400 estimators In [6]: rf = RandomForestRegressor(n_estimators=400, min_samples_leaf=0.12, random_state=seed) # Fit 'rf' to the training set In [7]: rf.fit(x_train, y_train) # Predict the test set labels 'y_pred' In [8]: y_pred = rf.predict(x_test) # Evaluate the test set RMSE In [9]: rmse_test = MSE(y_test, y_pred)**(1/2) # Print the test set RMSE In [10]: print('test set RMSE of rf: {:.2f}'.format(rmse_test)) Out[10]: Test set RMSE of rf: 3.98

Feature Importance Tree-based methods: enable measuring the importance of each feature in prediction. In sklearn: how much the tree nodes use a particular feature (weighted average) to reduce impurity accessed using the attribute feature_importance_

Feature Importance in sklearn In [11]: import pandas as pd In [12]: import matplotlib.pyplot as plt # Create a pd.series of features importances In [13]: importances_rf = pd.series(rf.feature_importances_, index = X.columns) # Sort importances_rf In [14]: sorted_importances_rf = importances_rf.sort_values() # Make a horizontal bar plot In [15]: sorted_importances_rf.plot(kind='barh', color='lightgreen'); plt.show()

Feature Importance in sklearn

MACHINE LEARNING WITH TREE-BASED MODELS IN PYTHON Let's practice!