Machine Learning. Module 12

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning From the Past with Experiment Databases

Model Ensemble for Click Prediction in Bing Search Ads

Assignment 1: Predicting Amazon Review Ratings

Probability and Statistics Curriculum Pacing Guide

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Artificial Neural Networks written examination

Softprop: Softmax Neural Network Backpropagation Learning

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

WHEN THERE IS A mismatch between the acoustic

Grade 6: Correlated to AGS Basic Math Skills

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Reducing Features to Improve Bug Prediction

Probabilistic Latent Semantic Analysis

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Evidence for Reliability, Validity and Learning Effectiveness

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Individual Differences & Item Effects: How to test them, & how to test them well

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

On-the-Fly Customization of Automated Essay Scoring

arxiv: v1 [cs.lg] 15 Jun 2015

STA 225: Introductory Statistics (CT)

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Learning Methods in Multilingual Speech Recognition

Analysis of Enzyme Kinetic Data

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

CSL465/603 - Machine Learning

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Knowledge Transfer in Deep Convolutional Neural Nets

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Axiom 2013 Team Description Paper

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Rule Learning with Negation: Issues Regarding Effectiveness

Chapter 2 Rule Learning in a Nutshell

Detailed course syllabus

Lecture 10: Reinforcement Learning

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Generative models and adversarial training

Rule Learning With Negation: Issues Regarding Effectiveness

Lecture 2: Quantifiers and Approximation

Major Milestones, Team Activities, and Individual Deliverables

Why Did My Detector Do That?!

Calibration of Confidence Measures in Speech Recognition

Probability estimates in a scenario tree

Shockwheat. Statistics 1, Activity 1

Speech Recognition at ICSI: Broadcast News and beyond

Radius STEM Readiness TM

Statewide Framework Document for:

Evolutive Neural Net Fuzzy Filtering: Basic Description

Time series prediction

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Introduction to Causal Inference. Problem Set 1. Required Problems

Applications of data mining algorithms to analysis of medical data

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

School Size and the Quality of Teaching and Learning

Multiple regression as a practical tool for teacher preparation program evaluation

Data Fusion Through Statistical Matching

Proof Theory for Syntacticians

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Universityy. The content of

Semi-Supervised Face Detection

How to Judge the Quality of an Objective Classroom Test

Truth Inference in Crowdsourcing: Is the Problem Solved?

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

On-Line Data Analytics

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Australian Journal of Basic and Applied Sciences

School of Innovative Technologies and Engineering

Introduction to Simulation

Navigating the PhD Options in CMS

Multi-Lingual Text Leveling

CONSISTENCY OF TRAINING AND THE LEARNING EXPERIENCE

Word Segmentation of Off-line Handwritten Documents

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Julia Smith. Effective Classroom Approaches to.

A Neural Network GUI Tested on Text-To-Phoneme Mapping

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

1.11 I Know What Do You Know?

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

M55205-Mastering Microsoft Project 2016

Transcription:

Machine Learning http://datascience.tntlab.org Module 12

Today s Agenda How You're Already Using Machine Learning Models Overview of Statistical Analysis vs. Machine Learning Terminology differences Model selection Technical Walkthrough Walkthrough of caret Some starter algorithms: tree, random forest, LASSO, ridge Cross-validation and model comparisons

Key to Understanding Machine Learning You've already learned a lot of these concepts in statistics classes. A lot of new terms are used for things you already have words for. "Training dataset" = "dataset" "Train" = "provide data to create a predictive model" The key to understanding machine learning is relating it back to what you already know, and extending those ideas where concepts are genuinely new 3

You Are Already Using Machine Learning Have you create an OLS linear regression model? Congrats; you're a data scientist. You may have also used more advanced machine learning algorithms without realizing it. EM imputation Utilizes an expectation-maximization algorithm to predict missing data. 4

Statistical Analysis vs. Machine Learning Statistical Analysis Focus on interpretability Assumption checking (i.e., integrity of mathematical approach) Interpretability of component parts/predictors "Given this theoretical model, how well do the data describe y?" Goal: Draw conclusions about predictors Machine Learning Focus on generalizable prediction Intention to take an algorithm developed in one context and use it in another "Given these data, what algorithm will predict y most consistently in other datasets with similar generative characteristics?" Goal: Predict as strongly as possible equally well in the future You will see many of the same predictive modeling techniques in both. 5

Types of Machine Learning (by Process) Supervised Learning (the focus in Data Camp) Regression models: continuous DVs Classification models: discrete DVs Decision tree models Neural networks Semi-supervised Learning Unsupervised Learning K-Means clustering (which you might already know) Reinforcement Learning 6

Classic Machine Learning Model Cheat Sheet https://www.google.com/url?sa=i&rct=j&q=&esrc=s&sourc e=images&cd=&cad=rja&uact=8&ved=0ahukewiowrwxis nxahwmwvqkhzf1dtqqjrwibw&url=http%3a%2f%2fscikit - learn.org%2fstable%2ftutorial%2fmachine_learning_map% 2Findex.html&psig=AOvVaw3TyQt02GPvdWF6Vt5eEmH0& ust=1511127401040403 Also see: https://github.com/rstudio/cheatsheets/raw/master/caret.pdf 7

Problems Supervised Machine Learning Solves When you have many predictors, overall effect estimates like R 2 will be inflated In psychology, we usually use adjusted-r 2 to account for this Predicts the amount of shrinkage in R 2 likely to be seen due to local overfitting However, adjusted-r 2 will always reveal relatively poor prediction; sometimes extremely poor, and always poorer with more predictors Machine learning is designed to better predict "true" variance despite the noise created by a complex predictor space If N is orders of magnitude larger than k, you probably don't need machine learning (or rather, it won't get you much better prediction anyway) 8

Problems Supervised Machine Learning Creates There are many ways to model the relationship between y and a set of x Researcher degrees of freedom Model selection: linear regression, random forest, support vector machines, etc. Parameter selection: variable selection (ntbcw: parameter estimation) Hyperparameter selection: configuration options for the model Model selection itself might be considered a hyperparameter Most models are optimized to a loss function; hyperparameters Hyperparameters can themselves be optimized If you're trying to eek out every last bit of true variance, regardless of where it comes from, you're going to need to get creative Comes with a distinct interpretability vs. prediction tradeoff Every one of these algorithms has a literature at least as if not more complex than the full courses you've taken on ANOVA or regression 9

Linear Regression as Machine Learning You probably learned 1-predictor OLS linear regression as the solution to a mathematical formula But what if you didn't know the formulas? What would you do? 1. Guess the value of b, predict your data, and look at the mean squared residual 2. Change b in one direction, predict again, and see if MSR changed 3. If it went down, change b further that way; if it didn't, go the other way 4. Repeat until you can't get MSR any smaller In machine learning terms, we call the MSR for regression the "cost function" Iterative procedure to find a minimum cost called stochastic gradient descent 10

Why Use caret? What is caret? Does not actually contain any machine learning algorithms Provides a common framework/syntax to access many other packages that do contain machine learning algorithms Centralizes tuning of hyperparameters across algorithms Automates mathematical modeling that you'd normally need to do by hand (by code) You need to know how to use those packages to use their functions in caret; each one is slightly different 11

Supervised Learning with caret Basic Regression model <- train( formula, data, method="lm", preprocess=c("center","scale","zv"), trcontrol=traincontrol(method="cv", number=10, verboseiter = T) ) Basic Classification model <- train( formula, data, method="glmnet", preprocess=c("center","scale","zv","conditionalx"), trcontrol=traincontrol(method="cv", number=10, verboseiter = T, summaryfunction = twoclasssummary, classprobs = T) ) 12

Pre-Processing: Missing Values Add preprocess = "" to caret syntax A reminder about missing values NMAR: Not missing at random MAR: Missing at random MCAR: Missing completely at random Imputation of missing values Median imputation: Assumes MCAR, so just don't K-nearest neighbors: Assumes MAR, so use if you're comfortable with that 13

Pre-Processing Options Centering and standardizing preprocess = c("center", "scale") Box-Cox transformation for non-linearity preprocess = "boxcox" Listwise deletion of non-varying predictors preprocess = "zv" # or nearly zero with "nzv" Run a PCA and use components as predictors instead preprocess = "pca" Remove highly correlated pairs of variables (by default, >.9) preprocess = "corr' In classification, remove predictors if there is no variance within a class preprocess = "conditionalx" 14

Machine Learning Models and Their Hyperparameters To add hyperparameters to caret, add tunelength=3 tunegrid=expand.grid() # number of levels for default tuning parameter # change many tuning parameters method="ranger" # random forests tunelength = 10 # changes mtry, which you could set by hand method="glmnet" # ridge and LASSO regression mygrid <- expand.grid(alpha = c(0,1), # 0 = LASSO, 1 = ridge lambda = seq(0.0001, 0.1, length = 10)) # complexity penalties Find a full list of methods with names(getmodelinfo()) Find explanations at http://topepo.github.io/caret/train-models-by-tag.html including permitted hyperparamters 15

Machine Learning: Decision Trees Can be regression trees or classification trees (in general: CART) A very simple classification tree roughly works like this: Look at y and each x to see which creates the two most homogeneous groups This becomes the root node For each classification created by the root node, repeat homogeneity test If you get better overall prediction with the new classification, create a split If you don't, stop following this path (i.e., this is a leaf) Continue this process (recursively) until you hit a stopping rule, such as too little additional variance predicted Refine the model by pruning, which removes leaves that don't contribute much to overall prediction Regression trees work similarly, but creating groups is more complex 16

Machine Learning: Decision Trees From https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052 and https://computersciencesource.files.wordpress.com/2010/01/detresb.png 17

Machine Learning: Random Forests Many decision trees; select models from random subsets of predictors to minimize the chance of overly influential cases (and thus overfitting) Because they involve later predictors conditional on earlier predictors, they by definition model interactions without explicit interaction terms Here, see an interaction between sibsp and age and sex plus their main effects Thus, transformations are less important here too Ultimately they use an "ensemble method" to combine trees; in this case, the mode 18

Machine Learning: LASSO/Ridge Regression LASSO and ridge combine automated predictor selection with regression In ML, the goal is to minimize the results of the "cost" function Remember: in lm, this is the residual mean square In LASSO and ridge, this is the residual mean square plus a bias term Gradient descent is used to create parameter estimates LASSO performs "L1 regularization," in which the bias term = the sum of lambda * the absolute values of the parameters, which can force zeros Ridge performs "L2 regularization," in which the bias term = the sum of lambda * the squared parameters, which makes all parameters shrink Elastic-net performs "L1/L2 regularization", i.e., lambda * the sum of both L1 & L2 Hyperparameters alpha: the balance between LASSO and ridge lambda: the cost function used to either drop (LASSO) or down-weight (ridge) predictors The power of ML comes from testing combinations of hyperparameters 19

Holdout vs. Cross Validation Validation does not tell you which model to choose; it gives you a generalizability estimate assuming consistent data generation. Holdout Validation Randomly select some subset of data to use for model training (fitting) and testing (predicting). Accuracy is determined by comparing trained predictions and test values. k-fold Cross Validation Randomly split the dataset into k datasets (folds) randomly which will be used as both training and test datasets. Each fold is used as a test set with all other folds as training sets. Accuracy is determined by comparing 2-fold cross validation is not the same as holdout validation why? N-fold cross validation is also called leave-one-out cross validation 20

Quantifying Accuracy Regression R 2 is easy and universal; just ask for model output (or plot) Classification Confusion matrices: Like Type I and Type II errors at the case level confusionmatrix(predicted, true) Accuracy: Proportion of correct predictions Sensitivity: Proportion of positive predictions out of all true positives Specificity: Proportion of negative predictions out of all true negatives Receiving operating characteristic (ROC) curve colauc(x=predicted, y=actual, plotroc=true) from catools Area under the curve (AUC) ranges from 0 to 1 (ratio of Sensititiy to 1-Specificity) caret will generally select the best-performing hyperparameters for you, but you should know where they came from and what they do 21

Comparing Models If using n-fold cross validation, keep your fold composition the same within trcontrol Inside a unique traincontrol() definition that you run one time: index = createfolds(outcomevar, k = 10) Use resamples() to compare output directly summary(resamples(list(model1, model2))) Use plots to look at difference in AUC across models dotplot(resamples(list(model1, model2)), metric="roc") 22

Want to Practice Building Predictive Models? Many datasets to practice on here: http://archive.ics.uci.edu/ml/index.php Enter competitions on your ability to build high-quality predictive models here: http://www.kaggle.com These competitions do not test your ability to use such models in a production environment, i.e., the real world 23