Practical Advice for Building Machine Learning Applications

Similar documents
(Sub)Gradient Descent

Python Machine Learning

Lecture 1: Machine Learning Basics

CSL465/603 - Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Assignment 1: Predicting Amazon Review Ratings

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Lecture 1: Basic Concepts of Machine Learning

Probabilistic Latent Semantic Analysis

Learning From the Past with Experiment Databases

CS Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Reducing Features to Improve Bug Prediction

Artificial Neural Networks written examination

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Generative models and adversarial training

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Software Maintenance

Softprop: Softmax Neural Network Backpropagation Learning

A Case Study: News Classification Based on Term Frequency

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Semi-Supervised Face Detection

arxiv: v1 [cs.lg] 15 Jun 2015

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Knowledge Transfer in Deep Convolutional Neural Nets

Model Ensemble for Click Prediction in Bing Search Ads

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Human Emotion Recognition From Speech

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Rule Learning With Negation: Issues Regarding Effectiveness

Learning Methods in Multilingual Speech Recognition

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

CS 446: Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The stages of event extraction

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speech Emotion Recognition Using Support Vector Machine

arxiv: v2 [cs.cv] 30 Mar 2017

CSC200: Lecture 4. Allan Borodin

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Truth Inference in Crowdsourcing: Is the Problem Solved?

Learning Methods for Fuzzy Systems

Calibration of Confidence Measures in Speech Recognition

Switchboard Language Model Improvement with Conversational Data from Gigaword

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A survey of multi-view machine learning

What is a Mental Model?

Time series prediction

Rule Learning with Negation: Issues Regarding Effectiveness

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Australian Journal of Basic and Applied Sciences

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Axiom 2013 Team Description Paper

Evolutive Neural Net Fuzzy Filtering: Basic Description

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Word Segmentation of Off-line Handwritten Documents

Discriminative Learning of Beam-Search Heuristics for Planning

Why Did My Detector Do That?!

A study of speaker adaptation for DNN-based speech synthesis

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

An investigation of imitation learning algorithms for structured prediction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Modeling function word errors in DNN-HMM based LVCSR systems

On-Line Data Analytics

Lecture 10: Reinforcement Learning

arxiv: v1 [cs.cv] 10 May 2017

The Strong Minimalist Thesis and Bounded Optimality

Online Updating of Word Representations for Part-of-Speech Tagging

Speech Recognition at ICSI: Broadcast News and beyond

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Reinforcement Learning by Comparing Immediate Reward

A Neural Network GUI Tested on Text-To-Phoneme Mapping

An OO Framework for building Intelligence and Learning properties in Software Agents

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Using computational modeling in language acquisition research

An Introduction to Simio for Beginners

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

B. How to write a research paper

A Pipelined Approach for Iterative Software Process Model

How to make an A in Physics 101/102. Submitted by students who earned an A in PHYS 101 and PHYS 102.

Modeling function word errors in DNN-HMM based LVCSR systems

Improvements to the Pruning Behavior of DNN Acoustic Models

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

The Good Judgment Project: A large scale test of different methods of combining expert predictions

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Vector Space Approach for Aspect-Based Sentiment Analysis

Medical Complexity: A Pragmatic Theory

Universidade do Minho Escola de Engenharia

Transcription:

Practical Advice for Building Machine Learning Applications Machine Learning Fall 2017 Based on lectures and papers by Andrew Ng, Pedro Domingos, Tom Mitchell and others 1

This lecture: ML and the world Bias vs Variance Making ML work in the world Mostly experiential advice Also based on what other people have said See readings on class website Diagnostics of your learning algorithm Error analysis Injecting machine learning into Your Favorite Task 2

ML and the world Bias vs Variance Diagnostics of your learning algorithm Error analysis Injecting machine learning into Your Favorite Task 3

Bias and variance Every learning algorithm requires assumptions about the hypothesis space. Eg: My hypothesis space is linear decision trees with 5 nodes deep neural network with 12 layers Bias is the true error (loss) of the best predictor in the hypothesis set What will the bias be if the hypothesis set can not represent the target function? (high or low?) Bias will be non zero, possibly high Underfitting: When bias is high 4

Bias and variance Every learning algorithm requires assumptions about the hypothesis space. Eg: My hypothesis space is linear decision trees with 5 nodes deep neural network with 12 layers Bias is the true error (loss) of the best predictor in the hypothesis set What will the bias be if the hypothesis set can not represent the target function? (high or low?) Bias will be non zero, possibly high Underfitting: When bias is high 5

Bias and variance Every learning algorithm requires assumptions about the hypothesis space. Eg: My hypothesis space is linear decision trees with 5 nodes deep neural network with 12 layers Bias is the true error (loss) of the best predictor in the hypothesis set What will the bias be if the hypothesis set can not represent the target function? (high or low?) Bias will be non zero, possibly high Underfitting: When bias is high 6

Bias and variance The performance of a classifier is dependent on the specific training set we have Perhaps the model will change if we slightly change the training set Variance: Describes how much the best classifier depends on the training set Overfitting: High variance Variance Increases when the classifiers become more complex Decreases with larger training sets 7

Bias and variance The performance of a classifier is dependent on the specific training set we have Perhaps the model will change if we slightly change the training set Variance: Describes how much the best classifier depends on a specific training set Overfitting: High variance Variance Increases when the classifiers become more complex Decreases with larger training sets 8

Bias and variance The performance of a classifier is dependent on the specific training set we have Perhaps the model will change if we slightly change the training set Variance: Describes how much the best classifier depends on a specific training set Overfitting: High variance Variance Increases when the classifiers become more complex Decreases with larger training sets 9

Bias variance tradeoff Error = bias + variance (+ noise) High bias ) both training and test error can be high Arises when the classifier can not represent the data High variance ) training error can be low, but test error will be high Arises when the learner overfits the training set Bias variance tradeoff has been studied extensively in the context of regression Generalized to classification (Pedro Domingos, 2000) 10

Managing bias and variance Ensemble methods can reduce both bias and variance Multiple classifiers are combined Eg: Bagging, boosting Decision trees of a fixed depth Increasing depth decreases bias, increases variance SVMs Stronger regularization increases bias, decreases variance Higher degree polynomial kernels decreases bias, increases variance K nearest neighbors Increasing k generally increases bias, reduces variance 11

ML and the world Bias vs Variance Diagnostics of your learning algorithm Error analysis Injecting machine learning into Your Favorite Task 12

Debugging machine learning Suppose you train an SVM or a logistic regression classifier for spam detection You obviously follow best practices for finding hyper-parameters (such as cross-validation) Your classifier is only 75% accurate What can you do to improve it? 13

Different ways to improve your model More training data Features 1. Use more features 2. Use fewer features 3. Use other features Better training 1. Run for more iterations 2. Use a different algorithm 3. Use a different classifier 4. Play with regularization 14

Different ways to improve your model More training data Features 1. Use more features 2. Use fewer features 3. Use other features Better training 1. Run for more iterations 2. Use a different algorithm 3. Use a different classifier 4. Play with regularization Tedious! And prone to errors, dependence on luck Let us try to make this process more methodical 15

First, diagnostics Easier to fix a problem if you know where it is Some possible problems: 1. Over-fitting (high variance) 2. Under-fitting (high bias) 3. Your learning does not converge 4. Your loss function is not good enough 5. Are you measuring the right thing? 16

Detecting over or under fitting Over-fitting: The training accuracy is much higher than the test accuracy The model explains the training set very well, but poor generalization Under-fitting: Both accuracies are unacceptably low The model can not represent the concept well enough 17

Detecting high variance using learning curves Error Training error Size of training data 18

Detecting high variance using learning curves Error Generalization error/ test error Training error Size of training data 19

Detecting high variance using learning curves Test error keeps decreasing as training set increases ) more data will help Large gap between train and test error Typically seen for more complex models Error Generalization error/ test error Training error Size of training data 20

Detecting high bias using learning curves Both train and test error are unacceptable (But the model seems to converge) Typically seen for more simple models Generalization error/ test error Error Training error Size of training set 21

Different ways to improve your model More training data Features 1. Use more features 2. Use fewer features 3. Use other features Better training 1. Run for more iterations 2. Use a different algorithm 3. Use a different classifier 4. Play with regularization 22

Different ways to improve your model More training data Helps with over-fitting Features 1. Use more features 2. Use fewer features 3. Use other features Helps with under-fitting Helps with over-fitting Could help with over-fitting and under-fitting Better training 1. Run for more iterations 2. Use a different algorithm 3. Use a different classifier 4. Play with regularization Could help with over-fitting and under-fitting 23

Diagnostics Easier to fix a problem if you know where it is Some possible problems: ü Over-fitting (high variance) ü Under-fitting (high bias) 3. Your learning does not converge 4. Your loss function is not good enough (if we want to build a classifier, we should aim for the 0-1 loss) 5. Are you measuring the right thing? 24

Does your learning algorithm converge? If learning is framed as an optimization problem, track the objective Objective Not yet converged here Converged here Iterations 25

Does your learning algorithm converge? If learning is framed as an optimization problem, track the objective Not always easy to decide Objective Not yet converged here How about here? Iterations 26

Does your learning algorithm converge? If learning is framed as an optimization problem, track the objective Objective Something is wrong Iterations 27

Does your learning algorithm converge? If learning is framed as an optimization problem, track the objective Objective Helps to debug If we are doing gradient descent on a convex function the objective can t increase (Caveat: For SGD, the objective will slightly increase occasionally, but not by much) Something is wrong Iterations 28

Different ways to improve your model More training data Helps with overfitting Features 1. Use more features 2. Use fewer features 3. Use other features Helps with under-fitting Helps with over-fitting Could help with over-fitting and under-fitting Better training 1. Run for more iterations 2. Use a different algorithm 3. Use a different classifier 4. Play with regularization Could help with over-fitting and under-fitting 29

Different ways to improve your model More training data Helps with overfitting Features 1. Use more features 2. Use fewer features 3. Use other features Helps with under-fitting Helps with over-fitting Could help with over-fitting and under-fitting Better training 1. Run for more iterations 2. Use a different algorithm 3. Use a different classifier 4. Play with regularization Track the objective for convergence Could help with over-fitting and under-fitting 30

Diagnostics Easier to fix a problem if you know where it is Some possible problems: ü Over-fitting (high variance) ü Under-fitting (high bias) ü Your learning does not converge 4. Your loss function is not good enough (if we want to build a classifier, we should aim for the 0-1 loss) 5. Are you measuring the right thing? 31

What if a different objective is better? Try out both objectives A and B (eg: SVM and logistic regression) Run to both convergence Remember that lower is better because we are minimizing That is, we hope that the lower objective gives better performance 32

What if a different objective is better? Try out both objectives A and B (eg: SVM and logistic regression) Run to both convergence Remember that lower is better because we are minimizing That is, we hope that the lower objective gives better performance If optimum value of A > optimum value of B But the generalization error of A < generalization error of B Then, we know that B does not capture the problem well enough 33

Diagnostics Easier to fix a problem if you know where it is Some possible problems: ü Over-fitting (high variance) ü Under-fitting (high bias) ü Your learning does not converge ü Your loss function is not good enough (if we want to build a classifier, we should aim for the 0-1 loss) 5. Are you measuring the right thing? 34

What to measure Accuracy of prediction is the most common measurement But if your data set is unbalanced, accuracy may be misleading 1000 positive examples, 1 negative example A classifier that always predicts positive will get 99.9% accuracy. Has it really learned anything? Unbalanced labels à measure label specific precision, recall and F- measure Precision for a label: Among examples that are predicted with label, what fraction are correct Recall for a label: Among the examples with given ground truth label, what fraction are correct F-measure: Harmonic mean of precision and recall 35

ML and the world Bias vs Variance Diagnostics of your learning algorithm Error analysis Injecting machine learning into Your Favorite Task 36

Machine Learning in this class ML code 37

Machine Learning in context Figure from [Sculley, et al NIPS 2015] 38

Error Analysis Generally machine learning plays a small role in a larger application Pre-processing Feature extraction (possibly by other ML based methods) Data transformations How much do each of these contribute to the error? Error analysis tries to explain why a system is not performing perfectly 39

Example: A typical text processing pipeline 40

Example: A typical text processing pipeline Text 41

Example: A typical text processing pipeline Text Words 42

Example: A typical text processing pipeline Text Words Parts-of-speech 43

Example: A typical text processing pipeline Text Words Parts-of-speech Parse trees 44

Example: A typical text processing pipeline Text Words Parts-of-speech Parse trees A ML-based application 45

Example: A typical text processing pipeline Each of these could be ML driven Text Or deterministic But still error prone Words Parts-of-speech Parse trees A ML-based application 46

Example: A typical text processing pipeline Each of these could be ML driven Text Or deterministic But still error prone Words Parts-of-speech How much do each of these contribute to the error of the final application? Parse trees A ML-based application 47

Tracking errors in a complex system Plug in the ground truth for the intermediate components and see how much the accuracy of the final system changes System End-to-end predicted 55% With ground truth words 60% Accuracy + ground truth parts-of-speech 84 % + ground truth parse trees 89 % + ground truth final output 100 % 48

Tracking errors in a complex system Plug in the ground truth for the intermediate components and see how much the accuracy of the final system changes System Accuracy End-to-end predicted 55% With ground truth words 60% + ground truth parts-of-speech 84 % + ground truth parse trees 89 % + ground truth final output 100 % Error in the part-of-speech component hurts the most 49

Ablative study Explaining difference between the performance between a strong model and a much weaker one (a baseline) Usually seen with features Suppose we have a collection of features and our system does well, but we don t know which features are giving us the performance Evaluate simpler systems that progressively use fewer and fewer features to see which features give the highest boost It is not enough to have a classifier that works; it is useful to know why it works. Helps interpret predictions, diagnose errors and can provide an audit trail 50

ML and the world Bias vs Variance Diagnostics of your learning algorithm Error analysis Injecting machine learning into Your Favorite Task 51

Classifying fish Say you want to build a classifier that identifies whether a real physical fish is salmon or tuna How do you go about this? 52

Classifying fish Say you want to build a classifier that identifies whether a real physical fish is salmon or tuna How do you go about this? The slow approach 1. Carefully identify features, get the best data, the software architecture, maybe design a new learning algorithm 2. Implement it and hope it works Advantage: Perhaps a better approach, maybe even a new learning algorithm. Research. 53

Classifying fish Say you want to build a classifier that identifies whether a real physical fish is salmon or tuna How do you go about this? The slow approach 1. Carefully identify features, get the best data, the software architecture, maybe design a new learning algorithm 2. Implement it and hope it works The hacker s approach 1. First implement something 2. Use diagnostics to iteratively make it better Advantage: Perhaps a better approach, maybe even a new learning algorithm. Research. Advantage: Faster release, will have a solution for your problem quicker 54

Classifying fish Say you want to build a classifier that identifies whether a real physical fish is salmon or tuna How do you go about this? The slow approach The hacker s approach 1. Carefully identify 1. First implement features, get the best something data, the software Be wary of premature optimization 2. Use diagnostics to architecture, maybe iteratively make it better design Be a equally new learning wary of prematurely committing to a bad path algorithm 2. Implement it and hope it works Advantage: Perhaps a better approach, maybe even a new learning algorithm. Research. Advantage: Faster release, will have a solution for your problem quicker 55

What to watch out for Do you have the right evaluation metric? And does your loss function reflect it? Beware of contamination: Ensure that your training data is not contaminated with the test set Learning = generalization to new examples Do not see your test set either. You may inadvertently contaminate the model Beware of contaminating your features with the label! (Be suspicious of perfect predictors) 56

What to watch out for Be aware of bias vs. variance tradeoff (or over-fitting vs. under-fitting) Be aware that intuitions may not work in high dimensions No proof by picture Curse of dimensionality A theoretical guarantee may only be theoretical May make invalid assumptions (eg: if the data is separable) May only be legitimate with infinite data (eg: estimating probabilities) Experiments on real data are equally important 57

Big data is not enough But more data is always better Cleaner data is even better Remember that learning is impossible without some bias that simplifies the search Otherwise, no generalization Learning requires knowledge to guide the learner Machine learning is not a magic wand 58

What knowledge? Which model is the right one for this task? Linear models, decision trees, deep neural networks, etc Which learning algorithm? Does the data violate any crucial assumptions that were used to define the learning algorithm or the model? Does that matter? Feature engineering is crucial Implicitly, these are all claims about the nature of the problem 59

Miscellaneous advice Learn simpler models first If nothing, at least they form a baseline that you can improve upon Ensembles seem to work better Think about whether your problem is learnable at all Learning = generalization 60

ML and system building Several recent papers about how ML fits in the context of large software systems 61

Making machine learning matter Challenges to the greater ML community 1. A law passed or legal decision made that relies on the result of an ML analysis 2. $100M saved through improved decision making provided by an ML system 3. A conflict between nations averted through high quality translation provided by an ML system 4. A 50% reduction in cybersecurity break-ins through ML defenses 5. A human life saved through a diagnosis or intervention recommended by an ML system 6. Improvement of 10% in one country s Human Development Index attributable to an ML system 62

A retrospective look at the course 63

Learning = generalization A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Tom Mitchell (1999) 64

We saw different models Or: what kind of a function should a learner learn Linear classifiers Decision trees Non-linear classifiers, feature transformations, neural networks Ensembles of classifiers 65

Different learning protocols Supervised learning A teacher supplies a collection of examples with labels The learner has to learn to label new examples using this data We did not see Unsupervised learning No teacher, learner has only unlabeled examples Data mining Semi-supervised learning Learner has access to both labeled and unlabeled examples 66

Learning algorithms Online algorithms: Learner can access only one labeled at a time Perceptron Batch algorithms: Learner can access to the entire dataset Naïve Bayes Support vector machines, logistic regression Decision trees and nearest neighbors Boosting Neural networks 67

Representing data What is the best way to represent data for a particular task? Features Dimensionality reduction (we didn t cover this, but do look at the material if you are interested) 68

The theory of machine learning Mathematically defining learning Online learning Probably Approximately Correct (PAC) Learning Bayesian learning 69

Representation, optimization, evaluation Table from [Domingos, 2012] 70

Machine learning is too easy! Remarkably diverse collection of ideas Yet, in practice many of these approaches work roughly equally well Eg: SVM vs logistic regression vs averaged perceptron 71

What we did not see Machine learning is a large and growing area of scientific study We did not cover Kernel methods Unsupervised learning, clustering Hidden Markov models Multiclass support vector machines Topic models Structured models. But we saw the foundations of how to think about machine learning 72

What we did not see Machine learning is a large and growing area of scientific study We did not cover Kernel methods Unsupervised learning, clustering Hidden Markov models Multiclass support vector machines Topic models Structured models. Several classes that can follow (or are related to) this course: But we saw the Data Mining foundations of how to think about machine Clustering learning Structured Prediction Theory of Machine Learning Various applications (NLP, vision, ) Data visualization 73

This course Focus on the underlying concepts and algorithmic ideas in the field of machine learning Not about Using a specific machine learning tool Any single learning paradigm 74

What we saw 1. A broad theoretical and practical understanding of machine learning paradigms and algorithms 2. Ability to implement learning algorithms 3. Identify where machine learning can be applied and make the most appropriate decisions (about algorithms, models, supervision, etc) 75