Bias-Variance Tradeoff

Similar documents
Lecture 1: Machine Learning Basics

Generative models and adversarial training

Python Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

(Sub)Gradient Descent

Assignment 1: Predicting Amazon Review Ratings

Semi-Supervised Face Detection

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Learning From the Past with Experiment Databases

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Exploration. CS : Deep Reinforcement Learning Sergey Levine

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

The Good Judgment Project: A large scale test of different methods of combining expert predictions

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Switchboard Language Model Improvement with Conversational Data from Gigaword

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

CSL465/603 - Machine Learning

Australian Journal of Basic and Applied Sciences

School of Innovative Technologies and Engineering

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Analysis of Enzyme Kinetic Data

Rule Learning With Negation: Issues Regarding Effectiveness

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Lecture 1: Basic Concepts of Machine Learning

Word learning as Bayesian inference

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Rule Learning with Negation: Issues Regarding Effectiveness

Detailed course syllabus

Indian Institute of Technology, Kanpur

arxiv:cmp-lg/ v1 22 Aug 1994

Softprop: Softmax Neural Network Backpropagation Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

INPE São José dos Campos

A Version Space Approach to Learning Context-free Grammars

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

12- A whirlwind tour of statistics

CS/SE 3341 Spring 2012

learning collegiate assessment]

A Case Study: News Classification Based on Term Frequency

Introduction to Simulation

Universityy. The content of

Why Did My Detector Do That?!

Lecture 10: Reinforcement Learning

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

On the Distribution of Worker Productivity: The Case of Teacher Effectiveness and Student Achievement. Dan Goldhaber Richard Startz * August 2016

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Foothill College Summer 2016

Working Paper: Do First Impressions Matter? Improvement in Early Career Teacher Effectiveness Allison Atteberry 1, Susanna Loeb 2, James Wyckoff 1

Software Maintenance

Role Models, the Formation of Beliefs, and Girls Math. Ability: Evidence from Random Assignment of Students. in Chinese Middle Schools

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Introduction. Chem 110: Chemical Principles 1 Sections 40-52

Model Ensemble for Click Prediction in Bing Search Ads

Mathematics. Mathematics

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Probabilistic Latent Semantic Analysis

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

University of Cincinnati College of Medicine. DECISION ANALYSIS AND COST-EFFECTIVENESS BE-7068C: Spring 2016

Chapter 2 Rule Learning in a Nutshell

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Office Hours: Mon & Fri 10:00-12:00. Course Description

Individual Differences & Item Effects: How to test them, & how to test them well

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Speech Recognition at ICSI: Broadcast News and beyond

Activity Recognition from Accelerometer Data

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Applications of data mining algorithms to analysis of medical data

What is a Mental Model?

CS 446: Machine Learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

WHEN THERE IS A mismatch between the acoustic

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

MGT/MGP/MGB 261: Investment Analysis

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

Reinforcement Learning by Comparing Immediate Reward

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Probability and Statistics Curriculum Pacing Guide

Learning Distributed Linguistic Classes

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

Firms and Markets Saturdays Summer I 2014

Truth Inference in Crowdsourcing: Is the Problem Solved?

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

The Strong Minimalist Thesis and Bounded Optimality

Multi-label classification via multi-target regression on data streams

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

LOUISIANA HIGH SCHOOL RALLY ASSOCIATION

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Transcription:

What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class less bias More complex class more variance 1

Training set error Given a dataset (Training data) Choose a loss function e.g., squared error (L 2 ) for regression Training set error: For a particular set of parameters, loss function on training data: Training set error as a function of model complexity 2

Prediction error Training set error can be poor measure of quality of solution Prediction error: We really care about error over all possible input points, not just training data: Prediction error as a function of model complexity 3

Computing prediction error Computing prediction hard integral May not know t(x) for every x Monte Carlo integration (sampling approximation) Sample a set of i.i.d. points {x1,,xm} from p(x) Approximate integral with sample average Why training set error doesn t approximate prediction error? Sampling approximation of prediction error: Training error : Very similar equations!!! Why is training set a bad measure of prediction error??? 4

Why training set error doesn t approximate prediction error? Sampling approximation Because of you prediction cheated!!! error: Training error good estimate for a single w, But you optimized w with respect to the training error, and found w that is good for this set of samples Training error : Training error is a (optimistically) biased estimate of prediction error Very similar equations!!! Why is training set a bad measure of prediction error??? Test set error Given a dataset, randomly split it into two parts: Training data {x 1,, x Ntrain } Test data {x 1,, x Ntest } Use training data to optimize parameters w Test set error: For the final solution w*, evaluate the error using: 5

Test set error as a function of model complexity Overfitting Overfitting: a learning algorithm overfits the training data if it outputs a solution w when there exists another solution w such that: 6

How many points to I use for training/testing? Very hard question to answer! Too few training points, learned w is bad Too few test points, you never know if you reached a good solution Bounds, such as Hoeffding s inequality can help: More on this later this semester, but still hard to answer Typically: if you have a reasonable amount of data, pick test set large enough for a reasonable estimate of error, and use the rest for learning if you have little data, then you need to pull out the big guns e.g., bootstrapping Error estimators 7

Error as a function of number of training examples for a fixed model complexity little data infinite data Error estimators Be careful!!! Test set only unbiased if you never never never never do any any any any learning on the test data For example, if you use the test set to select the degree of the polynomial no longer unbiased!!! (We will address this problem later in the semester) 8

Announcements First homework is out: Programming part and Analytic part Remember collaboration policy: can discuss questions, but need to write your own solutions and code Remember you are not allowed to look at previous years solutions, search the web for solutions, use someone else s solutions, etc. Due Oct. 3 rd beginning of class Start early! Recitation this week: Bayes optimal classifiers, Naïve Bayes What s (supervised) learning, more formally Given: Dataset: Instances { x 1 ;t(x 1 ),, x N ;t(x N ) } e.g., x i ;t(x i ) = (GPA=3.9,IQ=120,MLscore=99);150K Hypothesis space: H e.g., polynomials of degree 8 Loss function: measures quality of hypothesis h H Obtain: e.g., squared error for regression Learning algorithm: obtain h H that minimizes loss function e.g., using matrix operations for regression Want to minimize prediction error, but can only minimize error in dataset 9

Types of (supervised) learning problems, revisited Regression, e.g., dataset: position; temperature hypothesis space: Loss function: Density estimation, e.g., dataset: grades hypothesis space: Loss function: Classification, e.g., dataset: brain image; {verb v. noun} hypothesis space: Loss function: Learning is (simply) function approximation! The general (supervised) learning problem: Given some data (including features), hypothesis space, loss function Learning is no magic! Simply trying to find a function that fits the data Regression Density estimation Classification (Not surprisingly) Seemly different problem, very similar solutions 10

What is NB really optimizing? Naïve Bayes assumption: Features are independent given class: More generally: NB Classifier: MLE for the parameters of NB Given dataset Count(A=a,B=b) number of examples where A=a and B=b MLE for NB, simply: Prior: P(Y=y) = Likelihood: P(X i =x i Y i =y i ) = 11

What is NB really optimizing? Let s use an example Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target classes Bayes optimal classifier P(Y X) Generative classifier, e.g., Naïve Bayes: Assume some functional form for P(X Y), P(Y) Estimate parameters of P(X Y), P(Y) directly from training data Use Bayes rule to calculate P(Y X= x) This is a generative model Indirect computation of P(Y X) through Bayes rule But, can generate a sample of the data, P(X) = y P(y) P(X y) Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y X) Estimate parameters of P(Y X) directly from training data This is the discriminative model Directly learn P(Y X) But cannot obtain a sample of the data, because P(X) is not available 12