Project organization. Exam Details Wed 3/7/18. Approximation-generalization tradeoff. Approximation-generalization tradeoff.

Similar documents
Lecture 1: Machine Learning Basics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Python Machine Learning

Probability and Statistics Curriculum Pacing Guide

Learning From the Past with Experiment Databases

CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

(Sub)Gradient Descent

STAT 220 Midterm Exam, Friday, Feb. 24

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Assignment 1: Predicting Amazon Review Ratings

Generative models and adversarial training

A Case Study: News Classification Based on Term Frequency

STA 225: Introductory Statistics (CT)

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Softprop: Softmax Neural Network Backpropagation Learning

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Why Did My Detector Do That?!

Knowledge Transfer in Deep Convolutional Neural Nets

Major Milestones, Team Activities, and Individual Deliverables

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Lecture 1: Basic Concepts of Machine Learning

Mathematics subject curriculum

Sociology 521: Social Statistics and Quantitative Methods I Spring Wed. 2 5, Kap 305 Computer Lab. Course Website

SARDNET: A Self-Organizing Feature Map for Sequences

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Active Learning. Yingyu Liang Computer Sciences 760 Fall

BENCHMARK TREND COMPARISON REPORT:

The Efficacy of PCI s Reading Program - Level One: A Report of a Randomized Experiment in Brevard Public Schools and Miami-Dade County Public Schools

Simple Random Sample (SRS) & Voluntary Response Sample: Examples: A Voluntary Response Sample: Examples: Systematic Sample Best Used When

Improving Conceptual Understanding of Physics with Technology

Evidence for Reliability, Validity and Learning Effectiveness

Switchboard Language Model Improvement with Conversational Data from Gigaword

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

TU-E2090 Research Assignment in Operations Management and Services

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Learning Lesson Study Course

The Evolution of Random Phenomena

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

POLITICAL SCIENCE 315 INTERNATIONAL RELATIONS

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Reducing Features to Improve Bug Prediction

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Business 712 Managerial Negotiations Fall 2011 Course Outline. Human Resources and Management Area DeGroote School of Business McMaster University

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

MGT/MGP/MGB 261: Investment Analysis

Software Maintenance

Role Models, the Formation of Beliefs, and Girls Math. Ability: Evidence from Random Assignment of Students. in Chinese Middle Schools

Speech Emotion Recognition Using Support Vector Machine

THE UNIVERSITY OF SYDNEY Semester 2, Information Sheet for MATH2068/2988 Number Theory and Cryptography

Evidence-based Practice: A Workshop for Training Adult Basic Education, TANF and One Stop Practitioners and Program Administrators

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

School of Innovative Technologies and Engineering

Word Segmentation of Off-line Handwritten Documents

A Version Space Approach to Learning Context-free Grammars

PREP S SPEAKER LISTENER TECHNIQUE COACHING MANUAL

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Data Structures and Algorithms

Getting Started with Deliberate Practice

Detailed course syllabus

12- A whirlwind tour of statistics

w o r k i n g p a p e r s

Foothill College Summer 2016

95723 Managing Disruptive Technologies

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Chapter 2 Rule Learning in a Nutshell

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

ReFresh: Retaining First Year Engineering Students and Retraining for Success

Tun your everyday simulation activity into research

Telekooperation Seminar

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

CS/SE 3341 Spring 2012

Statewide Framework Document for:

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

The Strong Minimalist Thesis and Bounded Optimality

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Individual Differences & Item Effects: How to test them, & how to test them well

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

State University of New York at Buffalo INTRODUCTION TO STATISTICS PSC 408 Fall 2015 M,W,F 1-1:50 NSC 210

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Probability estimates in a scenario tree

Multivariate k-nearest Neighbor Regression for Time Series data -

Multi-label classification via multi-target regression on data streams

Probabilistic Latent Semantic Analysis

Lecture 2: Quantifiers and Approximation

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

Transcription:

Project organization Exam Details Wed 3/7/18 Project proposals due March 14 (~1.5 weeks) I would like to make sure everyone has a team, so I want to add a new deadline By TODAY please go to the link posted on Piazza (https://goo.gl/p5ntxb) and add your team s details to the spreadsheet: team members tentative project title campus(es) where team members are located number of team members whether you are potentially open to adding more members Coverage: HW #1-3 Also lectures through the lecture on the VC bound (from Feb 19). The midterm will not cover lecture material after Feb 19. The following are not on the exam: Regression, Tikhonov Regularization, Bias and Variance of Regression Function Sets, LASSO, etc. A single sheet of notes (front and back) allowed 75 minute time limit (3:00 PM - 4:15 PM) No calculators allowed Sample questions are posted Approximation-generalization tradeoff Approximation-generalization tradeoff Given a set, find a function that minimizes More complex better chance of approximating the ideal classifier/function Out-of-sample error Less complex better chance of generalizing to new data (out of sample) Error generalization error We must carefully limit complexity to avoid overfitting In-sample error Complexity of hypothesis set

Approximation-generalization tradeoff Learning curve A simple model Out-of-sample error Out-of-sample error Error variance bias In-sample error Expected Error bias In-sample error Complexity of hypothesis set Number of data points ( ) Learning curve A complex model Bias-variance decomposition What is it good for? Expected Error Out-of-sample error In-sample error bias Practically, impossible to compute bias/variance exactly Can estimate empirically split data into training and test sets split training data into many different subsets and estimate a classifier/regressor on each compute bias/variance using the results and test set Number of data points ( ) In reality, just like with the VC bound, more useful as a conceptual tool than as a practical technique

Developing a good learning model Example The bias-variance decomposition gives us a useful way to think about how to develop improved learning models Reduce variance (without significantly increasing the bias) limiting model complexity (e.g. polynomial order in regression) regularization can be counterintuitive (e.g Stein s paradox) typically can be done through general techniques Reduce bias (without significantly increasing the variance) exploit prior information to steer the model in the correct direction typically application specific Least-squares is an unbiased estimator, but can have high variance Tikhonov regularization deliberately introduces bias into the estimator (shrinking it towards the origin) The slight increase in bias can buy us a huge decrease in the variance, especially when some variables are highly correlated The trick is figuring out just how much bias to introduce Model selection Examples In statistical learning, a model is a mathematical representation of a function such as a classifier regression function density In many cases, we have one (or more) free parameters that are not automatically determined by the learning algorithm Often, the value chosen for these free parameters has a significant impact on the algorithm s output The problem of selecting values for these free parameters is called model selection Method polynomial regression ridge regression/lasso robust regression SVMs kernel methods regularized LR Parameter polynomial degree regularization parameter loss function parameter regularization parameter margin violation cost kernel choice/parameters regularization parameter -nearest neighbors number of neighbors

Model selection dilemma We need to select appropriate values for the free parameters All we have is the training data We must use the training data to select the parameters However, these free parameters usually control the balance between underfitting and overfitting They were left free precisely because we don t want to let the training data influence their selection, as this almost always leads to overfitting e.g., if we let the training data determine the degree in polynomial regression, we will just end up choosing the maximum and doing interpolation Big picture For much of this class, we have focused on trying to understand learning via decompositions of the form Validation takes another approach: VC dimension regularization After we have selected, why not just try (a little harder) to estimate directly? Validation Suppose that in addition to our training data, we also have a validation set Use the validation set to form an estimate Accuracy of validation What can we say about the accuracy of? In the case of classification,, which is just a Bernoulli random variable Hoeffding: Examples Classification: More generally, we always have Regression:

Accuracy of validation In either case, this shows us that We are given a data set Validation vs training Thus, we can get as accurate an estimate of using a validation set as long as is large enough Remember, is ultimately something we learned from training data Where is this validation set coming from? validation (holdout) set training set Validation error is : Small Large bad estimate accurate estimate, but of what? Learning curve Can we have our cake and eat it too? Expected Error Out-of-sample error In-sample error After we ve used our validation set to estimate the error, re-train on the whole data set training set ( ) validation set ( ) Number of data points ( ) Large lets us say: We are very confident that we have selected a terrible Small Large Rule of thumb: Set bad estimate of good estimate of, but, but

Validation vs testing Example We call this validation, but how is it any different than simply testing? Suppose we have two hypotheses and that Typically, is used to make learning choices If an estimate of affects learning, i.e., it impacts which we choose, then it is no longer a test set Next, suppose that our error estimates for, denoted by and, are distributed according to It becomes a validation set What s the difference? a test set is unbiased a validation set will have an (overly) optimistic bias (remember the coin tossing experiments?) Pick that minimizes It is easy to argue that Why? 75 % of the time, optimistic bias Using validation for model selection The bias Suppose we have models We select the model using the validation set is a biased estimate of (and ) training set ( ) validation set ( ) pick the best Expected Error Validation set size ( )

We ve seen this before Quantifying the bias For models, we use a data set of size to pick the model that does best out of Back to Hoeffding! Data contamination We have now discussed three different kinds of estimates of the risk : These three estimates have different degrees of contamination that manifests itself as a (deceptively) optimistic bias Training set: totally contaminated Or, if the correspond to a few continuous parameters, we can use the VC approach to argue Testing set: totally clean (requires strict discipline) Validation set: slightly contaminated We will return in a bit to the issue of data contamination Validation dilemma Back to our core dilemma in validation Leave one out We need to be small, so let s set! We would like to argue that Select a hypothesis using the data set small large Validation error We set to be too small, so this is a terrible estimate All we need to do is set and large so that it is simultaneously small Repeat this for all possible choices of and average! Can we do this? Yes! This is called the leave-one-out cross validation error

Fitting a line to 3 data points Example Leave more out Leave-one-out: Train times on points each -fold cross validation: Train times on points each Example: validate train Iterate over all 5 choices of validation set and average Common choices are (Note: On this slide, is the number of folds and is the size of the validation set) Remarks For -fold cross validation, the estimate depends on the particular choice of partition It is common to form several estimates based on different random partitions and then average them When using -fold cross validation for classification, you should ensure that each of the sets contain training data from each class in the same proportion as in the full data set stratified cross validation The bootstrap What else can you do when your training set is really small? You really need as much training data as possible to get reasonable results Fix For, let be a subset of size obtained by sampling with replacement from the full data set Example: Scikit-learn can do all of this for you for any of the built in learning methods

Define Set The bootstrap error estimate model learned based on the data Bootstrap in practice Typically, must be large (say, ) for the estimate to be accurate Can be rather computationally demanding tends to be pessimistic, so it is common to combine the training and bootstrap error estimates The bootstrap error estimate is then given by A common choice is the 0.632 bootstrap estimate The balanced bootstrap chooses each input-output pair appears exactly times such that Can be used to estimate confidence intervals of basically anything Data snooping If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised Example Suppose we plan to use an SVM with a quadratic kernel on our data set This is by far the most common trap that people fall into in practice Leads to serious overfitting Can be very subtle Many ways to slip up What is the VC dimension of the hypothesis set in this case?

Reuse of the data set Puzzle: Time-series forecasting If you try one model after another on the same data set, you will eventually succeed If you torture the data long enough, it will confess You need to think about the VC dimension/complexity of the total learning model May include models you only considered in your mind! May include models tried by others! Remedies Avoid data snooping (strict discipline) Test on new data that no one has seen before Account for data snooping Suppose we wish to predict whether the price of a stock is going to go up or down tomorrow Take history over a long period of time Normalize the time series to zero mean, unit variance Form all possible input-output pairs with input = previous 20 days of stock prices output = price movement on the 21 st day Randomly split data into training and testing data Train on training data only, test on testing data only Based on the test data, it looks like we can consistently predict the price movement direction with accuracy ~52% Are we going to be rich?