Lecture 1: Machine Learning Basics

Similar documents
Python Machine Learning

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Generative models and adversarial training

CSL465/603 - Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Semi-Supervised Face Detection

CS Machine Learning

Artificial Neural Networks written examination

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Probability and Statistics Curriculum Pacing Guide

Exploration. CS : Deep Reinforcement Learning Sergey Levine

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Statewide Framework Document for:

Lecture 10: Reinforcement Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Model Ensemble for Click Prediction in Bing Search Ads

Lecture 1: Basic Concepts of Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Learning to Rank with Selection Bias in Personal Search

Switchboard Language Model Improvement with Conversational Data from Gigaword

Calibration of Confidence Measures in Speech Recognition

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Software Maintenance

Active Learning. Yingyu Liang Computer Sciences 760 Fall

School Size and the Quality of Teaching and Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

STA 225: Introductory Statistics (CT)

arxiv: v1 [cs.lg] 15 Jun 2015

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Truth Inference in Crowdsourcing: Is the Problem Solved?

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Machine Learning and Development Policy

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Mathematics subject curriculum

Detailed course syllabus

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Why Did My Detector Do That?!

WHEN THERE IS A mismatch between the acoustic

Learning From the Past with Experiment Databases

Axiom 2013 Team Description Paper

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v2 [cs.cv] 30 Mar 2017

Universityy. The content of

A survey of multi-view machine learning

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Introduction to Causal Inference. Problem Set 1. Required Problems

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Word Segmentation of Off-line Handwritten Documents

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Attributed Social Network Embedding

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

Introduction to Simulation

Time series prediction

SARDNET: A Self-Organizing Feature Map for Sequences

Discriminative Learning of Beam-Search Heuristics for Planning

Laboratorio di Intelligenza Artificiale e Robotica

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Human Emotion Recognition From Speech

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

On the Combined Behavior of Autonomous Resource Management Agents

Probabilistic Latent Semantic Analysis

Knowledge Transfer in Deep Convolutional Neural Nets

Word learning as Bayesian inference

Reinforcement Learning by Comparing Immediate Reward

arxiv: v2 [cs.ir] 22 Aug 2016

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Data Fusion Through Statistical Matching

Extending Place Value with Whole Numbers to 1,000,000

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Using Web Searches on Important Words to Create Background Sets for LSI Classification

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Speech Emotion Recognition Using Support Vector Machine

Missouri Mathematics Grade-Level Expectations

Math Pathways Task Force Recommendations February Background

FF+FPG: Guiding a Policy-Gradient Planner

arxiv:cmp-lg/ v1 22 Aug 1994

Modeling function word errors in DNN-HMM based LVCSR systems

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Australian Journal of Basic and Applied Sciences

Grade 6: Correlated to AGS Basic Math Skills

A Version Space Approach to Learning Context-free Grammars

Applications of data mining algorithms to analysis of medical data

Transcription:

1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017

2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3 Hyperparameters and Validation Sets 4 Estimators, Bias and Variance 5 ML and MAP Estimators 6 Gradient Based Optimization 7 Challenges That Motivate Deep Learning

Learning Algorithms 3/69 Section 1 Learning Algorithms

Learning Algorithms 4/69 A machine learning algorithm is an algorithm that is able to learn from data. A machine is said to have learned from Experience E with respect to some Task T, as measured by a Performance Measure P, if its performance on T as measured by P, improves with E.

Learning Algorithms 5/69 The Task T Example T : Vehicle Detection In Lidar Data. Approach 1: Hard code what a vehicle is in Lidar data based on Human experience. Approach 2: Learn what a vehicle is in Lidar data. Machine learning allows us to tackle tasks that are too difficult to be hard coded by humans.

Learning Algorithms 6/69 The Task T Machine learning algorithms are usually described in terms of how the algorithm should process an example x R n. Each entry x j of x is called a feature. Example : Features in an image can be its pixel values.

Learning Algorithms 7/69 Common Machine Learning Tasks Classification: Find f (x) : R n {1,..., k} that maps examples x to one of k classes. Regression: Find f (x) : R n R that maps examples to the real line.

Learning Algorithms 8/69 The Performance Measure P A quantitative measure of performance is required in order to evaluate a machine s ability to learn. P depends on task T. Classification: P is usually the accuracy of the model. Another equivalent measure is the error rate (also called the expected 0-1 loss).

Learning Algorithms 9/69 The Experience E Machine learning algorithms can be classified into two classes: supervised and unsupervised based on what kind of experience they are allowed to have during the learning process. Machine learning algorithms are usually allowed to experience an entire dataset.

Learning Algorithms 10/69 Categorizing Algorithms Based On E Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target.

Learning Algorithms 11/69 Dataset Splits We usually split our dataset to three subsets: train, val, test. E is usually experiencing train and val sets. P is usually evaluated on test set.

Capacity, Overfitting, and Underfitting 12/69 Section 2 Capacity, Overfitting, and Underfitting

Capacity, Overfitting, and Underfitting 13/69 The main challenge in machine learning is that the algorithm must perform well on new, unseen input data. This ability is called generalization. We usually have access to the training set, and we try to minimize some error measure called the training error. This is standard optimization. What differentiates machine learning from standard optimization is that we care to minimize the generalization error, the error evaluated on the test set.

Capacity, Overfitting, and Underfitting 14/69 The Data Generating Distribution p data Is minimizing over training set error guaranteed to provide parameters that minimize the test set error? Under the i.i.d assumption on train and test examples, the answer is Yes.

Capacity, Overfitting, and Underfitting 15/69 The factors that determine how well a machine learning algorithm performs is its ability to: Make the training error small. Make the gap between training and test error small.

Capacity, Overfitting, and Underfitting 16/69 Overfitting, Underfitting, and Capacity Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set. Overfitting occurs when the gap between the training error and test error is too large. Capacity is a model s ability to fit a wide variety of functions.

Capacity, Overfitting, and Underfitting 17/69 Overfitting, Underfitting, and Capacity There is a direct relation between the model s capacity and whether it will overfit or underfit. Models with low capacity may struggle to fit the training set. Models with high capacity can overfit by memorizing properties of the training set that do not serve them well on the test set.

Capacity, Overfitting, and Underfitting 18/69 Controlling Capacity: The Hypothesis Space Hypothesis Space : the set of functions that the learning algorithm is allowed to select as being the solution. Increase the model s capacity by expanding the hypothesis space.

Capacity, Overfitting, and Underfitting 19/69 Controlling Capacity: The Hypothesis Space

Capacity, Overfitting, and Underfitting 20/69 Controlling Capacity: The Hypothesis Space From statistical learning theory: The discrepancy between training error and generalization error is bounded from above by a quantity that grows as the model capacity grows but shrinks as the number of training examples increases (Vapnik and Chervonenkis, 1971). Intellectual justification that machine learning algorithms can work! Note: We must remember that while simpler functions are more likely to generalize (to have a small gap between training and test error) we must still choose a sufficiently complex hypothesis to achieve low training error.

Capacity, Overfitting, and Underfitting 21/69 Controlling Capacity: The Hypothesis Space

Capacity, Overfitting, and Underfitting 22/69 Bayes Error The ideal model is an oracle that simply knows the true probability distribution that generates the data. The error incurred by an oracle making predictions from the true distribution p(x, y) is called the Bayes error. Example: In the case of supervised learning, the mapping from x to y may be inherently stochastic, or y may be a deterministic function that involves other variables besides those included in x.

Capacity, Overfitting, and Underfitting 23/69 The No Free Lunch Theorem Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points. What are the consequences of this theorem?

Capacity, Overfitting, and Underfitting 24/69 Controlling The Capacity: Regularization The behavior of our algorithm is strongly affected not just by how large we make the set of functions allowed in its hypothesis space, but by the specific identity of those functions. Regularization can be used as a way to give preference to one solution in our hypothesis space (more general than restricting the space itself). Weight Decay: λw T w

Capacity, Overfitting, and Underfitting 25/69 Controlling The Capacity: Regularization More formally, Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

Hyperparameters and Validation Sets 26/69 Section 3 Hyperparameters and Validation Sets

Hyperparameters and Validation Sets 27/69 Hyperparameters Hyperparameters are any variables that affect the behavior of the learning algorithm, but are not adapted by the algorithm itself.

Hyperparameters and Validation Sets 28/69 Importance of the Validation Set In a test-train-val split, learning is performed on the train set. The choice of hyperparameters is done by evaluation on the val set. Construction of a train-val-test split: Split the data set to train-test at a 1 : 1 ratio. Then, split the train set to train-val at a 4 : 1 ratio.

Hyperparameters and Validation Sets 29/69 What happens when the same test set has been used repeatedly to evaluate performance of different algorithms over many years?

Estimators, Bias and Variance 30/69 Section 4 Estimators, Bias and Variance

Estimators, Bias and Variance 31/69 Point Estimation Point estimation is an attempt to provide the single best prediction ˆθ of some quantity of interest θ. This quantity might be a scalar, vector, matrix, or even a function. Usually, point estimation is done using a set of data points: ˆθ = g(x (1),..., x (m) ) Note that g does not need to return a value close to θ, it even might not have the same set of allowable values.

Estimators, Bias and Variance 32/69 Bias The bias of an estimator is: bias(ˆθ) = E(ˆθ) θ Bias measures the expected deviation of the estimate from the true value of the function or parameter. We say an estimator is unbiased if its bias is 0. We say an estimator is asymptotically unbiased if lim m bias(ˆθ) = 0.

Estimators, Bias and Variance 33/69 Variance The variance Var(ˆθ) of an estimator provides a measure of how we would expect the estimate we compute from data to vary as we independently resample the dataset from the underlying data generating process.

Estimators, Bias and Variance 34/69 The Bias-Variance Trade Off How to choose between two estimators, one with large bias and the other with large variance? Mean-Square Error of the estimates: MSE = E[(ˆθ θ) 2 ] = Bias(ˆθ) 2 + Var(ˆθ) MSE incorporates both bias and variance components.

Estimators, Bias and Variance 35/69 Relation To Machine Learning The relationship between bias and variance is tightly linked to the machine learning concepts of capacity, underfitting and overfitting. How?

Estimators, Bias and Variance 36/69 Consistency Consistency is a desirable property of estimators. It insures that as the number of data points in our data set increase, our point estimate converges to the true value of θ. More formally, consistency states that: lim ˆθ p θ m The convergence here is in probability. Consistency of an estimator ensures that the bias will diminish as our training data set grows. It is better to choose consistent estimators with large bias over estimators with small bias and large variance. Why?

ML and MAP Estimators 37/69 Section 5 ML and MAP Estimators

ML and MAP Estimators 38/69 Maximum Likelihood Estimation Maximum likelihood (ML) is a principle used to derive estimators. Given m examples X = x (1),..., x (m) drawn independently form data generating distribution p data : θ ML = argmax p model (X; θ) θ p model (x; θ) maps any configuration x to a real number, hence tries to estimate the true data distribution p data.

ML and MAP Estimators 39/69 Maximum Likelihood Estimation After some mathematical manipulation: θ ML = argmax E x ˆpdata log p model (x, θ) θ Ideally, we would like to have this expectation over p data. Unfortunately, we only have access to the empirical distribution ˆp data from training data. Maximum likelihood can be viewed as a minimization of the dissimilarity between ˆp data and p model. How?

ML and MAP Estimators 40/69 Maximum Likelihood Estimation Maximum likelihood can be shown to be the best estimator, asymptotically in terms of its rate of convergence as m. The estimator derived by ML is consistent. However, certain conditions are required for consistency to hold: The true distribution p data must lie within the model family p model (.; θ). Otherwise, no estimator can recover p data even with infinite training examples. There needs to exist a unique θ. Otherwise, ML will recover p data but will not be able to determine the true value of θ used in the data generation process. Under these conditions, you are guaranteed to improve the performance of your estimator with more training data.

ML and MAP Estimators 41/69 Maximum A Posteriori Estimation

ML and MAP Estimators 42/69 Maximum A Posteriori Estimation Bayesian Statistics: The dataset is directly observed and so is not random. On the other hand, the true parameter θ is unknown or uncertain and thus is represented as a random variable. Before observing data, we represent our knowledge of θ using the prior probability distribution p(θ). After observing data, we use bayes rule to compute the posterior distribution p(θ x (1)...x (m) ).

ML and MAP Estimators 43/69 Maximum A Posteriori Estimation Usually, priors are chosen to be high entropy distributions such as uniform or Gaussian distributions. These distributions are described as broad. From Bayes rule we have: p(θ x (1)...x (m) ) = p(x (1)...x (m) θ)p(θ) p(x (1)...x (m) )

ML and MAP Estimators 44/69 Maximum A Posteriori Estimation To predict the distribution over new input data, marginalize over θ: p(x new x (1)...x (m) ) = p(x new θ)p(θ x (1)...x (m) )dθ Example: Bayesian Linear Regression.

ML and MAP Estimators 45/69 Maximum A Posteriori Estimation Maximum a posteriori estimation (MAP) tries to overcome the intractability of the full Bayesian treatment, by providing point estimates using the posterior probability: θ MAP = argmax p(θ x) θ = argmax log p(x θ) + log p(θ) θ MAP Bayesian inference has the advantage of leveraging information that is brought by the prior and cannot be found in the training data.

Gradient Based Optimization 46/69 Section 6 Gradient Based Optimization

Gradient Based Optimization 47/69 Optimization Optimization refers to the task of either minimizing or maximizing some function f (x) by altering the value of x. f (x) is called an objective function. In context of machine learning, it is also called the loss, cost, or error function. Notation: x = argmin f (x) is the value of x that minimizes f (x). x

Gradient Based Optimization 48/69 Using The Derivative For Optimization The derivative of a function specifies how to scale a small change in input in order to obtain the corresponding change in output. f (x + ɛ) f (x) + ɛ x f (x) The derivative is useful for optimization because it allows knowledge of how to change x to improve f (x). Example: f (x ɛ sign( x f (x))) f (x) for small enough ɛ.

Gradient Based Optimization 49/69 Critical Points A critical point or stationary point is a point x with x f (x) = 0.

Gradient Based Optimization 50/69 Global vs Local Optimal Points

Gradient Based Optimization 51/69 Gradient Descent Gradient descent proposes to update the parameter according to: x x ɛ x f (x) ɛ is referred to as the learning rate. Gradient descent converges when all the elements in the gradient are almost equal to zero.

Gradient Based Optimization 52/69 Gradient Descent

Gradient Based Optimization 53/69 Stochastic Gradient Descent Nearly all of deep learning is powered by one optimization algorithm: SGD. Motivation behind SGD: The cost function used by a machine learning algorithm often decomposes as a sum over training examples of some per-example loss function: J(θ) = E x,y ˆpdata L(x, y, θ) = 1 m L(x (i), y (i), θ) m i=1

Gradient Based Optimization 54/69 Stochastic Gradient Descent To minimize the loss over θ, the gradient needs to be computed. θ J(θ) = 1 m θ L(x (i), y (i), θ) m i=1 What is the computational cost for computing the gradient above?

Gradient Based Optimization 55/69 Stochastic Gradient Descent SGD relies on the fact that the gradient is an expectation, hence can be approximated with a small set of samples. let m be a minibatch uniformly drawn from our training data. θ J(θ) = 1 m m θ L(x (i), y (i), θ) i=1 The SGD update rule becomes : θ θ + ɛ θ J(θ)

Challenges That Motivate Deep Learning 56/69 Section 7 Challenges That Motivate Deep Learning

Challenges That Motivate Deep Learning 57/69 Major Obstacles For Traditional Machine Learning The development of deep learning was motivated by the failure of traditional ML algorithms when applied to central problems in AI due to: The mechanisms used to achieve generalization in traditional machine learning are insufficient to learn complicated functions in high-dimensional spaces. The challenge of generalizing to new examples becomes exponentially more difficult when working with high-dimensional data.

Challenges That Motivate Deep Learning 58/69 The Curse Of Dimensionality Many machine learning problems become exceedingly difficult when the number of dimensions in the data is high. This is because the number of distinct configurations of a set of variables increase exponentially as the number of variables increase. How does that affect ML algorithms?

Challenges That Motivate Deep Learning 59/69 The Curse Of Dimensionality

Challenges That Motivate Deep Learning 60/69 Local Constancy And Smoothness Regularization In order to generalize well, machine learning algorithms need to be guided by prior beliefs about what kind of function they should learn. Among the most widely used priors is the smoothness or local constancy prior. A function is said to have local constancy if it does not change much within a small region of space. As the machine learning algorithm becomes simpler, it tends to rely extensively on this prior. Example: K nearest neighbors.

Challenges That Motivate Deep Learning 61/69 Local Constancy And Smoothness Regularization In general, traditional learning algorithms require O(k) examples to distinguish O(k) regions in space. Is there a way to represent a complex function that has many more regions to be distinguished than the number of training examples?

Challenges That Motivate Deep Learning 62/69 Local Constancy And Smoothness Regularization Key insight: Even though the number of regions of a function can be very large, say O(2 k ), the function can be defined with O(k) examples as long as we introduce additional dependencies between regions via generic assumptions. Result: Non local generalization is actually possible.

Challenges That Motivate Deep Learning 63/69 Local Constancy And Smoothness Regularization Example assumption: The data was generated by the composition of factors or features, potentially at multiple levels in a hierarchy. (core idea in deep learning) To a certain point, the exponential advantages conferred by the use of deep, distributed representations counter the exponential challenges posed by the curse of dimensionality. Many other generic mild assumptions allow an exponential gain in the relationship between the number of examples and the number of regions that can be distinguished.

Challenges That Motivate Deep Learning 64/69 Manifold Learning A manifold is a connected region in space. Mathematically, it is a set of points, associated with a neighborhood around each points. From any point, the surface of the manifold appears as a euclidean space. Example: We observe the world as a 2-D plane, whereas in fact it is a spherical manifold in 3-D space.

Challenges That Motivate Deep Learning 65/69 Manifold Learning

Challenges That Motivate Deep Learning 66/69 Manifold Learning Most AI problems seem hopeless if we expect algorithms to learn interesting variations over all of R n. Manifold Learning: Most of R n consists of invalid input. Interesting input occurs only along a collection of manifolds embedded in R n. Conclusion: probability mass is highly concentrated.

Challenges That Motivate Deep Learning 67/69 Manifold Learning Fortunately, there is evidence to support the above assumptions. Observation 1: Probability distributions in natural data (images, text strings, and sound) is highly concentrated. Observation 2: Examples encountered in natural data are connected to each other by other examples, with each example being surrounded by similar data.

Challenges That Motivate Deep Learning 68/69 Manifold Learning Training examples from the QMULMultiview Face Dataset.

Challenges That Motivate Deep Learning 69/69 Conclusion Deep learning present a framework to solve tasks that cannot be solved by traditional ML algorithms. Next lecture: Feed Forward Neural Networks.