Machine Learning 101a. Jan Peters Gerhard Neumann

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

CS Machine Learning

Probabilistic Latent Semantic Analysis

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CSL465/603 - Machine Learning

Probability and Statistics Curriculum Pacing Guide

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Exploration. CS : Deep Reinforcement Learning Sergey Levine

The Good Judgment Project: A large scale test of different methods of combining expert predictions

(Sub)Gradient Descent

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Assignment 1: Predicting Amazon Review Ratings

Mathematics subject curriculum

WHEN THERE IS A mismatch between the acoustic

STA 225: Introductory Statistics (CT)

Statewide Framework Document for:

Artificial Neural Networks written examination

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Emotion Recognition Using Support Vector Machine

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Softprop: Softmax Neural Network Backpropagation Learning

Australian Journal of Basic and Applied Sciences

Learning From the Past with Experiment Databases

Human Emotion Recognition From Speech

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Rule Learning With Negation: Issues Regarding Effectiveness

Word Segmentation of Off-line Handwritten Documents

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Case Study: News Classification Based on Term Frequency

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Lecture 1: Basic Concepts of Machine Learning

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

A study of speaker adaptation for DNN-based speech synthesis

Honors Mathematics. Introduction and Definition of Honors Mathematics

Semi-Supervised Face Detection

Software Maintenance

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Truth Inference in Crowdsourcing: Is the Problem Solved?

Grade 6: Correlated to AGS Basic Math Skills

Chapter 4 - Fractions

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Why Did My Detector Do That?!

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Rule Learning with Negation: Issues Regarding Effectiveness

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

GDP Falls as MBA Rises?

12- A whirlwind tour of statistics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Teaching a Laboratory Section

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Detailed course syllabus

Adaptive Learning in Time-Variant Processes With Application to Wind Power Systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

Mathematics. Mathematics

Generative models and adversarial training

Speaker recognition using universal background model on YOHO database

Speech Recognition at ICSI: Broadcast News and beyond

Algebra 2- Semester 2 Review

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Modeling function word errors in DNN-HMM based LVCSR systems

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Theory of Probability

Applications of data mining algorithms to analysis of medical data

Comparison of network inference packages and methods for multiple networks inference

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Managerial Decision Making

Introduction to Simulation

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Modeling function word errors in DNN-HMM based LVCSR systems

Calibration of Confidence Measures in Speech Recognition

Learning Methods for Fuzzy Systems

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

B. How to write a research paper

OFFICE SUPPORT SPECIALIST Technical Diploma

Using focal point learning to improve human machine tacit coordination

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

School of Innovative Technologies and Engineering

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Comment-based Multi-View Clustering of Web 2.0 Items

Transcription:

Machine Learning 101a Jan Peters Gerhard Neumann 1

Purpose of this Lecture Statistics and Math Refresher Foundations of machine learning tools for robotics We focus on regression methods and general principles Often needed in robotics More on machine learning in general: Machine Learning Statistical Approaches 1 2

Content of this Lecture Math and Statistics Refresher What is Machine Learning? Model-Selection Linear Regression Gauss Approach Gauss Approach Frequentist Approach Bayesian Approach 3

Statistics Refresher: " Sweet memories from High School... What is a random variable? is a variable whose value x is subject to variations due to chance What is a distribution? Describes the probability that the random variable will be equal to a certain value What is an expectation? 4!

Statistics Refresher: " Sweet memories from High School... What is a joint, a conditional and a marginal distribution? What is independence of random variables? What does marginalization mean? And finally what is Bayes Theorem? 5!

Math Refresher: " Some more fancy math From now on, matrices are your friends derivatives too Some more matrix calculus Need more? Wikipedia on Matrix Calculus or The Matrix Cookbook 6

Math Refresher:" Inverse of matrices How can we invert a matrix that is not a square matrix? Left-Pseudo Inverse: works, if J has full column rank Right Pseudo Inverse: works, if J has full row rank 7!

Statistics Refresher: " Meet some old friends Gaussian Distribution Covariance matrix captures linear correlation Product: Gaussian stays Gaussian Mean is also the mode 8!

Statistics Refresher: " Meet some old friends Joint from Marginal and Conditional Marginal and Conditional Gaussian from Joint 9!

Statistics Refresher: " Meet some old friends Bayes Theorem for Gaussians Damped Pseudo Inverse Not enough? Find more stuff here (credit to Marc Toussaint) 10!

May I introduce you? The good old logarithm It s monoton but not boring, as: Product is easy Division a piece of cake Exponents also 11

Content of this Lecture Math and Statistics Refresher What is Machine Learning? Model-Selection Linear Regression Gauss Approach Frequentist Approach Bayesian Approach 12

Why Machine Learning We are drowning in information and starving for knowledge. -John Naisbitt. Era of big data: In 2008 there are about 1 trillion web pages 20 hours of video are uploaded to YouTube every minute Walmart handles more than 1M transactions per hour and has databases containing more than 2.5 petabytes (2.5 10 15 ) of information. No human being can deal with the data avalanche! 13

Why Machine Learning? 14 I keep saying the sexy job in the next ten years will be statisticians and machine learners. People think I m joking, but who would ve guessed that computer engineers would ve been the sexy job of the 1990s? The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it that s going to be a hugely important skill in the next decades. Hal Varian, 2009 Chief Engineer of Google

Types of Machine Learning Machine Learning predictive (supervised) descriptive (unsupervised) Active (e.g., reinforcement learning) 15

Prediction Problem (=Supervised Learning) What will be the CO² concentration in the future? 16

Prediction Problem (=Supervised Learning) What will be the CO² concentration in the future? Different prediction models possible Linear 17

Prediction Problem (=Supervised Learning) What will be the CO² concentration in the future? Different prediction models possible Linear Exponential with seasonal" trends 18

Formalization of Predictive Problems In predictive problems, we have the following data-set Two most prominent examples are: 1. Classification: Discrete outputs or labels. Most likely class: 2. Regression: Continuous outputs or labels. 19 Expected output:

Examples of Classification Document classification, e.g., Spam Filtering Image classification: Classifying flowers, face detection, face recognition, handwriting recognition,... 20

Examples of Regression 21 Predict tomorrow s stock market price given current market conditions and other possible side information. Predict the amount of prostate specific antigen (PSA) in the body as a function of a number of different clinical measurements. Predict the temperature at any location inside a building using weather data, time, door sensors, etc. Predict the age of a viewer watching a given video on YouTube. Many problems in robotics can be addressed by regression!

Types of Machine Learning Machine Learning predictive (supervised) descriptive (unsupervised) Active (e.g., reinforcement learning) 22

Formalization of Descriptive Problems In descriptive problems, we have 23 Three prominent examples are: 1. Clustering: Find groups of data which belong together. 2. Dimensionality Reduction: Find the latent dimension of your data. 3. Density Estimation: Find the probability of your data...

Old Faithful Duration of Eruption Time to Next Eruption 24 This is called Clustering!

Dimensionality Reduction 2D Projection Original Data 1D Projection 25 This is called Dimensionality Reduction!

Dimensionality Reduction Example: Eigenfaces 26 How many faces do you need to characterize these?

27 Example: Eigenfaces

Example: density of glu (plasma glucose concentration) for diabetes patients Estimate relative occurance of a data point 28 This is called Density Estimation!

The bigger picture... 29 When we re learning to see, nobody s telling us what the right answers are we just look. Every so often, your mother says that s a dog, but that s very little information. You d be lucky if you got a few bits of information even one bit per second that way. The brain s visual system has 10 14 neural connections. And you only live for 10 9 seconds. So it s no use learning one bit per second. You need more like 10 5 bits per second. And there s only one place you can get that much information: from the input itself. Geoffrey Hinton, 1996

Types of Machine Learning Machine Learning predictive (supervised) descriptive (unsupervised) Active (e.g., reinforcement learning) That will be the main topic of the lecture! 30

How to attack a machine learning problem? 31 Machine learning problems essentially always are about two entities: (i) data model assumptions: Understand your problem generate good features which make the problem easier determine the model class Pre-processing your data (ii) algorithms that can deal with (i): Estimating the parameters of your model. We are gonna do this for regression...

Content of this Lecture Math and Statistics Refresher What is Machine Learning? Model-Selection Linear Regression Gauss Approach Frequentist Approach Bayesian Approach 32

Important Questions 33 How does the data look like? Are you really learning a function? What data types do our outputs have? Outliers: Are there data points in China? What is our model (relationship between inputs and outputs)? Do you have features? What type of noise / What distribution models our outputs? Number of parameters? Is your model sufficiently rich? Is it robust to overfitting?

Important Questions Requirements for the solution accurate efficient to obtain (computation/memory) interpretable 34

Example Problem: a data set 35 Task: Describe the outputs as a function of the inputs (regression)

Model Assumptions: Noise + Features Additive Gaussian Noise: with Equivalent Probabilistic Model Lets keep in simple: linear in Features 36

Important Questions 37 How does the data look like? What data types do our outputs have? Outliers: Are there data points in China? NO Are you really learning a function? YES What is our model? Do you have features? What type of noise / What distribution models our outputs? Number of parameters? Is your model sufficiently rich? Is it robust to overfitting?

Let us fit our model... We need to answer: How many parameters? Is your model sufficiently rich? Is it robust to overfitting? We assume a model class: polynomials of degree n 38

39 Fitting an Easy Model: n=0

40 Add a Feature: n=1

41 More features...

42 More features...

More features: n=200 (zoomed in) 43 overfitting and numerical problems

Prominent example of overfitting... Is there a tank in the picture? 44 DARPA Neural Network Study (1988-89), AFCEA International Press

Test Error vs Training Error Underfitting About Right Overfitting 45 Does a small training error lead to a good model?? NO! We need to do model selection

Occam s Razor and Model Selection Model Selection: How can we choose number of features/parameters? choose type of features? prevent overfitting? Some insights: Always choose the model that fits the data and has the smallest model complexity called occam s razor 46

Bias-Variance Tradeoff Typically, you can not minimize both! Bias / Structure Error: Error because our model can not do better 47 Variance / Approximation Error: Error because we estimate parameters on a limited data set

" How do choose the model? Goal: Find a good model Split the dataset into: (e.g., good set of features) Training Validation Test 1. Training Set: Fit Parameters 2. Validation Set: Choose model class or single parameters 3. Test Set: Estimate prediction error of trained model Error needs to be estimated on independent set! 48

Model Selection: " K-fold Cross Validation Partition data into K sets, use K-1 as training set and 1 set as validation set For all possible ways of partitioning compute the validation error J computationally expensive!! 49 Choose model with smallest average validation error

Content of this Lecture Math and Statistics Refresher What is Machine Learning? Model-Selection Linear Regression Gauss Approach Frequentist Approach Bayesian Approach 50

How to find the parameters? Gauss Let s find parameters through a cost function! Objective is defined by minimizing a certain cost function 51

Gauss view: Least Squares The classical cost function is the one of least-squares Using we can rewrite it as Scalar Product and solve it 52 Least Squares solution contains left pseudo-inverse

Physical Interpretation 53 Energy of springs ~ squared lengths minimize energy of system

Geometric Interpretation true (unknown) function value Minimize projection error orthogonal projection 54

Robotics Example: Rigid-Body Dynamics" Known Features Inertial Forces Coriolis Forces Centripetal Forces Gravity Inertial Forces Centripetal Forces 55 Gravity

Robotics Example: Rigid-Body Dynamics We realize that rigid body dynamics is linear in the parameters We can rewrite it as accelerations, velocities, sin and cos terms masses, lengths, inertia,... 56 For finding the parameters we can apply even the first machine learning method that comes to mind: Least-Squares Regression

Cost Function II:" Ridge Regression We punish the magnitude of the parameters Controls model complexity 57 This yields ridge regression with, where is called ridge parameter. For features normalized by variance, typically 2 [10 9,...,10 5 ]. Numerically, this is much more stable! Even with redundant features.

Ridge regression: n=15 Influence of the regularization constant 58

MAP: Back to the Overfitting Problem Overfitting About Right Underfitting We can also scale the model complexity with the regularization parameter! 59 Smaller lambda higher model complexity

Content of this Lecture Math and Statistics Refresher What is Machine Learning? Model-Selection Linear Regression Gauss Approach Frequentist approach Bayesian Approach 60

How to find the parameters? Frequentist: Fisher Probabilities are frequencies of a repeated experiment. There are some true parameters of the experiment which we cannot observe. They reveal themselves by the frequency (i.e., likelihood) at which we can repeat the outcome of the experiment. We can obtain good parameters by maximizing likelihood of the outcome! 61

Maximum-Likelihood (ML) estimate We can maximize the likelihood of the outcome: That s hard! Do the log-trick : That s easy! 62 Least Squares Solution is equivalent to ML solution with Gaussian noise!!

Content of this Lecture Math and Statistics Refresher What is Machine Learning? Model-Selection Linear Regression Gauss Approach Frequentist Approach Bayesian Approach 63

Does this make sense? Maximizing basically means we only care about the accuracy of the reproduction of the outcomes. What if there is no fully true? In this case, we rather need to study the probability of different. Thus, our quantity of interest is. But where how can we obtain this quantity? 64

How to find the parameters? Bayesian: Bayes Parameters are just random variables. We can encode our subjective belief in the prior. likelihood prior posterior evidence 65 Intuition: If you assign each parameter estimator a probability of being right, the average of these parameter estimators will be better than the single one

Maximum a posteriori (MAP) estimate Put a prior on our parameters E.g., should be small: Find parameters that maximize the posterior Do the log trick again:

Maximum a posteriori (MAP) estimate The prior is just additive costs Lets put in our Model: 67 Ridge Regression is equivalent to MAP estimate with Gaussian prior

Predictions with the Model We found an amazing parameter set Let s do predictions! parameter estimate (e.g., ML, MAP) pred. function value Predictive mean: test input Predictive variance: 68

Comparing different data sets with same input data, but different output values (due to noise): 69

Comparing different data sets with same input data, but different output values (due to noise): 70

Comparing different data sets with same input data, but different output values (due to noise): 71 Our parameter estimate is also noisy! It depends on the noise in the data

Comparing different data sets Can we also estimate our " uncertainty in? Compute probability of " given data 72

How to get the posterior? Bayes Theorem for Gaussians For our model: Prior over parameters: Posterior over parameters: Data Likelihood 73

What to do with the posterior? We could sample from it to estimate uncertainty 74

What to do with the posterior? We could sample from it to estimate uncertainty 75

What to do with the posterior? We could sample from it to estimate uncertainty 76

Can we avoid the parameters? Bayesian Fundamentalism: We should not! We don t care about parameters. We care about predictions! 77

Full Bayesian Regression We can also do that in closed form: integrate out all possible parameters likelihood parameter posterior pred. function value test input training data Intuition: If you assign each parameter estimator a probability of being right, the average of these parameter estimators will be better than the single one 78

Full Bayesian Regression We can also do that in closed form: integrate out all possible parameters likelihood parameter posterior pred. function value test input training data Predictive Distribution is again a Gaussian 79 State Dependent Variance!

Integrating out the parameters Variance depends on the information in the data! 80

Quick Summary Models that are linear in the parameters: Overfitting is bad Model selection (leave-one-out cross validation) Parameter Estimation in Regression: Frequentist vs. Bayesian Cost functions like Least Squares go back to Gauss Least Squares ~ Maximum Likelihood estimation (ML; Frequentist) Ridge Regression ~ Maximum a Posteriori estimation (MAP; Bayesian) Full Bayesian Regression integrates out the parameters when predicting 81 State dependent uncertainty