Lecture 1. Introduction Bastian Leibe Visual Computing Institute RWTH Aachen University

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Generative models and adversarial training

Probabilistic Latent Semantic Analysis

CSL465/603 - Machine Learning

Speech Recognition at ICSI: Broadcast News and beyond

Statewide Framework Document for:

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Probability and Statistics Curriculum Pacing Guide

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

(Sub)Gradient Descent

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Switchboard Language Model Improvement with Conversational Data from Gigaword

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Learning From the Past with Experiment Databases

Comparison of network inference packages and methods for multiple networks inference

Active Learning. Yingyu Liang Computer Sciences 760 Fall

STA 225: Introductory Statistics (CT)

CS Machine Learning

A study of speaker adaptation for DNN-based speech synthesis

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

THE UNIVERSITY OF SYDNEY Semester 2, Information Sheet for MATH2068/2988 Number Theory and Cryptography

Lecture 1: Basic Concepts of Machine Learning

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

School of Innovative Technologies and Engineering

Lecture 10: Reinforcement Learning

Rule Learning With Negation: Issues Regarding Effectiveness

INPE São José dos Campos

Foothill College Summer 2016

Semi-Supervised Face Detection

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The Good Judgment Project: A large scale test of different methods of combining expert predictions

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

WHEN THERE IS A mismatch between the acoustic

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Detailed course syllabus

Corrective Feedback and Persistent Learning for Information Extraction

Artificial Neural Networks written examination

Calibration of Confidence Measures in Speech Recognition

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Truth Inference in Crowdsourcing: Is the Problem Solved?

12- A whirlwind tour of statistics

Mathematics subject curriculum

Introduction to Causal Inference. Problem Set 1. Required Problems

Speech Emotion Recognition Using Support Vector Machine

Assignment 1: Predicting Amazon Review Ratings

Probability and Game Theory Course Syllabus

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Model Ensemble for Click Prediction in Bing Search Ads

Human Emotion Recognition From Speech

Reducing Features to Improve Bug Prediction

The Evolution of Random Phenomena

Why Did My Detector Do That?!

Physics 270: Experimental Physics

Rule Learning with Negation: Issues Regarding Effectiveness

Syllabus ENGR 190 Introductory Calculus (QR)

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Spring 2014 SYLLABUS Michigan State University STT 430: Probability and Statistics for Engineering

Speaker recognition using universal background model on YOHO database

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Firms and Markets Saturdays Summer I 2014

Grade 6: Correlated to AGS Basic Math Skills

arxiv: v2 [cs.cv] 30 Mar 2017

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Theory of Probability

Australian Journal of Basic and Applied Sciences

Discriminative Learning of Beam-Search Heuristics for Planning

State University of New York at Buffalo INTRODUCTION TO STATISTICS PSC 408 Fall 2015 M,W,F 1-1:50 NSC 210

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Evolutive Neural Net Fuzzy Filtering: Basic Description

Math 098 Intermediate Algebra Spring 2018

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

A Case Study: News Classification Based on Term Frequency

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Modeling function word errors in DNN-HMM based LVCSR systems

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Mathematics process categories

Transcription:

Advanced Machine Learning Lecture 1 Introduction 20.10.2015 Bastian Leibe Visual Computing Institute RWTH Aachen University http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwth-aachen.de) Teaching Assistants Umer Rafi (rafi@vision.rwth-aachen.de) Lucas Beyer (beyer@vision.rwth-aachen.de) Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on the webpage There is also an L2P electronic repository Please subscribe to the lecture on the Campus system! Important to get email announcements and L2P access! 2

Language Official course language will be English If at least one English-speaking student is present. If not you can choose. However Please tell me when I m talking too fast or when I should repeat something in German for better understanding! You may at any time ask questions in German! You may turn in your exercises in German. You may take the oral exam in German. 3

Relationship to Previous Courses Lecture Machine Learning (past summer semester) Introduction to ML Classification Graphical models This course: Advanced Machine Learning Natural continuation of ML course Deeper look at the underlying concepts But: will try to make it accessible also to newcomers Quick poll: Who hasn t heard the ML lecture? This year: Lots of new material Large lecture block on Deep Learning First time for us to teach this (so, bear with us...) 4

New Content This Year Deep Learning 5

Organization Structure: 3V (lecture) + 1Ü (exercises) 6 EECS credits Part of the area Applied Computer Science Place & Time Lecture/Exercises: Mon 14:15 15:45 room UMIC 025 Lecture/Exercises: Thu 10:15 11:45 room UMIC 025 Exam Oral or written exam, depending on number of participants Towards the end of the semester, there will be a proposed date 6

Course Webpage Monday: Matlab tutorial http://www.vision.rwth-aachen.de/teaching/ 7

Exercises and Supplementary Material Exercises Typically 1 exercise sheet every 2 weeks. Pen & paper and programming exercises Matlab for early topics Theano for Deep Learning topics Hands-on experience with the algorithms from the lecture. Send your solutions the night before the exercise class. Supplementary material Research papers and book chapters Will be provided on the webpage. 8

Textbooks Most lecture topics will be covered in Bishop s book. Some additional topics can be found in Rasmussen & Williams. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 (available in the library s Handapparat ) Research papers will be given out for some topics. Tutorials and deeper introductions. Application papers Carl E. Rasmussen, Christopher K.I. Williams Gaussian Processes for Machine Learning MIT Press, 2006 (also available online: http://www.gaussianprocess.org/gpml/) 9

How to Find Us Office: UMIC Research Centre Mies-van-der-Rohe-Strasse 15, room 124 Office hours If you have questions to the lecture, come see us. My regular office hours will be announced. Send us an email before to confirm a time slot. Questions are welcome! 10

Machine Learning Statistical Machine Learning Principles, methods, and algorithms for learning and prediction on the basis of past evidence Already everywhere Speech recognition (e.g. speed-dialing) Computer vision (e.g. face detection) Hand-written character recognition (e.g. letter delivery) Information retrieval (e.g. image & video indexing) Operation systems (e.g. caching) Fraud detection (e.g. credit cards) Text filtering (e.g. email spam filters) Game playing (e.g. strategy prediction) Robotics (e.g. prediction of battery lifetime) Slide credit: Bernt Schiele 11

What Is Machine Learning Useful For? Automatic Speech Recognition Slide adapted from Zoubin Gharamani 12

What Is Machine Learning Useful For? Computer Vision (Object Recognition, Segmentation, Scene Understanding) Slide adapted from Zoubin Gharamani 13

What Is Machine Learning Useful For? Information Retrieval (Retrieval, Categorization, Clustering,...) Slide adapted from Zoubin Gharamani 14

What Is Machine Learning Useful For? Slide adapted from Zoubin Gharamani Financial Prediction (Time series analysis,...) 15

What Is Machine Learning Useful For? Slide adapted from Zoubin Gharamani Medical Diagnosis (Inference from partial observations) 16 Image from Kevin Murphy

What Is Machine Learning Useful For? Slide adapted from Zoubin Gharamani Bioinformatics (Modelling gene microarray data,...) 17

What Is Machine Learning Useful For? Slide adapted from Zoubin Gharamani Robotics (DARPA Grand Challenge,...) 18 Image from Kevin Murphy

Machine Learning: Core Questions Learning to perform a task from experience Task Can often be expressed through a mathematical function y = f(x; w) x: Input y: Output w: Parameters (this is what is learned ) Classification vs. Regression Regression: continuous y Classification: discrete y Slide credit: Bernt Schiele E.g. class membership, sometimes also posterior probability 19

Machine Learning: Core Questions y = f(x; w) w: characterizes the family of functions w: indexes the space of hypotheses w: vector, connection matrix, graph, Slide credit: Bernt Schiele 20

A Look Back: Lecture Machine Learning Fundamentals Bayes Decision Theory Probability Density Estimation Classification Approaches Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Generative Models Bayesian Networks Markov Random Fields 21

This Lecture: Advanced Machine Learning Extending lecture Machine Learning from last semester Regression Approaches Linear Regression Regularization (Ridge, Lasso) Gaussian Processes Learning with Latent Variables EM and Generalizations Approximate Inference Deep Learning Neural Networks CNNs, RNNs, RBMs, etc.

Let s Get Started Some of you already have basic ML background Who hasn t? We ll start with a gentle introduction I ll try to make the lecture also accessible to newcomers We ll review the main concepts before applying them I ll point out chapters to review from ML lecture whenever knowledge from there is needed/helpful But please tell me when I m moving too fast (or too slow) 23

Topics of This Lecture Regression: Motivation Polynomial fitting General Least-Squares Regression Overfitting problem Regularization Ridge Regression Recap: Important Concepts from ML Lecture Probability Theory Bayes Decision Theory Maximum Likelihood Estimation Bayesian Estimation A Probabilistic View on Regression Least-Squares Estimation as Maximum Likelihood 24

Regression Learning to predict a continuous function value Given: training set X = {x 1,, x N } with target values T = {t 1,, t N }. Learn a continuous function y(x) to predict the function value for a new input x. Steps towards a solution Choose a form of the function y(x,w) with parameters w. Define an error function E(w) to optimize. Optimize E(w) for w to find a good solution. (This may involve math). Derive the properties of this solution and think about its limitations. 25

Example: Polynomial Curve Fitting Toy dataset Generated by function Small level of random noise with Gaussian distribution added (blue dots) Goal: fit a polynomial function to this data Note: Nonlinear function of x, but linear function of the w j. 26 Image source: C.M. Bishop, 2006

Error Function How to determine the values of the coefficients w? We need to define an error function to be minimized. This function specifies how a deviation from the target value should be weighted. Popular choice: sum-of-squares error Definition We ll discuss the motivation for this particular function later 27 Image source: C.M. Bishop, 2006

Minimizing the Error How do we minimize the error? Solution (Always!) Compute the derivative and set it to zero. Since the error is a quadratic function of w, its derivative will be linear in w. Minimization has a unique solution. 28

Least-Squares Regression We have given Training data points: Associated function values: X = fx 1 2 R d ; : : : ; x n g T = ft 1 2 R; : : : ; t n g Start with linear regressor: Try to enforce One linear equation for each training data point / label pair. This is the same basic setup used for least-squares classification! Only the values are now continuous. Slide credit: Bernt Schiele 29

Least-Squares Regression Setup Step 1: Define ~x i = µ xi 1 ; ~w = µ w w 0 Step 2: Rewrite Step 3: Matrix-vector notation with Step 4: Find least-squares solution Solution: Slide credit: Bernt Schiele 30

Regression with Polynomials How can we fit arbitrary polynomials using least-squares regression? We introduce a feature transformation (as before in ML). E.g.: Fitting a cubic polynomial. y(x) = w T Á(x) MX = w i Á i (x) Á(x) = (1; x; x 2 ; x 3 ) T i=0 assume Á 0 (x) = 1 basis functions Slide credit: Bernt Schiele 31

Which one should we pick? 32 Image source: C.M. Bishop, 2006 Varying the Order of the Polynomial. Massive overfitting!

Analysis of the Results Results for different values of M Best representation of the original function sin(2¼x) with M = 3. Perfect fit to the training data with M = 9, but poor representation of the original function. Why is that??? After all, M = 9 contains M = 3 as a special case! 33 Image source: C.M. Bishop, 2006

Overfitting Problem Training data contains some noise Higher-order polynomial fitted perfectly to the noise. We say it was overfitting to the training data. Goal is a good prediction of future data Our target function should fit well to the training data, but also generalize. Measure generalization performance on independent test set. 34

Measuring Generalization Overfitting! E.g., Root Mean Square Error (RMS): Motivation Division by N lets us compare different data set sizes. Square root ensures E RMS is measured on the same scale (and in the same units) as the target variable t. 35 Image source: C.M. Bishop, 2006

Analyzing Overfitting Example: Polynomial of degree 9 Relatively little data Overfitting typical Enough data Good estimate Overfitting becomes less of a problem with more data. 36 Slide adapted from Bernt Schiele Image source: C.M. Bishop, 2006

What Is Happening Here? The coefficients get very large: Fitting the data from before with various polynomials. Coefficients: Slide credit: Bernt Schiele 37 Image source: C.M. Bishop, 2006

Regularization What can we do then? How can we apply the approach to data sets of limited size? We still want to use relatively complex and flexible models. Workaround: Regularization Penalize large coefficient values Here we ve simply added a quadratic regularizer, which is simple to optimize The resulting form of the problem is called Ridge Regression. (Note: w 0 is often omitted from the regularizer.) 38

Results with Regularization (M=9) 39 Image source: C.M. Bishop, 2006

RMS Error for Regularized Case Effect of regularization The trade-off parameter now controls the effective model complexity and thus the degree of overfitting. 40 Image source: C.M. Bishop, 2006

Summary We ve seen several important concepts Linear regression Overfitting Role of the amount of data Role of model complexity Regularization How can we approach this more systematically? Would like to work with complex models. How can we prevent overfitting systematically? How can we avoid the need for validation on separate test data? What does it mean to do linear regression? What does it mean to do regularization? 41

Topics of This Lecture Regression: Motivation Polynomial fitting General Least-Squares Regression Overfitting problem Regularization Ridge Regression Recap: Important Concepts from ML Lecture Probability Theory Bayes Decision Theory Maximum Likelihood Estimation Bayesian Estimation A Probabilistic View on Regression Least-Squares Estimation as Maximum Likelihood 42

Recap: The Rules of Probability Basic rules Sum Rule Product Rule From those, we can derive Bayes Theorem where 43

Recap: Bayes Decision Theory p x a p x b Likelihood p x a p( a) x p x b p( b) x Decision boundary Likelihood Prior p a x p b x Slide credit: Bernt Schiele x Posterior = Likelihood Prior NormalizationFactor 44

Recap: Gaussian (or Normal) Distribution One-dimensional case Mean ¹ Variance ¾ 2 N(xj¹; ¾ 2 ) = p 1 ¾ (x ¹)2 exp ½ 2¼¾ 2¾ 2 Multi-dimensional case Mean ¹ Covariance N(xj¹; ) = ½ 1 exp 1 ¾ (2¼) D=2 j j1=2 2 (x ¹)T 1 (x ¹) 45 Image source: C.M. Bishop, 2006

Side Note Notation In many situations, it will be necessary to work with the inverse of the covariance matrix : We call the precision matrix. We can therefore also write the Gaussian as 46

Recap: Parametric Methods Given Data Parametric form of the distribution with parameters µ E.g. for Gaussian distrib.: Learning X = fx 1 ; x 2 ; : : : ; x N g Estimation of the parameters µ µ = (¹; ¾) x x Likelihood of µ Probability that the data X have indeed been generated from a probability density with parameters µ L(µ) = p(xjµ) Slide adapted from Bernt Schiele 47

Recap: Maximum Likelihood Approach Computation of the likelihood p(x n jµ) Single data point: Assumption: all data points X = fx 1 ; : : : ; x n g are independent Log-likelihood L(µ) = p(xjµ) = E(µ) = ln L(µ) = NX ln p(x n jµ) n=1 Estimation of the parameters µ (Learning) Maximize the likelihood (=minimize the negative log-likelihood) Take the derivative and set it to zero. Slide credit: Bernt Schiele N @ @µ E(µ) = X n=1 NY n=1 p(x n jµ) @ @µ p(x njµ) p(x n jµ)! = 0 48

Recap: Maximum Likelihood Limitations Maximum Likelihood has several significant limitations It systematically underestimates the variance of the distribution! E.g. consider the case N = 1; X = fx 1 g x Maximum-likelihood estimate: ^¾ = 0! We say ML overfits to the observed data. We will still often use ML, but it is important to know about this effect. ^¹ x Slide adapted from Bernt Schiele 49

Recap: Deeper Reason Maximum Likelihood is a Frequentist concept In the Frequentist view, probabilities are the frequencies of random, repeatable events. These frequencies are fixed, but can be estimated more precisely when more data is available. This is in contrast to the Bayesian interpretation In the Bayesian view, probabilities quantify the uncertainty about certain states or events. This uncertainty can be revised in the light of new evidence. Bayesians and Frequentists do not like each other too well 50

Recap: Bayesian Learning Approach Bayesian view: Consider the parameter vector µ as a random variable. When estimating the parameters, what we compute is p(xjx) = p(xjx) = Z Z p(x; µjx)dµ p(x; µjx) = p(xjµ; X)p(µjX) p(xjµ)p(µjx)dµ Assumption: given µ, this doesn t depend on X anymore This is entirely determined by the parameter µ (i.e. by the parametric form of the pdf). Slide adapted from Bernt Schiele 51

Recap: Bayesian Learning Approach Discussion Likelihood of the parametric form µ given the data set X. Estimate for x based on parametric form µ Prior for the parameters µ p(xjx) = Z p(xjµ)l(µ)p(µ) R L(µ)p(µ)dµ dµ Normalization: integrate over all possible values of µ The more uncertain we are about µ, the more we average over all possible parameter values. 52

Topics of This Lecture Regression: Motivation Polynomial fitting General Least-Squares Regression Overfitting problem Regularization Ridge Regression Recap: Important Concepts from ML Lecture Probability Theory Bayes Decision Theory Maximum Likelihood Estimation Bayesian Estimation A Probabilistic View on Regression Least-Squares Estimation as Maximum Likelihood 53

Next lecture 54

References and Further Reading More information, including a short review of Probability theory and a good introduction in Bayes Decision Theory can be found in Chapters 1.1, 1.2 and 1.5 of Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 63