Machine Learning Lecture 1

Similar documents
Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Python Machine Learning

CSL465/603 - Machine Learning

Generative models and adversarial training

STA 225: Introductory Statistics (CT)

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Probability and Statistics Curriculum Pacing Guide

Exploration. CS : Deep Reinforcement Learning Sergey Levine

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Speech Recognition at ICSI: Broadcast News and beyond

Lecture 1: Basic Concepts of Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

(Sub)Gradient Descent

Probabilistic Latent Semantic Analysis

Word Segmentation of Off-line Handwritten Documents

Speech Emotion Recognition Using Support Vector Machine

Active Learning. Yingyu Liang Computer Sciences 760 Fall

The Evolution of Random Phenomena

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Learning Methods in Multilingual Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Comparison of network inference packages and methods for multiple networks inference

Semi-Supervised Face Detection

arxiv: v2 [cs.cv] 30 Mar 2017

CS 446: Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Case Study: News Classification Based on Term Frequency

Seminar - Organic Computing

Artificial Neural Networks written examination

Learning Methods for Fuzzy Systems

Axiom 2013 Team Description Paper

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Intelligent Agents. Chapter 2. Chapter 2 1

Lecture 10: Reinforcement Learning

Spring 2014 SYLLABUS Michigan State University STT 430: Probability and Statistics for Engineering

Human Emotion Recognition From Speech

Calibration of Confidence Measures in Speech Recognition

Reducing Features to Improve Bug Prediction

Welcome to. ECML/PKDD 2004 Community meeting

Math 96: Intermediate Algebra in Context

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Laboratorio di Intelligenza Artificiale e Robotica

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Algebra 2- Semester 2 Review

Switchboard Language Model Improvement with Conversational Data from Gigaword

INPE São José dos Campos

Learning From the Past with Experiment Databases

Rule Learning With Negation: Issues Regarding Effectiveness

Discriminative Learning of Beam-Search Heuristics for Planning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Evolutive Neural Net Fuzzy Filtering: Basic Description

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Word learning as Bayesian inference

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

CS Machine Learning

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

The Strong Minimalist Thesis and Bounded Optimality

WHEN THERE IS A mismatch between the acoustic

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A Reinforcement Learning Variant for Control Scheduling

A Case-Based Approach To Imitation Learning in Robotic Agents

Speaker recognition using universal background model on YOHO database

Corrective Feedback and Persistent Learning for Information Extraction

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

THE UNIVERSITY OF SYDNEY Semester 2, Information Sheet for MATH2068/2988 Number Theory and Cryptography

Australian Journal of Basic and Applied Sciences

Rule Learning with Negation: Issues Regarding Effectiveness

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Model Ensemble for Click Prediction in Bing Search Ads

Grade 6: Correlated to AGS Basic Math Skills

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Networks and the Diffusion of Cutting-Edge Teaching and Learning Knowledge in Sociology

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

BMBF Project ROBUKOM: Robust Communication Networks

Introduction to Causal Inference. Problem Set 1. Required Problems

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

Firms and Markets Saturdays Summer I 2014

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Theory of Probability

Transcription:

Machine Learning Lecture 1 Introduction 12.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwth-aachen.de) Assistants Francis Engelmann (engelmann@vision.rwth-aachen.de) Paul Voigtlaender (voigtlaender@vision.rwth-aachen.de) Course webpage http://www.vision.rwth-aachen.de/courses/ Slides will be made available on the webpage and in L2P Lecture recordings as screencasts will be available via L2P Please subscribe to the lecture on the Campus system! Important to get email announcements and L2P access! 2

Language Official course language will be English If at least one English-speaking student is present. If not you can choose. However Please tell me when I m talking too fast or when I should repeat something in German for better understanding! You may at any time ask questions in German! You may turn in your exercises in German. You may answer exam questions in German. 3

Organization Structure: 3V (lecture) + 1Ü (exercises) 6 EECS credits Part of the area Applied Computer Science Place & Time Lecture/Exercises: Mon 10:15 11:45 room UMIC 025 08:30 10:00 AH IV (?) 16:15 17:45 AH I (?) Lecture/Exercises: Thu 14:15 15:45 H02 (C.A.R.L) Exam Written exam 1 st Try TBD TBD 2 nd Try Thu 29.03. 10:30 13:00 4

Exercises and Supplementary Material Exercises Typically 1 exercise sheet every 2 weeks. Pen & paper and programming exercises Matlab for first exercise slots TensorFlow for Deep Learning part Hands-on experience with the algorithms from the lecture. Send your solutions the night before the exercise class. Need to reach 50% of the points to qualify for the exam! Teams are encouraged! You can form teams of up to 3 people for the exercises. Each team should only turn in one solution via L2P. But list the names of all team members in the submission. 5

Course Webpage First exercise on 30.10. http://www.vision.rwth-aachen.de/courses/ 6

Textbooks The first half of the lecture is covered in Bishop s book. For Deep Learning, we will use Goodfellow & Bengio. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 (available in the library s Handapparat ) I. Goodfellow, Y. Bengio, A. Courville Deep Learning MIT Press, 2016 Research papers will be given out for some topics. Tutorials and deeper introductions. Application papers 7

How to Find Us Office: UMIC Research Centre Mies-van-der-Rohe-Strasse 15, room 124 Office hours If you have questions to the lecture, contact to Francis or Paul. My regular office hours will be announced (additional slots are available upon request) Send us an email before to confirm a time slot. Questions are welcome! 8

Machine Learning Statistical Machine Learning Principles, methods, and algorithms for learning and prediction on the basis of past evidence Already everywhere Speech recognition (e.g. Siri) Machine translation (e.g. Google Translate) Computer vision (e.g. Face detection) Text filtering (e.g. Email spam filters) Operation systems (e.g. Caching) Fraud detection (e.g. Credit cards) Game playing (e.g. Alpha Go) Robotics (everywhere) Slide credit: Bernt Schiele 9

What Is Machine Learning Useful For? Automatic Speech Recognition Slide adapted from Zoubin Gharamani 10

What Is Machine Learning Useful For? Computer Vision (Object Recognition, Segmentation, Scene Understanding) Slide adapted from Zoubin Gharamani 11

What Is Machine Learning Useful For? Slide adapted from Zoubin Gharamani Information Retrieval (Retrieval, Categorization, Clustering,...) 12

What Is Machine Learning Useful For? Slide adapted from Zoubin Gharamani Financial Prediction (Time series analysis,...) 13

What Is Machine Learning Useful For? Slide adapted from Zoubin Gharamani Medical Diagnosis (Inference from partial observations) 14 Image from Kevin Murphy

What Is Machine Learning Useful For? Slide adapted from Zoubin Gharamani Bioinformatics (Modelling gene microarray data,...) 15

What Is Machine Learning Useful For? Slide adapted from Zoubin Gharamani Autonomous Driving (DARPA Grand Challenge,...) 16 Image from Kevin Murphy

And you might have heard of Deep Learning 17

Machine Learning Goal Machines that learn to perform a task from experience Why? Crucial component of every intelligent/autonomous system Important for a system s adaptability Important for a system s generalization capabilities Attempt to understand human learning Slide credit: Bernt Schiele 18

Machine Learning: Core Questions Learning to perform a task from experience Learning Most important part here! We do not want to encode the knowledge ourselves. The machine should learn the relevant criteria automatically from past observations and adapt to the given situation. Tools Statistics Probability theory Decision theory Information theory Optimization theory Slide credit: Bernt Schiele 19

Machine Learning: Core Questions Learning to perform a task from experience Task Can often be expressed through a mathematical function y = f(x; w) x: Input y: Output w: Parameters (this is what is learned ) Classification vs. Regression Regression: continuous y Classification: discrete y Slide credit: Bernt Schiele E.g. class membership, sometimes also posterior probability 20

Example: Regression Automatic control of a vehicle x f(x; w) y Slide credit: Bernt Schiele 21

Examples: Classification Email filtering x [a-z] y [ important, spam] Character recognition Speech recognition Slide credit: Bernt Schiele 22

Machine Learning: Core Problems Input x: Features Invariance to irrelevant input variations Selecting the right features is crucial Encoding and use of domain knowledge Higher-dimensional features are more discriminative. Curse of dimensionality Complexity increases exponentially with number of dimensions. Slide credit: Bernt Schiele 23

Machine Learning: Core Questions Learning to perform a task from experience Performance measure: Typically one number % correctly classified letters % games won % correctly recognized words, sentences, answers Generalization performance Training vs. test All data Slide credit: Bernt Schiele 24

Machine Learning: Core Questions Learning to perform a task from experience Performance: 99% correct classification Of what??? Characters? Words? Sentences? Speaker/writer independent? Over what data set? The car drives without human intervention 99% of the time on country roads Slide adapted from Bernt Schiele 25

Machine Learning: Core Questions Learning to perform a task from experience What data is available? Data with labels: supervised learning Images / speech with target labels Car sensor data with target steering signal Data without labels: unsupervised learning Automatic clustering of sounds and phonemes Automatic clustering of web sites Some data with, some without labels: semi-supervised learning Feedback/rewards: reinforcement learning Slide credit: Bernt Schiele 26

Machine Learning: Core Questions Learning to perform a task from experience Learning Most often learning = optimization Search in hypothesis space Search for the best function / model parameter w I.e. maximize y = f(x; w) w.r.t. the performance measure Slide credit: Bernt Schiele 27

Machine Learning: Core Questions Learning is optimization of y = f(x; w) w: characterizes the family of functions w: indexes the space of hypotheses w: vector, connection matrix, graph, Slide credit: Bernt Schiele 28

Course Outline Fundamentals Bayes Decision Theory Probability Density Estimation Classification Approaches Linear Discriminants Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Deep Learning Foundations Convolutional Neural Networks Recurrent Neural Networks 29

Note: Updated Lecture Contents New section on Deep Learning this year! Previously covered in Advanced ML lecture This lecture will contain an updated and consolidated version of the Deep Learning lecture block If you have taken the Advanced ML lecture last semester, you may experience some overlap! Lecture contents on Probabilistic Graphical Models I.e., Bayesian Networks, MRFs, CRFs, etc. Will be moved to Advanced ML Reasons for this change: Deep learning has become essential for many current applications I will not be able to offer an Advanced ML lecture this academic year due to other teaching duties 30

Topics of This Lecture Review: Probability Theory Probabilities Probability densities Expectations and covariances Bayes Decision Theory Basic concepts Minimizing the misclassification rate Minimizing the expected loss Discriminant functions 31

Probability Theory Probability theory is nothing but common sense reduced to calculation. Pierre-Simon de Laplace, 1749-1827 32 Image source: Wikipedia

Probability Theory Example: apples and oranges We have two boxes to pick from. Each box contains both types of fruit. What is the probability of picking an apple? Formalization B r, b F a, o Let be a random variable for the box we pick. Let be a random variable for the type of fruit we get. Suppose we pick the red box 40% of the time. We write this as p( B r) 0.4 p( B b) 0.6 The probability of picking an apple given a choice for the box is p( F a B r) 0.25 p( F a B b) 0.75 What is the probability of picking an apple? p( F a)? 33 Image source: C.M. Bishop, 2006

Probability Theory More general case Consider two random variables and Consider N trials and let = #fx = x i ^ Y = y j g n ij c i r j X x i Y y j = #fx = x i g = #fy = y j g Then we can derive Joint probability Marginal probability Conditional probability 34 Image source: C.M. Bishop, 2006

Probability Theory Rules of probability Sum rule Product rule 35 Image source: C.M. Bishop, 2006

The Rules of Probability Thus we have Sum Rule Product Rule From those, we can derive Bayes Theorem where 36

Probability Densities Probabilities over continuous variables are defined over their probability density function (pdf). The probability that x lies in the interval the cumulative distribution function (, z) is given by 37 Image source: C.M. Bishop, 2006

Expectations The average value of some function f( x) under a probability distribution is called its expectation px ( ) discrete case continuous case If we have a finite number N of samples drawn from a pdf, then the expectation can be approximated by We can also consider a conditional expectation 38

Variances and Covariances The variance provides a measure how much variability there is in around its mean value. For two random variables x and y, the covariance is defined by If x and y are vectors, the result is a covariance matrix 39

Bayes Decision Theory Thomas Bayes, 1701-1761 The theory of inverse probability is founded upon an error, and must be wholly rejected. R.A. Fisher, 1925 40 Image source: Wikipedia

Bayes Decision Theory Example: handwritten character recognition Goal: Classify a new letter such that the probability of misclassification is minimized. 41 Slide credit: Bernt Schiele Image source: C.M. Bishop, 2006

Bayes Decision Theory Concept 1: Priors (a priori probabilities) pc k What we can tell about the probability before seeing the data. Example: C C 1 2 a b pc pc 1 2 0.75 0.25 In general: Slide credit: Bernt Schiele pck k 1 42

Bayes Decision Theory Concept 2: Conditional probabilities Let x be a feature vector. p x Ck x measures/describes certain properties of the input. E.g. number of black pixels, aspect ratio, p(x C k ) describes its likelihood for class C k. p x a p x b x Slide credit: Bernt Schiele x 43

Bayes Decision Theory Example: p x a p x b Question: Which class? p x b x 15 Since is much smaller than, the decision should be a here. p x a Slide credit: Bernt Schiele 44

Bayes Decision Theory Example: p x a p x b Question: Which class? x 25 p x a p x b Since is much smaller than, the decision should be b here. Slide credit: Bernt Schiele 45

Bayes Decision Theory Example: p x a p x b Question: Which class? x 20 Remember that p(a) = 0.75 and p(b) = 0.25 I.e., the decision should be again a. How can we formalize this? Slide credit: Bernt Schiele 46

Bayes Decision Theory Concept 3: Posterior probabilities p Ck We are typically interested in the a posteriori probability, i.e. the probability of class C k given the measurement vector x. x Bayes Theorem: p C Interpretation k x p x C k p Ck p x Ck p Ck p x p x Ci p C Likelihood Prior Posterior Normalization Factor i i Slide credit: Bernt Schiele 47

Bayes Decision Theory p x a p x b Likelihood p x a p( a) x p x b p( b) x Decision boundary Likelihood Prior p a x p b x Slide credit: Bernt Schiele x Posterior = Likelihood Prior NormalizationFactor 48

Bayesian Decision Theory Goal: Minimize the probability of a misclassification The green and blue regions stay constant. Only the size of the red region varies! = Z R 1 p(c 2 jx)p(x)dx + Z p(c 1 jx)p(x)dx R 2 49 Image source: C.M. Bishop, 2006

Bayes Decision Theory Optimal decision rule Decide for C 1 if p(c 1 jx) > p(c 2 jx) This is equivalent to p(xjc 1 )p(c 1 ) > p(xjc 2 )p(c 2 ) Which is again equivalent to (Likelihood-Ratio test) p(xjc 1 ) p(xjc 2 ) > p(c 2) p(c 1 ) Slide credit: Bernt Schiele Decision threshold 50

Generalization to More Than 2 Classes Decide for class k whenever it has the greatest posterior probability of all classes: p(c k jx) > p(c j jx) 8j 6= k p(xjc k )p(c k ) > p(xjc j )p(c j ) 8j 6= k Likelihood-ratio test p(xjc k ) p(xjc j ) > p(c j) p(c k ) 8j 6= k Slide credit: Bernt Schiele 51

Classifying with Loss Functions Generalization to decisions with a loss function Differentiate between the possible decisions and the possible true classes. Example: medical diagnosis Decisions: sick or healthy (or: further examination necessary) Classes: patient is sick or healthy The cost may be asymmetric: loss(decision = healthyjpatient = sick) >> loss(decision = sickjpatient = healthy) Slide credit: Bernt Schiele 52

Truth Classifying with Loss Functions In general, we can formalize this by introducing a loss matrix L kj L kj = loss for decision C j if truth is C k : Example: cancer diagnosis Decision L cancer diagnosis = 53

Classifying with Loss Functions Loss functions may be different for different actors. Example: L stocktrader (subprime) = invest don t invest µ 1 2 c gain 0 0 0 L bank (subprime) = µ 1 2 c gain 0 0 Different loss functions may lead to different Bayes optimal strategies. 54

Minimizing the Expected Loss Optimal solution is the one that minimizes the loss. But: loss function depends on the true class, which is unknown. Solution: Minimize the expected loss This can be done by choosing the regions R j such that which is easy to do once we know the posterior class probabilities p(c k jx). 55

Minimizing the Expected Loss Example: 2 Classes: C 1, C 2 2 Decision: 1, 2 Loss function: L( j jc k ) = L kj Expected loss (= risk R) for the two decisions: Goal: Decide such that expected loss is minimized I.e. decide 1 if Slide credit: Bernt Schiele 56

Minimizing the Expected Loss R( 2 jx) > R( 1 jx) L 12 p(c 1 jx) + L 22 p(c 2 jx) > L 11 p(c 1 jx) + L 21 p(c 2 jx) (L 12 L 11 )p(c 1 jx) > (L 21 L 22 )p(c 2 jx) (L 12 L 11 ) (L 21 L 22 ) > p(c 2jx) p(c 1 jx) = p(xjc 2)p(C 2 ) p(xjc 1 )p(c 1 ) p(xjc 1 ) p(xjc 2 ) > (L 21 L 22 ) (L 12 L 11 ) p(c 2 ) p(c 1 ) Adapted decision rule taking into account the loss. Slide credit: Bernt Schiele 57

The Reject Option Classification errors arise from regions where the largest posterior probability p(c k jx) is significantly less than 1. These are the regions where we are relatively uncertain about class membership. For some applications, it may be better to reject the automatic decision entirely in such a case and e.g. consult a human expert. 58 Image source: C.M. Bishop, 2006

Discriminant Functions Formulate classification in terms of comparisons Discriminant functions y 1 (x); : : : ; y K (x) Classify x as class C k if y k (x) > y j (x) 8j 6= k Examples (Bayes Decision Theory) y k (x) = p(c k jx) y k (x) = p(xjc k )p(c k ) y k (x) = log p(xjc k ) + log p(c k ) Slide credit: Bernt Schiele 59

Different Views on the Decision Problem y k (x) / p(xjc k )p(c k ) First determine the class-conditional densities for each class individually and separately infer the prior class probabilities. Then use Bayes theorem to determine class membership. Generative methods y k (x) = p(c k jx) First solve the inference problem of determining the posterior class probabilities. Then use decision theory to assign each new x to its class. Discriminative methods Alternative Directly find a discriminant function which maps each input x directly onto a class label. y k (x) 60

Next Lectures Ways how to estimate the probability densities Non-parametric methods Histograms k-nearest Neighbor Kernel Density Estimation Parametric methods 3 N =10 2 1 p(xjc k ) 0 0 0.5 1 Gaussian distribution Mixtures of Gaussians Discriminant functions Linear discriminants Support vector machines Next lectures 61

References and Further Reading More information, including a short review of Probability theory and a good introduction in Bayes Decision Theory can be found in Chapters 1.1, 1.2 and 1.5 of Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 62