ECE-271A Statistical Learning I

Similar documents
Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Python Machine Learning

CSL465/603 - Machine Learning

Generative models and adversarial training

Speech Emotion Recognition Using Support Vector Machine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Probability and Statistics Curriculum Pacing Guide

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Lecture 1: Basic Concepts of Machine Learning

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

CS Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Semi-Supervised Face Detection

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Speech Recognition at ICSI: Broadcast News and beyond

INPE São José dos Campos

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning Methods in Multilingual Speech Recognition

Probabilistic Latent Semantic Analysis

OFFICE SUPPORT SPECIALIST Technical Diploma

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Reducing Features to Improve Bug Prediction

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Why Did My Detector Do That?!

Syllabus ENGR 190 Introductory Calculus (QR)

Learning From the Past with Experiment Databases

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Syllabus for CHEM 4660 Introduction to Computational Chemistry Spring 2010

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Course Development Using OCW Resources: Applying the Inverted Classroom Model in an Electrical Engineering Course

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

MGT/MGP/MGB 261: Investment Analysis

MASTER OF PHILOSOPHY IN STATISTICS

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

Lecture 2: Quantifiers and Approximation

A study of speaker adaptation for DNN-based speech synthesis

A Case Study: News Classification Based on Term Frequency

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Time series prediction

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

An Introduction to Simio for Beginners

Introduction. Chem 110: Chemical Principles 1 Sections 40-52

WHEN THERE IS A mismatch between the acoustic

Active Learning. Yingyu Liang Computer Sciences 760 Fall

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Statewide Framework Document for:

arxiv: v2 [cs.cv] 30 Mar 2017

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Australian Journal of Basic and Applied Sciences

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Introduction to the Practice of Statistics

Human Emotion Recognition From Speech

Artificial Neural Networks written examination

Fourth Grade. Reporting Student Progress. Libertyville School District 70. Fourth Grade

Knowledge Transfer in Deep Convolutional Neural Nets

Word Segmentation of Off-line Handwritten Documents

Calibration of Confidence Measures in Speech Recognition

STA 225: Introductory Statistics (CT)

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Lecture 10: Reinforcement Learning

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

South Carolina English Language Arts

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Disciplinary Literacy in Science

Assignment 1: Predicting Amazon Review Ratings

Firms and Markets Saturdays Summer I 2014

Mathematics. Mathematics

Math Techniques of Calculus I Penn State University Summer Session 2017

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

arxiv:cmp-lg/ v1 22 Aug 1994

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

An Online Handwriting Recognition System For Turkish

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Learning Methods for Fuzzy Systems

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Math 96: Intermediate Algebra in Context

LOUISIANA HIGH SCHOOL RALLY ASSOCIATION

Software Maintenance

Rule Learning With Negation: Issues Regarding Effectiveness

School of Innovative Technologies and Engineering

Introduction and Motivation

Mathematics Program Assessment Plan

Course Syllabus for Math

Integrating simulation into the engineering curriculum: a case study

Laboratorio di Intelligenza Artificiale e Robotica

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Transcription:

ECE-271A Statistical Learning I Nuno Vasconcelos ECE Department, UCSD

The course the course is an introductory level course in statistical learning by introductory I mean that you will not need any previous exposure to the field, not that it is basic we will cover the foundations of Bayesian or generative learning 271B is a follow-up course on discriminant learning, in alternating years more on generative vs discriminant later 2

Logistics Exams: 1 mid-term - 35% 1 final 45% (covers everything) Homework (20%): one problem set every week. will include a small computational problem. By small, I mean in terms of concepts, thinking, etc. some computational problems will require a fair amount of computer power, e.g. a few hours on a low-end PC. be sure to start early will count 20%, but almost impossible without it. will give you the hands-on experience needed to be able to claim that you really know learning! 3

Homework policies homework is individual OK to work on problems with someone else but you have to: write your own solution write down the names of who you collaborated with homework is due one week after it is issued. 4

Cheetah statistical learning only makes sense when you try it on data we will test what we learn on a image processing problem given the cheetah image, can we teach a computer to segment it into object and foreground? the question will be answered with different techniques, typically one problem per week a total of 5 computer problems try to keep an eye on the big picture, e.g. did this improve over what we had done before? 5

Resources Course web page: http://www.svcl.ucsd.edu/~nuno all handouts, problem sets, code will be available there TA: TBA, Me: Nuno Vasconcelos, nuno@ece.ucsd.edu, EBU1-5603 Office hours: TA: TBA mine: Fridays, 9:30-10:30AM for homework talk to TA first, everything else see me My assistant: Travis Spackman (tspackman@ece.ucsd.edu), outside my office, may sometimes be involved in administrative issues 6

Texts required: Pattern Classification, Duda, Hart, and Stork, John Willey and Sons, 2001 will follow closely, hand-outs where needed various other good, but optional, texts: Pattern Recognition and Machine Learning, Bishop, 2006 Elements of Statistical Learning, Hastie, Tibshirani, Fredman, 2001 Bayesian Data Analysis, Gelman, Rubin, Stern, 2003. A Probabilistic Theory of Pattern Recognition, Devroye, Gyorfi, Lugosi, 1998 (more than what we need) stuff you must know really well: Linear Algebra, Gilbert Strang, 1988 Fundamentals of Applied Probability, Drake, McGraw-Hill, 1967 7

The course why statistical learning? there are many processes in the world that are ruled by deterministic equations e.g. f = ma; linear systems and convolution, Fourier, etc, various chemical laws there may be some noise, error, variability, but we can leave with those we don t need statistical learning learning is needed when there is a need for predictions about variables in the world, Y that depend on factors (other variables) X in a way that is impossible or too difficult to derive an equation for. 8

Examples data-mining view: large amounts of data that does not follow deterministic rules e.g. given an history of thousands of customer records and some questions that I can ask you, how do I predict that you will pay on time? impossible to derive a theorem for this, must be learned while many associate learning with data-mining, it is by no means the only or more important application signal processing view: signals combine in ways that depend on hidden structure (e.g. speech waveforms depend on language, grammar, etc.) signals are usually subject to significant amounts of noise (which sometimes means things we do not know how to model ) 9

Examples (cont d) signal processing view: e.g. the cocktail party problem, although there are all these people talking, I can figure everything out. how do I build a chip to separate the speakers? model the hidden dependence as a linear combination of independent sources noise many other examples in the areas of wireless, communications, signal restoration, etc. 10

Examples (cont d) perception/ai view: it is a complex world, I cannot model everything in detail rely on probabilistic models that explicitly account for the variability use the laws of probability to make inferences, e.g. what is P( burglar alarm, no earthquake) is high P( burglar alarm, earthquake) is low a whole field that studies perception as Bayesian inference perception really just confirms what you already know priors + observations = robust inference 11

Examples (cont d) communications view: detection problems: X channel Y I see Y, and know something about the statistics of the channel. What was X? this is the canonic detection problem that appears all over learning. for example, face detection in computer vision: I see pixel array Y. Is it a face? 12

Statistical learning goal: given a function x f (.) y = f (x ) and a collection of example data-points, learn what the function f(.) is. this is called training. two major types of learning: unsupervised: only X is known, usually referred to as clustering; supervised: both are known during training, only X known at test time, usually referred to as classification or regression. 13

Supervised learning X can be anything, but the type of Y dictates the type of supervised learning problem Y in {0,1} referred to as detection Y in {0,..., M-1} referred to as classification Y real referred to as regression theory is quite similar, algorithms similar most of the time we will emphasize classification, but will talk about regression when particularly insightful 14

Example classifying fish: fish roll down a conveyer belt camera takes a picture goal: is this a salmon or a seabass? Q: what is X? What features do I use to distinguish between the two fish? this is somewhat of an artform. Frequently, the best is to ask experts. e.g. obvious! use length and scale width! 15

Classification/detection two major types of classifiers: discriminant: directly recover the decision boundary that best separates the classes; generative: fit a probability model to each class and then analyze the models to find the border. a lot more on this later! focus will be on generative learning. discriminant will be covered by 271B. 16

Caution how do we know learning worked? we care about generalization, i.e. accuracy outside training set models that are too powerful can lead to over-fitting: e.g. in regression I can always fit exactly n pts with polynomial of order n-1. is this good? how likely is the error to be small outside the training set? similar problem for classification fundamental LAW: only test set results matter!!! 17

Generalization good generalization requires controlling the trade-off between training and test error training error large, test error large training error smaller, test error smaller training error smallest, test error largest this trade-off is known by many names in the generative classification world it is usually due to the bias-variance trade-off of the class models will look at this in detail 18

Class-modeling each class is characterized by a probability density function (class conditional density) a model is adopted, e.g. a Gaussian training data used to estimate model parameters overall the process is referred to as density estimation the simplest example would be to use histograms 19

Density estimation there are, however, much better models usually, problem has two components: selecting a model estimating model parameters models: we will cover the whole gamut from the exponential family (e.g. Gaussian) to kernel-based density estimates including mixture models and non-parametric approaches (nearest neighbors, histograms, etc.) 20

Parameter estimation two main camps: maximum likelihood Bayesian estimates ML will devote most attention to the quality of estimates the bias variance/trade-off a lot more emphasis on Bayes: subjective probability what is really a prior? mechanics: predictive distribution, MAP estimates, etc. priors: conjugate, non-informative, improper why is the exponential family special? 21

Decision rules given class models, Bayesian decision theory provides us with optimal rules for classification optimal here means minimum probability of error, for example we will study BDT in detail, establish connections to other decision principles (e.g. linear discriminants) show that Bayesian decisions are usually intuitive derive optimal rules for a range of classifiers 22

Reasons to take the course statistical learning tremendous amount of theory but things invariably go wrong too little data, noise, too many dimensions, training sets that do not reflect all possible variability, etc. good learning solutions require: knowledge of the domain (e.g. these are the features to use ) knowledge of the available techniques, their limitations, etc. (e.g. here a Gaussian is enough for AB&C, but there I need a mixture) in the absence of either of these you will fail! we will cover the basics, but will talk about quite advanced concepts. easier scenario in which to understand them 24

Reasons to take the course theory together with hands-on experience will cover all theory, every week 5-6 problems hands-on component: one computational problem per week this will center around cheetah segmentation allows evaluation of the benefits of more advance techniques as they are introduced forces you to deal with real, noisy, data exposes you to working on a new domain 25

26

Cheetah day last class, we will have Cheetah Day what: 5 teams each team will write a report on the 5 cheetah problems each team will give a presentation on one of the problems why: to make sure that we get the big picture out of all this work presenting is always good practice 27

Cheetah Day how much: 10% of the final grade (5% report, 5% presentation) what to talk about: report: comparative analysis of all solutions of the problem as if you were writing a conference paper presentation: will be on one single problem review what solution was what did this problem taught us about learning? what tricks did we learn solving it? how well did this solution do compared to others? will talk about this in due time 28