INTRODUCTION TO MACHINE LEARNING SOME CONTENT COURTESY OF PROFESSOR ANDREW NG OF STANFORD UNIVERSITY

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Probabilistic Latent Semantic Analysis

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

(Sub)Gradient Descent

CS Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

A Case Study: News Classification Based on Term Frequency

Exploration. CS : Deep Reinforcement Learning Sergey Levine

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

CSL465/603 - Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Learning Methods for Fuzzy Systems

Lecture 1: Basic Concepts of Machine Learning

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Chapter 4 - Fractions

The Good Judgment Project: A large scale test of different methods of combining expert predictions

P-4: Differentiate your plans to fit your students

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Generative models and adversarial training

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Laboratorio di Intelligenza Artificiale e Robotica

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Mathematics process categories

Learning From the Past with Experiment Databases

STA 225: Introductory Statistics (CT)

Foothill College Summer 2016

Speech Emotion Recognition Using Support Vector Machine

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Australian Journal of Basic and Applied Sciences

Using Web Searches on Important Words to Create Background Sets for LSI Classification

The Strong Minimalist Thesis and Bounded Optimality

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Analysis of Enzyme Kinetic Data

Rule Learning With Negation: Issues Regarding Effectiveness

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

WHEN THERE IS A mismatch between the acoustic

Human Emotion Recognition From Speech

May To print or download your own copies of this document visit Name Date Eurovision Numeracy Assignment

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Truth Inference in Crowdsourcing: Is the Problem Solved?

Speech Recognition at ICSI: Broadcast News and beyond

MYCIN. The MYCIN Task

Evolutive Neural Net Fuzzy Filtering: Basic Description

Corpus Linguistics (L615)

A study of speaker adaptation for DNN-based speech synthesis

Probability and Statistics Curriculum Pacing Guide

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Axiom 2013 Team Description Paper

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

arxiv: v2 [cs.cv] 30 Mar 2017

Reducing Features to Improve Bug Prediction

Statewide Framework Document for:

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Rule Learning with Negation: Issues Regarding Effectiveness

Getting Started with Deliberate Practice

Artificial Neural Networks written examination

Using focal point learning to improve human machine tacit coordination

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

The Foundations of Interpersonal Communication

Switchboard Language Model Improvement with Conversational Data from Gigaword

Alignment of Australian Curriculum Year Levels to the Scope and Sequence of Math-U-See Program

Extending Place Value with Whole Numbers to 1,000,000

12- A whirlwind tour of statistics

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

If a measurement is given, can we convert that measurement to different units to meet our needs?

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Time series prediction

Modeling function word errors in DNN-HMM based LVCSR systems

Math 96: Intermediate Algebra in Context

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Laboratorio di Intelligenza Artificiale e Robotica

CLEARWATER HIGH SCHOOL

Modeling function word errors in DNN-HMM based LVCSR systems

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Calibration of Confidence Measures in Speech Recognition

Detailed course syllabus

Honors Mathematics. Introduction and Definition of Honors Mathematics

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Why Did My Detector Do That?!

Active Learning. Yingyu Liang Computer Sciences 760 Fall

BMBF Project ROBUKOM: Robust Communication Networks

Multivariate k-nearest Neighbor Regression for Time Series data -

Missouri Mathematics Grade-Level Expectations

EDEXCEL FUNCTIONAL SKILLS PILOT TEACHER S NOTES. Maths Level 2. Chapter 4. Working with measures

School of Innovative Technologies and Engineering

Changing User Attitudes to Reduce Spreadsheet Risk

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Transcription:

INTRODUCTION TO MACHINE LEARNING SOME CONTENT COURTESY OF PROFESSOR ANDREW NG OF STANFORD UNIVERSITY IQS2: Spring 2013

Machine Learning Definition 2 Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed. Samuel s claim to fame: Checkers. Had his program play against itself tens of thousands of times, noting board positions that tended to lead to wins, and those that tended to lead to losses. In time the program became much better at checkers than Samuel ever was!

Machine Learning Definition 3 Problem with Samuel s definition: too informal. How do we know when the definition has been satisfied? Tom Mitchell (1988). Well-posed learning problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. 4 Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to better filter spam. What are T, P, and E?

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. 5 Suppose your email program watches which emails you do or do not mark as spam, and based on that learns how to better filter spam. What are T, P, and E? T: Classifying emails as spam or not spam. E: Watching you label emails as spam or not spam P: The number (or fraction) of emails correctly classified as spam/not spam.

Machine Learning Algorithms 6 Supervised learning (what we ll do: learning is supervised in the sense that we provide training data) Currently the most common type of machine learning Unsupervised learning Example: Clustering algorithms Example: Some data mining Others: Recommender systems (think Netflix)

Terminology: Feature Vector 7 or simply features: The characteristics of the studied phenomenon that provide input to the machine learning algorithm Ex. In Housing prices example, only one feature: the size in square feet of the house. We could use more features, such as whether the house has a garage, whether it has central air conditioning, the number of bathrooms, etc. Then we would have an entire vector of features. NOTE: These are not parameters. We do not change their values to optimize our model. We do, however, try to select sufficient features to allow us to meet our goals.

Supervised Learning 8 Supervised learning: Correct answers given Ex. Linear regression: correct housing prices for some square foot values n Note: nothing to prevent two houses of the same size having two different prices Ex. Digit classification: Sample images are given along with the digit they represent n Note that in classification problems, though the answer is a class, it is often coded as a discrete numerical value n E.g., if code classified as malicious, code it 1, else code it 0 n In our digit classifier, the coded value is actually NOT the digit: it s a 10 x 1 column vector, each of whose entries are between 0 and 1 (typical of multi-class classification)

Supervised Learning Terminology 9 Training set: the data that is used to teach your program Test set: the data that will be used to test your program (this should definitely NOT be the same as your training data). Cross-validation VERY IMPORTANT: Just because your classifier works well (or even perfectly) on you training data, this does NOT mean that it will perform well on other data! n In fact, classifiers that work perfectly on the training data are often over fitted (more on this later).

Cost Function 10 Recall that machine learning, according to our formal definition, requires measured performance. In virtually all cases, this is provided by a cost function Recall from linear regression: In this example (as in most) lower cost means a better solution. So improving performance means finding values of the parameters θ0 and θ1 that minimize the cost function (or at least create a lower cost than our earlier choices).

Cost Function 11 Recall that machine learning, according to our formal definition, requires measured performance. In virtually all cases, this is provided by a cost function Recall from linear regression: Note here that θ0 and θ1 are the only variables in this example, in the sense that the values of the xi and yi are provided to us by the training data.

12 Your Cost Function

13 Where: m = number of training data K = output dimension (10 for us) h is the neural network output, which we ll discuss later Θ (l) is a family of matrices (actually just 2 matrices) n Θ (1) is a 257 x 256 matrix of weights n Θ (2) is a 257 x 10 matrix of weights The matrix entries are the parameters. So I lied. There are not 400 parameters. There are 68,362.

Don t Panic! 14 I showed you that to try to condition you to it. Sort of like shock therapy. I won t show it again for at least a few days.

Don t Panic! 15 I showed you that to try to condition you to it. Sort of like shock therapy. I won t show it again for at least a few days. And when I do, you ll have a better understanding of what it all means.

Don t Panic! 16 I showed you that to try to condition you to it. Sort of like shock therapy. I won t show it again for at least a few days. And when I do, you ll have a better understanding of what it all means. BUT, your head will probably still hurt a bit when you work with it.

Regularization 17 Deals with under fitting and over fitting A.K.A. bias and variance Under fitting (a.k.a. high bias): Too few parameters to accurately capture real phenomenon Ex:

Solution? 18 Add more parameters to the model (in this case, allow for higher dimensional polynomial fits)

Problem: Over fitting 19 Over fitting (a.k.a. high variance): Fitting the model too closely to the data can sometimes create poor model outcomes

Problem: Over fitting 20 Over fitting: Fitting the model too closely to the data can sometimes create poor model outcomes What if this is the true value of the data point you want to test?

Example: Nearest-Neighbor 21 The k-nearest neighbor algorithm classifies a a given point by looking at the k training data values nearest to the point. The majority class wins. 1-nearest neighbor will always classify the training set perfectly! But that will rarely be what you want to use to classify new values (it s classic over fitting). k-nearest neighbor, for k bigger than 1, will not always classify the training set correctly. But will do a much better job at classifying unknown data

22 Example: 1-nearest neighbor

23 Example: 15-nearest neighbor

For Emphasis 24 The issue with over fitting is that if your model has too many parameters (or too many features), you may fit the training set so well that your model fails to generalize to new examples. Since it s new examples you want to classify, this is a problem!

In General 25 Machine learning experts use various statistical tools to determine the likelihood that their model is experiencing over fitting or under fitting (and which of the two is the case). When identified, there are methods for remedying the situation. In general, beyond the scope of our foray into machine learning. For over fitting, I ll mention one obvious one: get rid of some features. n But which ones?

Regularization 26 A method for dealing with over fitting Basic idea: keep all the features or parameters, but reduce their magnitude As an intuitive example, think in general about what reducing coefficients does to a polynomial Works well when we have a lot of features, each of which contributes a little bit to predicting the class of the data.

So... 27 If we have the following cost function, and want to force θ3 and θ4 to be small, how can we do this?

So... 28 If we have the following cost function, and want to force θ3 and θ4 to be small, how can we do this? How about like this?:

Intuitively 29 it s a bit easier to see why shrinking θ 3 and θ4 simplify the model and reduce the effect of over fitting than it is to see why shrinking the values of ALL the parameters has a similar effect But shrinking ALL the parameters does do what we want Creates a simpler model Helps avoid the bad effects of over fitting Easiest way to see this is to play around and see the effects for yourself. But it s not necessary...unless you want to someday work in the machine learning arena.

Our Cost Function (Once Again) 30 The regularization term

Summary 31 We ve seen the basic concepts involved in machine learning We ve discussed the problems of under fitting and over fitting We ve discussed regularization We ve looked at the cost function we ll be using And have possibly been traumatized by it