COMS 4771 Introduction to Machine Learning. Nakul Verma

Similar documents
Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

CS Machine Learning

Generative models and adversarial training

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Probabilistic Latent Semantic Analysis

CSL465/603 - Machine Learning

Semi-Supervised Face Detection

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CS 446: Machine Learning

Artificial Neural Networks written examination

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Python Machine Learning

Speech Recognition at ICSI: Broadcast News and beyond

WHEN THERE IS A mismatch between the acoustic

Rule Learning With Negation: Issues Regarding Effectiveness

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Calibration of Confidence Measures in Speech Recognition

The Evolution of Random Phenomena

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Syllabus ENGR 190 Introductory Calculus (QR)

Universityy. The content of

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Assignment 1: Predicting Amazon Review Ratings

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A Case Study: News Classification Based on Term Frequency

State University of New York at Buffalo INTRODUCTION TO STATISTICS PSC 408 Fall 2015 M,W,F 1-1:50 NSC 210

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Corrective Feedback and Persistent Learning for Information Extraction

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Truth Inference in Crowdsourcing: Is the Problem Solved?

Data Structures and Algorithms

Rule Learning with Negation: Issues Regarding Effectiveness

Office Hours: Mon & Fri 10:00-12:00. Course Description

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Active Learning. Yingyu Liang Computer Sciences 760 Fall

GCE. Mathematics (MEI) Mark Scheme for June Advanced Subsidiary GCE Unit 4766: Statistics 1. Oxford Cambridge and RSA Examinations

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Theory of Probability

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

EECS 700: Computer Modeling, Simulation, and Visualization Fall 2014

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Switchboard Language Model Improvement with Conversational Data from Gigaword

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

Australian Journal of Basic and Applied Sciences

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Speaker recognition using universal background model on YOHO database

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Pre-AP Geometry Course Syllabus Page 1

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

A study of speaker adaptation for DNN-based speech synthesis

AP Statistics Summer Assignment 17-18

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

arxiv: v2 [cs.cv] 30 Mar 2017

Lecture 10: Reinforcement Learning

A Bayesian Learning Approach to Concept-Based Document Classification

MGT/MGP/MGB 261: Investment Analysis

Learning From the Past with Experiment Databases

Why Did My Detector Do That?!

Physics Experimental Physics II: Electricity and Magnetism Prof. Eno Spring 2017

Word Segmentation of Off-line Handwritten Documents

South Carolina English Language Arts

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Introduction to Simulation

Speech Emotion Recognition Using Support Vector Machine

Math 181, Calculus I

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Learning Methods for Fuzzy Systems

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

arxiv: v1 [cs.lg] 3 May 2013

Softprop: Softmax Neural Network Backpropagation Learning

ME 443/643 Design Techniques in Mechanical Engineering. Lecture 1: Introduction

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Reducing Features to Improve Bug Prediction

B.S/M.A in Mathematics

Laboratorio di Intelligenza Artificiale e Robotica

Human Emotion Recognition From Speech

Model Ensemble for Click Prediction in Bing Search Ads

Evolutive Neural Net Fuzzy Filtering: Basic Description

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

ME 4495 Computational Heat Transfer and Fluid Flow M,W 4:00 5:15 (Eng 177)

Comparison of network inference packages and methods for multiple networks inference

Launching GO 4 Schools as a whole school approach

Indian Institute of Technology, Kanpur

Lecture 1: Basic Concepts of Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Physics 270: Experimental Physics

Utilizing FREE Internet Resources to Flip Your Classroom. Presenter: Shannon J. Holden

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Transcription:

COMS 4771 Introduction to Machine Learning Nakul Verma

Machine learning: what? Study of making machines learn a concept without having to explicitly program it. Constructing algorithms that can: learn from input data, and be able to make predictions. find interesting patterns in data. Analyzing these algorithms to understand the limits of learning

Machine learning: why? We are smart programmers, why can t we just write some code with a set of rules to solve a particular problem? Write down a set of rules to code to distinguish these two faces: What if we don t even know the explicit task we want to solve?

Machine learning: problems in the real world Recommendation systems (Netflix, Amazon, Overstock) Stock prediction (Goldman Sachs, Morgan Stanley) Risk analysis (Credit card, Insurance) Face and object recognition (Cameras, Facebook, Microsoft) Speech recognition (Siri, Cortana, Alexa, Dragon) Search engines and content filtering (Google, Yahoo, Bing)

Machine learning: how? so. how do we do it? This is what we will focus on in this class!

This course We will learn: Study a prediction problem in an abstract manner and come up with a solution which is applicable to many problems simultaneously. Different types of paradigms and algorithms that have been successful in prediction tasks. How to systematically analyze how good an algorithm is for a prediction task.

Prerequisites Mathematical prerequisites Basics of probability and statistics Linear algebra Calculus Computational prerequisites Basics of algorithms and datastructure design Ability to program in a high-level language.

Administrivia Website: http://www.cs.columbia.edu/~verma/classes/sp18/coms4771/ The team: Instructor: Nakul Verma (me) TAs Students: you! Evaluation: Homeworks (40%) Exam 1 (30%) Exam 2 (30%)

Policies Homeworks: No late homework Must type your homework (no handwritten homework) Please include your name and UNI Submit a pdf copy of the assignment via gradescope ( 98644J ) Except for HW0, students are encouraged to do it in groups (at max 3 people) We encourage discussing the problems (piazza), but please don t copy.

Announcement! Visit the course website Review the basics (prerequisites) HW0 is out! Sign up on Piazza & Gradescope Students have access to recitation section on Fri 1:10-2:25p Math 207.

Let s get started!

Machine Learning: the basics A closer look at some prediction problems Handwritten character recognition: { 0, 1, 2,, 9 } Spam filtering: { spam, } not spam Object recognition: { building, tree, } car, road, sky,...

Machine Learning: the basics Commonalities in a prediction problem: Input: = = To learn: Output: 5 = { 0, 1, 2,, 9 }

Machine Learning: the basics Data: Supervised learning Assumption: there is a (relatively simple) function such that for most i Learning task: given n examples from the data, find an approximation Goal: gives mostly correct prediction on unseen examples Testing Phase Training Phase Unlabeled test data (unseen / future data) Labeled training data (n examples from data) Learning Algorithm classifier prediction

Machine Learning: the basics Data: Unsupervised learning Assumption: there is an underlying structure in Learning task: discover the structure given n examples from the data Goal: come up with the summary of the data using the discovered structure More later in the course

Supervised Machine Learning Statistical modeling approach: Labeled training data (n examples from data) drawn independently from a fixed underlying distribution (also called the i.i.d. assumption) select Learning Algorithm from? from a pool of models that maximizes label agreement of the training data classifier How to select? Maximum likelihood (best fits the data) Maximum a posteriori (best fits the data but incorporates prior assumptions) Optimization of loss criterion (best discriminates the labels)

Maximum Likelihood Estimation (MLE) Given some data Say we have a model class i.i.d. find the parameter settings θ that best fits the data. (Let s forget about the labels for now) ie, each model p can be described by a set of parameters θ If each model p, is a probability model then we can find the best fitting probability model via the likelihood estimation! Likelihood i.i.d. Interpretation: How probable (or how likely) is the data given the model p θ? Parameter setting θ that maximizes

MLE Example Fitting a statistical probability model to heights of females Height data (in inches): 60, 62, 53, 58, R Model class: Gaussian models in R µ = mean parameter σ 2 = variance parameter > 0 So, what is the MLE for the given data X?

θ MLE Example (contd.) Height data (in inches): = R Model class: Gaussian models in R MLE: Good luck! Trick #1: Trick #2: Log likelihood finding max (or other extreme values) of a function is simply analyzing the stationary points of a function. That is, values at which the derivative of the function is zero! p θ

MLE Example (contd. 2) Let s calculate the best fitting θ = {µ, σ 2 } Log likelihood i.i.d. Maximizing µ : Maximizing σ 2 :

MLE Example So, the best fitting Gaussian model Female height data: 60, 62, 53, 58, R Is the one with parameters: and What about other model classes?

Other popular probability models Bernoulli model (coin tosses) Scalar valued Multinomial model (dice rolls) Scalar valued Poisson model (rare counting events) Scalar valued Gaussian model (most common phenomenon) Scalar valued Most machine learning data is vector valued! Multivariate Gaussian Model Vector valued Multivariate version available of other scalar valued models

Multivariate Gaussian Univariate R µ = mean parameter σ 2 = variance parameter > 0 Multivariate R d µ = mean vector Σ = Covariance matrix (positive definite)

From MLE to Classification MLE sounds great, how do we use it to do classification using labelled data? Why? More later indep. of y Bayes rule Class prior: Simply the probability of data sample occurring from a category Class conditional: Class conditional Class Prior probability model Use a separate probability model individual categories/class-type We can find the appropriate parameters for the model using MLE!

Classification via MLE Example Task: learn a classifier to distinguish males from females based on say height and weight measurements Classifier: Using labelled training data, learn all the parameters: Learning class priors: fraction of training data labelled as male fraction of training data labelled as female Learning class conditionals: θ (male) = MLE using only male data θ (female) = MLE using only female data

What are we doing geometrically? Data geometry: Height male data female data Weight

Weight What are we doing geometrically? Data geometry: Height x male data female data MLE Gaussian (male) MLE Gaussian (female) Weight p θ

Classification via MLE Example Task: learn a classifier to distinguish males from females based on say height and weight measurements Classifier: Using labelled training data, learn all the parameters: Learning class priors: fraction of training data labelled as male fraction of training data labelled as male Learning class conditionals: θ (male) = MLE using only male data θ (female) = MLE using only female data

Classification via MLE Example We just made our first predictor! But why:

Why the particular f = argmax y P[Y X]? Accuracy of a classifier f : Assume binary classification (for simplicity) : Let: Bayes classifier any classifier Theorem:!!! Bayes classifier is optimal!!!

Optimality of Bayes classifier Theorem: Observation: For any classifier h So: By the choice of f Integrate over X to remove the conditional

So is classification a solved problem? We know that Bayes classifier is optimal. So have we solved all classification problems? Not even close! Why? How to estimate P[Y X]? How to estimate P[X Y]? How good is the model class? Quality of estimation degrades with increase in the dimension of X! Active area of research!

Classification via Prob. Models: Variation Naïve Bayes classifier: = Advantages: Naïve Bayes assumption: The individual features/measurements are independent given the class label Computationally very simple model. Quick to code. Disadvantages: Does not properly capture the interdependence between features, giving bad estimates.

How to evaluate the quality of a classifier? Your friend claims: My classifier is better than yours How can you evaluate this statement? Given a classifier f, we essentially need to compute: Accuracy of f But we don t know the underlying distribution We can use training data to estimate Severely overestimates the accuracy! Why? Training data is already used to construct f, so it is NOT an unbiased estimator

How to evaluate the quality of a classifier? General strategy: Divide the labelled data into training and test FIRST Only use the training data for learning f Then the test data can be used as an unbiased estimator for gauging the predictive accuracy of f Testing Phase Training Phase Unlabeled test data (unseen / future data) Labeled training data (n examples from data) Learning Algorithm classifier prediction

What we learned Why machine learning Basics of Supervised Learning Maximum Likelihood Estimation Learning a classifier via probabilistic modelling Optimality of Bayes classifier Naïve Bayes classifier How to evaluate the quality of a classifier

Questions?

Next time Direct ways of finding the discrimination boundary

Remember Visit the course website http://www.cs.columbia.edu/~verma/classes/sp18/coms4771/ Review the basics (prerequisites) HW0 is out Sign up on Piazza & Gradescope Recitation section on Fri 1:10-2:25p Math 207 (optional)