ECE 5424: Introduction to Machine Learning

Similar documents
Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Generative models and adversarial training

CSL465/603 - Machine Learning

Semi-Supervised Face Detection

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Artificial Neural Networks written examination

Learning From the Past with Experiment Databases

Probabilistic Latent Semantic Analysis

Python Machine Learning

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

CS Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Managerial Decision Making

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Calibration of Confidence Measures in Speech Recognition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Rule Learning With Negation: Issues Regarding Effectiveness

Australian Journal of Basic and Applied Sciences

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Assignment 1: Predicting Amazon Review Ratings

WHEN THERE IS A mismatch between the acoustic

Laboratorio di Intelligenza Artificiale e Robotica

Model Ensemble for Click Prediction in Bing Search Ads

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Softprop: Softmax Neural Network Backpropagation Learning

Indian Institute of Technology, Kanpur

Rule Learning with Negation: Issues Regarding Effectiveness

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Introduction to Simulation

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Detailed course syllabus

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

12- A whirlwind tour of statistics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Laboratorio di Intelligenza Artificiale e Robotica

Penn State University - University Park MATH 140 Instructor Syllabus, Calculus with Analytic Geometry I Fall 2010

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Evaluation of Teach For America:

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Truth Inference in Crowdsourcing: Is the Problem Solved?

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Probability and Statistics Curriculum Pacing Guide

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Data Structures and Algorithms

The Evolution of Random Phenomena

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

SYLLABUS. EC 322 Intermediate Macroeconomics Fall 2012

Secret Code for Mazes

On the Polynomial Degree of Minterm-Cyclic Functions

NATIONAL CENTER FOR EDUCATION STATISTICS RESPONSE TO RECOMMENDATIONS OF THE NATIONAL ASSESSMENT GOVERNING BOARD AD HOC COMMITTEE ON.

A Case Study: News Classification Based on Term Frequency

A General Class of Noncontext Free Grammars Generating Context Free Languages

CLEVELAND STATE UNIVERSITY James J. Nance College of Business Administration Marketing Department Spring 2012

DOCTOR OF PHILOSOPHY HANDBOOK

CS 446: Machine Learning

Probability and Game Theory Course Syllabus

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Speech Recognition at ICSI: Broadcast News and beyond

Proof Theory for Syntacticians

Introduction to Causal Inference. Problem Set 1. Required Problems

arxiv: v1 [cs.lg] 15 Jun 2015

Detecting English-French Cognates Using Orthographic Edit Distance

CS 3516: Computer Networks

INPE São José dos Campos

MOODLE 2.0 GLOSSARY TUTORIALS

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Discriminative Learning of Beam-Search Heuristics for Planning

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Reducing Features to Improve Bug Prediction

Corrective Feedback and Persistent Learning for Information Extraction

ECON492 Senior Capstone Seminar: Cost-Benefit and Local Economic Policy Analysis Fall 2017 Instructor: Dr. Anita Alves Pena

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Word Segmentation of Off-line Handwritten Documents

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Universidade do Minho Escola de Engenharia

Factoring - Grouping

Evaluating Statements About Probability

Syllabus ENGR 190 Introductory Calculus (QR)

A survey of multi-view machine learning

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Word learning as Bayesian inference

Hierarchical Linear Models I: Introduction ICPSR 2015

PHD COURSE INTERMEDIATE STATISTICS USING SPSS, 2018

Toward Probabilistic Natural Logic for Syllogistic Reasoning

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

An overview of risk-adjusted charts

Speech Emotion Recognition Using Support Vector Machine

Radius STEM Readiness TM

Transcription:

ECE 5424: Introduction to Machine Learning Topics: Classification: Naïve Bayes Readings: Barber 10.1-10.3 Stefan Lee Virginia Tech

Administrativia HW2 Due: Friday 09/28, 10/3, 11:55pm Implement linear regression, Naïve Bayes, Logistic Regression Next Tuesday s Class Review of topics Assigned readings on convexity with optional (useful) video Might be on the exam so brush up this and stochastic gradient descent. (C) Dhruv Batra 2

Administrativia Midterm Exam When: October 6th, class timing Where: In class Format: Pen-and-paper. Open-book, open-notes, closed-internet. No sharing. What to expect: mix of Multiple Choice or True/False questions Prove this statement What would happen for this dataset? Material Everything from beginning to class to Tuesday s lecture (C) Dhruv Batra 3

New Topic: Naïve Bayes (your first probabilistic classifier) x Classification y Discrete (C) Dhruv Batra 4

Error Decomposition Approximation/Modeling Error You approximated reality with model Estimation Error You tried to learn model with finite data Optimization Error You were lazy and couldn t/didn t optimize to completion Bayes Error Reality just sucks (C) Dhruv Batra 5

Learn: h:x! Y X features Y target classes Classification Suppose you know P(Y X) exactly, how should you classify? Bayes classifier: Why? Slide Credit: Carlos Guestrin

Optimal classification Theorem: Bayes classifier h Bayes is optimal! That is Proof: Slide Credit: Carlos Guestrin

Generative vs. Discriminative Generative Approach Estimate p(x y) and p(y) Use Bayes Rule to predict y Discriminative Approach Estimate p(y x) directly OR Learn discriminant function h(x) (C) Dhruv Batra 8

Generative vs. Discriminative Generative Approach Assume some functional form for P(X Y), P(Y) Estimate p(x Y) and p(y) Use Bayes Rule to calculate P(Y X=x) Indirect computation of P(Y X) through Bayes rule But, can generate a sample, P(X) = y P(y) P(X y) Discriminative Approach Estimate p(y x) directly OR Learn discriminant function h(x) Direct but cannot obtain a sample of the data, because P(X) is not available (C) Dhruv Batra 9

Generative vs. Discriminative Generative: Today: Naïve Bayes Discriminative: Next: Logistic Regression NB & LR related to each other. (C) Dhruv Batra 10

How hard is it to learn the optimal classifier? Categorical Data How do we represent these? How many parameters? Class-Prior, P(Y): Suppose Y is composed of k classes Likelihood, P(X Y): Suppose X is composed of d binary features Complex model à High variance with limited data!!! Slide Credit: Carlos Guestrin

Independence to the rescue (C) Dhruv Batra Slide Credit: Sam Roweis 12

The Naïve Bayes assumption Naïve Bayes assumption: Features are independent given class: More generally: d How many parameters now? Suppose X is composed of d binary features (C) Dhruv Batra Slide Credit: Carlos Guestrin 13

The Naïve Bayes Classifier Given: Class-Prior P(Y) d conditionally independent features X given the class Y For each X i, we have likelihood P(X i Y) Decision rule: If assumption holds, NB is optimal classifier! (C) Dhruv Batra Slide Credit: Carlos Guestrin 14

MLE for the parameters of NB Given dataset Count(A=a,B=b) #number of examples where A=a and B=b MLE for NB, simply: Class-Prior: P(Y=y) = Likelihood: P(X i =x i Y=y) = (C) Dhruv Batra 15

HW1 (C) Dhruv Batra 16

In class demo Naïve Bayes Variables Y = {haven t watched an American football game, have watched an American football game} X 1 = {domestic, international} X 2 = {<2 years at VT, >=2years} Estimate P(Y=1) P(X 1 =0 Y=0), P(X 2 =0 Y=0) P(X 1 =0 Y=1), P(X 2 =0 Y=1) Prediction: argmax_y P(Y=y)P(x 1 Y=y)P(x 2 Y=y) (C) Dhruv Batra 17

Subtleties of NB classifier 1 Violating the NB assumption Usually, features are not conditionally independent: d Probabilities P(Y X) often biased towards 0 or 1 Nonetheless, NB is a very popular classifier NB often performs well, even when assumption is violated [Domingos & Pazzani 96] discuss some conditions for good performance (C) Dhruv Batra Slide Credit: Carlos Guestrin 18

Subtleties of NB classifier 2 Insufficient training data What if you never see a training instance where X 1 =a when Y=c? e.g., Y={NonSpamEmail}, X 1 ={ Nigeria } P(X 1 =a Y=c) = 0 Thus, no matter what the values X 2,,X d take: P(Y=c X 1 =a,x 2,,X d ) = 0 What now??? (C) Dhruv Batra Slide Credit: Carlos Guestrin 19

Recall MAP for Bernoulli-Beta MAP: use most likely parameter: Beta prior equivalent to extra flips (C) Dhruv Batra Slide Credit: Carlos Guestrin 20

Bayesian learning for NB parameters a.k.a. smoothing Prior on parameters Dirichlet all the things! MAP estimate Now, even if you never observe a feature/class, posterior probability never zero (C) Dhruv Batra Slide Credit: Carlos Guestrin 21

Text classification Classify e-mails Y = {Spam,NotSpam} Classify news articles Y = {what is the topic of the article?} Classify webpages Y = {Student, professor, project, } What about the features X? The text! (C) Dhruv Batra Slide Credit: Carlos Guestrin 22

Features X are entire document X i for i th word in article (C) Dhruv Batra Slide Credit: Carlos Guestrin 23

NB for Text classification P(X Y) is huge!!! Article at least 1000 words, X={X 1,,X 1000 } X i represents i th word in document, i.e., the domain of X i is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc. NB assumption helps a lot!!! P(X i =x i Y=y) is just the probability of observing word x i in a document on topic y (C) Dhruv Batra Slide Credit: Carlos Guestrin 24

Bag of Words model Typical additional assumption: Position in document doesn t matter: P(X i =a Y=y) = P(X k =a Y=y) Bag of words model order of words on the page ignored Sounds really silly, but often works very well! When the lecture is over, remember to wake up the person sitting next to you in the lecture room. (C) Dhruv Batra Slide Credit: Carlos Guestrin 25

Bag of Words model Typical additional assumption: Position in document doesn t matter: P(X i =a Y=y) = P(X k =a Y=y) Bag of words model order of words on the page ignored Sounds really silly, but often works very well! in is lecture lecture next over person remember room sitting the the the to to up wake when you (C) Dhruv Batra Slide Credit: Carlos Guestrin 26

Bag of Words model aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0... gas 1... oil 1 Zaire 0 (C) Dhruv Batra Slide Credit: Carlos Guestrin 27

Object Bag of words (C) Dhruv Batra Slide Credit: Fei Fei Li 28

(C) Dhruv Batra Slide Credit: Fei Fei Li 29

learning recognition feature detection & representation codewords dictionary image representation category models (and/or) classifiers category decision (C) Dhruv Batra Slide Credit: Fei Fei Li 30

What if we have continuous X i? Eg., character recognition: X i is i th pixel Gaussian Naïve Bayes (GNB): Sometimes assume variance is independent of Y (i.e., i), or independent of X i (i.e., k) or both (i.e., ) (C) Dhruv Batra Slide Credit: Carlos Guestrin 31

Estimating Parameters: Y discrete, X i continuous Maximum likelihood estimates: (C) Dhruv Batra 32

What you need to know about NB Optimal decision using Bayes Classifier Naïve Bayes classifier What s the assumption Why we use it How do we learn it Why is Bayesian estimation of NB parameters important Text classification Bag of words model Gaussian NB Features are still conditionally independent Each feature has a Gaussian distribution given class (C) Dhruv Batra 33

Generative vs. Discriminative Generative Approach Estimate p(x y) and p(y) Use Bayes Rule to predict y Discriminative Approach Estimate p(y x) directly OR Learn discriminant function h(x) (C) Dhruv Batra 34

Today: Logistic Regression Main idea Think about a 2 class problem {0,1} Can we regress to P(Y=1 X=x)? Meet the Logistic or Sigmoid function Crunches real numbers down to 0-1 Model In regression: y ~ N(w x, λ 2 ) Logistic Regression: y ~ Bernoulli(σ(w x)) (C) Dhruv Batra 35

Understanding the sigmoid (w 0 + X i w i x i )= 1 1+e w 0 P i w ix i w 0 =2, w 1 =1 w 0 =0, w 1 =1 w 0 =0, w 1 =0.5 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 (C) Dhruv Batra Slide Credit: Carlos Guestrin 36