CSE 258 Lecture 3. Web Mining and Recommender Systems. Supervised learning Classification

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Assignment 1: Predicting Amazon Review Ratings

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Artificial Neural Networks written examination

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Switchboard Language Model Improvement with Conversational Data from Gigaword

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Probabilistic Latent Semantic Analysis

A Case Study: News Classification Based on Term Frequency

Indian Institute of Technology, Kanpur

Generative models and adversarial training

Rule Learning With Negation: Issues Regarding Effectiveness

Learning From the Past with Experiment Databases

Human Emotion Recognition From Speech

WHEN THERE IS A mismatch between the acoustic

Exploration. CS : Deep Reinforcement Learning Sergey Levine

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Multivariate k-nearest Neighbor Regression for Time Series data -

Rule Learning with Negation: Issues Regarding Effectiveness

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

CS 446: Machine Learning

Semi-Supervised Face Detection

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Reducing Features to Improve Bug Prediction

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Model Ensemble for Click Prediction in Bing Search Ads

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Softprop: Softmax Neural Network Backpropagation Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Calibration of Confidence Measures in Speech Recognition

Mathematics subject curriculum

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v2 [cs.cv] 30 Mar 2017

Discriminative Learning of Beam-Search Heuristics for Planning

Truth Inference in Crowdsourcing: Is the Problem Solved?

CSL465/603 - Machine Learning

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Go fishing! Responsibility judgments when cooperation breaks down

Software Maintenance

Using Web Searches on Important Words to Create Background Sets for LSI Classification

INPE São José dos Campos

Mathematics. Mathematics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Lecture 10: Reinforcement Learning

An empirical study of learning speed in backpropagation

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Multi-label classification via multi-target regression on data streams

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Universidade do Minho Escola de Engenharia

STA 225: Introductory Statistics (CT)

The stages of event extraction

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Australian Journal of Basic and Applied Sciences

Foothill College Summer 2016

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Data Fusion Through Statistical Matching

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Major Milestones, Team Activities, and Individual Deliverables

Corrective Feedback and Persistent Learning for Information Extraction

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Speech Emotion Recognition Using Support Vector Machine

arxiv: v1 [cs.lg] 3 May 2013

Support Vector Machines for Speaker and Language Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Radius STEM Readiness TM

Economics 201 Principles of Microeconomics Fall 2010 MWF 10:00 10:50am 160 Bryan Building

Introduction to Simulation

arxiv: v1 [cs.lg] 15 Jun 2015

Linking Task: Identifying authors and book titles in verbose queries

Statewide Framework Document for:

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Managerial Decision Making

Probability and Statistics Curriculum Pacing Guide

Detailed course syllabus

TUESDAYS/THURSDAYS, NOV. 11, 2014-FEB. 12, 2015 x COURSE NUMBER 6520 (1)

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Speech Recognition at ICSI: Broadcast News and beyond

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Beyond the Pipeline: Discrete Optimization in NLP

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Comparison of Two Text Representations for Sentiment Analysis

Lecture 1: Basic Concepts of Machine Learning

Students Understanding of Graphical Vector Addition in One and Two Dimensions

The Boosting Approach to Machine Learning An Overview

Let s think about how to multiply and divide fractions by fractions!

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Axiom 2013 Team Description Paper

Transcription:

CSE 258 Lecture 3 Web Mining and Recommender Systems Supervised learning Classification

Last week Last week we started looking at supervised learning problems

Last week We studied linear regression, in order to learn linear relationships between features and parameters to predict realvalued outputs matrix of features (data) vector of outputs unknowns (labels) (which features are relevant)

Last week ratings features

Four important ideas from last week: 1) Regression can be cast in terms of maximizing a likelihood

Four important ideas from last week: 2) Gradient descent for model optimization 1. Initialize at random 2. While (not converged) do

Four important ideas from last week: 3) Regularization & Occam s razor Regularization is the process of penalizing model complexity during training How much should we trade-off accuracy versus complexity?

Four important ideas from last week: 4) Regularization pipeline 1. Training set select model parameters 2. Validation set to choose amongst models (i.e., hyperparameters) 3. Test set just for testing!

Model selection A few theorems about training, validation, and test sets The training error increases as lambda increases The validation and test error are at least as large as the training error (assuming infinitely large random partitions) The validation/test error will usually have a sweet spot between under- and over-fitting

Today How can we predict binary or categorical variables? {0,1}, {True, False} {1,, N}

Today Will I purchase this product? (yes) Will I click on this ad? (no)

Today What animal appears in this image? (mandarin duck)

Today What are the categories of the item being described? (book, fiction, philosophical fiction)

Today We ll attempt to build classifiers that make decisions according to rules of the form

This week 1. Naïve Bayes Assumes an independence relationship between the features and the class label and learns a simple model by counting 2. Logistic regression Adapts the regression approaches we saw last week to binary problems 3. Support Vector Machines Learns to classify items by finding a hyperplane that separates them

This week Ranking results in order of how likely they are to be relevant

This week Evaluating classifiers False positives are nuisances but false negatives are disastrous (or vice versa) Some classes are very rare When we only care about the most confident predictions e.g. which of these bags contains a weapon?

Naïve Bayes We want to associate a probability with a label and its negation: (classify according to whichever probability is greater than 0.5) Q: How far can we get just by counting?

Naïve Bayes e.g. p(movie is action schwarzenneger in cast) Just count! #fims with Arnold = 45 #action films with Arnold = 32 p(movie is action schwarzenneger in cast) = 32/45

Naïve Bayes What about: p(movie is action schwarzenneger in cast and release year = 2017 and mpaa rating = PG and budget < $1000000 ) #(training) fims with Arnold, released in 2017, rated PG, with a budged below $1M = 0 #(training) action fims with Arnold, released in 2017, rated PG, with a budged below $1M = 0

Naïve Bayes Q: If we ve never seen this combination of features before, what can we conclude about their probability? A: We need some simplifying assumption in order to associate a probability with this feature combination

Naïve Bayes Naïve Bayes assumes that features are conditionally independent given the label

Naïve Bayes

Conditional independence? (a is conditionally independent of b, given c) if you know c, then knowing a provides no additional information about b

Naïve Bayes =

Naïve Bayes posterior prior likelihood evidence

Naïve Bayes? The denominator doesn t matter, because we really just care about vs. both of which have the same denominator

Naïve Bayes The denominator doesn t matter, because we really just care about vs. both of which have the same denominator

Example 1 Amazon editorial descriptions: 50k descriptions: http://jmcauley.ucsd.edu/cse258/data/amazon/book_descriptions_50000.json

Example 1 P(book is a children s book wizard is mentioned in the description and witch is mentioned in the description) Code available on: http://jmcauley.ucsd.edu/cse258/code/week2.py

Example 1 Conditional independence assumption: if you know a book is for children, then knowing that wizards are mentioned provides no additional information about whether witches are mentioned obviously ridiculous

Double-counting Q: What would happen if we trained two regressors, and attempted to naively combine their parameters?

Double-counting

Double-counting A: Since both features encode essentially the same information, we ll end up double-counting their effect

Logistic regression Logistic Regression also aims to model By training a classifier of the form

Logistic regression Last week: regression This week: logistic regression

Logistic regression Q: How to convert a realvalued expression ( ) Into a probability ( )

Logistic regression A: sigmoid function:

Logistic regression Training: should be maximized when is positive and minimized when is negative

Logistic regression How to optimize? Take logarithm Subtract regularizer Compute gradient Solve using gradient ascent (solve on blackboard)

Logistic regression

Logistic regression

Multiclass classification The most common way to generalize binary classification (output in {0,1}) to multiclass classification (output in {1 N}) is simply to train a binary predictor for each class e.g. based on the description of this book: Is it a Children s book? {yes, no} Is it a Romance? {yes, no} Is it Science Fiction? {yes, no} In the event that predictions are inconsistent, choose the one with the highest confidence

Questions? Further reading: On Discriminative vs. Generative classifiers: A comparison of logistic regression and naïve Bayes (Ng & Jordan 01) Boyd-Fletcher-Goldfarb-Shanno algorithm (BFGS)

CSE 258 Lecture 3 Web Mining and Recommender Systems Supervised learning SVMs

Logistic regression Q: Where would a logistic regressor place the decision boundary for these features? positive examples negative examples a b

Logistic regression Q: Where would a logistic regressor place the decision boundary for these features? positive examples negative examples hard to classify easy to classify b easy to classify

Logistic regression Logistic regressors don t optimize the number of mistakes No special attention is paid to the difficult instances every instance influences the model But easy instances can affect the model (and in a bad way!) How can we develop a classifier that optimizes the number of mislabeled examples?

Support Vector Machines This is essentially the intuition behind Support Vector Machines (SVMs) train a classifier that focuses on the difficult examples by minimizing the misclassification error We still want a classifier of the form But we want to minimize the number of misclassifications:

Support Vector Machines

Support Vector Machines Simple (seperable) case: there exists a perfect classifier a

Support Vector Machines The classifier is defined by the hyperplane

Support Vector Machines Q: Is one of these classifiers preferable over the others?

Support Vector Machines d A: Choose the classifier that maximizes the distance to the nearest point

Support Vector Machines Distance from a point to a line?

Support Vector Machines such that support vectors

Support Vector Machines This is known as a quadratic program (QP) and can be solved using standard techniques such that See e.g. Nocedal & Wright ( Numerical Optimization ), 2006

Support Vector Machines But: is finding such a separating hyperplane even possible?

Support Vector Machines Or: is it actually a good idea?

Support Vector Machines Want the margin to be as wide as possible While penalizing points on the wrong side of it

Support Vector Machines Soft-margin formulation: such that

Judging a book by its cover [0.723845, 0.153926, 0.757238, 0.983643, ] 4096-dimensional image features Images features are available for each book on http://jmcauley.ucsd.edu/cse258/data/amazon/book_images_5000.json http://caffe.berkeleyvision.org/

Judging a book by its cover Example: train an SVM to predict whether a book is a children s book from its cover art (code available on) http://jmcauley.ucsd.edu/cse258/code/week2.py

Judging a book by its cover The number of errors we made was extremely low, yet our classifier doesn t seem to be very good why? (stay tuned next lecture!)

Summary The classifiers we ve seen today all attempt to make decisions by associating weights (theta) with features (x) and classifying according to

Summary Naïve Bayes Probabilistic model (fits ) Makes a conditional independence assumption of the form allowing us to define the model by computing for each feature Simple to compute just by counting Logistic Regression Fixes the double counting problem present in naïve Bayes SVMs Non-probabilistic: optimizes the classification error rather than the likelihood

Questions?