ECS171: Machine Learning Lecture 1: Overview of class, LFD 1.1, 1.2 Cho-Jui Hsieh UC Davis Jan 8, 2018
Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS171_Winter2018/main.html and canvas My office: Mathematical Sciences Building (MSB) 4232 Office hours: Tuesday 1pm-2pm, MSB 4232 (starting next week) TAs: Patrick Chen (phpchen@ucdavis.edu) Xuanqing Liu (xqliu@ucdavis.edu) Office hour: Thursday 10AM 11AM Kemper 55 (starting next week) My email: chohsieh@ucdavis.edu
Course Information Course Material: Part I (before midterm exam): Use the book Learning from data (LFD) by Abu-Mostafa, Magdon-Ismail and Hsuan-Tian Lin Foundation of machine learning: why can we learn from data? overfitting, underfitting, training vs testing, regularization 11 lectures Most slides are based on Yaser Abu-Mostafa (Caltech): http://work.caltech.edu/lectures.html#lectures Hsuan-Tian Lin (NTU): https://www.csie.ntu.edu.tw/~htlin/course/mlfound17fall/ Part II: Introduce some practical machine learning models. Deep learning, kernel methods, boosting, tree-based approach, clustering, dimension reduction
Grading Policy Midterm (30%) Written exam for Part I Homework (30%) 2 or 3 homeworks Final project (40%) Competition?
Final project Group of 4 students. We will announce the dataset and task Kaggle-styled competition Upload your model/prediction online Our website will report the accuracy Final report: Report the algorithms you have tested and the implementation details Discuss your findings
The Learning Problem
From learning to machine learning What is learning? Machine learning: observations Learning Skill data Machine Learning Skill Automatic the learning process! Skill: how to make decision (action) Classify an image Predict bitcoin price...
Example: movie recommendation Data: user-movie ratings Skill: predict how a user rate an unrated movie Known as the Netflix problem A competition held by Netflix in 2006 1 million ratings, 480K users, 17K movies 10% improvement over baseline 1 million dollar price
Movie rating - a solution Each viewer/movie is associated with a latent factor Prediction: Rating viewer/movie factors Learning: Known ratings viewer/movie factors
Credit Approval Problem Customer record: To be learned: Is Approving credit card good for bank?
Formalize the Learning Problem Input: x X (customer application) e.g., x = [23, 1, 1000000, 1, 0.5, 200000] Output: y Y (good/bad after approving credit card) Target function to be learned: f : X Y (ideal credit approval formula) Data (historical records in bank): D = {(x 1, y 1 ), (x 2, y 2 ),, (x N, y N )} Hypothesis (function) g : X Y (learned formula to be used)
Basic Setup of Learning Problem
Learning Model A learning model has two components: The hypothesis set H: Set of candidate hypothesis (functions) The learning algorithm: To pick a hypothesis (function) from the H Usually optimization algorithm (choose the best function to minimize the training error)
Perceptron Our first ML model: perceptron (1957) Learning a linear function Single layer neural network Next, we introduce two components of perceptron: What s the hypothesis space? What s the learning algorithm?
Perceptron Hypothesis Space Define the hypothesis set H For input x = (x 1,..., x d ) attributes of a customer Approve credit if Deny credit if d w i x i > threshold, i=1 d w i x i < threshold i=1 Define Y = {+1(good), 1(bad)} Linear hypothesis space H: all the h with the following form d h(x) = sign( w i x i threshold) (perceptron hypothesis) i=1
Perceptron Hypothesis Space (cont d) Introduce an artificial coordinate x 0 = 1 and set w 0 = threshold d d h(x) = sign( w i x i threshold) = sign( w i x i ) = sign(w T x) (vector form) i=1 i=0 Customer features x: points on R d (d dimensional space) Labels y: +1 or 1 Hypothesis h: linear hyperplanes
Select g from H H: all possible linear hyperplanes How to select the best one? g(x n ) f (x n ) = y n for most of the n = 1,, N Naive approach: Test all h H and choose the best one minimizing the training error train error = 1 N N I (h(x n ) y n ) n=1 (I ( ): indicator) Difficult: H is of infinite size
Perceptron Learning Algorithm Perceptron Learning Algorithm (PLA) Initial from some w (e.g., w = 0) For t = 1, 2, Find a misclassified point n(t): Update the weight vector: sign(w T x n(t) ) y n(t) w w + y n(t) x n(t)
PLA Iteratively Find a misclassified point Rotate the hyperplane according to the misclassified point
Perceptron Learning Algorithm Converge for linearly separable case: Linearly separable: there exists a perceptron (linear) hypothesis f with 0 training error PLA is guaranteed to obtain f (Stop when no more misclassified point)
Binary classification Data: Features for each training example: {x n } N n=1, each x n R d Labels for each training example: y n {+1, 1} Goal: learn a function f : R d {+1, 1} Examples: Credit approve/disapprove Email spam/not-spam patient sick/not sick...
Other types of labels - Multi-class Multi-class classification: y n {1,, C} (C-way classification) Example: Coin recognition Classify coins by two features (size, mass) (x n R 2 ) y n Y = {1c, 5c, 10c, 25c} (Y = {1, 2, 3, 4}) Other examples: hand-written digits,
Other types of labels - Regression Regression: y n R (output is a real number) Example: Stock price prediction Movie rating prediction
Other types of labels - structure prediction I }{{} pronoun love }{{} verb ML }{{} noun Multiclass classification for each word (word word class) (not using information of the whole sentence) Structure prediction problem: sentence structure (class of each word) Other examples: speech recognition, image captioning,...
Machine Learning Problems Machine learning problems can usually be categorized into Supervised learning: every x n comes with y n (label) (semi-supervised learning) Unsupervised learning: only x n, no y n Reinforcement learning: Examples contain (input, some output, grade for this output)
Unsupervised Learning (no y n ) Clustering: given examples x 1,..., x N, classify them into K classes Other unsupervised learning: Outlier detection: {x n } unusual(x) Dimensional reduction...
Semi-supervised learning Only some (few) x n has y n Labeled data is much more expensive than unlabeled data
Reinforcement Learning Used a lot in game AI, robotic controls Agent observe state S t Agent conduct action A t (ML model, based on input S t ) Environment gives agent reward R t Environment gives agent next state S t+1 Only observe grade for a certain action (best action is not revealed) Ads system: (customer, ad choice, click or not)
Conclusions Two components in ML: Set up a hypothesis space (potential functions) Develop an algorithm to choose a good hypothesis based on training examples A perceptron algorithm (linear classification) Supervised vs unsupervised learning Next class: LFD 1.3, 1.4 Questions?