CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu
Time and Loca@on Time: Thursdays from 6:00 pm 9:00 pm Loca)on: Forsyth Building 129
Course Webpage hpp://www.ccs.neu.edu/home/luwang/ courses/cs6140_sp2017.html
Prerequisites Programming Being able to write code in some programming languages (e.g. Python, Java, C/C++, Matlab) proficiently Courses Algorithms Probability and sta@s@cs Linear algebra
Prerequisites Courses Algorithms Probability and sta@s@cs Linear algebra A quiz: 22 simple ques@ons, 20 of them as True or False ques@ons (relevant to probability, sta@s@cs, and linear algebra) The purpose of this quiz is to indicate the expected background of students. 80% of the ques@ons should be easy to answer. Not counted in your final score!
Textbook and References Main Textbook Kevin Murphy, "Machine Learning - a Probabilis@c Perspec@ve", MIT Press, 2012. Christopher M. Bishop, "PaPern Recogni@on and Machine Learning", Springer, 2006. Other textbooks Tom Mitchell, "Machine Learning", McGraw Hill, 1997. Machine learning lectures
Content of the Course Regression: linear regression, logis@c regression Dimensionality Reduc)on: Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis Probabilis)c Models: Naive Bayes, maximum likelihood es@ma@on Sta)s)cal Learning Theory: VC dimension Kernels: Support Vector Machines (SVMs), kernel tricks, duality Sequen)al Models and Structural Models: Hidden Markov Model (HMM), Condi@onal Random Fields (CRFs) Clustering: spectral clustering, hierarchical clustering Latent Variable Models: K-means, mixture models, expecta@on-maximiza@on (EM) algorithms, Latent Dirichlet Alloca@on (LDA), representa@on learning Deep Learning: feedforward neural network, restricted Boltzmann machine, autoencoders, recurrent neural network, convolu@onal neural network Reinforcement Learning: Markov decision processes, Q-learning and others, including advanced topics for machine learning in natural language processing and text analysis
The Goal Scien@fic understanding of machine learning models How to apply and design learning methods for novel problems
The Goal Not only what, but also why!
Grading Assignment 3 assignments, 10% for each Quiz 10 in-class tests, 1% for each Exam 1 exam, 30% Project 1 project, 27% Par@cipa@on 3% Classes Piazza
Exam Open book April 20, 2017
Course Project A machine learning relevant research project 2-3 students as a team
Topics Machine learning relevant Natural language processing Computer vision Robo@cs Bioinforma@cs Health informa@cs
Course Project Grading We want to see novel and interes@ng projects! The problem needs to be well-defined, novel, useful, and prac@cal machine learning techniques Reasonable results and observa@ons
Project from Last Year
Project from Last Year Predic@ng Follow-back Behavior in Instagram Users
Project from Last Year Predic@ng Grasp Points Using Convolu@onal Neural Networks
Project from Last Year Ar@ficial Neural Networks for Drug Response Predic@on in Tailored Therapy
Project from Last Year Threat Detec@on from TwiPer
Project from Last Year Player Ranking in Popular Games
Course Project Grading Three reports Proposal (2%) Progress, with code (10%) Final, with code (10%) One presenta@on In class (5%)
Submission and Late Policy Each assignment or report, both electronic copy and hard copy, is due at the beginning of class on the corresponding due date. Programming language Python, Java, C/C++, Matlab Electronic version On blackboard Hard copy In class
Submission and Late Policy Assignment or report turned in late will be charged 10 points (out of 100 points) off for each late day (i.e. 24 hours). Each student has a budget of 5 days throughout the semester before a late penalty is applied.
How to find us? Course webpage: hpp://www.ccs.neu.edu/home/luwang/courses/ cs6140_sp2017.html Office hours Lu Wang: Thursdays from 4:30pm to 5:30pm, or by appointment, 448 WVH Rui Dong (TA), Tuesdays from 4:00pm to 5:00pm, or by appointment, 466B WVH Piazza hpp://piazza.com/northeastern/spring2017/cs614002 All course relevant ques@ons go here
What is Machine Learning? A set of methods that can automa@cally detect paperns in data, and then use the uncovered paperns to predict future data, or to perform other kinds of decisions making under certainty.
Real World Applica@ons
Real World Applica@ons
Real World Applica@ons
Real World Applica@ons
Real World Applica@ons
Real World Applica@ons
Real World Applica@ons
Real World Applica@ons
Real World Applica@ons
Real World Applica@ons
Rela@ons with Other Areas Natural Language Processing Computer Vision Robo@cs A lot of other areas
Today s Outline Basic concepts in machine learning K-nearest neighbors Linear regression Ridge regression
Supervised vs. Unsupervised Learning
Supervised Learning
Supervised vs. Unsupervised Learning Supervised learning Training set Training sample Gold-standard label - Classifica)on, if categorical - Regression, if numerical
Supervised Learning
Supervised Learning
Supervised Learning Goal: Generalizable to new input samples Overfivng vs. underfivng One solu@on: we use probabilis@c models Typical setup: Step 1: Features Step 2: Training set, test set, development set Step 3: Evalua@on
Supervised Learning
Supervised Learning
Supervised Learning
Supervised Learning Regression Predic@ng stock price Predic@ng temperature Predic@ng revenue
Supervised vs. Unsupervised Learning Unsupervised Learning More about knowledge discovery
Unsupervised Learning Dimension reduc@on Principal component analysis
Unsupervised Learning Clustering (e.g. graph mining) RolX: Role Extrac.on and Mining in Large Networks, by Henderson et al, 2011
Unsupervised Learning Topic modeling
Parametric vs. Non-parametric model Fixed number of parameters? If yes, parametric model Number of parameters grow with the amount of training data? If yes, non-parametric model Computa@onal tractability
Today s Outline Basic concepts in machine learning K-nearest neighbors Supervised learning A non-parametric classifier Linear regression Ridge regression
A non-parametric classifier: K-nearest neighbors (KNN)
A non-parametric classifier: K-nearest neighbors (KNN) Basic idea: memorize all the training samples The more you have in training data, the more the model has to remember
A non-parametric classifier: K-nearest neighbors (KNN) Basic idea: memorize all the training samples The more you have in training data, the more the model has to remember Nearest neighbor (or 1-nearest neighbor): Tes@ng phase: find closet sample, and return corresponding label
A non-parametric classifier: K-nearest neighbors (KNN) Basic idea: memorize all the training samples The more you have in training data, the more the model has to remember K-Nearest neighbor: Tes@ng phase: find the K nearest neighbors, and return the majority vote of their labels
About K K=1: just piecewise constant labeling K=N: global majority vote (class)
Problems of knn Can be slow when training data is big Searching for the neighbors takes @me Needs lots of memory to store training data Needs to tune k and distance func@on Not a probability distribu@on
Problems of knn Distance func@on Euclidean distance
Problems of knn Distance func@on Mahalanobis distance: weights on components
Probabilis@c knn We prefer a probabilis@c output because some@mes we may get an uncertain result 1 samples as yes, 199 samples as no à? 99 samples as yes, 101 samples as no à? Probabilis@c knn:
Probabilis@c knn 3-class synthe@c training data
Smoothing Class 1: 3, class 2: 0, class 3: 1 Original probability: P(y=1)=3/4, p(y=2)=0/4, p(y=3)=1/4
Smoothing Class 1: 3, class 2: 0, class 3: 1 Original probability: P(y=1)=3/4, p(y=2)=0/4, p(y=3)=1/4 Add-1 smoothing: Class 1: 3+1, class 2: 0+1, class 3: 1+1 P(y=1)=4/7, p(y=2)=1/7, p(y=3)=2/7
Soxmax Class 1: 3, class 2: 0, class 3: 1 Original probability: P(y=1)=3/4, p(y=2)=0/4, p(y=3)=1/4 Redistribute probability mass into different classes Define a soxmax as
Today s Outline Basic concepts in machine learning K-nearest neighbors Linear regression Supervised learning A parametric classifier Ridge regression
A parametric classifier: linear regression Assump@on: the response is a linear func@on of the inputs Inner product between input sample X and weight vector W Residual error: difference between predic@on and true label
A parametric classifier: linear regression Inner product between input sample X and weight vector W Residual error: difference between predic@on and true label Assume residual error has a normal distribu@on
A parametric classifier: linear regression We can further assume Basic func@on expansion
A parametric classifier: linear regression Ver@cal: temperature Horizontal: loca@on within a room
A parametric classifier: linear regression
Learning with Maximum Likelihood Es@ma@on (MLE) Maximum Likelihood Es@ma@on (MLE)
Learning with Maximum Likelihood Log-likelihood Es@ma@on (MLE) Maximize log-likelihood is equivalent to minimize nega@ve log-likelihood (NLL)
Learning with Maximum Likelihood Es@ma@on (MLE) With our normal distribu@on assump@on Residual sum of squares (RSS) à We want to minimize it!
Deriva@on of MLE for Linear Regression Rewrite our objec@ve func@on as
Deriva@on of MLE for Linear Regression Rewrite our objec@ve func@on as Get the deriva@ve (or gradient)
Deriva@on of MLE for Linear Regression Rewrite our objec@ve func@on as Get the deriva@ve (or gradient) Set our deriva@ve to 0 Ordinary least squares solu)on
Feature weights w: Overfivng
A Prior on the Weight Zero-mean Gaussian prior
A Prior on the Weight Zero-mean Gaussian prior New objec@ve func@on
A Prior on the Weight Zero-mean Gaussian prior New objec@ve func@on
Today s Outline Basic concepts in machine learning K-nearest neighbors Linear regression Ridge regression
We want to minimize Ridge Regression
Ridge Regression We want to minimize New es@ma@on for the weight
Ridge Regression We want to minimize L2 regulariza)on New es@ma@on for the weight
Ridge Regression We want to minimize L2 regulariza)on New es@ma@on for the weight Leave the proof in Assignment 1!
What we learned Basic concepts in machine learning K-nearest neighbors Linear regression Ridge regression
Homework Reading Murphy ch1, ch2, and ch7 (only the sec@ons covered in the lecture) Sign up at Piazza hpp://piazza.com/northeastern/spring2017/cs614002 Start thinking about course project and find a team! Project proposal due Jan 26
Homework Reading Murphy ch1, ch2, and ch7 Sign up at Piazza hpp://piazza.com/northeastern/spring2017/cs614002 Start thinking about course project and find a team! Project proposal due Jan 26 Next Time: Logis@c Regression, Decision Tree, Genera@ve Models (Naive Bayes) Reading: Murphy Ch 3, 8.1-8.3, 8.6, 16.2