Midterm Exam Review Introduction to Machine Learning. Matt Gormley Lecture 14 March 6, 2017

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Learning From the Past with Experiment Databases

CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Multivariate k-nearest Neighbor Regression for Time Series data -

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Assignment 1: Predicting Amazon Review Ratings

Switchboard Language Model Improvement with Conversational Data from Gigaword

arxiv: v1 [cs.lg] 15 Jun 2015

Generative models and adversarial training

Rule Learning with Negation: Issues Regarding Effectiveness

CSL465/603 - Machine Learning

Introduction to Causal Inference. Problem Set 1. Required Problems

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Rule Learning With Negation: Issues Regarding Effectiveness

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Artificial Neural Networks written examination

Reducing Features to Improve Bug Prediction

The Evolution of Random Phenomena

Model Ensemble for Click Prediction in Bing Search Ads

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Probability and Statistics Curriculum Pacing Guide

CS 446: Machine Learning

Semi-Supervised Face Detection

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

SURVIVING ON MARS WITH GEOGEBRA

12- A whirlwind tour of statistics

Hardhatting in a Geo-World

Indian Institute of Technology, Kanpur

Active Learning. Yingyu Liang Computer Sciences 760 Fall

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Discriminative Learning of Beam-Search Heuristics for Planning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Activity 2 Multiplying Fractions Math 33. Is it important to have common denominators when we multiply fraction? Why or why not?

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

AP Statistics Summer Assignment 17-18

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Calibration of Confidence Measures in Speech Recognition

Spinners at the School Carnival (Unequal Sections)

STA 225: Introductory Statistics (CT)

STAT 220 Midterm Exam, Friday, Feb. 24

Algebra 2- Semester 2 Review

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Mathematics Success Level E

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Major Milestones, Team Activities, and Individual Deliverables

Rule-based Expert Systems

Data Fusion Through Statistical Matching

Human Emotion Recognition From Speech

Word Segmentation of Off-line Handwritten Documents

A Case Study: News Classification Based on Term Frequency

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Australian Journal of Basic and Applied Sciences

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Applications of data mining algorithms to analysis of medical data

Chapter 2 Rule Learning in a Nutshell

Test Effort Estimation Using Neural Network

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Mathematics Success Grade 7

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Speech Emotion Recognition Using Support Vector Machine

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

The stages of event extraction

Issues in the Mining of Heart Failure Datasets

An overview of risk-adjusted charts

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Attributed Social Network Embedding

Truth Inference in Crowdsourcing: Is the Problem Solved?

Lecture 1: Basic Concepts of Machine Learning

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The Strong Minimalist Thesis and Bounded Optimality

Visit us at:

The Good Judgment Project: A large scale test of different methods of combining expert predictions

arxiv: v2 [cs.cv] 30 Mar 2017

Evolutive Neural Net Fuzzy Filtering: Basic Description

Knowledge Transfer in Deep Convolutional Neural Nets

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Probabilistic Latent Semantic Analysis

Universidade do Minho Escola de Engenharia

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Midterm Exam Review Matt Gormley Lecture 14 March 6, 2017 1

Reminders Midterm Exam (Evening Exam) Tue, Mar. 07 at 7:00pm 9:30pm See Piazza for details about location 2

Outline Midterm Exam Logistics Sample Questions Classification and Regression: The Big Picture Q&A 3

MIDTERM EXAM LOGISTICS 4

Midterm Exam Logistics Evening Exam Tue, Mar. 07 at 7:00pm 9:30pm 8-9 Sections Format of questions: Multiple choice True / False (with justification) Derivations Short answers Interpreting figures No electronic devices You are allowed to bring one 8½ x 11 sheet of notes (front and back) 5

How to Prepare Midterm Exam Attend the midterm review session: Thu, March 2 at 6:30pm (PH 100) Attend the midterm review lecture Mon, March 6 (in- class) Review prior year s exam and solutions (we ll post them) Review this year s homework problems 6

Midterm Exam Advice (for during the exam) Solve the easy problems first (e.g. multiple choice before derivations) if a problem seems extremely complicated you re likely missing something Don t leave any answer blank! If you make an assumption, write it down If you look at a question and don t know the answer: we probably haven t told you the answer but we ve told you enough to work it out imagine arguing for some answer and see if you like it 7

Topics for Midterm Foundations Probability MLE, MAP Optimization Classifiers KNN Naïve Bayes Logistic Regression Perceptron SVM Regression Linear Regression Important Concepts Kernels Regularization and Overfitting Experimental Design 8

SAMPLE QUESTIONS 9

1.4 Probability Sample Questions Assume we have a sample space. Answer each question with T or F. (a) [1 pts.] TorF:If events A, B, andc are disjoint then they are independent. (b) [1 pts.] TorF:P (A B) / P (A)P (B A). (The sign / means is proportional to ) P (A B) 10

4 K-NN [12 pts] Sample Questions Now we will apply K-Nearest Neighbors using Euclidean distance to a binary classification task. We assign the class of the test point to be the class of the majority of the k nearest neighbors. A point can be its own neighbor. Figure 5 3. [2 pts] What value of k minimizes leave-one-out cross-validation error for the dataset shown in Figure 5? What is the resulting error? 11

Sample Questions 1.2 Maximum Likelihood Estimation (MLE) Assume we have a random sample that is Bernoulli distributed X 1,...,X n Bernoulli( ). We are going to derive the MLE for. Recall that a Bernoulli random variable X takes values in {0, 1} and has probability mass function given by P (X; ) = X (1 ) 1 X. (a) [2 pts.] Derive the likelihood, L( ; X 1,...,X n ). (c) Extra Credit: [2 pts.] Derive the following formula for the MLE: ˆ = 1 n (P n i=1 X i). 12

Sample Questions 1.3 MAP vs MLE Answer each question with T or F and provide a one sentence explanation of your answer: (a) [2 pts.] TorF:In the limit, as n (the number of samples) increases, the MAP and MLE estimates become the same. 13

1.1 Naive Bayes Sample Questions You are given a data set of 10,000 students with their sex, height, and hair color. You are trying to build a classifier to predict the sex of a student, so you randomly split the data into a training set and a testing set. Here are the specifications of the data set: sex 2 {male,female} height 2 [0,300] centimeters hair 2 {brown, black, blond, red, green} 3240 men in the data set 6760 women in the data set Under the assumptions necessary for Naive Bayes (not the distributional assumptions you might naturally or intuitively make about the dataset) answer each question with T or F and provide a one sentence explanation of your answer: (a) [2 pts.] TorF:As height is a continuous valued variable, Naive Bayes is not appropriate since it cannot handle continuous valued variables. (c) [2 pts.] TorF:P (height sex, hair) =P (height sex). 14

3.1 Linear regression Sample Questions X Consider the dataset S plotted in Fig. 1 along with its associated regression line. For each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write your answers in the table below. Dataset Dataset (a) (b) (c) (d) (e) Regression line Figure 1: An observed data set and its associated regression line. (a) Adding one outlier to the original data set. Figure 2: New regression lines for altered data sets S new. 15

3.1 Linear regression Sample Questions X Consider the dataset S plotted in Fig. 1 along with its associated regression line. For each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write your answers in the table below. Dataset (a) (b) (c) (d) (e) Regression line original data set. Dataset Figure 1: An observed data set and its associated regression line. (c) Adding three outliers to the original data set. Two on one side and one on the other side. Figure 2: New regression lines for altered data sets S new. 16

3.1 Linear regression Sample Questions X Consider the dataset S plotted in Fig. 1 along with its associated regression line. For each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write your answers in the table below. Dataset Dataset (a) (b) (c) (d) (e) Regression line Figure 1: An observed data set and its associated regression line. (d) Duplicating the original data set. Figure 2: New regression lines for altered data sets S new. 17

3.1 Linear regression Sample Questions X Consider the dataset S plotted in Fig. 1 along with its associated regression line. For each of the altered data sets S new plotted in Fig. 3, indicate which regression line (relative to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write your answers in the table below. Dataset Dataset (a) (b) (c) (d) (e) Regression line Figure 1: An observed data set and its associated regression line. (e) Duplicating the original data set and adding four points that lie on the trajectory of the original regression line. Figure 2: New regression lines for altered data sets S new. 18

3.2 Logistic regression Sample Questions { Given a training set {(x i,y i ),i=1,...,n} where x i 2 R d is a feature vector and y i 2 {0, 1} is a binary label, we want to find the parameters ŵ that maximize the likelihood for the training set, assuming a parametric model of the form p(y =1 x; w) = 1 1+exp( w T x). The conditional log likelihood of the training set is X nx `(w) = y i log p(y i, x i ; w)+(1 y i )log(1 p(y i, x i ; w)), and the gradient is i=1 r`(w) = nx (y i p(y i x i ; w))x i. i=1 (b) [5 pts.] What is the form of the classifier output by logistic regression? (c) [2 pts.] Extra Credit: Consider the case with binary features, i.e, x 2 {0, 1} d R d, where feature x 1 is rare and happens to appear in the training set with only label 1. What is ŵ 1?Isthegradienteverzeroforanyfinitew? Why is it important to include a regularization term to control the norm of ŵ? 19

Samples Questions 2.1 Train and test errors In this problem, we will see how you can debug a classifier by looking at its train and test errors. Consider a classifier trained till convergence on some training data D train, and tested on a separate test set D test. You look at the test error, and find that it is very high. You then compute the training error and find that it is close to 0. 1. [4 pts] Which of the following is expected to help? Select all that apply. (a) Increase the training data size. (b) Decrease the training data size. (c) Increase model complexity (For example, if your classifier is an SVM, use a more complex kernel. Or if it is a decision tree, increase the depth). (d) Decrease model complexity. (e) Train on a combination of D train and D test and test on D test (f) Conclude that Machine Learning does not work. [5 pts] 20

Samples Questions 2.1 Train and test errors In this problem, we will see how you can debug a classifier by looking at its train and test errors. Consider a classifier trained till convergence on some training data D train, and tested on a separate test set D test. You look at the test error, and find that it is very high. You then compute the training error and find that it is close to 0. 4. [1 pts] Say you plot the train and test errors as a function of the model complexity. Which of the following two plots is your plot expected to look like? (a) (b) 21

Sample Questions 4.1 True or False Answer each of the following questions with T or F and provide a one line justification. (a) [2 pts.] Consider two datasets D (1) and D (2) where D (1) = {(x (1) 1,y (1) 1 ),...,(x (1) n,y n (1) )} and D (2) = {(x (2) 1,y (2) 1 ),...,(x (2) m,y m (2) )} such that x (1) i 2 R d 1, x (2) i 2 R d 2. Suppose d 1 >d 2 and n>m. Then the maximum number of mistakes a perceptron algorithm will make is higher on dataset D (1) than on dataset D (2). 24

Sample Questions 4.3 Analysis (a) [4 pts.] In one or two sentences, describe the benefit of using the Kernel trick. (b) [4 pt.] The concept of margin is essential in both SVM and Perceptron. Describe why a large margin separator is desirable for classification. 25

Sample Questions (c) [4 pts.] Extra Credit: Consider the dataset in Fig. 4. Under the SVM formulation in section 4.2(a), (1) Draw the decision boundary on the graph. (2) What is the size of the margin? (3) Circle all the support vectors on the graph. Figure 4: SVM toy dataset 26

Figure 2: Plot here Sample Questions 3. [Extra Credit: 3 pts.] One formulation of soft-margin SVM optimization problem is: min w 1 2 kwk2 2 + C NX i=1 s.t. y i (w > x i ) 1 i 8i =1,...,N i i 0 8i =1,...,N C 0 where (x i,y i ) are training samples and w defines a linear decision boundary. Derive a formula for i when the objective function achieves its minimum (No steps necessary). Note it is a function of y i w > x i. Sketch a plot of i with y i w > x i on the x-axis and value of i on the y-axis. What is the name of this function? 28

The Big Picture CLASSIFICATION AND REGRESSION 30

Classification and Regression: Whiteboard The Big Picture Decision Rules / Models (probabilistic generative, probabilistic discriminative, perceptron, SVM, regression) Objective Functions (likelihood, conditional likelihood, hinge loss, mean squared error) Regularization (L1, L2, priors for MAP) Update Rules (SGD, perceptron) Nonlinear Features (preprocessing, kernel trick) 31

Q&A 32