Linear classifiers: Scaling up learning via SGD

Size: px

Start display at page:

Download "Linear classifiers: Scaling up learning via SGD"

Veronica McDonald
5 years ago
Views:

1 This image cannot currently be displayed. Linear classifiers: Scaling up learning via SGD Emily Fox University of Washington January 27, 2017 Stochastic gradient descent: Learning, one data point at a time 1

2 Stochastic gradient ascent w (t) Update coefficients w (t+1) Update coefficients w (t+2) Update coefficients w (t+3) Update coefficients w (t+4) Data Use only small subsets of data Compute gradient Many updates for each pass over data 3 Stochastic gradient ascent for logistic regression init w (1) =0, t=1 until for converged i=1,,n for j=0,,d partial[j] = Each time, pick Sum over different data point i data points w j (t+1) w j (t) + η partial[j] t t

3 Why would stochastic gradient ever work??? Gradient is direction of steepest ascent Gradient is best direction, but any direction that goes up would be useful 6 3

4 In ML, steepest direction is sum of little directions from each data point Sum over data points 7 For most data points, contribution points up Stochastic gradient: Pick a data point and move in direction Most of the time, total likelihood will increase 8 4

5 Stochastic gradient ascent: Most iterations increase likelihood, but sometimes decrease it On average, make progress until converged for i=1,,n for j=0,,d w (t+1) j w (t) j + η t t Convergence path 5

Convergence paths Gradient Stochastic gradient 11 Stochastic gradient convergence is noisy Stochastic gradient makes noisy progress Stochastic gradient achieves

6 Convergence paths Gradient Stochastic gradient 11 Stochastic gradient convergence is noisy Stochastic gradient makes noisy progress Stochastic gradient achieves higher likelihood sooner, but it s noisier Better Avg. log likelihood Gradient usually increases likelihood smoothly 12 Total time proportional to # passes over data 6

! 14 Stochastic gradient will eventually oscillate around a solution w (1005) was good How do we minimize risk

7 Eventually, gradient catches up Note: should only trust average quality of stochastic gradient Better Avg. log likelihood Stochastic gradient Gradient 13 The last coefficients may be really good or really bad!! 14 Stochastic gradient will eventually oscillate around a solution w (1005) was good How do we minimize risk of picking bad coefficients w (1000) was bad Minimize noise: don t return last learned coefficients Output average: = 1 w (t) T 7

8 Summary of why stochastic gradient works Gradient finds direction of steepest ascent Gradient is sum of contributions from each data point Stochastic gradient uses direction from 1 data point On average increases likelihood, sometimes decreases 15 Stochastic gradient has noisy convergence Online learning: Fitting models from streaming data 8

9 Batch vs online learning Batch learning All data is available at start of training time Online learning Data arrives (streams in) over time - Must train model as data arrives! t=1 t=2 t=3 t=4 time Data ML algorithm Data Data Data Data ML (1) (2) (3) (4) algorithm Online learning example: Ad targeting Website Ad1 Ad2 Ad3 = Suggested ads User clicked on Ad2 y t =Ad2 18 Input: x t User info, page text ML algorithm (t) (t+1) 9

10 Online learning problem Data arrives over each time step t: - Observe input x t Info of user, text of webpage - Make a prediction t Which ad to show - Observe true output y t Which ad user clicked on 19 Need ML algorithm to update coefficients each time step! Stochastic gradient ascent can be used for online learning!!! init w (1) =0, t=1 Each time step t: - Observe input x t - Make a prediction t - Observe true output y t - Update coefficients: for j=0,,d w j (t+1) w j (t) + η 20 10

11 Summary of online learning Data arrives over time Must make a prediction every time new data point arrives Observe true class after prediction made Want to update parameters immediately 21 Summary of stochastic gradient descent 11

What you can do now Significantly speedup learning

behind why stochastic gradient works Apply stochastic

Relate stochastic gradient to online learning 23

12 What you can do now Significantly speedup learning algorithm using stochastic gradient Describe intuition behind why stochastic gradient works Apply stochastic gradient in practice Describe online learning problems Relate stochastic gradient to online learning 23 Decision Trees Emily Fox University of Washington January 27,

13 Predicting potential loan defaults What makes a loan risky? I want a to buy a new house! Credit History Income Loan Application Term Personal Info 26 13

14 Credit history explained Did I pay previous loans on time? Example: excellent, good, or fair Credit History Income Term Personal Info 27 Income What s my income? Example: $80K per year Credit History Income Term Personal Info 28 14

15 Loan terms How soon do I need to pay the loan? Example: 3 years, 5 years, Credit History Income Term Personal Info 29 Personal information Credit History Age, reason for the loan, marital status, Example: Home loan for a married couple Income Term Personal Info 30 15

16 Intelligent application Loan Applications Intelligent loan application review system 31 Classifier review i = +1 Loan Application Input: x i Classifier MODEL Output: Predicted class i =

This module... decision trees Start excellent Credit?

x i = (Credit = poor, Income = high, Term = 5

17 This module... decision trees Start excellent Credit? poor fair 3 years Term? 5 years high Income? Low Term? 3 years 5 years 33 Scoring a loan application Start x i = (Credit = poor, Income = high, Term = 5 years) excellent Credit? fair poor 3 years Term? 5 years high Income? Low Term? 3 years 5 years i = 34 17

18 Decision tree learning task Decision tree learning problem Training data: N observations (x i,y i ) Credit Term Income y excellent 3 yrs high safe fair 5 yrs low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs low safe poor 3 yrs high risky poor 5 yrs low safe fair 3 yrs high safe Optimize quality metric on training data T(X) 36 18

19 Quality metric: Classification error Error measures fraction of mistakes Error = # incorrect predictions # examples - Best possible value : Worst possible value: How do we find the best tree? Exponentially large number of possible trees makes decision tree learning hard! T 1 (X) T 2 (X) T 3 (X) Learning the smallest decision tree is an NP-hard problem [Hyafil & Rivest 76] T 4 (X) T 5 (X) T 6 (X) 38 19

20 Greedy decision tree learning Our training data table Assume N = 40, 3 features Credit Term Income y excellent 3 yrs high safe fair 5 yrs low risky fair 3 yrs high safe poor 5 yrs high risky excellent 3 yrs low risky fair 5 yrs low safe poor 3 yrs high risky poor 5 yrs low safe fair 3 yrs high safe 40 20

21 Start with all the data Loan status: (all data) # of loans # of loans N = 40 examples 41 Compact visual notation: node Loan status: # of loans # of loans N = 40 examples 42 21

22 Decision stump: Single level tree Loan status: Split on Credit Credit? excellent 9 0 fair 9 4 poor 4 14 Subset of data with Credit = excellent Subset of data with Credit = fair Subset of data with Credit = poor 43 Visual notation: Intermediate nodes Loan status: Credit? excellent 9 0 fair 9 4 poor Intermediate nodes 22

23 Making predictions with a decision stump Loan status: root excellent 9 0 credit? fair 9 4 poor 4 14 For each intermediate node, set = majority value 45 Selecting best feature to split on 23

24 How do we learn a decision stump? Loan status: Find the best feature to split on! Credit? excellent 9 0 fair 9 4 poor How do we select the best feature? Choice 1: Split on Credit Loan status: Choice 2: Split on Term Loan status: Credit? Term? excellent 9 0 fair 9 4 poor years years

25 How do we measure effectiveness of a split? Loan status: Credit? Idea: Calculate classification error of this decision stump excellent 9 0 fair 9 4 poor 4 14 Error = # mistakes # data points 49 Calculating classification error Step 1: = class of majority of data in node Step 2: Calculate classification error of predicting for this data Loan status: Error =. 22 correct = majority class 18 mistakes = Tree Classification error (root)

26 Choice 1: Split on Credit history? Choice 1: Split on Credit Loan status: Does a split on Credit reduce classification error below 0.45? Credit? excellent 9 0 fair 9 4 poor Split on Credit: Classification error Choice 1: Split on Credit Loan status: Credit? Error =. 52 excellent 9 0 fair 9 4 poor mistakes 4 mistakes 4 mistakes Tree = Classification error (root) 0.45 Split on credit

27 Choice 2: Split on Term? Choice 2: Split on Term Loan status: Term? 3 years years Evaluating the split on Term Choice 2: Split on Term Loan status: 54 3 years 16 4 Term? 5 years mistakes 6 mistakes Error =. = Tree Classification error (root) 0.45 Split on credit 0.2 Split on term

28 Choice 1 vs Choice 2: Comparing split on Credit vs Term Tree Classification error (root) 0.45 split on credit 0.2 split on loan term 0.25 Choice 1: Split on Credit Loan status: Choice 2: Split on Term Loan status: Credit? Term? excellent 9 0 fair 8 4 poor years years Feature split selection algorithm Given a subset of data M (a node in a tree) For each feature h i (x): 1. Split data of M according to feature h i (x) 2. Compute classification error split Chose feature h * (x) with lowest classification error 56 28

29 Recursion & Stopping conditions We ve learned a decision stump, what next? Loan status: Credit? excellent 9 0 fair 9 4 poor 4 14 All data points are nothing else to do with this subset of data 58 Leaf node 29

30 Tree learning = Recursive stump learning Loan status: Credit? excellent 9 0 fair 9 4 poor 4 14 Build decision stump with subset of data where Credit = fair Build decision stump with subset of data where Credit = poor 59 Second level Loan status: excellent 9 0 Credit? fair 9 4 Term? poor 4 14 Income? 3 years years 9 0 high 4 5 Low Build another stump these data points 30

31 Final decision tree Loan status: poor 4 14 Credit? Income? excellent 9 0 Fair 9 4 high 4 5 low 0 9 Term? Term? 3 years years years years Simple greedy decision tree learning Pick best feature to split on Learn decision stump with this split For each leaf of decision stump, recurse 62 When do we stop??? 31

Stopping condition 1: All data agrees on y All data in these nodes have same Loan y value status: Nothing to do Credit? poor 4 14 Income? excellent 9 0 Fair 9 4 high 4 5 low 0 9 Term?

32 Stopping condition 1: All data agrees on y All data in these nodes have same Loan y value status: Nothing to do Credit? poor 4 14 Income? excellent 9 0 Fair 9 4 high 4 5 low 0 9 Term? Term? 3 years years years years Stopping condition 2: Already split on all features Already split on all possible features Loan status: Nothing to do Credit? poor 4 14 Income? excellent 9 0 Fair 9 4 high 4 5 low 0 9 Term? Term? 3 years years years years

33 Greedy decision tree learning Step 1: Start with an empty tree Step 2: Select a feature to split data For each split of the tree: Step 3: If nothing more to, make predictions Step 4: Otherwise, go to Step 2 & continue (recurse) on this split Pick feature split leading to lowest classification error Stopping conditions 1 & 2 Recursion 65 Is this a good idea? Proposed stopping condition 3: Stop if no split reduces the classification error 66 33

34 Stopping condition 3: Don t stop if error doesn t decrease??? y = x[1] xor x[2] x[1] x[2] y y values True False 2 2 False False False False True True True False True Error =. True True False = Tree Classification error (root) Consider split on x[1] y = x[1] xor x[2] x[1] x[2] y y values True False 2 2 False False False False True True True False True x[1] Error =. True True False True 1 1 False 1 1 = 68 Tree Classification error (root) 0.5 Split on x[1]

35 Consider split on x[2] 69 y = x[1] xor x[2] x[1] x[2] y False False False False True True True False True True True False y values True False Neither features improve training error Stop now??? True x[2] False 1 1 Tree Error = = 0.5 Classification error (root) 0.5 Split on x[1] 0.5 Split on x[2] 0.5 Final tree with stopping condition 3 y = x[1] xor x[2] x[1] x[2] y False False False False True True True False True True True False y values True False 2 2 Predict True Tree with stopping condition 3 Classification error

36 Without stopping condition 3 Condition 3 (stopping when training error doesn t improve) is not recommended! y = x[1] xor x[2] x[1] x[2] y False False False False True True True False True True True False y values True False True x[1] False 1 1 x[2] x[2] Tree with stopping condition 3 without stopping condition 3 71 Classification error 0.5 True 0 1 False False 1 0 True True 1 0 True False 0 1 False Decision tree learning: Real valued features 36

37 How do we use real values inputs? Income Credit Term y $105 K excellent 3 yrs $112 K good 5 yrs $73 K fair 3 yrs $69 K excellent 5 yrs $217 K excellent 3 yrs $120 K good 5 yrs $64 K fair 3 yrs $340 K excellent 5 yrs $60 K good 3 yrs 73 Threshold split Loan status: Split on the feature Income Income? < $60K 8 13 >= $60K 14 5 Subset of data with Income >= $60K 74 37

38 Finding the best threshold split Infinite possible values of t Income = t * Income Income < t * Income >= t * $10K $120K 75 Consider a threshold between points Same classification error for any threshold split between v A and v B Income v A v B $10K $120K 76 38

39 Only need to consider mid-points Finite number of splits to consider Income $10K $120K 77 Threshold split selection algorithm Step 1: Sort the values of a feature h j (x) : Let {v 1, v 2, v 3, v N } denote sorted values Step 2: - For i = 1 N-1 Consider split t i = (v i + v i+1 ) / 2 Compute classification error for treshold split h j (x) >= t i - Chose the t * with the lowest classification error 78 39

40 Visualizing the threshold split Income Threshold split is the line Age = 38 $80K $40K $0K Age 79 Split on Age >= 38 Income age >= 38 age < 38 Predict $80K $40K Predict $0K Age 80 40

41 Depth 2: Split on Income >= $60K Income Threshold split is the line Income = 60K $80K $40K $0K Age 81 Each split partitions the 2-D space Income Age < 38 Age >= 38 Income >= 60K $80K $40K $0K Age >= 38 Income < 60K Age 82 41

42 Summary of decision trees What you can do now Define a decision tree classifier Interpret the output of a decision trees Learn a decision tree classifier using greedy algorithm Traverse a decision tree to make predictions - Majority class predictions - Probability predictions - Multiclass classification 84 42

(Sub)Gradient Descent

(Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include