Covariate Shift Consequences and good. practice Covariate shift, re-weight training data, active sampling. Joyce Wang Software Engineer Sep 2017

Size: px

Start display at page:

Download "Covariate Shift Consequences and good. practice Covariate shift, re-weight training data, active sampling. Joyce Wang Software Engineer Sep 2017"

Marion Conley
5 years ago
Views:

1 Covariate Shift Consequences and good practice Covariate shift, re-weight training data, active sampling Joyce Wang Software Engineer Sep

2 Motivation Validation Accuracy = 0.96 Query Accuracy = 0.67 What is going on here? 2

3 Outline What is covariate shift? why would it occur? what consequence would it have? How to detect covariate shift? visualization method quantitative method Strategies to handle covariate shift training data reweighting active learning 3

4 Covariate Shift When the distribution on training and test/query sets do not match, we are facing covariate shift, or sample selection bias. Against fundamental assumption: Both the training and query data should be drawn from the same population / distribution. 4

5 Distribution Mismatch Training data and query data are drawn from almost the same population 5 Training data and query data are drawn from completely different population

6 Covariate Shift - Commonplace Lack of randomness Inadequate samples Biased sampling rules 6

7 Covariate Shift - Consequence Overfitting on training examples Unreliable predictions Example: binary classification wrong decision optimal decision boundary boundary Training set Query set Training set classification actual label 0 actual label 1 7 Query set

8 Detect Covariate Shift

9 Detect Covariate Shift Visualization Membership modelling Uncertainty quantification 9

10 Visualize Training and Query Data Query set Distribution Training set Distribution What if I have high-dimensional data? Per dimension visualization Dimensionality reduction (PCA, t-sne) 10 We need more robust methods.

11 Membership Modelling We apply a model to predict the probability of a new point being a member of training set. For example, one-class SVM could classify new data as similar or different to the training set. 11

12 Uncertainty Quantification 1. Fit a probabilistic model to training set 2. Every prediction has uncertainty (confidence interval) associated with it 3. Determine covariate shift with uncertainty of predictions 12

13 Uncertainty Quantification upper bound prediction value lower bound query low uncertainty similar to training dataset high uncertainty not similar to training dataset 13 high uncertainty

14 Handle Covariate Shift

15 Handle Covariate Shift Training Sample Reweighting Make the distribution of training data look like the distribution of query data. Active Sampling Help model gain understanding about query data and learn effectively. 15

16 Sample Reweighting Build a classifier to classify training and query sets e.g. logistic regression Training Set Query Set classification Color training points by the probability of being in query set 16 Low Median High

17 Sample Reweighting Reweight every training point in learning process. 17 Training samples Probability of being in query set w w w wi n wn-1 n wn

18 Overlap Overlap is essential to apply sample re-weighting. 18

19 Active Learning Train a probabilistic model. Predict query set with trained model. Find the query point with that is expected to most improve the model Training Set Query Set 19 Get the target value for that most useful point. Put the point into training set.

20 Active Learning - Demo 20

21 Active Learning - Demo 21

22 Active Learning - Demo 22

23 Active Learning - Demo 23

24 Active Learning - Demo 24

25 Active Learning - Demo 25

26 Active Learning - Demo 26

27 Comparison of Strategies for Handling Covariate Shift Sample Reweighting Advantages Disadvantages 27 achievable if you cannot get more samples need overlap between training and query sets less understanding on data Active Learning no need for overlap gain more understanding about query data not achievable if you cannot get more samples

28 Thank you twitter

29 Reference Density Ratio Estimation in Machine Learning y-ratio-estimation-in-machine-learning.pdf Correcting Sample Selection Bias by Unlabeled Data -unlabeled-data 29

30 Uncertainty Quantification probability of positive label 30

31 Sample Reweighting Reweight every training point in minimizing loss function. where Reweighting 31 training samples Training samples Probability of being in query set w w w wi n wn-1 n wn

32 Acquisition Function Reduce the maximum uncertainty Reduce the maximum upper confidence bound Reduce the total uncertainty Utility function if policy is known 32

33 Detect Covariate Shift - Comparison Membership Modelling Visualization Advantage quick Disadvantage subjective open to interpretation 33 informative quantitative sensitive to tuning parameters Uncertainty Quantification informative quantitative make predictions difficult to work with large-size data

34 Sample Reweighting Apply trained classifier to obtain the probability of each training point being inside query set Hold-out Training Training Hold-out Training Training samples Probability of being in query set Training Hold-out n n Training Hold-out Training Training Training Hold-out Use cross-validation to avoid over-fitting. 34

35 Glossary 90% Training data Training set Split 10% Hold-out / Development set Test data Query data 35 used to Validate model (optional) Apply model to predict the y value

36 Sample Reweighting Reweight every training point in learning process. reweighting Scale training points by weight 36 importance level

(Sub)Gradient Descent

(Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include