Machine Learning Practical

Size: px

Start display at page:

Download "Machine Learning Practical"

Ethelbert Wright
6 years ago
Views:

1 Machine Learning Practical Pamela K Douglas UCLA August 6, 2015 Pamela K. Douglas, University of California, Los Angeles 2015 NITP Summer Course

2 Overview Part I : Weka Part II : MVPA Machine Learning Exercises Pamela K. Douglas, University of California, Los Angeles 2015 NITP Summer Course

3 Why WEKA? 1. It s allows you to easily test many machine learning (ML) classifiers that have been vetted by the ML community. 2. Many feature selection methods available. 3. Running cross validation (and nested cross validation) is fast and easy

4 " Inductive Bias In both supervised and unsupervised ML (and regression problems), the data by themselves are not sufficient find a unique solution from the hypothesis class of all possible models Machine learning is therefore an ill-posed problem. Additional assumptions are therefore required. These are called the model s InducEve Bias Line? Curve? Higher Order Polynomial? PK Douglas 2015, University of California, Los Angeles

5 " Inductive Bias The inductive bias can be related to an assumption made about the underlying distributions of the data (Parametric models) Alternatively, the model can assume a form for discriminant or boundary used to separate data exemplars from different classes. (Nonparametric) Parametric Nonparametric PK Douglas 2015, University of California, Los Angeles

6 1. Weka: Testing Multiple Classifiers With any modelling technique, it s good practice to test multiple model hypotheses. Specifically, in ML there is the No Free Lunch Theorem. There is no single classifier that universally works best across all domains and data sets (Wolpert & MacGreedy) Performance Comparison of ML Classifiers % Accuracy (10-fold Cross Validation) Number of ICs PK Douglas et al., NeuroImage, 56(2): p Trying out different classifiers can be a good idea.

7 Classifying Decision Making With Decision Trees Belief Disbelief IC 5 IC 13 B>DB Common to B DB Common to DB IC Spatial Mask IC 15 IC 19

8 " Features for Decoding Pattern classifiers operate on features. What are features? Features (or a/ributes) are descriptive variable categories that measure certain attributes of the data. Features can take on strings assignments. For example, sex can be either male or female. With neuroimaging data, however, features are often numerical. A typical training example may contain nominal or numeric values for each of the feature categories. PK Douglas 2015, University of California, Los Angeles

Many Types of Features Used in Decoding Neuroimaging Data,-$./01+23(4&51+#61417)& Voxels,-$./01+23(4&51+#61417)& (Cox & Savoy 2003) &&&01""*3%19*CG &&&01""*3%19* >#29(4&?

&E&A*F1& (Kriegeskorte et al. 2006) J(+(9*%*+;&K=$"*HL& $;@"7&"*;%*H&MNC:14H& J(+(9*%*+;&K=$"*HL& 3+1;;&O(4@H(21"& +;C:35A!

2012) Effective Connectivity <+(#6&=6*1+)& <+(#6&=6*1+)&!!!!! φ! φ φ! φ φ!! <+(#6&= <+(# TD, A TD, AD or AD or ADH 8$9.*+ 8$9. >$%#$%&04(;;& >$%#$%&04(;;& '(.*4& '(.

9 Many Types of Features Used in Decoding Neuroimaging Data,-$./01+23(4&51+#61417)& Voxels,-$./01+23(4&51+#61417)& (Cox & Savoy 2003) &&&01""*3%19*CG &&&01""*3%19* >#29(4&?*(%$+*& -$.;*%&-*4*321"& >#29(4&?*(%$+*& -$.;*%&-*4*321"& Searchlights +;C:35A!&E&A*F1& (Kriegeskorte et al. 2006) J(+(9*%*+;&K=$"*HL& J(+(9*%*+;&K=$"*HL& +;C:35A!&E&A*F1& φ φ Fc-MRI Matrices φ φ (Dosenbach et al. 2010) φ Graph Theory Metrics (Colby et al. 2012) Effective Connectivity <+(#6&=6*1+)& <+(#6&=6*1+)&!!!!! φ! φ φ! φ φ!! <+(#6&= <+(# TD, A TD, AD or AD or ADH 8$9.*+ 8$9. >$%#$%&04(;;& >$%#$%&04(;;& '(.*4& '(.*4& >#29(4&?*(%$+*& >#29(4&?*(%$+*& -$.;*%;& -$.;*%;& Independent Components PK Douglas 2015, University of California, Los Angeles I44&,91H*/&! (Broadersen et al. 2011) 8$9.*+&1:&!0;& 8$9.*+&1:&!0;& G@""*+&=(P*& I44&,91H*/& G@""*+&=(P*& (Douglas et al. 2011)

10 2. Weka: Many Feature Selection Algorithms Is feature selection important?

1,000s of neuroimaging features " Brief survey of Keggle Big Data ML Competitions winning teams

11 ADHD 200 Initiative " Public release of (n=973) subjects including structural, resting state fmri, and demographic information from ADHD subtypes " Our team (3 rd place) and others had 1,000s of neuroimaging features " Brief survey of Keggle Big Data ML Competitions winning teams use FS Winning team used only demographic features!! 2015 Pamela Douglas, UCLA NITP

12 " Regularization In regularization, we write an augmented error function: cost = data misfit + λ complexity Regularization also limits the influence of outliers (Rätsch et al. 2001;Lemm et al. 2011) which may arise due to movement, and have shown to be particularly problematic if using rs-fcmri features (Powers et al. 2011). PK Douglas 2015, University of California, Los Angeles

13 " SVM: Inherent Regularization SVM provides an internal regularization step with its C parameter When C is large (tending towards the hard margin), it penalizes the error points more strongly and results in a smaller margin typically with more support vectors and therefore with a stronger tendency to overfit There is no one C parameter that fits all data best so this parameter must be tuned appropriately Test Error Curves SVM with Radial Kernel γ =5 γ =1 γ =0.5 γ =0.1 Different radial basis funceon kernel sizes highly relevant to size of searchlight Test Error e 01 1e+01 1e+03 1e 01 1e+01 1e+03 1e 01 1e+01 1e+03 1e 01 1e+01 1e+03 Most Regularized Wins PK Douglas 2015, University of California, Los Angeles C =1/λ Least Regularized is best HasEe et al. 2013

14 2013 Pamela Douglas, UCLA NITP The Need for Reduction Controversial?

Chu et al. 2012 Revisited. 36 ) t_25355 t_7522 b Accuracy 90% 80% 70% 60% 50% 40% 30% 20% e 2013 Pamela Douglas, UCLA NITP All voxels used C. Chu et al.

15 Chu et al Revisited. 36 ) t_25355 t_7522 b Accuracy 90% 80% 70% 60% 50% 40% 30% 20% e 2013 Pamela Douglas, UCLA NITP All voxels used C. Chu et al. / NeuroImage 60 (2012) t-test+rfe ( Sample Size 134 ) All t_82260 R_71229 t_25355 R_12754 t_11031 R_9542 t_7522 R_6463 NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA t_4568 Varying C C* C0 C1 C2 C3 C4 C5 From Chu et al Kerr et al. Page 9 c f Fig. 1. A reproduction of Chu et al.'s Fig. 9E where the added shading indicates the 95% confidence interval for the no feature selection accuracy using the normal approximation of From Kerr et al the binomial distribution. Accuracy using all voxelized features was not significantly higher than data-driven All feature selection t_82260 accuracy R_71229 at the optimum t_25355 C, C*. At multiple non-optimum ( A Comment on Chu et al. 2012) C values, the accuracy using data-driven feature selection was significantly higher than R_12754 t_11031 R_9542 t_7522 using all voxelized features. R_6463 t_

16 3. Cross Validation

17 Hyperparameters Hyperparameter: A value critical to the structure of a model, but is not optimized jointly with the inherent parameters 17

18 K Nearest Neighbor Hyperparameter: The Number of Neighbors that influence the decision making is a hyperparameter.

vote should be opemized Alpaydin (2004)" Nested cross validation used to

19 Hyperparameter Optimization via Nested Cross Validation All neighbors have equal vote, K is generally an odd number to avoid a Ee vote Number of neighbors that vote should be opemized Alpaydin (2004)" Nested cross validation used to simultaneously tune the number of features and the blending ratio (akin to # of neighbours)!

20 3. Cross Validation

21 Nested Cross-Validation Nested Cross-Validation Useful Practice for Tuning Hyperparameters

22 Nested Cross-Validation

23 " Interpretation Beyond Accuracy is Tricky. Although tempting, interpreting decoding weight vectors as meaningful may lead to false conclusions (Guyon & Eisseloff 2002; Haufe et al. 2014). A feature with almost zero class-specific information is given a higher weight than a feature containing a high degree of information (Haufe et al. 2014) PK Douglas 12/8/14

24 " Geometry of the SVM g(x)=w x+ w 0 =0 g(x)<0 g(x)>0 S;mulus class 2 Choose: C1 if g(x)>0 C2 else Support Vectors outlined in red S;mulus Class 1 SVM is interested in finding the maximum margin hyperplane that shatters the distance between support vectors or difficult points (examples near the boundary) on either side. PK Douglas 2015, University of California, Los Angeles

25 " Interpretation Beyond Accuracy is Tricky. Although tempting, interpreting decoding weight vectors as meaningful may lead to false conclusions (Guyon & Eisseloff 2002; Haufe et al. 2014). PK Douglas 12/8/14 A number of factors can influence interpretability including: - feature covariance (or lack thereof) - Kernel applied - Classifier used, etc.

26 Overview Part I : Weka Part II : MVPA Machine Learning Exercises Pamela K. Douglas, University of California, Los Angeles 2015 NITP Summer Course

27 Benefits of Using MVPA 1. Loads brain images directly, and extracts voxel features for you. 2. Many feature selection methods available that are specifically designed for fmri data (Searchlight, etc.) 3. It also has built-in tools for parameter tuning via nested crossvalidation, etc. 4. It has user-friendly tools for running permutation tests. 5. Runs on Matlab or Python (PyMVPA).

28 MVPA Has Many Options 1. One may extract time of estimated HRF peak for feature - Or mean of a few points near estimated peak can be used 2. Alternatively fits to expected HRF response can be generated, and features for that trial can be: - Beta value for that trial - t values derived from the Beta values - Or % signal change Beta values

29 " MVPA: Many Options for Permutation Tests In order to double check that you have not unknowingly peeked you can run a permutation test?? Training Phase Test Phase PK Douglas 2015, University of California, Los Angeles

30 " Permutation A good sanity check! In order to double check that you have not unknowingly peeked you can run a permutation test In this case the labels of the data exemplars are scrambled or shuffled randomly?? Training Phase Test Phase PK Douglas 2015, University of California, Los Angeles

31 " Permutation A good sanity check! In order to double check that you have not unknowingly peeked you can run a permutation test In this case the labels of the data exemplars are scrambled or shuffled randomly You now should verify that the accuracy outcome is at chance level?? Training Phase Test Phase PK Douglas 2015, University of California, Los Angeles

32 Overview Part I : Weka Part II : MVPA Machine Learning Exercises Pamela K. Douglas, University of California, Los Angeles 2015 NITP Summer Course

33 Lab Practical Brief feature Selection exercise Explore classification of the Haxyby et al data using a combination of Weka and MVPA code

34 Data: Haxby et al. (2001) Science Paper Six subjects, 12 runs each Each run consisted of viewing 8 object categories. Each object category was shown for 24 sec (500msec on, 1500msec rest).

Weka Usage Notes In Weka, features are called attributes Their input file is called attribute relation file format (.arff) Useful flags: -t <name of training file> Sets training file.

35 Weka Usage Notes In Weka, features are called attributes Their input file is called attribute relation file format (.arff) Useful flags: -t <name of training file> Sets training file. -T <name of test file> Sets test file. If missing, a cross-validation is performed. -x <number of folds> Specify number of cross-validation folds (default: 10). -split-percentage <%> Sets the percentage for the train/test set split, e.g., 66. -preserve-order Preserves the order in the percentage split. -Xmx Ask for more memory (useful!) ex. Xmx2g (2 gigs) More Detail on Weka Available via Online video MOOCs Pamela K. Douglas, University of California, Los Angeles 2015 NITP Summer Course

Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled