Machine Learning Practical

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

Learning From the Past with Experiment Databases

(Sub)Gradient Descent

CS Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Artificial Neural Networks written examination

Probability and Statistics Curriculum Pacing Guide

Reducing Features to Improve Bug Prediction

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Assignment 1: Predicting Amazon Review Ratings

STA 225: Introductory Statistics (CT)

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Iowa School District Profiles. Le Mars

Rule Learning With Negation: Issues Regarding Effectiveness

Human Emotion Recognition From Speech

Rule Learning with Negation: Issues Regarding Effectiveness

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Speech Recognition at ICSI: Broadcast News and beyond

Switchboard Language Model Improvement with Conversational Data from Gigaword

arxiv: v2 [cs.cv] 30 Mar 2017

Linking Task: Identifying authors and book titles in verbose queries

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Statewide Framework Document for:

CSL465/603 - Machine Learning

Lecture 1: Basic Concepts of Machine Learning

Activity Recognition from Accelerometer Data

The Good Judgment Project: A large scale test of different methods of combining expert predictions

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Calibration of Confidence Measures in Speech Recognition

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Algebra 2- Semester 2 Review

12- A whirlwind tour of statistics

Multi-Lingual Text Leveling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Mathematics process categories

Bayllocator: A proactive system to predict server utilization and dynamically allocate memory resources using Bayesian networks and ballooning

A Case Study: News Classification Based on Term Frequency

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Australian Journal of Basic and Applied Sciences

Model Ensemble for Click Prediction in Bing Search Ads

Speech Emotion Recognition Using Support Vector Machine

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Course Content Concepts

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

An overview of risk-adjusted charts

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Julia Smith. Effective Classroom Approaches to.

On-the-Fly Customization of Automated Essay Scoring

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

UCLA UCLA Electronic Theses and Dissertations

Welcome to. ECML/PKDD 2004 Community meeting

4-3 Basic Skills and Concepts

Software Maintenance

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

On the Combined Behavior of Autonomous Resource Management Agents

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Extending Place Value with Whole Numbers to 1,000,000

Multivariate k-nearest Neighbor Regression for Time Series data -

Mathematics. Mathematics

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Issues in the Mining of Heart Failure Datasets

Wenguang Sun CAREER Award. National Science Foundation

Detailed course syllabus

Knowledge Transfer in Deep Convolutional Neural Nets

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Ensemble Technique Utilization for Indonesian Dependency Parser

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Learning to Schedule Straight-Line Code

Softprop: Softmax Neural Network Backpropagation Learning

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Learning to Rank with Selection Bias in Personal Search

arxiv: v1 [cs.lg] 15 Jun 2015

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Time series prediction

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Discriminative Learning of Beam-Search Heuristics for Planning

Primary National Curriculum Alignment for Wales

Chapter 2 Rule Learning in a Nutshell

arxiv: v1 [cs.lg] 3 May 2013

Semi-Supervised Face Detection

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Proof Theory for Syntacticians

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Applications of data mining algorithms to analysis of medical data

Full text of O L O W Science As Inquiry conference. Science as Inquiry

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Introduction to Causal Inference. Problem Set 1. Required Problems

Transcription:

Machine Learning Practical Pamela K Douglas UCLA August 6, 2015 Pamela K. Douglas, University of California, Los Angeles 2015 NITP Summer Course

Overview Part I : Weka Part II : MVPA Machine Learning Exercises Pamela K. Douglas, University of California, Los Angeles 2015 NITP Summer Course

Why WEKA? 1. It s allows you to easily test many machine learning (ML) classifiers that have been vetted by the ML community. 2. Many feature selection methods available. 3. Running cross validation (and nested cross validation) is fast and easy

" Inductive Bias In both supervised and unsupervised ML (and regression problems), the data by themselves are not sufficient find a unique solution from the hypothesis class of all possible models Machine learning is therefore an ill-posed problem. Additional assumptions are therefore required. These are called the model s InducEve Bias. - - - Line? Curve? Higher Order Polynomial? PK Douglas 2015, University of California, Los Angeles

" Inductive Bias The inductive bias can be related to an assumption made about the underlying distributions of the data (Parametric models) Alternatively, the model can assume a form for discriminant or boundary used to separate data exemplars from different classes. (Nonparametric) Parametric Nonparametric PK Douglas 2015, University of California, Los Angeles

1. Weka: Testing Multiple Classifiers With any modelling technique, it s good practice to test multiple model hypotheses. Specifically, in ML there is the No Free Lunch Theorem. There is no single classifier that universally works best across all domains and data sets (Wolpert & MacGreedy) Performance Comparison of ML Classifiers % Accuracy (10-fold Cross Validation) Number of ICs PK Douglas et al., NeuroImage, 56(2): p. 544-553. 2011. Trying out different classifiers can be a good idea.

Classifying Decision Making With Decision Trees Belief Disbelief IC 5 IC 13 B>DB Common to B DB Common to DB IC Spatial Mask IC 15 IC 19

" Features for Decoding Pattern classifiers operate on features. What are features? Features (or a/ributes) are descriptive variable categories that measure certain attributes of the data. Features can take on strings assignments. For example, sex can be either male or female. With neuroimaging data, however, features are often numerical. A typical training example may contain nominal or numeric values for each of the feature categories. PK Douglas 2015, University of California, Los Angeles

Many Types of Features Used in Decoding Neuroimaging Data,-$./01+23(4&51+#61417)& Voxels,-$./01+23(4&51+#61417)& (Cox & Savoy 2003) &&&01""*3%19*CG &&&01""*3%19* >#29(4&?*(%$+*& -$.;*%&-*4*321"& >#29(4&?*(%$+*& $;@"7&A?BC-D5& -$.;*%&-*4*321"& $;@"7&A?BC-D5& Searchlights $;@"7&"*;%*H&MNC:14H& 3+1;;&O(4@H(21"& +;C:35A!&E&A*F1& (Kriegeskorte et al. 2006) J(+(9*%*+;&K=$"*HL& $;@"7&"*;%*H&MNC:14H& J(+(9*%*+;&K=$"*HL& 3+1;;&O(4@H(21"& +;C:35A!&E&A*F1& φ φ Fc-MRI Matrices φ φ (Dosenbach et al. 2010) &&&01""*3%19*CG@H*&I;;13@(21";& φ &&&01""*3%19*CG@H*&I;;13@(21";& Graph Theory Metrics (Colby et al. 2012) Effective Connectivity <+(#6&=6*1+)& <+(#6&=6*1+)&!!!!! φ! φ φ! φ φ!! <+(#6&= <+(# TD, A TD, AD or AD or ADH 8$9.*+ 8$9. >$%#$%&04(;;& >$%#$%&04(;;& '(.*4& '(.*4& >#29(4&?*(%$+*& >#29(4&?*(%$+*& -$.;*%;& -$.;*%;& Independent Components PK Douglas 2015, University of California, Los Angeles I44&,91H*/&! (Broadersen et al. 2011) 8$9.*+&1:&!0;& 8$9.*+&1:&!0;& G@""*+&=(P*& I44&,91H*/& G@""*+&=(P*& (Douglas et al. 2011)

2. Weka: Many Feature Selection Algorithms Is feature selection important?

ADHD 200 Initiative " Public release of (n=973) subjects including structural, resting state fmri, and demographic information from ADHD subtypes " Our team (3 rd place) and others had 1,000s of neuroimaging features " Brief survey of Keggle Big Data ML Competitions winning teams use FS Winning team used only demographic features!! 2015 Pamela Douglas, UCLA NITP www.brainmapping.org

" Regularization In regularization, we write an augmented error function: cost = data misfit + λ complexity Regularization also limits the influence of outliers (Rätsch et al. 2001;Lemm et al. 2011) which may arise due to movement, and have shown to be particularly problematic if using rs-fcmri features (Powers et al. 2011). PK Douglas 2015, University of California, Los Angeles

" SVM: Inherent Regularization SVM provides an internal regularization step with its C parameter When C is large (tending towards the hard margin), it penalizes the error points more strongly and results in a smaller margin typically with more support vectors and therefore with a stronger tendency to overfit There is no one C parameter that fits all data best so this parameter must be tuned appropriately Test Error Curves SVM with Radial Kernel γ =5 γ =1 γ =0.5 γ =0.1 Different radial basis funceon kernel sizes highly relevant to size of searchlight Test Error 0.20 0.25 0.30 0.35 1e 01 1e+01 1e+03 1e 01 1e+01 1e+03 1e 01 1e+01 1e+03 1e 01 1e+01 1e+03 Most Regularized Wins PK Douglas 2015, University of California, Los Angeles C =1/λ Least Regularized is best HasEe et al. 2013

2013 Pamela Douglas, UCLA NITP The Need for Reduction Controversial?

Chu et al. 2012 Revisited. 36 ) t_25355 t_7522 b Accuracy 90% 80% 70% 60% 50% 40% 30% 20% e 2013 Pamela Douglas, UCLA NITP All voxels used C. Chu et al. / NeuroImage 60 (2012) 59 70 t-test+rfe ( Sample Size 134 ) All t_82260 R_71229 t_25355 R_12754 t_11031 R_9542 t_7522 R_6463 NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA t_4568 Varying C C* C0 C1 C2 C3 C4 C5 From Chu et al. 2012 Kerr et al. Page 9 c f Fig. 1. A reproduction of Chu et al.'s Fig. 9E where the added shading indicates the 95% confidence interval for the no feature selection accuracy using the normal approximation of From Kerr et al. 2014 the binomial distribution. Accuracy using all voxelized features was not significantly higher than data-driven All feature selection t_82260 accuracy R_71229 at the optimum t_25355 C, C*. At multiple non-optimum ( A Comment on Chu et al. 2012) C values, the accuracy using data-driven feature selection was significantly higher than R_12754 t_11031 R_9542 t_7522 using all voxelized features. R_6463 t_4568 67

3. Cross Validation

Hyperparameters Hyperparameter: A value critical to the structure of a model, but is not optimized jointly with the inherent parameters 17

K Nearest Neighbor Hyperparameter: The Number of Neighbors that influence the decision making is a hyperparameter.

Hyperparameter Optimization via Nested Cross Validation All neighbors have equal vote, K is generally an odd number to avoid a Ee vote Number of neighbors that vote should be opemized Alpaydin (2004)" Nested cross validation used to simultaneously tune the number of features and the blending ratio (akin to # of neighbours)!

3. Cross Validation

Nested Cross-Validation Nested Cross-Validation Useful Practice for Tuning Hyperparameters

Nested Cross-Validation

" Interpretation Beyond Accuracy is Tricky. Although tempting, interpreting decoding weight vectors as meaningful may lead to false conclusions (Guyon & Eisseloff 2002; Haufe et al. 2014). A feature with almost zero class-specific information is given a higher weight than a feature containing a high degree of information (Haufe et al. 2014) PK Douglas 12/8/14

" Geometry of the SVM g(x)=w x+ w 0 =0 g(x)<0 g(x)>0 S;mulus class 2 Choose: C1 if g(x)>0 C2 else Support Vectors outlined in red S;mulus Class 1 SVM is interested in finding the maximum margin hyperplane that shatters the distance between support vectors or difficult points (examples near the boundary) on either side. PK Douglas 2015, University of California, Los Angeles

" Interpretation Beyond Accuracy is Tricky. Although tempting, interpreting decoding weight vectors as meaningful may lead to false conclusions (Guyon & Eisseloff 2002; Haufe et al. 2014). PK Douglas 12/8/14 A number of factors can influence interpretability including: - feature covariance (or lack thereof) - Kernel applied - Classifier used, etc.

Overview Part I : Weka Part II : MVPA Machine Learning Exercises Pamela K. Douglas, University of California, Los Angeles 2015 NITP Summer Course

Benefits of Using MVPA 1. Loads brain images directly, and extracts voxel features for you. 2. Many feature selection methods available that are specifically designed for fmri data (Searchlight, etc.) 3. It also has built-in tools for parameter tuning via nested crossvalidation, etc. 4. It has user-friendly tools for running permutation tests. 5. Runs on Matlab or Python (PyMVPA).

MVPA Has Many Options 1. One may extract time of estimated HRF peak for feature - Or mean of a few points near estimated peak can be used 2. Alternatively fits to expected HRF response can be generated, and features for that trial can be: - Beta value for that trial - t values derived from the Beta values - Or % signal change Beta values

" MVPA: Many Options for Permutation Tests In order to double check that you have not unknowingly peeked you can run a permutation test?? Training Phase Test Phase PK Douglas 2015, University of California, Los Angeles

" Permutation A good sanity check! In order to double check that you have not unknowingly peeked you can run a permutation test In this case the labels of the data exemplars are scrambled or shuffled randomly?? Training Phase Test Phase PK Douglas 2015, University of California, Los Angeles

" Permutation A good sanity check! In order to double check that you have not unknowingly peeked you can run a permutation test In this case the labels of the data exemplars are scrambled or shuffled randomly You now should verify that the accuracy outcome is at chance level?? Training Phase Test Phase PK Douglas 2015, University of California, Los Angeles

Overview Part I : Weka Part II : MVPA Machine Learning Exercises Pamela K. Douglas, University of California, Los Angeles 2015 NITP Summer Course

Lab Practical Brief feature Selection exercise Explore classification of the Haxyby et al. 2001 data using a combination of Weka and MVPA code

Data: Haxby et al. (2001) Science Paper Six subjects, 12 runs each Each run consisted of viewing 8 object categories. Each object category was shown for 24 sec (500msec on, 1500msec rest).

Weka Usage Notes In Weka, features are called attributes Their input file is called attribute relation file format (.arff) Useful flags: -t <name of training file> Sets training file. -T <name of test file> Sets test file. If missing, a cross-validation is performed. -x <number of folds> Specify number of cross-validation folds (default: 10). -split-percentage <%> Sets the percentage for the train/test set split, e.g., 66. -preserve-order Preserves the order in the percentage split. -Xmx Ask for more memory (useful!) ex. Xmx2g (2 gigs) More Detail on Weka Available via Online video MOOCs Pamela K. Douglas, University of California, Los Angeles 2015 NITP Summer Course