Spatial regularization and sparsity for brain mapping

Similar documents
(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

Python Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

WHEN THERE IS A mismatch between the acoustic

Learning From the Past with Experiment Databases

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

CS Machine Learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

CSL465/603 - Machine Learning

arxiv: v1 [cs.lg] 15 Jun 2015

Comment-based Multi-View Clustering of Web 2.0 Items

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Conference Presentation

Comparison of network inference packages and methods for multiple networks inference

Artificial Neural Networks written examination

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

A survey of multi-view machine learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

An empirical study of learning speed in backpropagation

Probability and Statistics Curriculum Pacing Guide

Australian Journal of Basic and Applied Sciences

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

An Introduction to Simio for Beginners

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Indian Institute of Technology, Kanpur

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Speech Recognition at ICSI: Broadcast News and beyond

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Truth Inference in Crowdsourcing: Is the Problem Solved?

Semi-Supervised Face Detection

Assignment 1: Predicting Amazon Review Ratings

arxiv: v2 [cs.cv] 30 Mar 2017

Multi-tasks Deep Learning Model for classifying MRI images of AD/MCI Patients

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Software Maintenance

Generative models and adversarial training

Time series prediction

Active Learning. Yingyu Liang Computer Sciences 760 Fall

CS 446: Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

arxiv: v1 [cs.cl] 2 Apr 2017

Discriminative Learning of Beam-Search Heuristics for Planning

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Rule Learning With Negation: Issues Regarding Effectiveness

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Calibration of Confidence Measures in Speech Recognition

A Reinforcement Learning Variant for Control Scheduling

Copyright by Sung Ju Hwang 2013

Sociology 521: Social Statistics and Quantitative Methods I Spring Wed. 2 5, Kap 305 Computer Lab. Course Website

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

A Comparison of Annealing Techniques for Academic Course Scheduling

Axiom 2013 Team Description Paper

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

BMBF Project ROBUKOM: Robust Communication Networks

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Softprop: Softmax Neural Network Backpropagation Learning

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Modeling function word errors in DNN-HMM based LVCSR systems

Probabilistic Latent Semantic Analysis

Speech Emotion Recognition Using Support Vector Machine

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Learning Methods for Fuzzy Systems

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Detailed course syllabus

Attributed Social Network Embedding

STA 225: Introductory Statistics (CT)

On-the-Fly Customization of Automated Essay Scoring

Sociology 521: Social Statistics and Quantitative Methods I Spring 2013 Mondays 2 5pm Kap 305 Computer Lab. Course Website

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Analysis of Enzyme Kinetic Data

Hierarchical Linear Models I: Introduction ICPSR 2015

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Benjamin Pohl, Yves Richard, Manon Kohler, Justin Emery, Thierry Castel, Benjamin De Lapparent, Denis Thévenin, Thomas Thévenin, Julien Pergaud

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

SARDNET: A Self-Organizing Feature Map for Sequences

Human Emotion Recognition From Speech

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Linking Task: Identifying authors and book titles in verbose queries

w o r k i n g p a p e r s

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

CSC200: Lecture 4. Allan Borodin

Transcription:

Spatial regularization and sparsity for brain mapping Bertrand Thirion, INRIA Saclay-Île-de-France, Parietal team http://parietal.saclay.inria.fr bertrand.thirion@inria.fr

FMRI data analysis pipeline Complex metabolic pathway 2

Statistical inference & MVPA Question 1 : Is there any effect? omnibus test MVPA: Can I discriminate btw the two conditions? Question 2 : What regions actually display a difference btw the two conditions? MVPA: Support of the discriminative pattern? 3

Outline Machine learning techniques for MVPA in neuroimaging Improving the decoder: smoothness and sparsity Recovery and randomness. 4

Reverse inference : combining the information from different regions Aims at decoding brain activities predicting a cognitive variable [Dehaene et al. 1998], [Haxby et al. 2001], [Cox et al. 2003] 5

Predictive linear model y = f (X, w, b) + noise y is the behavioral variable. X Rn p is the data matrix, i.e. the activations maps (w, b) are the parameters to be estimated. n activation maps (samples), p voxels (features). y Rn regression setting : f (X, w, b) = X w + b, y {-1, 1}n classification setting : f (X, w, b) = sign(x w + b), where sign denotes the sign function. 6

Curse of dimensionality in MVPA Problem: p n Overfit the noise on the training data Solutions Prior region selection Prior selection of brain regions prior-bound result Data-driven feature selection (e.g. Anova, RFE) : Univariate methods (Anova) no optimality? Multivariate methods combinatorial pb, computational cost Regularization (e.g. Lasso, Elastic net) : Shrink w according to your prior 7

Training a predictive model Learning w from a given training set (y, X) Choice of the loss Regression: Least-squares, Hinge, Huber Classification: Hinge, logistic Choice of the regularizer Convex setting: a norm on w Bayesian setting: prior distribution on w 8

Evaluation of the decoding Prediction accuracy Coefficient of determination R2 : Classification accuracy κ : Quantify the amount of information shared by the pattern and y. Layout of the resulting maps of weights: Do we have any guarantee to recover the true discriminative pattern? Common hypothesis = segregation into functionally specific territories sparse: few relevant regions implied compact structure: grouping into connected clusters. 9

You said: recovery? MVPA cannot recover the true sources as it aims at finding a good discriminative model ( filters ), not at estimating the signal. A correction taking covariance structure is necessary However, this can be improved by choosing relevant priors You might want to have a discriminative model that makes sense to you [Haufe et al. NIMG 2013] 10

Outline Machine learning techniques for MVPA in neuroimaging Improving the decoder: smoothness and sparsity Recovery and randomness. 11

Regularization framework w = the discriminative pattern Constrain w to select few parameters that explain well the data. Penalized regression ℓ(y, Xw) is the loss function, usually λj(w) is the penalization term. for regression Ridge (no sparsity) Lasso (very sparse) Elastic net (sparsity + grouping) Smooth lasso (sparsity + smoothness) Total variation (piecewise sparsity) 12

Priors and penalization: Brain decoding = engineering problem? Prior on the relevant activation maps Penalization in regularized regression Design of a norm w to be minimized Example: Total Variation penalization [Michel et al. 2011] 13

Do we need to bother about sparsity? Is brain activation (connectivity,..) sparse? No! But... In neuroscience, people estimate discriminative patterns that look like: But in a neuroimaging article, it will look more like If you want to show the truly discriminative pattern, you need it to be sparse! 14

Solution: (F)ISTA w(t) Gradient descent on the smooth terms w(t+1) projection on the non-smooth constrains Lasso: the proximal operator is simply soft-threshodling FISTA = accelerated ISTA (much faster convergence) 15

The smooth lasso: the proximal operator smoothness sparsity Stronger penalty 16

Sparse total variation: the proximal operator Small TV sparsity Stronger penalty 17

What do the results look like? Can nevertheless be improved with adapted techniques Encoding Elastic net decoding Sparse flat decoding [Gramfort et al PRNI 2013] 18

Performance on recovery (simulation) Example of recovery (simulated data): The TV-l1 prior outperforms alternatives 19

Caveat: resulting map depends on convergence tolerance TV-l1 estimator: stricter convergence a different sparser map! [Dohmatob et al. PRNI 2014] 20

Discussion Bayesian alternatives (ARD, smooth ARD) [Sabuncu et al.] You lose the convexity Empirical Bayes: adapts well to new data Cost of these methods Convergence monitoring is hard Smoothing + ANOVA selection + SVM is a good competitor... Other approaches: use of clustering for structured sparsity [Jenatton et al. SIAM 2012], even more costly! 21

Outline Machine learning techniques for MVPA in neuroimaging Improving the decoder: smoothness and sparsity Recovery and randomness 22

Recovery... Prediction vs. Identification Prediction: estimate w that maximizes the prediction accuracy Identification or Recovery: estimate ŵ such that supp(ŵ) =supp(w) Compressive sensing: detection of k signals out of p (voxels) with only n observations << k Problem: data are correlated How to measure the recovery of the set of regions? How to improve recovery 23

Small sample recovery [Haxby Science 2001] dataset: Trying to discriminate faces vs houses: level of performance achieved with limited number of samples 24

Randomization Lasso path Stability selection = randomization of the features + bootstrap on the samples stability path of Lasso Improved feature recovery... for few, weakly correlated features [Meinshausen and Bühlman, 2009] 25

Hierarchical clustering and randomized selection Algorithm Randomized-Ward-Logistic (1) Loop: randomly perturb the data (2) Ward agglomeration to form q features (3) sparse linear model on reduced features (4) accumulate non-zero features (5) threshold map of selection counts [Gramfort et al. MLINI 2011] 26

Simulation study Ground truth F test Randomized Ward logistic 27

The best approach for feature recovery depends on the problem The response depends on the characteristics of the problem: smoothness (coupling between signal and noise) and clustering (redundancy of features) 128 samples 256 samples [Varoquaux et al. ICML 2012] 28

Simulation study Identification accuracy Prediction accuracy Improves both prediction and identification! 29

Examples on real data Regression task [Jimura et al. 2011] Classification task [Haxby et al. 2001] 30

Conclusion SVM and sparse models less powerful than univariate methods for recovery. Sparsity + clustering + randomization: excellent recovery Multivariate brain mapping Simultaneous prediction and recovery High computational cost (parameter setting) cc 31

Acknowledgements Many thanks to my co-workers: V. Michel, G. Varoquaux, A. Gramfort, F. Pedregosa, P. Fillard, J.B. Poline, V.Fritsch, V. Siless, S.Medina, R. Bricquet To People who provide data: E.Eger, R. Poldrack, K. Jimura, J. Haxby 32

All this will land into... Machine learning for neuroimaging http://nilearn.github.io Scikit-learn-like API BSD, Python, OSS Classification of neuroimaging data (decoding) Functional connectivity analysis 33

Thank you for your attention http://parietal.saclay.inria.fr 34