Deep Learning: An Overview. Bradley J Erickson, MD PhD Mayo Clinic, Rochester

Deep Learning: An Overview Bradley J Erickson, MD PhD Mayo Clinic, Rochester Medical Imaging Informatics and Teleradiology Conference 1:30-2:05pm June 17, 2016

Disclosures Relationships with commercial interests: Board of OneMedNet Board of VoiceIT

What is Machine Learning? It is a part of Artificial Intelligence Finds patterns in data Patterns that reflect properties of examples (supervised) Patterns that separate examples (unsupervised) (Other types of artificial intelligence include rules systems)

Machine Learning Classes Supervised Unsupervised Reinforced ANN Clusters SVM Adaptive Resonance Random Forest Bayes DNN

Machine Learning History Artificial Neural Networks (ANN) Starting point of machine learning Early versions didn t work well Other Machine Learning Methods Naïve Bayes Support Vector Machine (SVM) Random Forest Classifier (RFC)

Artificial Neural Network/Perceptron Input Layer Hidden Layer Output Layer f(σ) T1 Pre f(σ) f(σ) Tumor T1 Post f(σ) f(σ) Brain T2 f(σ) f(σ)

Artificial Neural Network/Perceptron Input Layer Hidden Layer Output Layer f(σ) T1 Pre 45 f(σ) f(σ) Tumor T1 Post 322 f(σ) f(σ) Brain T2 128 f(σ) f(σ)

Artificial Neural Network/Perceptron Input Layer Hidden Layer Output Layer 57 T1 Pre 45 418 f(σ) Tumor T1 Post 322-68 f(σ) Brain T2 128 34 312

Artificial Neural Network/Perceptron Input Layer Hidden Layer Output Layer 57 T1 Pre 45 418 1 Tumor T1 Post 322-68 0 Brain T2 128 34 312

How ANNs Learn Propagation Multiple prior layer node value times weight Activation function. E.g. threshold the sum Weight Update Compute error = actual output expected output Weight gradient = error * input value New weight = old weight * gradient * learning rate

Learning = Optimization Problem Learning depends on: Correct gradient directions Correct gradient multiplier (learning rate) Global Minimum Local Minimum Small Gradient

Support Vector Machines Maps input data to new space Creates hyperplane that separates classes in that space f(x)

Deep Learning: Why the Hype? Performance in ImageNet Challenge Team / Software Year Error Rate XRCE (not Deep Learning) 2011 25.8% SuperVision (AlexNet) 2012 16.4% Clarifai 2013 11.7% GoogLeNet (Inception) 2014 6.66% Andrej Karpathy (human comparison) 2014 5.1% BN-Inception (Arxiv) 2015 4.9% Inception-v3 (Arxiv) 2015 3.46%

What is Deep Learning Deep because it uses many layers ANN typically had 3 or fewer layers

DNNs have 15+ layers

Types of DNNs Convolutional Neural Network (CNN) Early layers have windows of image as input Multiplied by a kernel to get output Known as a convolution 22 13 0 31 71 1 2 1 14 27 28 43 21 2 4 2 18 64 89 65 32 1 2 1 44 55 32 41 4 21 32 15 33 7

Types of DNNs Convolutional Neural Network (CNN) Early layers have windows of image as input Multiplied by a kernel to get output Known as a convolution * = 22 13 0 31 71 1 2 1 22 14 27 28 43 21 2 4 2 18 64 89 65 32 1 2 1 44 55 32 41 4 21 32 15 33 7

Types of DNNs Convolutional Neural Network (CNN) Early layers have windows of image as input Multiplied by a kernel to get output Known as a convolution 22 13 0 31 71 1 2 1 22 26 0 14 27 28 43 21 2 4 2 28 108 56 / 9 53 18 64 89 65 32 1 2 1 18 128 89 44 55 32 41 4 21 32 15 33 7

Types of DNNs Convolutional Neural Network (CNN) Early layers have windows of image as input Multiplied by a kernel to get output Known as a convolution 22 13 0 31 71 1 2 1 13 0 31 14 27 28 43 21 2 4 2 56 112 86 / 9 53 67 18 64 89 65 32 1 2 1 64 178 65 44 55 32 41 4 21 32 15 33 7

Why the Excitement Now? Advances That Addressed Problems Many layers -> Overfitting Implement sparsity in weights: Dropout

Why the Excitement Now? Advances That Addressed Problems Many layers -> Vanishing Gradients Drop out partially addresses this Can use pre-trained weights for early layers, and fix those, with weights of later layers for learning higher level features

Typical CNNs Convolution Pooling Pooling Convolution Pooling Fully Connected

Typical CNNs Andrei Karpathy: http://karpathy.github.io/2015/10/25/selfie/

Why the Excitement Now? Batch Normalization What should be the initial set of weights connecting nodes? All the same = no gradients Random. But what range of values? BatchNorm: After each Convolutional layer Subtract mean / divide by standard deviation Simple but effective

Why the Excitement Now? Residual Networks Residual defines if and how to pass data through from layer to layer. Makes deep network construction reliable *Targ, ICLR 2016

Why the Excitement Now? Deep Neural Network Theory Exponential Compute Power Growth

Moore s Law Computing performance doubles approximately every 18 months

Exponentials In Real Life If you put 1 drop of water into a football stadium, and then double the number of drops each minute: At 5 minutes, you will have 32 drops At 45 minutes, you will cover the field 1" At 55 minutes, the stadium will be full It is not natural for humans to grasp exponential growth

Deep Learning Works Well on GPUs Naturally parallel Less precision (single precision FP) actually can be advantage Now building cards with no video output and optimized for Deep Learning (P-100)

GPUs are Beating Moore s Law 1,000,000 100,000 10,000 FPGA FPGA TPU 1000 100 10 GPU CPU Ice Age 2000 2005 2010 2015 2020

Deep Learning Myths You Need Millions of Exams to Train and Use Deep Learning Methods

Ways To Avoid Need For Large Data Data Augmentation Sets Essentially, creating variants of data that are different enough that they are learnable Similar enough that they teaching point is kept Mirror/Flip/Rotate/Contrast/Crop

Image Conv Conv MaxPool Conv Conv MaxPool Conv Conv MaxPool Fully Connected Fully Connected Fully Connected SoftMax Ways To Avoid Need For Large Data Sets Data Augmentation Transfer Learning Train on Large Corpus like ImageNet

Take Home Point Deep Learning Learns Features and Connections vs Just Connections Hand-Crafted Feature Extraction Classifier Learning Feature Extractor Classifier

Examples of CNN in Medical Imaging: Body Part *Roth, Arxiv 2016

Moeskops, IEEE-TMI, 2016 Examples of CNN in Medical Imaging: Segmentation

Mayo: AutoEncoder for Segmentation Dataset Trained on Brats 2015 Flair enhancing signal Preprocessing N4 bias correction Nuyl intensity normalization Autoencoders trained on 110.000 ROIs (size=12) Time 1 hour for 155 slices (DNN would be days or weeks) Korfiatis, Submitted

What is an AutoEncoder? Korfiatis, Submitted

Dice = 0.92 over BRATS dataset Korfiatis, Submitted

Machine Learning & Radiomics Computers find textures reflecting genomics: 1p19q 85 Subjects with FISH results, computed multiple textures, SVM SVM Abstract # Features Sens Spec F-score Accuracy 10 10 0.91 0.95 0.87 0.93 0.93 0.96 0.91 0.95 Naïve Bayes 12 0.95 0.77 0.92 0.89 Erickson, Proc ASNR, 2016

Machine Learning & Radiomics 155 Subjects, GBM, MGMT Methylation Compute textures (T2 was best) -> SVM Korfiatis, Med Phys, 2016

Deep Learning: MGMT Methylation Same set of patients, use VGGNet / Xfer: Az=0.86 Autoencoder is giving nearly as good performance and trains about 10x faster Now testing DeepMedic and RNN Korfiatis, unpublished

The Pace of Change

Will Computers Replace Radiologists? Deep Learning will likely be able to create reports for diagnostic images in the future. 5 years: Mammo & CXR 10 years: CT Head, Chest, Abd, Pelvis, MR head, knee, shoulder, US: liver, thyroid, carotids 15-20 years: most diagnostic imaging Will likely see more than we do today Will allow radiologists for focus on patient interaction and invasive procedures

How Might Medicine Best Embrace Deep Learning

How Might Medicine Best Embrace Deep Learning Algorithms for Machine Learning are rapidly improving. CNN are not the only game in town Hardware for Machine Learning is REALLY rapidly improving The amount of change in 20 years will be unbelievable

How Might Medicine Best Embrace Deep Learning Medicine needs to remain flexible about hardware and software The VALUE is in the data and metadata Physicians are OBLIGATED to make sure the data are properly handled. Improper interpretation of data will lead to bad implementations and poor patient care Non-cooperation is also counter-productive