Training Neural Networks, Part I. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 6-1

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Lecture 1: Machine Learning Basics

Python Machine Learning

Generative models and adversarial training

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Artificial Neural Networks written examination

Knowledge Transfer in Deep Convolutional Neural Nets

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v1 [cs.lg] 15 Jun 2015

Dropout improves Recurrent Neural Networks for Handwriting Recognition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Model Ensemble for Click Prediction in Bing Search Ads

An empirical study of learning speed in backpropagation

(Sub)Gradient Descent

SORT: Second-Order Response Transform for Visual Recognition

Cultivating DNN Diversity for Large Scale Video Labelling

WHEN THERE IS A mismatch between the acoustic

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

CSL465/603 - Machine Learning

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A Deep Bag-of-Features Model for Music Auto-Tagging

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Attributed Social Network Embedding

Calibration of Confidence Measures in Speech Recognition

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

CS Machine Learning

Learning Methods for Fuzzy Systems

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Neural Network GUI Tested on Text-To-Phoneme Mapping

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.cv] 10 May 2017

arxiv: v1 [cs.cl] 27 Apr 2016

INPE São José dos Campos

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A study of speaker adaptation for DNN-based speech synthesis

Softprop: Softmax Neural Network Backpropagation Learning

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

SARDNET: A Self-Organizing Feature Map for Sequences

Deep Neural Network Language Models

Forget catastrophic forgetting: AI that learns after deployment

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Test Effort Estimation Using Neural Network

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Offline Writer Identification Using Convolutional Neural Network Activation Features

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

arxiv:submit/ [cs.cv] 2 Aug 2017

Assignment 1: Predicting Amazon Review Ratings

arxiv: v4 [cs.cl] 28 Mar 2016

arxiv: v2 [cs.ir] 22 Aug 2016

Evolutive Neural Net Fuzzy Filtering: Basic Description

Human Emotion Recognition From Speech

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Speech Recognition at ICSI: Broadcast News and beyond

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

arxiv: v2 [cs.cl] 26 Mar 2015

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Software Maintenance

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

arxiv: v2 [cs.ro] 3 Mar 2017

An Introduction to Simio for Beginners

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Second Exam: Natural Language Parsing with Neural Networks

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Lip Reading in Profile

THE enormous growth of unstructured data, including

Modeling function word errors in DNN-HMM based LVCSR systems

How People Learn Physics

arxiv: v2 [cs.cv] 30 Mar 2017

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

Word Segmentation of Off-line Handwritten Documents

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Mathematics subject curriculum

Getting Started with TI-Nspire High School Science

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

LEGO MINDSTORMS Education EV3 Coding Activities

Spring 2015 IET4451 Systems Simulation Course Syllabus for Traditional, Hybrid, and Online Classes

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

A Review: Speech Recognition with Deep Learning Methods

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Improvements to the Pruning Behavior of DNN Acoustic Models

Modeling function word errors in DNN-HMM based LVCSR systems

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

SURVIVING ON MARS WITH GEOGEBRA

Transcription:

Lecture 6: Training Neural Networks, Part I Lecture 6-1

Administrative Assignment 1 due Thursday (today), 11:59pm on Canvas Assignment 2 out today Project proposal due Tuesday April 25 Notes on backprop for a linear layer and vector/tensor derivatives linked to Lecture 4 on syllabus Lecture 6-2

Where we are now... Computational graphs x * s (scores) hinge loss + L W R Lecture 6-3

Where we are now... Neural Networks Linear score function: 2-layer Neural Network x 3072 W1 h 100 W2 s 10 Lecture 6-4

Where we are now... Convolutional Neural Networks Illustration of LeCun et al. 1998 from CS231n 2017 Lecture 1 Lecture 6-5

Where we are now... Convolutional Layer activation map 32 32x32x3 image 5x5x3 filter 28 convolve (slide) over all spatial locations 28 32 3 1 Lecture 6-6

Where we are now... Convolutional Layer For example, if we had 6 5x5 filters, we ll get 6 separate activation maps: activation maps 32 28 Convolution Layer 28 32 3 6 We stack these up to get a new image of size 28x28x6! Lecture 6-7

Where we are now... Learning network parameters through optimization Landscape image is CC0 1.0 public domain Walking man image is CC0 1.0 public domain Lecture 6-8

Where we are now... Mini-batch SGD Loop: 1. Sample a batch of data 2. Forward prop it through the graph (network), get loss 3. Backprop to calculate the gradients 4. Update the parameters using the gradient Lecture 6-9

Next: Training Neural Networks Lecture 6-10

Overview 1. One time setup activation functions, preprocessing, weight initialization, regularization, gradient checking 2. Training dynamics babysitting the learning process, parameter updates, hyperparameter optimization 3. Evaluation model ensembles Lecture 6-11

Part 1 - Activation Functions Data Preprocessing Weight Initialization Batch Normalization Babysitting the Learning Process Hyperparameter Optimization Lecture 6-12

Activation Functions Lecture 6-13

Activation Functions Lecture 6-14

Activation Functions Sigmoid Leaky ReLU tanh Maxout ReLU ELU Lecture 6-15

Activation Functions - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating firing rate of a neuron Sigmoid Lecture 6-16

Activation Functions - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating firing rate of a neuron 3 problems: Sigmoid 1. Saturated neurons kill the gradients Lecture 6-17

x sigmoid gate What happens when x = -10? What happens when x = 0? What happens when x = 10? Lecture 6-18

Activation Functions - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating firing rate of a neuron 3 problems: Sigmoid 1. Saturated neurons kill the gradients 2. Sigmoid outputs are not zero-centered Lecture 6-19

Consider what happens when the input to a neuron (x) is always positive: What can we say about the gradients on w? Lecture 6-20

Consider what happens when the input to a neuron is always positive... allowed gradient update directions allowed gradient update directions What can we say about the gradients on w? Always all positive or all negative :( (this is also why you want zero-mean data!) Lecture 6-21 zig zag path hypothetical optimal w vector

Activation Functions - Squashes numbers to range [0,1] - Historically popular since they have nice interpretation as a saturating firing rate of a neuron 3 problems: Sigmoid 1. Saturated neurons kill the gradients 2. Sigmoid outputs are not zero-centered 3. exp() is a bit compute expensive Lecture 6-22

Activation Functions - Squashes numbers to range [-1,1] - zero centered (nice) - still kills gradients when saturated :( tanh(x) [LeCun et al., 1991] Lecture 6-23

Activation Functions - Computes f(x) = max(0,x) - Does not saturate (in +region) - Very computationally efficient - Converges much faster than sigmoid/tanh in practice (e.g. 6x) - Actually more biologically plausible than sigmoid ReLU (Rectified Linear Unit) [Krizhevsky et al., 2012] Lecture 6-24

Activation Functions - Computes f(x) = max(0,x) - Does not saturate (in +region) - Very computationally efficient - Converges much faster than sigmoid/tanh in practice (e.g. 6x) - Actually more biologically plausible than sigmoid ReLU (Rectified Linear Unit) - Not zero-centered output - An annoyance: hint: what is the gradient when x < 0? Lecture 6-25

x ReLU gate What happens when x = -10? What happens when x = 0? What happens when x = 10? Lecture 6-26

DATA CLOUD active ReLU dead ReLU will never activate => never update Lecture 6-27

DATA CLOUD => people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01) active ReLU dead ReLU will never activate => never update Lecture 6-28

[Mass et al., 2013] [He et al., 2015] Activation Functions - Does not saturate - Computationally efficient - Converges much faster than sigmoid/tanh in practice! (e.g. 6x) - will not die. Leaky ReLU Lecture 6-29

[Mass et al., 2013] [He et al., 2015] Activation Functions - Does not saturate - Computationally efficient - Converges much faster than sigmoid/tanh in practice! (e.g. 6x) - will not die. Parametric Rectifier (PReLU) Leaky ReLU backprop into \alpha (parameter) Lecture 6-30

[Clevert et al., 2015] Activation Functions Exponential Linear Units (ELU) - All benefits of ReLU - Closer to zero mean outputs - Negative saturation regime compared with Leaky ReLU adds some robustness to noise - Computation requires exp() Lecture 6-31

[Goodfellow et al., 2013] Maxout Neuron - Does not have the basic form of dot product -> nonlinearity - Generalizes ReLU and Leaky ReLU - Linear Regime! Does not saturate! Does not die! Problem: doubles the number of parameters/neuron :( Lecture 6-32

TLDR: In practice: - Use ReLU. Be careful with your learning rates Try out Leaky ReLU / Maxout / ELU Try out tanh but don t expect much Don t use sigmoid Lecture 6-33

Data Preprocessing Lecture 6-34

Step 1: Preprocess the data (Assume X [NxD] is data matrix, each example in a row) Lecture 6-35

Remember: Consider what happens when the input to a neuron is always positive... allowed gradient update directions allowed gradient update directions What can we say about the gradients on w? Always all positive or all negative :( (this is also why you want zero-mean data!) Lecture 6-36 zig zag path hypothetical optimal w vector

Step 1: Preprocess the data (Assume X [NxD] is data matrix, each example in a row) Lecture 6-37

Step 1: Preprocess the data In practice, you may also see PCA and Whitening of the data (data has diagonal covariance matrix) (covariance matrix is the identity matrix) Lecture 6-38

TLDR: In practice for Images: center only e.g. consider CIFAR-10 example with [32,32,3] images - Subtract the mean image (e.g. AlexNet) (mean image = [32,32,3] array) - Subtract per-channel mean (e.g. VGGNet) (mean along each channel = 3 numbers) Not common to normalize variance, to do PCA or whitening Lecture 6-39

Weight Initialization Lecture 6-40

- Q: what happens when W=0 init is used? Lecture 6-41

- First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) Lecture 6-42

- First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation) Works ~okay for small networks, but problems with deeper networks. Lecture 6-43

Lets look at some activation statistics E.g. 10-layer net with 500 neurons on each layer, using tanh non-linearities, and initializing as described in last slide. Lecture 6-44

Lecture 6-45

All activations become zero! Q: think about the backward pass. What do the gradients look like? Hint: think about backward pass for a W*X gate. Lecture 6-46

*1.0 instead of *0.01 Almost all neurons completely saturated, either -1 and 1. Gradients will be all zero. Lecture 6-47

Xavier initialization [Glorot et al., 2010] Reasonable initialization. (Mathematical derivation assumes linear activations) Lecture 6-48

but when using the ReLU nonlinearity it breaks. Lecture 6-49

He et al., 2015 (note additional /2) Lecture 6-50

He et al., 2015 (note additional /2) Lecture 6-51

Proper initialization is an active area of research Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013 Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014 Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015 Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015 All you need is a good init, Mishkin and Matas, 2015 Lecture 6-52

Batch Normalization Lecture 6-53

Batch Normalization [Ioffe and Szegedy, 2015] you want unit gaussian activations? just make them so. consider a batch of activations at some layer. To make each dimension unit gaussian, apply: this is a vanilla differentiable function... Lecture 6-54

Batch Normalization [Ioffe and Szegedy, 2015] you want unit gaussian activations? just make them so. N X 1. compute the empirical mean and variance independently for each dimension. 2. Normalize D Lecture 6-55

Batch Normalization FC BN [Ioffe and Szegedy, 2015] Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity. tanh FC BN tanh... Lecture 6-56

Batch Normalization FC BN [Ioffe and Szegedy, 2015] Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity. tanh FC BN tanh... Problem: do we necessarily want a unit gaussian input to a tanh layer? Lecture 6-57

Batch Normalization [Ioffe and Szegedy, 2015] Normalize: Note, the network can learn: And then allow the network to squash the range if it wants to: to recover the identity mapping. Lecture 6-58

[Ioffe and Szegedy, 2015] Batch Normalization - Improves gradient flow through the network Allows higher learning rates Reduces the strong dependence on initialization Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe Lecture 6-59

Batch Normalization [Ioffe and Szegedy, 2015] Note: at test time BatchNorm layer functions differently: The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used. (e.g. can be estimated during training with running averages) Lecture 6-60

Babysitting the Learning Process Lecture 6-61

Step 1: Preprocess the data (Assume X [NxD] is data matrix, each example in a row) Lecture 6-62

Step 2: Choose the architecture: say we start with one hidden layer of 50 neurons: 50 hidden neurons output layer CIFAR-10 images, 3072 numbers input layer hidden layer 10 output neurons, one per class Lecture 6-63

Double check that the loss is reasonable: disable regularization loss ~2.3. correct for 10 classes returns the loss and the gradient for all parameters Lecture 6-64

Double check that the loss is reasonable: crank up regularization loss went up, good. (sanity check) Lecture 6-65

Lets try to train now Tip: Make sure that you can overfit very small portion of the training data The above code: - take the first 20 examples from CIFAR-10 - turn off regularization (reg = 0.0) - use simple vanilla sgd Lecture 6-66

Lets try to train now Tip: Make sure that you can overfit very small portion of the training data Very small loss, train accuracy 1.00, nice! Lecture 6-67

Lets try to train now Start with small regularization and find learning rate that makes the loss go down. Lecture 6-68

Lets try to train now Start with small regularization and find learning rate that makes the loss go down. Loss barely changing Lecture 6-69

Lets try to train now Start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low Loss barely changing: Learning rate is probably too low Lecture 6-70

Lets try to train now Start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low Loss barely changing: Learning rate is probably too low Notice train/val accuracy goes to 20% though, what s up with that? (remember this is softmax) Lecture 6-71

Lets try to train now Start with small regularization and find learning rate that makes the loss go down. Now let s try learning rate 1e6. loss not going down: learning rate too low Lecture 6-72

Lets try to train now Start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high cost: NaN almost always means high learning rate... Lecture 6-73

Lets try to train now Start with small regularization and find learning rate that makes the loss go down. loss not going down: learning rate too low loss exploding: learning rate too high 3e-3 is still too high. Cost explodes. => Rough range for learning rate we should be cross-validating is somewhere [1e-3 1e-5] Lecture 6-74

Hyperparameter Optimization Lecture 6-75

Cross-validation strategy coarse -> fine cross-validation in stages First stage: only a few epochs to get rough idea of what params work Second stage: longer running time, finer search (repeat as necessary) Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early Lecture 6-76

For example: run coarse search for 5 epochs note it s best to optimize in log space! nice Lecture 6-77

Now run finer search... adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons. Lecture 6-78

Now run finer search... adjust range 53% - relatively good for a 2-layer neural net with 50 hidden neurons. But this best cross-validation result is worrying. Why? Lecture 6-79

Random Search vs. Grid Search Important Parameter Unimportant Parameter Random Layout Unimportant Parameter Grid Layout Random Search for Hyper-Parameter Optimization Bergstra and Bengio, 2012 Important Parameter Illustration of Bergstra et al., 2012 by Shayne Longpre, copyright CS231n 2017 Lecture 6-80

Hyperparameters to play with: - network architecture - learning rate, its decay schedule, update type - regularization (L2/Dropout strength) neural networks practitioner music = loss function This image by Paolo Guereta is licensed under CC-BY 2.0 Lecture 6-81

Cross-validation command center Lecture 6-82

Monitor and visualize the loss curve Lecture 6-83

Loss time Lecture 6-84

Loss Bad initialization a prime suspect time Lecture 6-85

Monitor and visualize the accuracy: big gap = overfitting => increase regularization strength? no gap => increase model capacity? Lecture 6-86

Track the ratio of weight updates / weight magnitudes: ratio between the updates and values: ~ 0.0002 / 0.02 = 0.01 (about okay) want this to be somewhere around 0.001 or so Lecture 6-87

Summary TLDRs We looked in detail at: - Activation Functions (use ReLU) Data Preprocessing (images: subtract mean) Weight Initialization (use Xavier init) Batch Normalization (use) Babysitting the Learning process Hyperparameter Optimization (random sample hyperparams, in log space when appropriate) Lecture 6-88

Next time: Training Neural Networks, Part 2 - Parameter update schemes Learning rate schedules Gradient checking Regularization (Dropout etc.) Evaluation (Ensembles etc.) Transfer learning / fine-tuning Lecture 6-89