Dropout Training (Hinton et al. 2012)

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Python Machine Learning

Lecture 1: Machine Learning Basics

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

CS Machine Learning

arxiv: v1 [cs.lg] 7 Apr 2015

Assignment 1: Predicting Amazon Review Ratings

Model Ensemble for Click Prediction in Bing Search Ads

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

A Deep Bag-of-Features Model for Music Auto-Tagging

Dropout improves Recurrent Neural Networks for Handwriting Recognition

arxiv: v1 [cs.lg] 15 Jun 2015

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Human Emotion Recognition From Speech

Calibration of Confidence Measures in Speech Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Deep Neural Network Language Models

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A study of speaker adaptation for DNN-based speech synthesis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v1 [cs.cv] 10 May 2017

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Second Exam: Natural Language Parsing with Neural Networks

Attributed Social Network Embedding

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Knowledge Transfer in Deep Convolutional Neural Nets

arxiv: v2 [cs.ir] 22 Aug 2016

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv:submit/ [cs.cv] 2 Aug 2017

1.11 I Know What Do You Know?

THE enormous growth of unstructured data, including

Softprop: Softmax Neural Network Backpropagation Learning

On the Formation of Phoneme Categories in DNN Acoustic Models

Cultivating DNN Diversity for Large Scale Video Labelling

Artificial Neural Networks written examination

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

(Sub)Gradient Descent

SARDNET: A Self-Organizing Feature Map for Sequences

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Learning From the Past with Experiment Databases

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

Image based Static Facial Expression Recognition with Multiple Deep Network Learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

arxiv: v2 [cs.cl] 26 Mar 2015

Offline Writer Identification Using Convolutional Neural Network Activation Features

Multi-tasks Deep Learning Model for classifying MRI images of AD/MCI Patients

Lip Reading in Profile

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

Learning Methods for Fuzzy Systems

Active Learning. Yingyu Liang Computer Sciences 760 Fall

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Lecture 1: Basic Concepts of Machine Learning

Algebra 2- Semester 2 Review

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

Grade 6: Correlated to AGS Basic Math Skills

arxiv: v1 [cs.cl] 27 Apr 2016

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Missouri Mathematics Grade-Level Expectations

School Size and the Quality of Teaching and Learning

arxiv: v2 [cs.ro] 3 Mar 2017

Dublin City Schools Mathematics Graded Course of Study GRADE 4

INPE São José dos Campos

CSL465/603 - Machine Learning

Mathematics subject curriculum

Learning to Schedule Straight-Line Code

A Review: Speech Recognition with Deep Learning Methods

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Improvements to the Pruning Behavior of DNN Acoustic Models

Implementing a tool to Support KAOS-Beta Process Model Using EPF

BENCHMARK TREND COMPARISON REPORT:

Using Proportions to Solve Percentage Problems I

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Evolution of Symbolisation in Chimpanzees and Neural Nets

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Statewide Framework Document for:

Dialog-based Language Learning

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

arxiv: v1 [cs.cl] 20 Jul 2015

Transcription:

Dropout Training (Hinton et al. 2012) Aaron Courville IFT6135 - Representation Learning Slide Credit: Some slides were taken from Ian Goodfellow 1

Dropout training Introduced in Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors.corr, abs/1207.0580. Dropout recipe: - Each time we present data example x, randomly delete each hidden node with 0.5 probability. - This is like sampling from 2 h different architectures. y h 21 h 22 h 23 - At test time, use all nodes but divide the weights by 2. h 11 h 12 h 13 Effect I: Reduce overfitting by preventing coadaptation x 1 x 2 x 3 Effect 2: Ensemble model averaging via bagging 2

Dropout training Introduced in Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors.corr, abs/1207.0580. Dropout recipe: - Each time we present data example x, randomly delete each hidden node with 0.5 probability. - This is like sampling from 2 h different architectures. y X X h 21 h 22 h 23 - At test time, use all nodes but divide the weights by 2. X h 11 h 12 h 13 Effect I: Reduce overfitting by preventing coadaptation X x 1 x 2 x 3 Effect 2: Ensemble model averaging via bagging 3

Dropout: TIMIT phone recognition Dropout helps. Dropout + pretraining helps more. Method Phone Error Rate% Neural Net (6 layers) [12] 23.4 Dropout Neural Net (6 layers) 21.8 DBN-pretrained Neural Net (4 layers) 22.7 DBN-pretrained Neural Net (6 layers) [12] 22.4 DBN-pretrained Neural Net (8 layers) [12] 20.7 mcrbm-dbn-pretrained Neural Net (5 layers) [2] 20.5 DBN-pretrained Neural Net (4 layers) + dropout 19.7 DBN-pretrained Neural Net (8 layers) + dropout 19.7 4

Dropout: MNIST digit recognition Dropout is effective on MNIST. Particularly with input dropout. Comparison against other regularizers. Method MNIST Classification error % L2 1.62 L1 (towards the end of training) 1.60 KL-sparsity 1.55 Max-norm 1.35 Dropout 1.25 Dropout + Max-norm 1.05 5

The unreasonable effectiveness of dropout Training data without dropout with dropout A simple 2D example. Decision surfaces after training: 6

Claim: Dropout is approximate model averaging Hinton et al. (2012): - Dropout approximates geometric model averaging. Arithmetic mean: 1 N N i=1 x i Geometric mean: ( N i=1 x i ) 1 N 7

Claim: Dropout is approximate model averaging In networks with a single hidden layer of N units and a softmax output layer: Using the mean network is exactly equivalent to taking the geometric mean of the probability distributions over labels predicted by all 2 N possible networks. For deep networks, it s an approximation. 8

Bagging predictors Bagging: A method of model averaging. - To reduce overfitting (decrease variance of the estimator). Methodology: Given a standard training set D of size n, - Bagging generates m new training sets, each of size n, by sampling from D uniformly and with replacement. - train m models using the above m datasets and combined by averaging the output (for regression) or voting (for classification). 9

Bagging predictors Bag 1: 8 Bag 2: 8 10

Dropout training 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 X X 11

Dropout as bagging 5 4 4 5 5 X X 5 X X X X 12

Is dropout performing bagging? There are a few important differences: 1. The model averaging is only approximate for deep learning. 2. Bagging is typically done with an arithmetic mean. Dropout approximates the geometric mean. 3. In dropout, the members of the ensemble are not independent. There is significant weight sharing. 13

Dropout geometric mean? How accurate is the weight scaling trick approximation to the geometric mean? - How does the use of this approximation impact classification performance? How does the geometric mean compare to the arithmetic mean? - Conventionally, the arithmetic mean is used with ensemble methods? 14

Dropout geometric mean? Small networks experiments: - Exhaustive computation of exponential quantities is possible. - Two hidden layers (rectified linear), 10 hidden units each, 20 hidden units total - 2 20 = 1,048,576 possible dropout masks (for simplicity, don t drop input) Benchmark on 7 simplified binary classification tasks: - 2 different binary classification subtasks from CoverType - 4 different binary classification subtasks from MNIST - 1 synthetic task in 2-dimensions ( Diamond ) 15

Geometric Mean vs. Arithmetic Mean No systematic advantage to using the arithmetic mean over all possible subnetworks rather than the geometric mean. - Each dot represents a different randomly sampled hyperparameter configuration. No statistically significant differences in test errors across hyperparameter configurations on any task (Wilcoxon signed-rank test). 16

Quality of the Geometric Mean Approximation With ReLUs, weight-scaled predictions perform as well or better than exhaustively computed geometric mean predictions on these tasks. - Each dot represents a different randomly sampled hyperparameter configuration. No statistically significant differences in test errors across hyperparameter configurations on any task (Wilcoxon signed-rank test). 17

Dropout vs. Untied Weight Ensembles How does the implicit ensemble trained by dropout compare to an ensemble of networks trained with independent weights? - With the explicit ensemble drawn from the same distribution (i.e. masked copies of the original). - Experiment on MNIST: Average test error for varying sizes of untied-weight ensembles... - Key Observation: Bagging untied networks yields some benefit, but dropout performs better. Dropout weight-sharing has an impact! 18