Understanding Deep Learning Requires Rethinking Generalization

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Machine Learning Basics

Artificial Neural Networks written examination

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Second Exam: Natural Language Parsing with Neural Networks

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Generative models and adversarial training

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Python Machine Learning

Deep Neural Network Language Models

arxiv: v1 [cs.lg] 15 Jun 2015

(Sub)Gradient Descent

Lecture 1: Basic Concepts of Machine Learning

Knowledge Transfer in Deep Convolutional Neural Nets

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

CS Machine Learning

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Semi-Supervised Face Detection

Residual Stacking of RNNs for Neural Machine Translation

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

SARDNET: A Self-Organizing Feature Map for Sequences

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Assignment 1: Predicting Amazon Review Ratings

A Neural Network GUI Tested on Text-To-Phoneme Mapping

THE enormous growth of unstructured data, including

Attributed Social Network Embedding

CSL465/603 - Machine Learning

WHEN THERE IS A mismatch between the acoustic

A study of speaker adaptation for DNN-based speech synthesis

Discriminative Learning of Beam-Search Heuristics for Planning

Softprop: Softmax Neural Network Backpropagation Learning

INPE São José dos Campos

Model Ensemble for Click Prediction in Bing Search Ads

Evolutive Neural Net Fuzzy Filtering: Basic Description

Modeling function word errors in DNN-HMM based LVCSR systems

Learning From the Past with Experiment Databases

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A Deep Bag-of-Features Model for Music Auto-Tagging

arxiv: v1 [math.at] 10 Jan 2016

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

SORT: Second-Order Response Transform for Visual Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v4 [cs.cl] 28 Mar 2016

Lip Reading in Profile

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Review: Speech Recognition with Deep Learning Methods

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

The Strong Minimalist Thesis and Bounded Optimality

Speech Emotion Recognition Using Support Vector Machine

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Comment-based Multi-View Clustering of Web 2.0 Items

Learning Methods for Fuzzy Systems

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Human Emotion Recognition From Speech

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Cultivating DNN Diversity for Large Scale Video Labelling

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v2 [cs.cl] 26 Mar 2015

Using focal point learning to improve human machine tacit coordination

On the Formation of Phoneme Categories in DNN Acoustic Models

Lecture 10: Reinforcement Learning

arxiv: v1 [cs.cl] 27 Apr 2016

ON THE USE OF WORD EMBEDDINGS ALONE TO

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Georgetown University at TREC 2017 Dynamic Domain Track

Speaker Identification by Comparison of Smart Methods. Abstract

Image based Static Facial Expression Recognition with Multiple Deep Network Learning

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Dialog-based Language Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Probabilistic Latent Semantic Analysis

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

A survey of multi-view machine learning

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Syntactic systematicity in sentence processing with a recurrent self-organizing network

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Evolution of Symbolisation in Chimpanzees and Neural Nets

arxiv: v1 [cs.cl] 20 Jul 2015

Axiom 2013 Team Description Paper

arxiv: v4 [cs.cv] 13 Aug 2017

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Improvements to the Pruning Behavior of DNN Acoustic Models

Natural Language Processing: Interpretation, Reasoning and Machine Learning

Speech Recognition at ICSI: Broadcast News and beyond

Transcription:

Understanding Deep Learning Requires Rethinking Generalization Chiyuan Zhang Massachusetts Institute of Technology chiyuan@mit.edu Samy Bengio Google Brain bengio@google.com Moritz Hardt Google Brain mrtz@google.com Benjamin Recht University of California, Berkeley brecht@berkeley.edu Oriol Vinyals Google DeepMind vinyals@google.com ICLR Best Paper Award 2017 Presentation by Rodney LaLonde at the University of Central Florida s (UCF) Center for Research in Computer Vision (CRCV).

Presentation Outline Motivation Background Experimental Findings Discussion

Question: What is it then that Motivation distinguishes neural networks that generalize well from those that don t?

Generalization Error Background Model Capacity Regularization

Generalization Error The difference between training error and test error Generalization error = 0.1

Generalization Error The difference between training error and test error Generalization error = 0.2

Generalization Error The difference between training error and test error Generalization error = 0.3

Universal Approximation Theorem A feed-forward network with a single hidden layer containing a finite number of neurons, can approximate any continuous function on compact subsets of.* George Cybenko proved this for sigmoid activation functions. 1 Does not define the algorithmic learnability of those parameters. * Some mild assumptions about the activation function must be met. 1 Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 2 (4), 303-314.

Vapnik-Chervonenkis theory VC-Dimension A classification model with some parameter vector is said to shatter a set of data points,,, if, for all assignments of labels to those points, there exists a such that the model makes no errors when evaluating that set of data points.

Vapnik-Chervonenkis theory VC-Dimension The VC dimension of a model is the maximum number of data points that can be arranged so that shatters them.

VC-Dimension Statistical Learning Theory Probabilistic upper bound on test error: P 1 1 Valid only when. Not useful for deep neural networks where typically.

Explicit Regularization Weight Decay Dropout Regularization Data Augmentation Implicit Regularization Early Stopping Batch Normalization SGD

L2 Regularization Weight Decay Standard weight update: * New weight update: λ * Forces the weights to become small, decay. * Krogh, Anders, and John A. Hertz. "A simple weight decay can improve generalization." In Advances in neural information processing systems, pp. 950-957. 1992.

Dropout Randomly drop neurons from layers in the network. Removes reliance on individual neurons. Figure taken from: Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine Learning Research 15, no. 1 (2014): 1929-1958.

Dropout Learns redundancies. Learns more nuanced set of feature detectors. Figure taken from: Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. "Dropout: A simple way to prevent neural networks from overfitting." The Journal of Machine Learning Research 15, no. 1 (2014): 1929-1958.

Data Augmentation Domain specific transformations of the input data. For images: Shown in figure (also random noise, sheer, zoom, elastic deformations, etc.) Figure from: Taylor, Luke, and Geoff Nitschke. "Improving Deep Learning using Generic Data Augmentation." arxiv preprint arxiv:1708.06020 (2017).

Data Augmentation Increases the input space (i.e. all possible images we care about). Figure from: Taylor, Luke, and Geoff Nitschke. "Improving Deep Learning using Generic Data Augmentation." arxiv preprint arxiv:1708.06020 (2017).

Experimental Findings

Randomization Tests True labels Partially corrupted labels Random labels Human Monkey Bird Cat Deer Dog Frog Building Ship Truck

Randomization Tests True labels Partially corrupted labels Random labels Shuffled pixels Human Monkey Bird Cat Deer Dog Frog Building Ship Truck

Randomization Tests True labels Partially corrupted labels Random labels Shuffled pixels Random pixels Human Monkey Bird Cat Deer Dog Frog Building Ship Truck

Randomization Tests True labels Partially corrupted labels Random labels Shuffled pixels Random pixels Gaussian Human Monkey Bird Cat Deer Dog Frog Building Ship Truck

Results of Randomization Tests

Conclusions & Implications Conclusion: Deep neural networks easily fit random labels. Implications: The effective capacity of neural networks is sufficient for memorizing the entire data set. Even optimization on random labels remains easy.

Explicit Regularization Tests

Conclusions & Implications Conclusions: Augmenting data is more powerful than only weight decay. Bigger gains by changing the model architecture. Implications: Explicit regularization may improve generalization, but is neither necessary nor by itself sufficient.

Implicit Regularization Findings Early stopping could potentially improve generalization. Batch normalization improves generalization.

Both explicit and implicit regularizers could help to improve the generalization performance. However, it is unlikely that the regularizers are the fundamental reason for generalization Authors Conclusions

Finite-Sample Expressivity of Neural Networks At the population level, depth networks typically more powerful than depth 1networks. Given a finite sample size, even a two-layer neural network can represent any function once the number of parameters exceeds.

Finite-Sample Expressivity of Neural Networks Theorem 1: There exists a two-layer neural network with ReLU activations and 2 weights that can represent any function on a sample of size in dimensions.

Finite-Sample Expressivity of Neural Networks A network can represent any function of a sample size in dimensions if: for every sample with and every function :, there exists a setting of weights of such that for every. Can be extended to depth networks with width.

Appeal to Linear Models Imagine data points,, where are -dimensional feature vectors and are labels. Solve: min,. Eq. (2) If we can fit any labeling.

Appeal to Linear Models Let denote the matrix whose -th row is. If has rank, then has an infinite number of solutions. We find a global min of Eq. (2) by solving this linear system.

Investigating SGD Given SGD: and 0, then for some coefficients. Therefore lies in the span of the data points. Replacing this and perfectly interpolating the labels, gives, which has a unique solution.

Investigating SGD Forming the kernel matrix (Gram matrix) K and solving Kα for yields a perfect fit on the labels. Turns out, this kernel solution is exactly the minimum -norm solution to. Hence SGD converges to the solution with minimum norm. Minimum norm is not predictive of generalization performance.

Final Conclusions Effective capacity of neural networks. Successful neural networks are large enough to shatter the training data. Optimization continues to be easy even when generalization is poor.

Final Conclusions SGD may be performing implicit regularization by converging to solutions with minimum -norm. Traditional measures of model complexity struggle to explain the generalization of large neural networks.

Thank You! Questions and Discussions?