ECE521 Lecture10 Deep Learning

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Lecture 1: Machine Learning Basics

arxiv: v1 [cs.lg] 15 Jun 2015

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Knowledge Transfer in Deep Convolutional Neural Nets

Artificial Neural Networks written examination

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

(Sub)Gradient Descent

Cultivating DNN Diversity for Large Scale Video Labelling

Model Ensemble for Click Prediction in Bing Search Ads

Assignment 1: Predicting Amazon Review Ratings

A Deep Bag-of-Features Model for Music Auto-Tagging

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Softprop: Softmax Neural Network Backpropagation Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CSL465/603 - Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Dropout improves Recurrent Neural Networks for Handwriting Recognition

A study of speaker adaptation for DNN-based speech synthesis

Second Exam: Natural Language Parsing with Neural Networks

CS Machine Learning

arxiv: v1 [cs.lg] 7 Apr 2015

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Deep Neural Network Language Models

Evolutive Neural Net Fuzzy Filtering: Basic Description

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

arxiv: v2 [cs.cv] 30 Mar 2017

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A Neural Network GUI Tested on Text-To-Phoneme Mapping

WHEN THERE IS A mismatch between the acoustic

Calibration of Confidence Measures in Speech Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Review: Speech Recognition with Deep Learning Methods

Human Emotion Recognition From Speech

arxiv: v1 [cs.cv] 10 May 2017

Offline Writer Identification Using Convolutional Neural Network Activation Features

Learning Methods for Fuzzy Systems

Generative models and adversarial training

arxiv: v2 [cs.cl] 26 Mar 2015

An empirical study of learning speed in backpropagation

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Comment-based Multi-View Clustering of Web 2.0 Items

Forget catastrophic forgetting: AI that learns after deployment

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Probabilistic Latent Semantic Analysis

Learning From the Past with Experiment Databases

Attributed Social Network Embedding

Lip Reading in Profile

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

SORT: Second-Order Response Transform for Visual Recognition

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Discriminative Learning of Beam-Search Heuristics for Planning

Residual Stacking of RNNs for Neural Machine Translation

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Modeling function word errors in DNN-HMM based LVCSR systems

Exploration. CS : Deep Reinforcement Learning Sergey Levine

arxiv: v1 [cs.cl] 27 Apr 2016

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Truth Inference in Crowdsourcing: Is the Problem Solved?

Learning to Schedule Straight-Line Code

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Lecture 10: Reinforcement Learning

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

THE enormous growth of unstructured data, including

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Modeling function word errors in DNN-HMM based LVCSR systems

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

SARDNET: A Self-Organizing Feature Map for Sequences

Axiom 2013 Team Description Paper

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

INPE São José dos Campos

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Speech Recognition at ICSI: Broadcast News and beyond

w o r k i n g p a p e r s

arxiv: v4 [cs.cl] 28 Mar 2016

arxiv: v2 [cs.ir] 22 Aug 2016

ON THE USE OF WORD EMBEDDINGS ALONE TO

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Rule Learning With Negation: Issues Regarding Effectiveness

Image based Static Facial Expression Recognition with Multiple Deep Network Learning

Transcription:

ECE521 Lecture10 Deep Learning

Learning fully connected multi-layer neural networks For a single data point, we can write the the hidden activations of the fully connected neural network as a recursive computation using the vector notation: forward propagation f() is the output activation function The output of the network is then used to compute the loss function on the training data 2

Learning fully connected multi-layer neural networks For a single training data, computing the gradient w.r.t. the weight matrices is also a recursive procedure: Remember: Back-propagation is similar to running the neural network backwards using the transpose of the weight matrices back-propagation What about the expression for the bias units? 3

Outline Bag-of-tricks for deep neural networks Local minimums and initialization Early stopping and regularization Dataset normalization Type of neural networks Building blocks Convolutional neural network Recurrent neural network 4

Hyper-parameters At the beginning, there were the choices: How many hidden units to use in each hidden layer? How many layers in total? Which hidden activation function? Good answer: decide these hyper-parameters using validation set. Best practical answer: Around 500-2000 hidden units 2-3 layers Choosing ReLU often leads to fast convergence... 5

Parameter initialization The loss functions for neural networks w.r.t. the weight matrices in general are non-convex. Different weight-initialization schemes can lead to significant differences in final performance. Use random initialization (e.g. Gaussian with std. 0.01) and avoid constant initialization training loss Non-convex optimization is not crazy. Most of the local minima in a neural network s loss function are not bad. As long as the model has enough capacity to model the data. Stochastic gradient descent can usually find good local minima 6

Some local optima generalize better than others Consider two neural networks that each achieve a low error rate on the training set. We prefer the model that learns the underlying statistical patterns of the data and can generalize to unseen examples during test time. The test loss function and the training loss are almost always slightly different. Wide shallow basins can typically generalize better than deep narrow local minima. The subsampling noise from SGD helps to find shallow basins. training loss test loss wide local minimum generalize better deep local minimum does not generalize 7

Careful initialization For really deep neural networks (> 5 layers), random initialization from a constant-variance Gaussian noise will not work well. The back-propagated partial derivatives will likely be too small to learn anything useful. Simple fix: initialize the weight matrices to identity matrices if you can. (Le, Jaitly and Hinton, 2015) More elaborate fix: adapt the std. of the zero-mean Gaussian initialization to be sqrt{3/{#inputs + #outputs}} aka Xavier initialization (Xavier Glorot and Yoshua Bengio, 2010) Oftentimes it is beneficial to initialize the weights from another model that was pre-trained on some other tasks. Initialize the model from an auto-encoder or ImageNet models. Weight matrices in the early layers are transferable and help generalization.... 8

Pre-training and deep learning hypothesis One of the hypothesis why deep learning works so well is that deep nets learn appropriate low, mid and high level features given enough data. Lower level layers learn local but general feature detectors. It makes sense to transfer the early hidden layers and expect those low level features to work well If we do not have enough data for the current task, we can expect a win by training models on a similar but not identical task with millions of training examples and then transfer the model back to the current task 9 [from Andrew Ng s slides]

Regularize the capacity of deep neural networks Given enough hidden units, neural networks can overfit to the training dataset. Regularizing a neural network is equivalent to restricting its capacity. Reducing the number of hidden units to limit the model capacity has a sound statistical justification: you need more training examples than the number of weights to have enough statistical power when estimating the unknown model parameter. Alternatively, one can have an over-parameterized model and deal with overfitting through a very strong regularizer such that most of the weights are close to zero, e.g. using weight decay. In deep learning applications, strong-regularizer approaches often work much better. A good heuristic is to prefer a globally simple prediction function with some irregular local structure. training loss test loss #epochs 10

Early stopping A very simple trick to ensure that most of the weights are as close to zero as possible involves monitoring the validation loss. You stop learning at the minimum of the validation loss curve (aka early stopping ) before attaining the minimum training loss. training loss The goal of machine learning is to generalize well on test data instead of finding the minimum training loss. The weights are typically initialized around zero. Early stopping terminates the learning process prematurely before the weights grow too large and thus limits the capacity of the model. It is by far the most commonly used regularization technique. test loss valid. loss #epochs early stopping point (the minimum of validation loss) 11

Dropout Another simple trick to effectively regularize deep NNs is to remove hidden units randomly during training. Dropout prevents co-adaptation of the hidden units and encourages independence among the neurons. Dropout can be understood as stochastic training on all the permutations of neural network architectures that share the same weight matrices. Efficiently shares the statistical power among an ensemble of neural networks. During test time, the mean network is used to make predictions. If each hidden unit is dropped out 50% of the time in training, we need to compensate during test time by reducing the weight matrix by a factor of 2, such that the expected activation of each hidden unit stays the same for both training and testing.... 12

Dropout Another simple trick to effectively regularize deep NNs is to remove hidden units randomly during training. Dropout prevents co-adaptation of the hidden units and encourages independence among the neurons. Dropout can be understood as stochastic training on all the permutations of neural network architectures that share the same weight matrices. Efficiently shares the statistical power among an ensemble of neural networks. During test time, the mean network is used to make predictions. If each hidden unit is dropped out 50% of the time in training, we need to compensate during test time by reducing the weight matrix by a factor of 2, such that the expected activation of each hidden unit stays the same for both training and testing.... 13

Dropout Another simple trick to effectively regularize deep NNs is to remove hidden units randomly during training. Dropout prevents co-adaptation of the hidden units and encourages independence among the neurons. Dropout can be understood as stochastic training on all the permutations of neural network architectures that share the same weight matrices. Efficiently shares the statistical power among an ensemble of neural networks. During test time, the mean network is used to make predictions. If each hidden unit is dropped out 50% of the time in training, we need to compensate during test time by reducing the weight matrix by a factor of 2, such that the expected activation of each hidden unit stays the same for both training and testing.... 14

Statistical power of the MLE v.s. neural networks Traditional view of machine learning is that we need more training examples than the number of parameters. E.g. fitting a linear regression of 100 inputs on thousands of training examples. E.g. fitting the hyper-parameters (weight decay) on hundreds of validation examples. Neural networks typically have millions of learnable weights and there are always not enough data. (we would just keep building fancier and larger models once we had more training examples) It is almost always better off to over-parameterize and then prevent overfitting by applying very strong regularizer. (e.g. dropout) Or to transfer statistical efficiency from other pre-trained models. 15

Summary of regularization options Lower the number of units and depth Weight decay Very effective, always apply early stopping Dropout Should always have a small amount of weight decay Early stopping Often due to computational reasons Strength of the regularization can be increased or decreased by the dropout probability Fine-tuning pre-trained models Initialize your network from some model pre-trained on millions of data points 16

Dataset normalization Learning can often be made easier by pre-processing the dataset before running the learning algorithms. The simplest normalization scheme is to centre the input and remove its variance for each input dimension. This fixes the scaling discrepancy among the input dimensions (e.g. removes the units of measurement). SGD learning works much better on the inputs that have the same scale. Why? For each of the N input dimensions, estimate its mean and its variance from the training data. Then normalize all the training example: We can further remove the covariance among the input dimensions, i.e. whitening the data using Principal Component Analysis (PCA): next lecture 17

Outline Bag-of-tricks for deep neural networks Local minimums and initialization Early stopping and regularization Dataset normalization Type of neural networks Building blocks Convolutional neural network Recurrent neural network 18

Basic neural building blocks Here is some basic computational units that we will use to construct deep neural networks with more advanced connectivity patterns. All these basic computational units have well defined partial derivatives. σ max + x 19

Sparse local connectivity When keeping the general stacked layer-wise neural network architecture, we can have specialized connectivity pattern between the hidden layers: Local connectivity: each hidden unit operates on a local image patch (3 instead of 7 connections per hidden unit) Weight sharing: filtering each chunk of the signal the same way. (e.g. processing each patch of the image the same way.) Parameter sharing improves statistical power and reduces overfitting. Now we can have a lot more neurons: # neurons vs. # weights. It works well to match a 20 pattern in image [from Piercy Liang s slides and Andrej Karpathy s blog]

ConvNet: 2D convolution Convolutional architecture where each depth column is produced from localized region (in height/width): 21 [from Piercy Liang s slides and Andrej Karpathy s blog]

ConvNet: Max-pooling We can reducing the dimension of the input without introducing extra parameters by pooling the maximum value of a neighbourhood: The pooling layer tests if there is a strong pattern in a local neighbourhood while suppressing others Max-pooling layer also prevent overfitting because it has not learnable parameters 22 [from Piercy Liang s slides and Andrej Karpathy s blog]

CNN: object recognition AlexNet: a typical modern CNN 23 [Krizhevsky, et. al., 2012]

CNN: neural style transfer 24

CNN: neural style transfer 25 [Gatys, et. al., 2015]

Recurrent neural network Weight sharing is a general technique that can be applied to any neural networks. In CNNs, weight sharing is applied to the neurons in the same hidden convolutional layer. Here, we apply weight sharing among weight matrices. The same weight matrix recurs between the hidden layers. Sharing the same weight matrix in a very deep fully connected neural network allow us to build an autoregressive process: a process in which future values are estimated based on past values estimated using the same model. An autoregressive process operates under the premise that past values have an effect on current values....... 26

RNN: statistical machine translation RNN can deal with variable length input sequence, i.e. the inputs to RNNs does not have to be a fixed length vector anymore. Machine translation: x:je crains l homme de un seul livre. y:fear the man of one book. 27 [from Piercy Liang s slides and Andrej Karpathy s blog]

RNN: statistical machine translation Large scale Google Neural Machine Translation: deep 8 layer RNNs 28 [Wu, et. al., 2016]

Combining CNN with RNN: Caption generation The model consists of a convolutional network that consume a fixed size input vector (a fixed size image). The last hidden layer of the CNN is connected to the RNN that outputs a sequence of words to describe the input image. 29 [from Xu et.al. 2014]