Session 4: Regularization (Chapter 7)

Similar documents
Lecture 1: Machine Learning Basics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Generative models and adversarial training

Python Machine Learning

Artificial Neural Networks written examination

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

(Sub)Gradient Descent

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

arxiv: v2 [cs.cv] 30 Mar 2017

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v1 [cs.lg] 15 Jun 2015

Semi-Supervised Face Detection

Probabilistic Latent Semantic Analysis

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Model Ensemble for Click Prediction in Bing Search Ads

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

A Deep Bag-of-Features Model for Music Auto-Tagging

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Attributed Social Network Embedding

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Assignment 1: Predicting Amazon Review Ratings

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

CS Machine Learning

CSL465/603 - Machine Learning

A study of speaker adaptation for DNN-based speech synthesis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Softprop: Softmax Neural Network Backpropagation Learning

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

SARDNET: A Self-Organizing Feature Map for Sequences

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A Review: Speech Recognition with Deep Learning Methods

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Switchboard Language Model Improvement with Conversational Data from Gigaword

Introduction to Simulation

Corrective Feedback and Persistent Learning for Information Extraction

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Comment-based Multi-View Clustering of Web 2.0 Items

Calibration of Confidence Measures in Speech Recognition

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Deep Neural Network Language Models

Learning Methods for Fuzzy Systems

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

An empirical study of learning speed in backpropagation

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Cultivating DNN Diversity for Large Scale Video Labelling

Modeling function word errors in DNN-HMM based LVCSR systems

Deep Facial Action Unit Recognition from Partially Labeled Data

Knowledge Transfer in Deep Convolutional Neural Nets

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

arxiv: v1 [cs.cl] 2 Apr 2017

Speaker recognition using universal background model on YOHO database

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

arxiv: v1 [cs.cv] 10 May 2017

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

A Neural Network GUI Tested on Text-To-Phoneme Mapping

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Learning From the Past with Experiment Databases

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Second Exam: Natural Language Parsing with Neural Networks

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Data Fusion Through Statistical Matching

Speech Emotion Recognition Using Support Vector Machine

Offline Writer Identification Using Convolutional Neural Network Activation Features

INPE São José dos Campos

Word Segmentation of Off-line Handwritten Documents

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Rule Learning With Negation: Issues Regarding Effectiveness

Universidade do Minho Escola de Engenharia

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

arxiv: v1 [cs.lg] 7 Apr 2015

Why Did My Detector Do That?!

Summarizing Answers in Non-Factoid Community Question-Answering

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Improvements to the Pruning Behavior of DNN Acoustic Models

Comparison of network inference packages and methods for multiple networks inference

Truth Inference in Crowdsourcing: Is the Problem Solved?

Transcription:

Session 4: Regularization (Chapter 7) Tapani Raiko Aalto University 30 September 2015 Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 1 / 27

Table of Contents Background Regularization methods Exercises Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 2 / 27

Goal of Regularization Neural networks are very powerful (universal appr.). Easy to perform great on the training set (overfitting). Regularization improves generalization to new data at the expense of increased training error. Use held-out validation data to choose hyperparameters (e.g. regularization strength). Use held-out test data to evaluate performance. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 3 / 27

Example Without regularization training error goes to zero and learning stops. With noise regularization, test error keeps dropping. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 4 / 27

Expressivity demo: Training first layer only No regularization, training W (1) and b (1) only. 0.2% error on training set, 2% error on test set. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 5 / 27

What is overfitting? Probability theory states how we should make predictions (of y test) using a model with unknowns θ and data X = {x train, y train, x test}: P(y test X) = P(y test, θ X)dθ = P(y test θ, X)P(θ X)dθ. Probability of observing y test can be acquired by summing or integrating over all different explanations θ. The term P(y test θ, X) is the probability of y test given a particular explanation θ and it is weighted with the probability of the explanation P(θ X). However, such computation is intractable. If we want to choose a single θ to represent all the probability mass, it is better not to overfit to the highest probability peak, but to find a good representative of the mass. Posterior probability mass matters Center of gravity maximum Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 6 / 27

Table of Contents Background Regularization methods Exercises Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 7 / 27

Regularization methods Limited size of network Early stopping Weight decay Data augmentation Injecting noise Parameter sharing (e.g. convolutional) Sparse representations Ensemble methods Auxiliary tasks (e.g. unsupervised) Probabilistic treatment (e.g. variational methods) Adversarial training,... Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 8 / 27

Limited size of network Rule of thumb: When #parameters is ten times less than #outputs #examples, overfitting will not be severe. Reducing input dimensionality (e.g. by PCA) helps in reducing parameters Easy. Low computational complexity Other methods give better accuracy Data augmentation increases #examples Parameter sharing decreases #parameters Auxiliary tasks increases #outputs Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 9 / 27

Early stopping Monitor validation performance during training Stop when it starts to deteriorate With other regularization, it might never start Keeps solution close to the initialization Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 10 / 27

Weight decay (Tikhonov, 1943) Add a penalty term to the training cost C = + Ω(θ) Note: only a function of parameters θ, not data. L 2 regularization: Ω(θ) = λ 2 θ 2 hyperparameter λ for strength. Gradient: Ω(θ) θ i = λθ i. L 1 regularization: Ω(θ) = λ/2 θ 1 Gradient: Ω(θ) θ i = λ sign(θ i ). Induces sparsity: Often many params become zero. Max-norm: Constrain row vectors w i of weight matrices to w i 2 c. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 11 / 27

Weight decay L2 (left) and L1 (right). w unregularized solution, w regularized solution. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 12 / 27

Weight decay as Bayesian prior Consider the maximum a posteriori solution Bayes rule: P(θ X) = P(X θ)p(θ) written on -log scale: C = log P(X θ) log P(θ) Assuming Gaussian prior P(θ) = N (0, λ 1 I) we get Ω(θ) = i log exp θ2 i 2λ = λ 1 2 θ 2 L 2 regularization Gaussian prior L 1 regularization Laplace prior Max-norm regularization Uniform prior with finite support Ω = 0 Maximum likelihood Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 13 / 27

Data augmentation Image from (Dosovitskiy et al., 2014) Augmented data by image-specific transformations. E.g. cropping just 2 pixels gets you 9 times the data! Infinite MNIST: http://leon.bottou.org/projects/infimnist Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 14 / 27

Injecting noise (Sietsma and Dow, 1991) Inject random noise during training separately in each epoch Can be applied to input data, to hidden activations, or to weights Can be seen as data augmentation Simple end effective Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 15 / 27

Injecting noise to inputs (analysis) Inject small additive Gaussian noise at inputs Assume least squares error at output y Taylor series expansion around x Corresponds to penalizing the Jacobian J 2 y 1 J = dy x dx = 1.... y 1 x d y c x 1. y c x d For linear networks, this reduces to L 2 penalty Rifai et al. (2011) penalize the Jacobian directly Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 16 / 27

Parameter sharing Force sets of parameters to be equal Reduces the number of (unique) parameters Important in convolutional networks (next week) Auto-encoders sometimes share weights between encoder and decoder (Oct 28 session) Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 17 / 27

Sparse representations Penalize representation h using Ω(h) to make it sparse L 1 penalty on weights makes W sparse Similarly L 1 penalty can make h sparse Also possible to set a desired sparsity level Sparse coding is common in image processing Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 18 / 27

Ensemble methods Train several models and take average of their outputs Also known as bagging or model averaging It helps to make individual models different by varying models or algorithms varying hyperparameters varying data (dropping examples or dimensions) varying random seed It is possible to train a single final model to mimick the performance of the ensemble, for test-time computational efficiency (Hinton et al., 2015) Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 19 / 27

Dropout (Hinton et al., 2012) Each time we present data example x, randomly delete each hidden node with 0.5 probability Can be seen as injecting noise or as ensemble: Multiplicative binary noise Training an ensemble of 2 h networks with weight sharing At test time, use all nodes but divide weights by 2 Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 20 / 27

Dropout training Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 21 / 27

Dropout as bagging Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 22 / 27

Auxiliary tasks Multi-task learning: Parameter sharing between multiple tasks E.g. speech recognition and speaker identification could share low-level representations Layer-wise pretraining (Hinton and Salakhutdinov, 2006) can be seen as using unsupervised learning as an auxiliary task (Nov 4 session) Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 23 / 27

Probabilistic treatment Variational methods starting to appear in deep learning research T-61.5140 Machine Learning: Advanced Probabilistic Methods Jyri Kivinen might discuss these on Nov 11 session Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 24 / 27

Adversarial training (Szegedy et al., 2014) Search for an input x near a datapoint x that would have very different output y from y Adversaries can be found surprisingly close! Miato et al. (2015) build a very effective regularizer Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 25 / 27

Table of Contents Background Regularization methods Exercises Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 26 / 27

Exercises Read Chapter 7 (Regularization) and Chapter 9 (Convolutional Networks) Read the Theano tutorial on Regularization: http://deeplearning.net/tutorial/gettingstarted.html#regularization Extend your MNIST classifier to include regularization. Consider at least L2 weight decay and additive Gaussian noise injected in the inputs. Choose a good regularization strength using a held-out validation set. Tapani Raiko (Aalto University) Session 4: Regularization (Chapter 7) 30 September 2015 27 / 27