Adaptive Hyperparameter Search for Regularization in Neural Networks

Similar documents
(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

arxiv: v1 [cs.lg] 15 Jun 2015

Artificial Neural Networks written examination

Generative models and adversarial training

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

On the Combined Behavior of Autonomous Resource Management Agents

Softprop: Softmax Neural Network Backpropagation Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CS Machine Learning

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Knowledge Transfer in Deep Convolutional Neural Nets

Learning Methods for Fuzzy Systems

Word Segmentation of Off-line Handwritten Documents

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Axiom 2013 Team Description Paper

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

WHEN THERE IS A mismatch between the acoustic

Lecture 10: Reinforcement Learning

SARDNET: A Self-Organizing Feature Map for Sequences

A Reinforcement Learning Variant for Control Scheduling

Reinforcement Learning by Comparing Immediate Reward

CSL465/603 - Machine Learning

An empirical study of learning speed in backpropagation

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Modeling function word errors in DNN-HMM based LVCSR systems

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Discriminative Learning of Beam-Search Heuristics for Planning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Model Ensemble for Click Prediction in Bing Search Ads

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A study of speaker adaptation for DNN-based speech synthesis

Evolutive Neural Net Fuzzy Filtering: Basic Description

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Modeling function word errors in DNN-HMM based LVCSR systems

A Review: Speech Recognition with Deep Learning Methods

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Reducing Features to Improve Bug Prediction

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Human Emotion Recognition From Speech

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

TD(λ) and Q-Learning Based Ludo Players

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Summary results (year 1-3)

Learning From the Past with Experiment Databases

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Test Effort Estimation Using Neural Network

An Introduction to Simio for Beginners

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Calibration of Confidence Measures in Speech Recognition

Dropout improves Recurrent Neural Networks for Handwriting Recognition

FF+FPG: Guiding a Policy-Gradient Planner

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Cultivating DNN Diversity for Large Scale Video Labelling

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

arxiv: v1 [cs.cv] 10 May 2017

2 nd grade Task 5 Half and Half

How People Learn Physics

Comment-based Multi-View Clustering of Web 2.0 Items

A Case Study: News Classification Based on Term Frequency

Software Maintenance

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Seminar - Organic Computing

arxiv: v1 [cs.lg] 7 Apr 2015

Attributed Social Network Embedding

Improving Fairness in Memory Scheduling

Simulation of Multi-stage Flash (MSF) Desalination Process

INPE São José dos Campos

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Individual Differences & Item Effects: How to test them, & how to test them well

Learning to Schedule Straight-Line Code

Mining Association Rules in Student s Assessment Data

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Probabilistic Latent Semantic Analysis

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Classification Using ANN: A Review

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Australian Journal of Basic and Applied Sciences

Exploration. CS : Deep Reinforcement Learning Sergey Levine

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Truth Inference in Crowdsourcing: Is the Problem Solved?

Transcription:

Adaptive Hyperparameter Search for Regularization in Neural Networks Devin Lu Stanford University Department of Statistics devinlu@stanford.edu June 13, 017 Abstract In this paper, we consider the problem of whether we can achieve comparable or better validation accuracy from a regularization constant that varies during training than from a fixed hyperparameter. Also, can we consider the problem of developing a policy that can change the regularization parameter based on feedback from training rather than using a fixed schedule. 1 Introduction Traditionally, regularization constants and other hyperparameters are fixed for a model throughout training. Optimizing these hyperparameters is usually done through splitting out a portion of the training data into an evaluation set separate from the test data specifically for use in hyperparameter optimization. The model is then trained repeatedly using different hyperparameter settings and an optimal choice is made from performance on the dev set. The downside of this approach is that this is often very expensive, such as when training the model is expensive or when data volume is large enough that training examples cannot all be held in memory. These are issues that often arise with neural networks. Here, we examine whether using a regularization parameter that varies during training can accelerate the process of achieving a certain level of accuracy or even achieve a model that performs better than what is achieved the traditional hyperparameter optimization. We also examine the question of whether we can develop a policy that determines changes in regularization rather than using a fixed schedule. Previous Work The concept of an adaptive regularization parameter in a different domain has been considered before in Adaptive regularization parameter adjustment for reconstruction problems by Watzenig, et al (004). In that paper, they considered the field of reconstruction problems and used an approach based on condition number. 3 Problem and Data To examine the problem of adapative regularization, we chose to try an image classificatin task. We used the CIFAR-10 dataset. This dataset is 60,000 images each of which is 3 x 3. Each image is colored, so with the three color channels each image consists of 3 x 3 x 3 = 307 floating point values. Each image falls into one of 10 classes. The dataset was split into a training set consisting of 50,000 images and a test set consisting of 10,000 images. The total size of the dataset in memory is approximately 163MB. 1

4 Neural Network Comparison 4.1 Pull Down/Up Schedule As a first test we implemented a feed forward neural network. We tried using a schedule that started with high regularization that gradually reduced. We compared this against using any of the individual regularization parameters in the schedule. We implemented a two-layer neural network for the classification task and compared using a pull-down regularization schedule (starting with a high regularization and subsequently decreasing) and a pull-up regularization schedule (starting with a low regularization schedule and subsequently increasing) vs using a fixed regularization schedule. Fig 1 shows the increase in accuracy from using a varying schedule over using the best fixed regularization parameter. To make the comparison fair, each comparison model was trained for as many iterations in total as during the varying schedule training. To illustrate what we mean by this, suppose we used a regularization schedule [3.0,.0, 1.0, 0.0], training the neural net for 1000 iterations for each value. In total, this means we trained the neural net for 4000 iterations. To compare, we would then train four separate neural networks, one each with regularization 3.0,.0, 1.0, and 0.0. We would train each of these for 4000 iterations. Fig 1 shows the gain in accuracy for using a pull-down regularization schedule over the best model found this way using a constant regularization term. This is equivalent to the gain from using a varying regularization schedule over identifying the best constant regularization parameter with grid search and using that as the baseline. This is effectively a speedup of a factor of n, where n is the number of regularization parameters to search over. Interestingly, we find that in addition to the speedup, we can find an absolute performance gain by using the pulldown schedule. We found that a pull-down schedule generally worked better than a pull-up schedule. Figure shows generalization accuracy of a pull-up vs pull-down regularization schedule as a function of the number of hidden units. This was for a fixed 1000 iterations per constant. 4. Analysis As shown in Figure, a pull down schedule (high regularization to start and decreasing with time) generally performed better than a pull up schedule (low regularization to start and increasing with time). We suspect this is because in the beginning of training, essentially all training is overfitting since there is literally nothing else. In other words, at the beginning of training, the model will tend to overfit to the idiosyncrasies of the early batches. If regularization is strong at this point, weights will not get updated to fit the false positive signals that exist simply because of the random distribution of the initial batches. While it is theoretically possible to eventually learn to ignore these false signals, it will take longer to unlearn rather than to simply not learn in the first place, especially if learning rate decay is heavy. Therefore, an early heavy regularization will prevent more of these false starts when the model is most vulnerable to it, resulting in faster convergence to better accuracy. As can be seen in Figure 1, the varying regularization schedule generally helped more as the number of hidden units increased. We believe this is because as hidden unit numbers increase, the complexity of the loss surface increases exponentially. Thus it is likely there will be more bad local minima that happen to exist because simply because of the particularities of the training data. Since changing regularization changes the loss surface, weak local minima are more likely to no longer become critical points if regularization changes. Thus, the network will probably escape these points, an effect that would be pronounced with higher hidden units. Finally, we observed that the gain from the varying regularization schedules tended to decrease as the iterations per constant increased. This suggests to us that part of the benefit in varying regularization schedules is in accelerating convergence to a certain accuracy, even if given sufficient computational resources, a vanilla grid search could eventually get to a similar performance.

Figure 1: Accuracy gain from a pull-down regularization schedule, as a function of the number of hidden units and iterations per constant. Figure : Difference in final generalization accuracy for a pull-down (Large to Small) vs pull-up (Small to Large) schedule. Iteration number fixed at 1000. 3

5 Adaptive Regularization In addition to using a fixed regularization schedule, we also experimented with an adaptive regularization parameter. In this approach, we would during training continuously observe the training and validation accuracy and use this as feedback for determining how to adjust the regularization parameter. Specifically, if the training accuracy is significantly higher than the validation accuracy, we adjust the regularization constant up because this indicates overfitting. If the training accuracy is very close to or lower than the validation accuracy, we adjust the regularization constant down. We examined this approach both with our feed-forward neural network we used in section 4 and with a three-layer convolutional neural network. In Figure 3, we show the results for the feed-forward neural net. In general, there is a positive gain from using this adaptive method, but the gain tends to decrease with the number of hidden units. Figure 4 shows the results for a convolutional neural network with fixed architecture as a function of the epoch number. We see that the adaptive method usually gives a gain, especially in the earlier epochs, but the gap tends to close as the model trains for many iterations. Note however that the adaptive method reaches its final validation accuracy earlier, while the fixed schedule tends to continue improving, eventually reaching roughly the same level as the adaptively trained model. This is similar to what we saw in Section 4, which suggested the varying regularization schedules were speeding up convergence. 6 Nonconvex Regularization Usually, l 1 or l norm is used for regularization. The reason is that l 1 and l norms are both convex functions. Convex functions are preferred in many machine learning applications because they have the property that there is at most one locally optimal value, which coincides with the global optimum, if it exists. Additionally, the set of points achieving this optimal value is a connected, convex set. Therefore any optimization method that eventually converges to a local optimum (like stochastic Figure 3: Accuracy gain over the feed-forward neural network as a function of the number of hidden units. Figure 4: Accuracy gain using an adaptive regularization schedule over a fixed schedule for a three-layer convolutional neural network. gradient descent) also converges to a global optimum. Convex functions are thus easy to optimize. Many traditional machine learning algorithms, such as logistic regression, are formulated with convex loss functions for this reason. The space of convex functions have the nice property that it is closed under addition, so adding an l 1 or l regularization term to a convex loss term keeps the loss convex. If the regularization term used a non-convex function, such as a loss of the form l m for m < 1, there is no guarantee the resulting loss function would be convex, and so there could exist multiple local minima. However, neural network losses are already highly non- 4

Figure 5: Loss of L1-regularization convex. Therefore, the reason for restricting regularization terms to l 1 or l norms is already lost. Therefore, we also experimented with using an l 1 loss, i.e., x 1 = x 1. We expect this to have a sparsifying effect on our weights. To see why, consider Figures 5-7. Suppose a feature has weight w. Reducing any particular feature weight by δ in l 1 regularization reduces the regularization loss by a factor linear in δ. In l regularization, d dw w = w, so reducing a feature weight by δ reduces the regularization loss by O(wδ). However, d in l frac1 regularization, dw w 1 = 1, so reducing w a feature weight by δ reduces the regularization loss by O( δ w ) = O( 1 w ). Therefore, as w 0, l regularization rewards reducing w less and less, l 1 keeps the reward constant, and l 1 rewards it more and more. Therefore, we would expect that l 1 regularization promotes sparsity even more than l 1. However, we observed that using both a pull-up and pull-down schedule harmed final accuracy with l 1 regularization (Fig 8). Generally, the larger the number of hidden units, the greater loss we encountered. Figure 6: Loss of L-regularization. Figure 7: Loss of L 1 -regularization. 7 Conclusion We completed a study on using a regularization parameter that changes over time instead of a fixed parameter. We found that it can help in certain neural network models, either by accelerating convergence to a particular accuracy level or sometimes achieving an accuracy level beyond what can be obtained by a grid search-based optimal hyperparameter selection over the same parameters as used in the varying schedule. We also examined using l 1 regularization and found a varying regularization schedule tended to hurt performance. Future work is needed to develop more sophisticated feedback policies for adaptively changing the regularization parameter and understanding why using l 1 regularization produced the opposite result to the typical l case. 5

Figure 8: Accuracy loss from a varying regularization schedule under l 1 number of hidden units. loss. Loss grows more severe with greater 6

8 References Watzenig, et al (004). Adaptive regularization parameter adjustment for reconstruction problems. IEEE Transactions on Magnetics. Volume 40, Issue, March 004. 7