Hyper-parameter Optimization for Deep Learning. Tianxiang Gao Feb, 16, 2016

Similar documents
Python Machine Learning

arxiv: v1 [cs.lg] 15 Jun 2015

Lecture 1: Machine Learning Basics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Generative models and adversarial training

Knowledge Transfer in Deep Convolutional Neural Nets

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Artificial Neural Networks written examination

(Sub)Gradient Descent

Model Ensemble for Click Prediction in Bing Search Ads

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Word Segmentation of Off-line Handwritten Documents

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Review: Speech Recognition with Deep Learning Methods

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

SARDNET: A Self-Organizing Feature Map for Sequences

A Deep Bag-of-Features Model for Music Auto-Tagging

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Calibration of Confidence Measures in Speech Recognition

Rule Learning With Negation: Issues Regarding Effectiveness

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

A study of speaker adaptation for DNN-based speech synthesis

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Axiom 2013 Team Description Paper

CSL465/603 - Machine Learning

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v2 [cs.ir] 22 Aug 2016

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Deep Neural Network Language Models

Softprop: Softmax Neural Network Backpropagation Learning

Learning Methods for Fuzzy Systems

TD(λ) and Q-Learning Based Ludo Players

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

arxiv: v1 [cs.cv] 10 May 2017

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

CS Machine Learning

arxiv: v4 [cs.cl] 28 Mar 2016

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Rule Learning with Negation: Issues Regarding Effectiveness

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Evolutive Neural Net Fuzzy Filtering: Basic Description

WHEN THERE IS A mismatch between the acoustic

THE enormous growth of unstructured data, including

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Offline Writer Identification Using Convolutional Neural Network Activation Features

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

arxiv: v2 [cs.cl] 26 Mar 2015

Learning to Schedule Straight-Line Code

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v3 [cs.cl] 7 Feb 2017

Australian Journal of Basic and Applied Sciences

arxiv: v1 [cs.cl] 27 Apr 2016

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Residual Stacking of RNNs for Neural Machine Translation

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv:submit/ [cs.cv] 2 Aug 2017

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

An OO Framework for building Intelligence and Learning properties in Software Agents

Human Emotion Recognition From Speech

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Seminar - Organic Computing

Second Exam: Natural Language Parsing with Neural Networks

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Georgetown University at TREC 2017 Dynamic Domain Track

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

An empirical study of learning speed in backpropagation

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

A Comparison of Annealing Techniques for Academic Course Scheduling

Probabilistic Latent Semantic Analysis

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Assignment 1: Predicting Amazon Review Ratings

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

THE world surrounding us involves multiple modalities

Efficient Online Summarization of Microblogging Streams

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Artificial Neural Networks

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

arxiv: v2 [cs.ro] 3 Mar 2017

arxiv: v2 [cs.cv] 30 Mar 2017

Discriminative Learning of Beam-Search Heuristics for Planning

Machine Learning and Development Policy

arxiv: v1 [cs.cl] 20 Jul 2015

Lecture 1: Basic Concepts of Machine Learning

Truth Inference in Crowdsourcing: Is the Problem Solved?

arxiv: v2 [cs.cv] 4 Mar 2016

Transcription:

Hyper-parameter Optimization for Deep Learning Tianxiang Gao Feb, 16, 2016

hyper-parameters Space of hyperparameters Evaluation function

Objective 1. What are the hyper-parameters in the deep learning? 2. How to explore the space for hyper-parameters? 3. How do we evaluate the best hyper-parameter? Discuss experiences in hyper-parameter search/optimization.

Objective 1. What are the hyper-parameters in the deep learning? 2. How to explore the space for hyper-parameters? 3. How do we evaluate the best hyper-parameter? Discuss experiences in hyper-parameter search/optimization.

Typical steps for training a deep network 1. data pre-processing: (none/pca/normalization) 2. select network structure (number of nodes, number of layers, activation function) 3. select weight initialization strategy 4. select regularization penalty 5. learning related parameters (learning rate, annealing rate, momentum coefficient, mini-batch size, drop-out rate, total iterations) 6. evaluation (cross-validation, leave-out)

2 Types of hyper-parameters 1. Training related hyper-parameters 2. Model related hyper-parameters

Training related parameters 1. Learning rate a. adaptive learning rate 2. Batch size a. small batch size will lead to stochastic behavior b. large batch size will affect memory requirements and computational time c. mostly computational concerned 3. Momentum a. can help pass through local minimum 4. Weight-update a. SGD, CG, L-BFGS, more complex more hyper-parameters 5. Stopping criteria a. patience (stop if the validation error is not improved after a while) b. early stopping is related to regularization

Model related hyper-parameters 1. Network architecture a. depth, width, layer specific structures 2. Initial weights a. fan-in fan-out: 4sqrt(6/(fanin + fanout)) for sigmoid (units with more input should have smaller weights) b. pre-training weights (next slide) 3. Weight-decay a. L1 and L2 penalty 4. drop-out rate LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks of the trade. Springer Berlin Heidelberg, 2012. 9-48. Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." International conference on artificial intelligence and statistics. 2010.

Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." Advances in neural information processing systems 19 (2007): 153. http://www.cs.toronto.edu/~rsalakhu/deeplearning/yoshua_icml2009.pdf Weight Pre-training 1. Directly train a deep network with random initial weights can be very hard 2. Idea: use unsupervised stacked Restricted Boltzmann Machine (RBM) to pre-train the weights 3. This ensures that the higher level representations can reconstruct the input information

Pre-training using Stack AutoEncoders Weight-Tying http://www.cs.toronto. edu/~rsalakhu/deeplearning/yoshua_icml2009.pdf

Learning Trajectories in Function Space http://www.cs.toronto. edu/~rsalakhu/deeplearning/yoshua_icml2009.pdf

Is my network structure OK? network_dim = [500, 500, 500] How do we choose the number in above setting? What are the intuitions for choosing a good structure? Larochelle, Hugo, et al. "Exploring strategies for training deep neural networks." The Journal of Machine Learning Research 10 (2009): 1-40. Test for 1-4 layers, [500, 1000, 2000] nodes for each layer, on MNIST dataset.

Network structures -- depth Larochelle, Hugo, et al. "Exploring strategies for training deep neural networks." The Journal of Machine Learning Research 10 (2009): 1-40.

Network structures -- width Larochelle, Hugo, et al. "Exploring strategies for training deep neural networks." The Journal of Machine Learning Research 10 (2009): 1-40.

Some notes on the result 1. Larger than optimal network does not hurt the performance very much. As deep network has regularization (early stopping, weight-decay) 2. If pretraining is applied, then more layers are needed (unsupervised training, many features are irrelevant to the specific supervised learning task goal) 3. Optimal layer number for more complex dataset is larger. 4. Network with equal width in all layers performs the best given fixed number of nodes. This setting yields the most parameters. 5. Overcomplete first hidden layer works better (more than input nodes) than undercomplete ones. Bengio, Yoshua. "Practical recommendations for gradient-based training of deep architectures." Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 437-478.

Objective 1. What are the hyper-parameters in the deep learning? 2. How to explore the space for hyper-parameters? 3. How do we evaluate the best hyper-parameter? Discuss experiences in hyper-parameter search/optimization.

Hyper-parameter optimization The total hyper-parameter space is very large to explore! With limited time/resource, how should we efficiently explore the hyperparameters? Some can be pre-determined by experience (stopping criteria, batch-size) Some parameters might not affect the performance very much (momentum) Some parameters might need careful selection (network structure) Layer-specific hyper-parameters?

General strategy Manual search: start with some settings, and gradually change each parameter to decide the best setting Grid Search: Set a range for each hyper-parameter, then search the all the combinations Random search: Randomly generate hyper-parameter sample from the range Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." The Journal of Machine Learning Research 13.1 (2012): 281-305.

Random search for hyper-parameter optimization Motivation: Not all the hyper-parameters are equally important, some are more important, and some are less important. Insight: By performing random search, we are likely to increase exploration of the important parameters

Random search vs Grid search Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." The Journal of Machine Learning Research 13.1 (2012): 281-305.

Experiment setting Find the best 7 hyper-parameter combination. For grid-search, 100 trials For random-search, randomly choose from a subset of 256 trials.

Datasets MNIST rotate Rectangle noise background image background rotate + image background rectangle image convex

Experiment result Blue dashed line: grid search with 100 trials

Notes The best performance converges very fast as number of trials increases Random search with 8 trials on average finds the model with similar test error as grid search using 100 trials. With more hyper-parameters, trial number for grid search grows exponentially, while random search can locate best result very fast, and even better. This is important if there are harmful hyper-parameter settings. It is also suggested that there might be only few important hyper-parameters.

Relevance

Some other hints about hyper-parameter search 1. If the best result is on the border of search space, then we need larger space 2. Both grid-search and random-search can perform in parallel. But the results for different random-search trials can be easily integrated 3. For a long-term hyper-parameter optimization task, we might have intermediate results. We can even learn the relationship between hyperparameters and test-error, so that we can select the next set of hyperparameters more wisely. However, they are much more complicated comparing to random/grid search

Other references for automation parameter optimization Bergstra, James S., et al. "Algorithms for hyper-parameter optimization." Advances in Neural Information Processing Systems. 2011. Snoek, Jasper, et al. "Scalable Bayesian Optimization Using Deep Neural Networks." arxiv preprint arxiv:1502.05700 (2015). Hutter, Frank. "Automated configuration of algorithms for solving hard computational problems." (2009). Hutter, Frank, Holger H. Hoos, and Kevin Leyton-Brown. "Sequential model-based optimization for general algorithm configuration." Learning and Intelligent Optimization. Springer Berlin Heidelberg, 2011. 507-523. Srinivasan, Ashwin, and Ganesh Ramakrishnan. "Parameter screening and optimisation for ILP using designed experiments." The Journal of Machine Learning Research 12 (2011): 627-662. Most of them are black-box methods and applicable to not only deep learning. https://github.com/hips/spearmint/blob/master/readme.md

Have you tuned well enough?? A website that keeps track of the best score from published paper: http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results. html

Objective 1. What are the hyper-parameters in the deep learning? 2. How to explore the space for hyper-parameters? 3. How do we evaluate the best hyper-parameter? Discuss experiences in hyper-parameter search/optimization.

How do we evaluate the best parameter? 1. Generally, for each hyper-parameter setting, we fully trained on the network and evaluate with validation dataset. 2. Do we need to fully train on the dataset? 3. Validation dataset might be biased. Cross-validation might be more fair, but need more time/resource. A fast structure searching method for CNN Saxe, Andrew, et al. "On random weights and unsupervised feature learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.

Saxe, Andrew, et al. "On random weights and unsupervised feature learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011. On random weights and unsupervised feature learning Studies showed that in CNN, random weights in the lower layer can still give good predictions. Convolution + pooling already extracted some features.

Experiment setting 11 random architectures varying filter sizes {4x4, 8x8, 12x12, 16x16}, pooling sizes {3x3, 5x5, 9x9} and filter strides {1, 2} 10 sets of random weights for each architecture on NORB 5 sets of random weights for each architecture on CIFAR-10 Compare the prediction accuracy between random weights and pretrain+finetuned weights

Random weights vs Fully learned weights Saxe, Andrew, et al. "On random weights and unsupervised feature learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.

Notes The performance of random weights is correlated with fully-trained weights for any architecture. The architecture contributes a lot to the performance of a network. We can use random weights to do a fast approximate search for the best architecture.

Random weights vs other methods It s important to distinguish the contribution from the architecture and the contribution from the training.

Objective 1. What are the hyper-parameters in the deep learning? 2. How to explore the space for hyper-parameters? 3. How do we evaluate the best hyper-parameter?

Summary We can efficiently optimize hyper-parameters from the following techniques: 1. [Knowledge and prior about the parameters] We can pre-set some parameters by manual trials and experiences and spend more resource in estimating other important hyper-parameters. 2. [Efficiently utilize the trials for exploring hyper-parameters] Not all the parameters are equally important. Therefore, random search gives important parameters more chance to be explored. 3. [Evaluation of the hyper-parameter can be more efficient] Sometimes we do not need to fully optimize the model for each hyper-parameter setting to determine the best hyper-parameter. Some approximate estimations can save time and reduce the candidate space.

Some further ideas 1. In evaluation we use cross-validation error. As evaluating a single trial is very time consuming, leave-one-out cross-validation is impossible in deep learning. Will there be better estimation methods for an unbiased validation metric? 2. Are all the training data equally important, is it possible that some training data samples are more important than the others? (Curriculum learning) 3. If we have long enough time (Like google deepmind vs pro go players), should we spend more time on training network with more data? Exploring more hyper-parameters?

Objective 1. What are the hyper-parameters in the deep learning? 2. How to explore the space for hyper-parameters? 3. How do we evaluate the best hyper-parameter? Discuss experiences in hyper-parameter search/optimization. What s your experiences in hyper-parameter settings?