Hyper-parameter Optimization for Deep Learning. Tianxiang Gao Feb, 16, PDF Free Download

Hyper-parameter Optimization for Deep Learning Tianxiang Gao Feb, 16, 2016

hyper-parameters Space of hyperparameters Evaluation function

Objective 1. What are the hyper-parameters in the deep learning? 2. How to explore the space for hyper-parameters? 3. How do we evaluate the best hyper-parameter? Discuss experiences in hyper-parameter search/optimization.

Typical steps for training a deep network 1. data pre-processing: (none/pca/normalization) 2. select network structure (number of nodes, number of layers, activation function) 3. select weight initialization strategy 4. select regularization penalty 5. learning related parameters (learning rate, annealing rate, momentum coefficient, mini-batch size, drop-out rate, total iterations) 6. evaluation (cross-validation, leave-out)

2 Types of hyper-parameters 1. Training related hyper-parameters 2. Model related hyper-parameters

Training related parameters 1. Learning rate a. adaptive learning rate 2. Batch size a. small batch size will lead to stochastic behavior b. large batch size will affect memory requirements and computational time c. mostly computational concerned 3. Momentum a. can help pass through local minimum 4. Weight-update a. SGD, CG, L-BFGS, more complex more hyper-parameters 5. Stopping criteria a. patience (stop if the validation error is not improved after a while) b. early stopping is related to regularization

Model related hyper-parameters 1. Network architecture a. depth, width, layer specific structures 2. Initial weights a. fan-in fan-out: 4sqrt(6/(fanin + fanout)) for sigmoid (units with more input should have smaller weights) b. pre-training weights (next slide) 3. Weight-decay a. L1 and L2 penalty 4. drop-out rate LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks of the trade. Springer Berlin Heidelberg, 2012. 9-48. Glorot, Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." International conference on artificial intelligence and statistics. 2010.

Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." Advances in neural information processing systems 19 (2007): 153. http://www.cs.toronto.edu/~rsalakhu/deeplearning/yoshua_icml2009.pdf Weight Pre-training 1. Directly train a deep network with random initial weights can be very hard 2. Idea: use unsupervised stacked Restricted Boltzmann Machine (RBM) to pre-train the weights 3. This ensures that the higher level representations can reconstruct the input information

Pre-training using Stack AutoEncoders Weight-Tying http://www.cs.toronto. edu/~rsalakhu/deeplearning/yoshua_icml2009.pdf

Learning Trajectories in Function Space http://www.cs.toronto. edu/~rsalakhu/deeplearning/yoshua_icml2009.pdf

Is my network structure OK? network_dim = [500, 500, 500] How do we choose the number in above setting? What are the intuitions for choosing a good structure? Larochelle, Hugo, et al. "Exploring strategies for training deep neural networks." The Journal of Machine Learning Research 10 (2009): 1-40. Test for 1-4 layers, [500, 1000, 2000] nodes for each layer, on MNIST dataset.

Network structures -- depth Larochelle, Hugo, et al. "Exploring strategies for training deep neural networks." The Journal of Machine Learning Research 10 (2009): 1-40.

Network structures -- width Larochelle, Hugo, et al. "Exploring strategies for training deep neural networks." The Journal of Machine Learning Research 10 (2009): 1-40.

Some notes on the result 1. Larger than optimal network does not hurt the performance very much. As deep network has regularization (early stopping, weight-decay) 2. If pretraining is applied, then more layers are needed (unsupervised training, many features are irrelevant to the specific supervised learning task goal) 3. Optimal layer number for more complex dataset is larger. 4. Network with equal width in all layers performs the best given fixed number of nodes. This setting yields the most parameters. 5. Overcomplete first hidden layer works better (more than input nodes) than undercomplete ones. Bengio, Yoshua. "Practical recommendations for gradient-based training of deep architectures." Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 437-478.

Hyper-parameter optimization The total hyper-parameter space is very large to explore! With limited time/resource, how should we efficiently explore the hyperparameters? Some can be pre-determined by experience (stopping criteria, batch-size) Some parameters might not affect the performance very much (momentum) Some parameters might need careful selection (network structure) Layer-specific hyper-parameters?

General strategy Manual search: start with some settings, and gradually change each parameter to decide the best setting Grid Search: Set a range for each hyper-parameter, then search the all the combinations Random search: Randomly generate hyper-parameter sample from the range Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." The Journal of Machine Learning Research 13.1 (2012): 281-305.

Random search for hyper-parameter optimization Motivation: Not all the hyper-parameters are equally important, some are more important, and some are less important. Insight: By performing random search, we are likely to increase exploration of the important parameters

Random search vs Grid search Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." The Journal of Machine Learning Research 13.1 (2012): 281-305.

Experiment setting Find the best 7 hyper-parameter combination. For grid-search, 100 trials For random-search, randomly choose from a subset of 256 trials.

Datasets MNIST rotate Rectangle noise background image background rotate + image background rectangle image convex

Experiment result Blue dashed line: grid search with 100 trials

Notes The best performance converges very fast as number of trials increases Random search with 8 trials on average finds the model with similar test error as grid search using 100 trials. With more hyper-parameters, trial number for grid search grows exponentially, while random search can locate best result very fast, and even better. This is important if there are harmful hyper-parameter settings. It is also suggested that there might be only few important hyper-parameters.

Relevance

Some other hints about hyper-parameter search 1. If the best result is on the border of search space, then we need larger space 2. Both grid-search and random-search can perform in parallel. But the results for different random-search trials can be easily integrated 3. For a long-term hyper-parameter optimization task, we might have intermediate results. We can even learn the relationship between hyperparameters and test-error, so that we can select the next set of hyperparameters more wisely. However, they are much more complicated comparing to random/grid search

Other references for automation parameter optimization Bergstra, James S., et al. "Algorithms for hyper-parameter optimization." Advances in Neural Information Processing Systems. 2011. Snoek, Jasper, et al. "Scalable Bayesian Optimization Using Deep Neural Networks." arxiv preprint arxiv:1502.05700 (2015). Hutter, Frank. "Automated configuration of algorithms for solving hard computational problems." (2009). Hutter, Frank, Holger H. Hoos, and Kevin Leyton-Brown. "Sequential model-based optimization for general algorithm configuration." Learning and Intelligent Optimization. Springer Berlin Heidelberg, 2011. 507-523. Srinivasan, Ashwin, and Ganesh Ramakrishnan. "Parameter screening and optimisation for ILP using designed experiments." The Journal of Machine Learning Research 12 (2011): 627-662. Most of them are black-box methods and applicable to not only deep learning. https://github.com/hips/spearmint/blob/master/readme.md

Have you tuned well enough?? A website that keeps track of the best score from published paper: http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results. html

How do we evaluate the best parameter? 1. Generally, for each hyper-parameter setting, we fully trained on the network and evaluate with validation dataset. 2. Do we need to fully train on the dataset? 3. Validation dataset might be biased. Cross-validation might be more fair, but need more time/resource. A fast structure searching method for CNN Saxe, Andrew, et al. "On random weights and unsupervised feature learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.

Saxe, Andrew, et al. "On random weights and unsupervised feature learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011. On random weights and unsupervised feature learning Studies showed that in CNN, random weights in the lower layer can still give good predictions. Convolution + pooling already extracted some features.

Experiment setting 11 random architectures varying filter sizes {4x4, 8x8, 12x12, 16x16}, pooling sizes {3x3, 5x5, 9x9} and filter strides {1, 2} 10 sets of random weights for each architecture on NORB 5 sets of random weights for each architecture on CIFAR-10 Compare the prediction accuracy between random weights and pretrain+finetuned weights

Random weights vs Fully learned weights Saxe, Andrew, et al. "On random weights and unsupervised feature learning." Proceedings of the 28th international conference on machine learning (ICML-11). 2011.

Notes The performance of random weights is correlated with fully-trained weights for any architecture. The architecture contributes a lot to the performance of a network. We can use random weights to do a fast approximate search for the best architecture.

Random weights vs other methods It s important to distinguish the contribution from the architecture and the contribution from the training.

Objective 1. What are the hyper-parameters in the deep learning? 2. How to explore the space for hyper-parameters? 3. How do we evaluate the best hyper-parameter?

Summary We can efficiently optimize hyper-parameters from the following techniques: 1. [Knowledge and prior about the parameters] We can pre-set some parameters by manual trials and experiences and spend more resource in estimating other important hyper-parameters. 2. [Efficiently utilize the trials for exploring hyper-parameters] Not all the parameters are equally important. Therefore, random search gives important parameters more chance to be explored. 3. [Evaluation of the hyper-parameter can be more efficient] Sometimes we do not need to fully optimize the model for each hyper-parameter setting to determine the best hyper-parameter. Some approximate estimations can save time and reduce the candidate space.

Some further ideas 1. In evaluation we use cross-validation error. As evaluating a single trial is very time consuming, leave-one-out cross-validation is impossible in deep learning. Will there be better estimation methods for an unbiased validation metric? 2. Are all the training data equally important, is it possible that some training data samples are more important than the others? (Curriculum learning) 3. If we have long enough time (Like google deepmind vs pro go players), should we spend more time on training network with more data? Exploring more hyper-parameters?

Hyper-parameter Optimization for Deep Learning. Tianxiang Gao Feb, 16, 2016