Stochastic Gradient Descent

Similar documents
Lecture 1: Machine Learning Basics

Generative models and adversarial training

(Sub)Gradient Descent

arxiv: v1 [cs.lg] 15 Jun 2015

Softprop: Softmax Neural Network Backpropagation Learning

On the Combined Behavior of Autonomous Resource Management Agents

Artificial Neural Networks written examination

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

An empirical study of learning speed in backpropagation

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Acquiring Competence from Performance Data

arxiv: v1 [cs.lg] 7 Apr 2015

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CSL465/603 - Machine Learning

Python Machine Learning

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

A Reinforcement Learning Variant for Control Scheduling

An Introduction to Simio for Beginners

Axiom 2013 Team Description Paper

Assignment 1: Predicting Amazon Review Ratings

Calibration of Confidence Measures in Speech Recognition

arxiv: v1 [cs.cv] 10 May 2017

Comment-based Multi-View Clustering of Web 2.0 Items

Learning Methods for Fuzzy Systems

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Residual Stacking of RNNs for Neural Machine Translation

Model Ensemble for Click Prediction in Bing Search Ads

Rule Learning With Negation: Issues Regarding Effectiveness

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

An Introduction to Simulation Optimization

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Probabilistic Latent Semantic Analysis

Lecture 10: Reinforcement Learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

arxiv: v1 [cs.cl] 20 Jul 2015

Australian Journal of Basic and Applied Sciences

Online Updating of Word Representations for Part-of-Speech Tagging

Discriminative Learning of Beam-Search Heuristics for Planning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Teaching a Laboratory Section

INPE São José dos Campos

arxiv: v1 [cs.cl] 2 Apr 2017

Rule Learning with Negation: Issues Regarding Effectiveness

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

TD(λ) and Q-Learning Based Ludo Players

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Software Maintenance

A Comparison of Annealing Techniques for Academic Course Scheduling

Visit us at:

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Classification Using ANN: A Review

WHEN THERE IS A mismatch between the acoustic

Learning to Schedule Straight-Line Code

Grade 6: Correlated to AGS Basic Math Skills

Deep Neural Network Language Models

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Review: Speech Recognition with Deep Learning Methods

Detailed course syllabus

A Deep Bag-of-Features Model for Music Auto-Tagging

Julia Smith. Effective Classroom Approaches to.

Team Formation for Generalized Tasks in Expertise Social Networks

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

IMPORTANT STEPS WHEN BUILDING A NEW TEAM

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Second Exam: Natural Language Parsing with Neural Networks

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Human Emotion Recognition From Speech

arxiv: v1 [cs.cl] 27 Apr 2016

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Truth Inference in Crowdsourcing: Is the Problem Solved?

Attributed Social Network Embedding

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Mathematics subject curriculum

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v2 [cs.ir] 22 Aug 2016

An investigation of imitation learning algorithms for structured prediction

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Transcription:

Stochastic Gradient Descent EE807: Recent Advances in Deep Learning Lecture 2 Slide made by Insu Han and Jongheon Jeong KAIST EE

Table of Contents 1. Introduction Empirical risk minimization (ERM) 2. Gradient Descend Methods Gradient descent (GD) Stochastic gradient descent (SGD) 3. Momentum and Adaptive Learning Rate Methods Momentum methods Learning rate scheduling Adaptive learning rate methods (AdaGrad, RmsProp, Adam) 4. Changing Batch Size Increasing the batch size without learning rate decaying 5. Summary 2

Empirical Risk Minimization (ERM) Given training set Prediction function parameterized by Empirical risk minimization: Find a paramater that minimizes the loss function where is a loss function e.g., MSE, cross entropy, For example, neural network has Next, how to solve ERM? 3

Gradient Descent (GD) Gradient descent (GD) updates parameters iteratively by taking gradient. parameters loss function learning rate (+) Converges to global (local) minimum for convex (non-convex) problem. ( ) Not efficient with respect to computation time and memory space for huge n. For example, ImageNet dataset has n =1,281,167 images for training. 1.2M of 256x256 RGB images 236 GB memory Next, efficient GD 4

Stochastic Gradient Descent (SGD) Stochastic gradient descent (SGD) use samples to approximate GD In practice, minibatch sizes can be 32/64/128. Main practical challenges and current solutions: 1. SGD can be too noisy and might be unstable 2. hard to find a good learning rate momentum adaptive learning rate Next, momentum *source : https://lovesnowbest.site/2018/02/16/improving-deep-neural-networks-assignment-2/ 5

Momentum Methods 1. Momentum gradient descent Add decaying previous gradients (momentum). momentum preservation ratio Equivalent to the weighted-sum of the fraction μ of previous update. (+) Momentum reduces the oscillation and accelerates the convergence. SGD friction to vertical fluctuation SGD + momentum acceleration to left 6

Momentum Methods: Nesterov s Momentum 1. Momentum gradient descent Add decaying previous gradients (momentum). momentum preservation ratio ( ) Momentum can fail to converge even for simple convex optimizations. Nestrov s accelerated gradient (NAG) [Nesterov 1983] use gradient for approximate future position, i.e., lookahead gradient 7

Momentum Methods: Nesterov s Momentum 1. Momentum gradient descent Add decaying previous gradients (momentum). momentum preservation ratio Nesterov s accelerated gradient (NAG) [Nesterov 1983] use gradient for approximate future position, i.e., SGD SGD + momentum NAG Quiz: fill in the pseudo code of Nesterov accelerated gradient 8

Adaptive Learning Rate Methods 2. Learning rate scheduling Learning rate is critical for minimizing loss! Too high May ignore the narrow valley, can diverge Too low May fall into the local minima, slow converge Next, learning rate scheduling *source : http://cs231n.github.io/neural-networks-3/ 9

Adaptive Learning Rate Methods: Learning rate annealing 2. Learning rate scheduling : decay methods A naive choice is the constant learning rate Common learning rate schedules include time-based/step/exponential decay Time-based Exponential Step (most popular in practice) Step decay decreases learning rate by a factor every few epochs Typically, it is set = 0.01 and drops by half ever = 10 epoch step decay exponential decay accuracy *source : https://towardsdatascience.com/ 10

Adaptive Learning Rate Methods: Learning rate annealing 2. Learning rate scheduling : cycling method [Smith 2015] proposed cycling learning rate (triangular) Why cycling learning rate? Sometimes, increasing learning rate is helpful to escape the saddle points It can be combined with exponential decay or periodic decay cycling (triangular) decay *source : https://github.com/bckenstler/clr 11

Adaptive Learning Rate Methods: Learning rate annealing 2. Learning rate scheduling : cycling method [Loshchilov 2017] use cosine cycling and restart the maximum at each cycle Why cosine? It decays slowly at the half of cycle and drop quickly at the rest (+) can climb down and up the loss surface, thus can traverse several local minima (+) same as restarting at good points with an initial learning rate * source : Loshchilov et al., SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017 12

Adaptive Learning Rate Methods: Learning rate annealing 2. Learning rate scheduling : cycling method [Loshchilov 2017] also proposed warm restart in cycling learning rate *Warm restart : frequently restart in early iterations (+) It help to escape saddle points since it is more likely to stuck in early iteration : step decay : cycling with no restart : cycling with restart But, there is no perfect learning rate scheduling! It depends on specific task. Next, adaptive learning rate * source : Loshchilov et al., SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017 13

Adaptive Learning Rate Methods: AdaGrad, RMSProp 3. Adaptively changing learning rate (AdaGrad, RMSProp) AdaGrad [Duchi 11] downscales a learning rate by magnitude of previous gradients. sum of all previous squared gradients ( ) the learning rate strictly decreases and becomes too small for large iterations. RMSProp [Tieleman 12] uses the moving averages of squared gradient. preservation ratio Other variants also exist, e.g., Adadelta [Zeiler 2012] 14

Adaptive Learning Rate Methods Visualization of algorithms optimization on saddle point optimization on local optimum Adaptive learning-rate methods, i.e., Adadelta and RMSprop are most suitable and provide the best convergence for these scenarios Next, momentum + adaptive learning rate * source: animations from from Alec Radford blog 15

Adaptive Learning Rate Methods: ADAM 3. Combination of momentum and adaptive learning rate Adam (ADAptive Moment estimation) [Kingma 2015] momentum Can be seen as momentum + RMSprop update. Other variants exist, e.g., Adamax [Kingma 14], Nadam [Dozat 16] average of squared gradients * source : Kingma and Ba. Adam: A method for stochastic optimization. ICLR 2015 16

Decaying the Learning Rate = Increasing the Batch Size In practice, SGD + Momentum and Adam works well in many applications. But, scheduling learning rates is still critical! (should be decay appropriately) [Smith 2017] shows that decaying learning rate = increasing batch size, (+) A large batch size allows fewer parameter updates, leading to parallelism! * source : Smith et al., "Don't Decay the Learning Rate, Increase the Batch Size., ICLR 2017 17

Summary SGD have been used as essential algorithms to deep learning as backpropagation. Momentum methods improve the performance of gradient descend algorithms. Nesterov s momentum Annealing learning rates are critical for training loss functions Exponential, harmonic, cyclic decaying methods Adaptive learning rate methods (RMSProp, AdaGrad, AdaDelta, Adam, etc) In practice, SGD + momentum shows successful results, outperforming Adam! For example, NLP (Huang et al., 2017) or machine translation (Wu et al., 2016) 18

References [Nesterov 1983] Nesterov. A method of solving a convex programming problem with convergence rate O(1/k^2). 1983 link: http://mpawankumar.info/teaching/cdt-big-data/nesterov83.pdf [Duchi et al 2011], Adaptive subgradient methods for online learning and stochastic optimization, JMLR 2011 link : http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf [Tieleman 2012] Geoff Hinton s Lecture 6e of Coursera Class link : http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf [Zeiler 2012] Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method link : https://arxiv.org/pdf/1212.5701.pdf [Smith 2015] Smith, Leslie N. "Cyclical learning rates for training neural networks. link : https://arxiv.org/pdf/1506.01186.pdf [Kingma and Ba., 2015] Kingma and Ba. Adam: A method for stochastic optimization. ICLR 2015 link : https://arxiv.org/pdf/1412.6980.pdf [Dozat 2016] Dozat, T. (2016). Incorporating Nesterov Momentum into Adam. ICLR Workshop, link : http://cs229.stanford.edu/proj2015/054_report.pdf [Smith et al., 2017] Smith, Samuel L., Pieter-Jan Kindermans and Quoc V. Le. Don't Decay the Learning Rate, Increase the Batch Size. ICLR 2017. link : https://openreview.net/pdf?id=b1yy1bxcz [Loshchilov et al., 2017] Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017. link : https://arxiv.org/pdf/1608.03983.pdf 19