Training Neural Networks

Similar documents
Python Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Machine Learning Basics

arxiv: v1 [cs.lg] 15 Jun 2015

Artificial Neural Networks written examination

(Sub)Gradient Descent

Generative models and adversarial training

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Model Ensemble for Click Prediction in Bing Search Ads

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Softprop: Softmax Neural Network Backpropagation Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

WHEN THERE IS A mismatch between the acoustic

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

INPE São José dos Campos

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v1 [cs.cv] 10 May 2017

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Calibration of Confidence Measures in Speech Recognition

Learning Methods for Fuzzy Systems

Assignment 1: Predicting Amazon Review Ratings

An empirical study of learning speed in backpropagation

Test Effort Estimation Using Neural Network

Knowledge Transfer in Deep Convolutional Neural Nets

Word Segmentation of Off-line Handwritten Documents

CSL465/603 - Machine Learning

Cultivating DNN Diversity for Large Scale Video Labelling

Attributed Social Network Embedding

Human Emotion Recognition From Speech

Axiom 2013 Team Description Paper

Lecture 10: Reinforcement Learning

Teaching a Laboratory Section

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Discriminative Learning of Beam-Search Heuristics for Planning

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Evolutive Neural Net Fuzzy Filtering: Basic Description

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

A Review: Speech Recognition with Deep Learning Methods

THE enormous growth of unstructured data, including

A Deep Bag-of-Features Model for Music Auto-Tagging

Time series prediction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

arxiv: v1 [cs.lg] 7 Apr 2015

Second Exam: Natural Language Parsing with Neural Networks

Lecture 1: Basic Concepts of Machine Learning

Statewide Framework Document for:

CS 446: Machine Learning

The Good Judgment Project: A large scale test of different methods of combining expert predictions

arxiv: v2 [cs.ro] 3 Mar 2017

Visit us at:

Probabilistic Latent Semantic Analysis

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

SORT: Second-Order Response Transform for Visual Recognition

arxiv: v2 [cs.cl] 26 Mar 2015

Speaker Identification by Comparison of Smart Methods. Abstract

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

arxiv: v1 [cs.cl] 27 Apr 2016

Summarizing Answers in Non-Factoid Community Question-Answering

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

TD(λ) and Q-Learning Based Ludo Players

arxiv: v2 [cs.ir] 22 Aug 2016

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Deep Neural Network Language Models

A study of speaker adaptation for DNN-based speech synthesis

Software Maintenance

arxiv: v1 [cs.dc] 19 May 2017

SARDNET: A Self-Organizing Feature Map for Sequences

CS Machine Learning

Cost-sensitive Deep Learning for Early Readmission Prediction at A Major Hospital

Offline Writer Identification Using Convolutional Neural Network Activation Features

arxiv: v4 [cs.cl] 28 Mar 2016

Introduction to Simulation

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Speech Emotion Recognition Using Support Vector Machine

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Comment-based Multi-View Clustering of Web 2.0 Items

IMPORTANT STEPS WHEN BUILDING A NEW TEAM

Standard 1: Number and Computation

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Issues in the Mining of Heart Failure Datasets

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

Learning to Schedule Straight-Line Code

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Transcription:

Training Neural Networks

VISION Accelerate innovation by unifying data science, engineering and business PRODUCT Unified Analytics Platform powered by Apache Spark WHO WE ARE Founded by the original creators of Apache Spark Contributes 75% of the open source code, 10x more than any other company Trained 100k+ Spark users on the Databricks platform

About our speaker Denny Lee Technical Product Marketing Manager Former: Senior Director of Data Sciences Engineering at SAP Concur Principal Program Manager at Microsoft Azure Cosmos DB Engineering Spark and Graph Initiatives Isotope Incubation Team (currently known as HDInsight) Bing s Audience Insights Team Yahoo! s 24TB Analysis Services cube

Deep Learning Fundamentals Series This is a three-part series: Introduction to Neural Networks Training Neural Networks Applying your Convolutional Neural Network This series will be make use of Keras (TensorFlow backend) but as it is a fundamentals series, we are focusing primarily on the concepts.

Previous Session: Introduction to Neural Networks What is Deep Learning? What can Deep Learning do for you? What are artificial neural networks? Let s start with a perceptron Understanding the effect of activation functions

Current Session: Training Neural Networks Tuning training Training Algorithms Optimization (including Adam) Convolutional Neural Networks

Upcoming Session: Applying Neural Networks Diving further into CNNs CNN Architectures Convolutions at Work!

Convolutional Neural Networks 28 x 28 28 x 28 14 x 14 Dropout 0 1 Dropout Fully Connected Convolution 32 filters Convolution 64 filters Subsampling Stride (2,2) 8 9 Feature Extraction Classification

Tuning Training

Hyperparameters Network How many layers? How many neurons in each layer? What activation functions to use? Learning algorithm What s the best value of the learning rate? How quickly decay the learning rate? Momentum? What type of loss function should I use? What batch size? How many iterations is enough?

Overfitting and underfitting

Overfitting and underfitting

Overfitting and underfitting

Hyperparameters: Network Generally, the more layers and the number of units in each layer: The greater the capacity of the artificial neural network The risk is overfitting when your goal is to build a generalized model. From a practical perspective, a good starting point is: The number of input units equals the dimension of features The number of output units equals the number of classes (e.g. in the MNIST dataset, there are 10 possible values represents digits (0 9) hence there are 10 output units Start with one hidden layer that is 2x the number of input units A good reference is Andrew Ng s Coursera Machine Learning course.

Hyperparameters: Activation Functions? Good starting point: ReLU Note many neural networks samples: Keras MNIST, TensorFlow CIFAR10 Pruning, etc. Note that each activation function has its own strengths and weaknesses. A good quote on activation functions from CS231N summarizes the choice well: What neuron type should I use? Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of dead units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/ Maxout.

DEMO Neurons Activate!

Hyperparameters Learning algorithm What s the best value of the learning rate? How quickly decay the learning rate? Momentum? What type of loss function should I use? What batch size? How many iterations is enough?

Training Algorithms

Cost function For this linear regression p example, to determine the best (slope of the line) for y = x p we can calculate the cost function, such as Mean Square Error, Mean absolute error, Mean bias error, SVM Loss, etc. For this example, we ll use sum of squared absolute differences cost = t y 2 Source: https://bit.ly/2ioagzl

Gradient Descent Optimization Source: https://bit.ly/2ioagzl

Small Learning Rate Source: https://bit.ly/2ioagzl

Small Learning Rate Source: https://bit.ly/2ioagzl

Small Learning Rate Source: https://bit.ly/2ioagzl

Small Learning Rate Source: https://bit.ly/2ioagzl

Simplified Two-Layer ANN 1 1 0.8 0.8 0.6 0.2 0.75 0.9 0.7 h 1 = σ(1x0.8 + 1x0.6) = 0.80 h 2 = σ(1x0.2 + 1x0.9) = 0.75 h 3 = σ(1x0.7 + 1x0.1) = 0.69 0.1 0.69

Simplified Two-Layer ANN 0.8 1 0.8 0.6 0.2 0.75 0.9 0.2 0.8 0.75 out = σ(0.2x0.8 + 0.8x0.75 + 0.5x0.69) = σ(1.105) 1 0.7 0.5 = 0.75 0.1 0.69

Backpropagation 0.8 0.2 0.75 Input Hidden Output

Backpropagation 0.10 0.85 Backpropagation: calculate the gradient of the cost function in a neural network Used by gradient descent optimization algorithm to adjust weight of neurons Also known as backward propagation of errors as the error is calculated and distributed back through the network of layers Input Hidden Output

Sigmoid function (continued) Output is not zero-centered: During gradient descent, if all values are positive then during backpropagation the weights will become all positive or all negative creating zig zagging dynamics. Source: https://bit.ly/2ioagzl

Learning Rate Callouts Too small, it may take too long to get minima Too large, it may skip the minima altogether

Optimization

Optimization Overview After backpropagation, the parameters are updated based on the gradients calculated There are several approaches in this area of active research; we will focus on: Stochastic Gradient Descent Momentum, NAG Per-parameter adaptive learning rate methods

Stochastic Gradient Descent (Batch) Gradient Descent is computed on the full dataset (not efficient for large scale models and datasets). Often converges faster because it performs updates more frequently But due to frequent updates, this may complicate convergence to the exact minima For more information, refer to: Andrew Ng s 2. Stochastic Gradient (https://goo.gl/bnrjbx) Types of Optimization Algorithms used in Neural Networks and Ways to Optimize Gradient Descent (https://goo.gl/tb2e7s)

Gradient Descent Source: https://goo.gl/vux2zs

Momentum and NAG Obtain faster convergence by helping parameter vector build up velocity i.e. use the momentum of the gradient to converge faster Nesterov Accelerated Gradient (NAG): optimized version of Momentum Typically works better in practice than Momentum Source: http://cs231n.github.io/neural-networks-3

Annealing the learning rate i.e. slow down the learning rate to prevent it from bouncing around too much Referred as the decay parameter (i.e., the learning rate decay over each update) to reduce kinetic energy Note, this is different from rho (i.e. exponentially weighted average or exponentially weighted decay of past gradients) to smooth the descent path trajectory

Per-parameter adaptive learning rate methods Adaptively tune learning rates at the parameter level Popular methods include: Adaptive Gradient Algorithm (AdaGrad) improves performance on problems with sparse gradients (e.g. natural language and computer vision problems). Root Mean Square Propagation (RMSProp) maintains per-parameter learning rates based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy). AdaDelta: Per dimension learning rate method for gradient descent with minimal computational overhead, requires no manual tuning, and quite robust

Which Optimizer? Source: https://goo.gl/2da4wy In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative.. Andrej Karpathy, et al, CS231n Comparison of Adam to Other Optimization Algorithms Training a Multilayer Perceptron Taken from Adam: A Method for Stochastic Optimization, 2015.

Optimization on loss surface contours Source: http://cs231n.github.io/neural-networks-3/#hyper Image credit: Alec Radford Adaptive algorithms converge quickly and find the right direction for the parameters. In comparison, SGD is slow Momentum-based methods overshoot

Optimization on saddle point Source: http://cs231n.github.io/neural-networks-3/#hyper Image credit: Alec Radford Notice how SGD gets stuck near the top Meanwhile adaptive techniques optimize the fastest

Good References Suki Lau's Learning Rate Schedules and Adaptive Learning Rate Methods for Deep Learning CS23n Convolutional Neural Networks for Visual Recognition Fundamentals of Deep Learning ADADELTA: An Adaptive Learning Rate Method Gentle Introduction to the Adam Optimization Algorithm for Deep Learning

Convolutional Networks

Convolutional Neural Networks Similar to Artificial Neural Networks but CNNs (or ConvNets) make explicit assumptions that the input are images Regular neural networks do not scale well against images E.g. CIFAR-10 images are 32x32x3 (32 width, 32 height, 3 color channels) = 3072 weights somewhat manageable A larger image of 200x200x3 = 120,000 weights CNNs have neurons arranged in 3D: width, height, depth. Neurons in a layer will only be connected to a small region of the layer before it, i.e. NOT all of the neurons in a fully-connected manner. Final output layer for CIFAR-10 is 1x1x10 as we will reduce the full image into a single vector of class scores, arranged along the depth dimension

CNNs / ConvNets Regular 3-layer neural network ConvNet arranges neurons in 3 dimensions 3D input results in 3D output Source: https://cs231n.github.io/convolutional-networks/

Convolutional Neural Networks 28 x 28 28 x 28 14 x 14 Dropout 0 1 Dropout Fully Connected Convolution 32 filters Convolution 64 filters Subsampling Stride (2,2) 8 9 Feature Extraction Classification

Convolutional Neural Networks 28 x 28 28 x 28 14 x 14 Input Pixel value of 32x32x3: 32 width, 32 height, Dropout 3 color channels (RGB) 0 1 Dropout Fully Connected Convolution 32 filters Convolution 64 filters Subsampling Stride (2,2) 8 9 Feature Extraction Classification

Convolutional Neural Networks 28 x 28 28 x 28 14 x 14 Convolutions Compute output of neurons (dot product between their weights) Dropout connected to a small local region. If we use 32 filters, then the output is 28x28x32 (using 5x5 filter) Fully Connected Dropout 0 1 Convolution 32 filters Convolution 64 filters Subsampling Stride (2,2) 8 9 Feature Extraction Classification

Convolutional Neural Networks Pooling Perform down sampling operation along spatial dimensions (w, h) resulting in reduced volume, e.g. 14x14x2. 28 x 28 28 x 28 14 x 14 Dropout Fully Connected Dropout 0 1 Convolution 32 filters Convolution 64 filters Subsampling Stride (2,2) 8 9 Feature Extraction Classification

Convolutional Neural Networks Fully Connected Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. 28 x 28 28 x 28 14 x 14 Dropout Fully Connected Dropout 0 1 Convolution 32 filters Convolution 64 filters Subsampling Stride (2,2) 8 9 Feature Extraction Classification

ConvNetJS MNIST Demo https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

DEMO Neurons Activate!

I d like to thank

Great References Andrej Karparthy s ConvNetJS MNIST Demo What is back propagation in neural networks? CS231n: Convolutional Neural Networks for Visual Recognition Syllabus and Slides Course Notes YouTube With particular focus on CS231n: Lecture 7: Convolution Neural Networks Neural Networks and Deep Learning TensorFlow

Great References Deep Visualization Toolbox Back Propagation with TensorFlow TensorFrames: Google TensorFlow with Apache Spark Integrating deep learning libraries with Apache Spark Build, Scale, and Deploy Deep Learning Pipelines with Ease

Attribution Tomek Drabas Brooke Wenig Timothee Hunter Cyrielle Simeone

Q&A

What s next? Applying your Convolutional Neural Network October 25, 2018 10:00 PDT https://dbricks.co/2o2c4bz State Of The Art Deep Learning On Apache Spark October 31, 2018 09:00 PDT https://dbricks.co/2nqogip