Adaptive Activation Functions for Deep Networks

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

SORT: Second-Order Response Transform for Visual Recognition

Generative models and adversarial training

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v1 [cs.cl] 27 Apr 2016

Lecture 1: Machine Learning Basics

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Knowledge Transfer in Deep Convolutional Neural Nets

Lip Reading in Profile

Artificial Neural Networks written examination

INPE São José dos Campos

Model Ensemble for Click Prediction in Bing Search Ads

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Python Machine Learning

Test Effort Estimation Using Neural Network

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Cultivating DNN Diversity for Large Scale Video Labelling

Softprop: Softmax Neural Network Backpropagation Learning

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

(Sub)Gradient Descent

arxiv: v1 [cs.cv] 10 May 2017

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Image based Static Facial Expression Recognition with Multiple Deep Network Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Calibration of Confidence Measures in Speech Recognition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Human Emotion Recognition From Speech

arxiv: v4 [cs.cl] 28 Mar 2016

Diverse Concept-Level Features for Multi-Object Classification

Attributed Social Network Embedding

A Deep Bag-of-Features Model for Music Auto-Tagging

Axiom 2013 Team Description Paper

A study of speaker adaptation for DNN-based speech synthesis

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

arxiv: v4 [cs.cv] 13 Aug 2017

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

A Review: Speech Recognition with Deep Learning Methods

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v2 [cs.cl] 26 Mar 2015

A Reinforcement Learning Variant for Control Scheduling

Forget catastrophic forgetting: AI that learns after deployment

CS Machine Learning

Seminar - Organic Computing

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

arxiv: v2 [cs.cv] 30 Mar 2017

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Adaptive learning based on cognitive load using artificial intelligence and electroencephalography

Offline Writer Identification Using Convolutional Neural Network Activation Features

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

arxiv: v2 [cs.ro] 3 Mar 2017

Evolutive Neural Net Fuzzy Filtering: Basic Description

THE enormous growth of unstructured data, including

CSL465/603 - Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Artificial Neural Networks

arxiv: v1 [cs.lg] 7 Apr 2015

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

arxiv: v2 [cs.cv] 3 Aug 2017

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

Deep Neural Network Language Models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Learning Methods for Fuzzy Systems

An empirical study of learning speed in backpropagation

Early Model of Student's Graduation Prediction Based on Neural Network

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

*** * * * COUNCIL * * CONSEIL OFEUROPE * * * DE L'EUROPE. Proceedings of the 9th Symposium on Legal Data Processing in Europe

Neuro-Symbolic Approaches for Knowledge Representation in Expert Systems

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Improvements to the Pruning Behavior of DNN Acoustic Models

Word Segmentation of Off-line Handwritten Documents

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Accelerated Learning Course Outline

SARDNET: A Self-Organizing Feature Map for Sequences

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Semi-Supervised Face Detection

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Accelerated Learning Online. Course Outline

Active Learning. Yingyu Liang Computer Sciences 760 Fall

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Webly Supervised Learning of Convolutional Networks

Georgetown University at TREC 2017 Dynamic Domain Track

On the Formation of Phoneme Categories in DNN Acoustic Models

Modeling function word errors in DNN-HMM based LVCSR systems

Transcription:

Adaptive Activation Functions for Deep Networks Michael Dushkoff, Raymond Ptucha Rochester Institute of Technology IS&T International Symposium on Electronic Imaging 2016 Computational Imaging Feb 16, 2016 Convolutional Neural Networks have Revolutionized Computer Vision and Pattern Recognition Taigman et al.2014 Simonyan et al.2014 Szegedy et al., 2014 Karpathy et al., 2014 3/22/2016 Dushkoff, Ptucha EI'16 2 1

Deep Learning Surpassing The Visual Cortex s Object Detection and Recognition Capability Top-5 error on ImageNet Traditional Computer Vision and Machine Learning Deep Convolution Neural Networks (CNNs) Error Introduction of deep learning Trained Human (genius intellect) Similar effect demonstrated on voice and pattern recognition Trained CNN Year 2015 3/22/2016 Dushkoff, Ptucha EI'16 3 Introduction Background Methodology Datasets Results Conclusions Outline 3/22/2016 Dushkoff, Ptucha EI'16 4 2

The Human Brain We ve learned more about the brain in the last 5 years than we have learned in the last 5000 years! It controls every aspect of our lives, but we still don t understand exactly how it works. 3/22/2016 Dushkoff, Ptucha EI'16 5 Neurons in Brain vs. Computer x 0 x 1 0 1 g ( ) h (x) The brain has billions of cells called neurons. Each is connected to up to 10K others, forming a network of 100T connections. If the sum of inputs > threshold, the neuron will fire. x n Artificial neurons, inspired by biology, compute a weighted sum of inputs, then pass through a nonlinear activation function. Artificial neural networks are formed by connecting thousands to millions of these artificial neurons together. 3/22/2016 Dushkoff, Ptucha EI'16 6 n 3

Three Most Common Activation Functions Sigmoid Tanh Rectified Linear Units (ReLU) 1 1 Constrains 0 out 1 Gradient saturates to 0 Inputs centered on 0, but output centered on 0.5 Gradient easy to calculate. Constrains -1 out 1 Gradient saturates to 0 Input and output centered on 0 Gradient easy to calculate max 0, Unbounded upper range with no gradient saturation Empirical faster and better result Neurons can die if allowed to grow unconstrained 3/22/2016 Dushkoff, Ptucha EI'16 7 Tanh vs. ReLU on CIFAR 10 dataset [Krizhevsky 12] ReLU tanh 6 faster! ReLU reaches 25% error 6 faster! Note: Learning rates optimized for each, no regularization, four layer CNN. 3/22/2016 Dushkoff, Ptucha EI'16 8 4

Lots of other Activation Functions Non monotonic functions [Dawson 92] Adaptive cubic spline [Vecci 98] Adaptive parameters [Nakayama 98] Monotonic and non monotonic mixtures [Wong 02] Gated adaptive functions [Scheler 04] Periodic functions [Kang 05] Maxout & Leaky ReLU s [Goodfellow 13] Adaptive Leaky ReLU s [He 15] 3/22/2016 Dushkoff, Ptucha EI'16 9 Contributions Prior work either was constrained to small networks, or forced all nodes in a layer to have the same activation function. This work learns functions on a node by node basis (for images, every pixel can have own activation function), and experiments on larger datasets. This work finds that allowing nodes to adaptively learn their own activation functions results in faster convergence and higher accuracies. 3/22/2016 Dushkoff, Ptucha EI'16 10 5

Where: Traditional Artificial Neuron x 0 x 1 x n 0 1 n g ( ) is the input is the output is a activation function Note, x 0 is the bias unit, x 0 =1 h (x) Activation function 3/22/2016 Dushkoff, Ptucha EI'16 11 Proposed Method Adaptive activation functions are defined by: Where: is the input is the output is a unique activation function is a convex (sigmoid) limiting function is the gating factor which is learned 3/22/2016 Dushkoff, Ptucha EI'16 12 6

Proposed Method Adaptive activation functions are defined by: x 0 Where: is the input x 1 1 g() x is the output n n is a unique activation function is a convex (sigmoid) limiting function is the gating factor which is learned 0 3/22/2016 Dushkoff, Ptucha EI'16 13 Architecture VGG like network structure [1] 32 32 3 (64) 3 3 3 Modified forward and back propagation to handle adaptive activation functions Batch normalization after each convolution [2] 32 32 64 32 32 64 (64) 3 3 64 16 16 64 16 16 128 16 16 128 (128) (128) 3 3 64 3 3 128 8 8 128 8 8 256 8 8 256 (256) (256) 3 3 128 3 3 256 3/22/2016 Dushkoff, Ptucha EI'16 14 4 4 256 4 4 4 4 512 512 (512) (512) 3 3 256 3 3 512 (64 64 input to 100 class example) 2 2 512 2 2 2 2 512 512 FC FC 512 1 512 1 100 1 (512) (512) 3 3 512 3 3 512 [1] Simonyan and Zisserman. Very Deep Convolutional Networks for Large Scale Image Recognition. CoRR abs/1409.1556, 2014. [2] Ioffe and Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. CoRR abs/1502.03167, 2015. 7

Technical Approach Adaptive functions were used only on certain layers First n layers vs. last n layers 32 32 3 (64) 3 3 3 32 32 64 32 32 64 (64) 3 3 64 16 16 64 16 16 128 16 16 128 (128) (128) 3 3 64 3 3 128 8 8 128 8 8 256 8 8 256 (256) (256) 3 3 128 3 3 256 4 4 256 4 4 4 4 512 512 (512) (512) 3 3 256 3 3 512 2 2 512 2 2 2 2 512 512 FC FC 512 1 512 1 100 1 (512) (512) 3 3 512 3 3 512 3/22/2016 Dushkoff, Ptucha EI'16 15 CIFAR100: 100 classes 32 32 3 pixels/image 500 images training, 100 testing/class Datasets CalTech256: 257 classes 300 200 3 pixels/image* 80 to 827 images/class *resampled to 64 64 3 3/22/2016 Dushkoff, Ptucha EI'16 16 8

Results CIFAR 100 ReLU Baseline Results (57.4%) 3/22/2016 Dushkoff, Ptucha EI'16 17 Results CIFAR 100 Adaptive case First 7 Adaptive (51.5%) 3/22/2016 Dushkoff, Ptucha EI'16 18 9

Results CIFAR 100 Adaptive case Last 7 Adaptive (59.8%) 3/22/2016 Dushkoff, Ptucha EI'16 19 Results CIFAR 100 Comparison (Baseline vs. Adaptive) 3/22/2016 Dushkoff, Ptucha EI'16 20 10

Additional CIFAR 100 Results Usage Statistics 3/22/2016 Dushkoff, Ptucha EI'16 21 Additional CIFAR 100 Results Randomly selected adaptive functions 3/22/2016 Dushkoff, Ptucha EI'16 22 11

Results Caltech 256 Baseline Results (32.5%) 3/22/2016 Dushkoff, Ptucha EI'16 23 Results Caltech 256 Adaptive Results (32.6%) Last 5 layers 3/22/2016 Dushkoff, Ptucha EI'16 24 12

Results Caltech 256 Comparison (Baseline vs. Adaptive) 3/22/2016 Dushkoff, Ptucha EI'16 25 Conclusions Adaptive accuracies are improved over ReLU in CIFAR 100, but not in Caltech 256. For both datasets, training time is faster using adaptive activation functions. Additional training strategies can be implemented in order to combat the problem of the adaptive function parameters taking over the optimization problem. 3/22/2016 Dushkoff, Ptucha EI'16 26 13

Next Steps Implement new training method (ON/OFF training with gradient scaling) Apply non monotonic functions to the adaptive definition to allow for more complex non linear behavior 3/22/2016 Dushkoff, Ptucha EI'16 27 Thank you!! rwpeec@rit.edu 3/22/2016 Dushkoff, Ptucha EI'16 28 14