Reduced-memory training and deployment of deep residual networks by stochastic binary quantization

Similar documents
A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

arxiv: v1 [cs.lg] 15 Jun 2015

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Python Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Forget catastrophic forgetting: AI that learns after deployment

SORT: Second-Order Response Transform for Visual Recognition

Generative models and adversarial training

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Knowledge Transfer in Deep Convolutional Neural Nets

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v1 [cs.cv] 10 May 2017

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

arxiv: v1 [cs.cl] 27 Apr 2016

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

INPE São José dos Campos

Lip Reading in Profile

arxiv: v1 [cs.lg] 7 Apr 2015

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

An empirical study of learning speed in backpropagation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

THE enormous growth of unstructured data, including

Artificial Neural Networks written examination

Evolutive Neural Net Fuzzy Filtering: Basic Description

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Diverse Concept-Level Features for Multi-Object Classification

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v2 [cs.cv] 30 Mar 2017

A Deep Bag-of-Features Model for Music Auto-Tagging

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Offline Writer Identification Using Convolutional Neural Network Activation Features

Softprop: Softmax Neural Network Backpropagation Learning

arxiv: v2 [cs.cl] 26 Mar 2015

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Modeling function word errors in DNN-HMM based LVCSR systems

SARDNET: A Self-Organizing Feature Map for Sequences

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

Residual Stacking of RNNs for Neural Machine Translation

Cultivating DNN Diversity for Large Scale Video Labelling

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Recognition at ICSI: Broadcast News and beyond

arxiv: v4 [cs.cl] 28 Mar 2016

Abstractions and the Brain

A Review: Speech Recognition with Deep Learning Methods

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Lecture 1: Machine Learning Basics

AUTOMATED FABRIC DEFECT INSPECTION: A SURVEY OF CLASSIFIERS

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Modeling function word errors in DNN-HMM based LVCSR systems

Device Independence and Extensibility in Gesture Recognition

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Human Emotion Recognition From Speech

arxiv: v2 [cs.ro] 3 Mar 2017

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

(Sub)Gradient Descent

Image based Static Facial Expression Recognition with Multiple Deep Network Learning

Deep Neural Network Language Models

A study of speaker adaptation for DNN-based speech synthesis

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Deep Facial Action Unit Recognition from Partially Labeled Data

WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web

Attributed Social Network Embedding

Dropout improves Recurrent Neural Networks for Handwriting Recognition

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Calibration of Confidence Measures in Speech Recognition

Artificial Neural Networks

Axiom 2013 Team Description Paper

QUT Digital Repository:

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Evidence for Reliability, Validity and Learning Effectiveness

arxiv: v1 [cs.dc] 19 May 2017

AI Agent for Ice Hockey Atari 2600

A Case-Based Approach To Imitation Learning in Robotic Agents

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Probabilistic Latent Semantic Analysis

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

WHEN THERE IS A mismatch between the acoustic

Test Effort Estimation Using Neural Network

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Beyond the Pipeline: Discrete Optimization in NLP

FY16 UW-Parkside Institutional IT Plan Report

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Issues in the Mining of Heart Failure Datasets

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Multi-tasks Deep Learning Model for classifying MRI images of AD/MCI Patients

Transcription:

Reduced-memory training and deployment of deep residual networks by stochastic binary quantization Mark D. McDonnell 1, Ruchun Wang 2 and André van Schaik 2 cls-lab.org 1 Computational Learning Systems Laboratory School of Information Technology & Mathematical Sciences University of South Australia 2 BENS Laboratory MARCS Institute, Western Sydney University, Australia

Motivation and Background

Background Deep convolutional neural networks Many parameters Many sequential layers Following training: Learnt parameters ~10 100 MB During training with BP+SGD: Can easily max the 12 GB of RAM in GPUs Mainly temporary storage from FP for use in BP

Motivation How can we minimize MB required during training with BP+SGD? Different goal to model compression following training but we consider this too model compression methods offer ways to reduce RAM access, if not usage, during BP+SGD Compressed Learning

Benefits of reducing RAM use during BP+SGD Train larger models on a single GPU BP+SGD for large models on mobile devices Is it always possible/desirable to train at the data center? Personalized or highly-secure fine-tuning rapid-retraining remote deployment: no comms continuous learning with streaming data

Low bit-width deep CNNs: Prior results Iandola et al., Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size, Arxiv:1602.07360, 2016 Courbariaux, Bengio and David, Binaryconnect: Training deep neural networks with binary weights during propagations, Arxiv:1511.00363, 2015. Hubara et al., Quantized neural networks: Training neural networks with low precision weights and activations, Arxiv:1609.07061. Merolla et al., Deep neural networks are robust to weight binarization and other non-linear distortions, Arxiv:1606.01981, 2016. Rastegari et al., Xnor-net: Imagenet classification using binary convolutional neural networks, Arxiv:1603.05279, 2016.

Low bit-width deep CNNs: Prior results 1. Model compression Easy to compress convolution parameters to a single bit following training little accuracy penalty 2. Compressed learning Model compression doesn t help much: parameters updated using full precision Gradients: need 6-12 bits Activations: Use binary nonlinearity layers instead of ReLUs; incurs an accuracy penalty

Our Approach

Our approach for model compression Similar to others use the sign of weights for FP and BP Use full-precision weights for updates Different to others we found no need to normalise [Rastegari et al] We use new tricks from full-precision CNN training Net result: large improvements on CIFAR-10

Our approach for model compression Our improvements come from: Using wide ResNets 1 as a baseline: Using standard light data augmentation Using a warm-restart learning-rate schedule 1 S. Zagoruyko and N. Komodakis. Wide residual networks. arxiv:1605.07146, 2016.

Our approach for compressed learning Inspiration from computational neuroscience: Feedback alignment Key points: Forward propagation remains unchanged BP with inexact gradient calculations

Feedback alignment Lillicrap et al. Random synaptic feedback weights support error backpropagation for deep learning, Nature Communications, vol. 7, p. 13276, 2016. CINE: Computation-inspired neurobiological elements! Thought-provoking 2016 Hinton talk: Can the brain do backpropagation?

Our approach for compressed learning Key points we borrow from feedback alignment: Forward propagation remains unchanged BP with inexact gradient calculations Different to others: We keep ReLU activations, A, for forward pass We convert to a single bit, A q only for use in the backward pass Our single-bit quantization of activations is stochastic: A q = I(A + noise >1)

Our approach for compressed learning Benefits E.g. 20 layer resnet on imagenet 32 bit precision: BP+SGD needs 1.8GB 1 bit precision: 1.8 GB 56 MB

Our Results

Our Results: Model Compression for CIFAR (single-bit weights following training) Method Depth Width #params CIFAR-10 CIFAR-100 32-bit Wide ResNet 28 10 36.5M 4.00% 19.25% Binary connect 9 8 10.3M 8.27% N/A (VGG net) 1 Weight binarization 2 (VGG net) 8 8 11.7M 8.25% N/A BWN (VGG net) 3 8 8 11.7M 9.88% N/A Our Wide Resnet 20 4 4.3M 6.34% 23.79% Our Wide Resnet 20 10 26.8M 4.48% 22.28% We used only 63 epochs for width=4 and 127 for width=10 1 Courbariaux et al., Binaryconnect: Training deep neural networks with binary weights during propagations, Arxiv:1511.00363, 2015. 2 Hubara et al., Quantized neural networks: Training neural networks with low precision weights and activations, Arxiv:1609.07061. 3 Rastegari et al., Xnor-net: Imagenet classification using binary convolutional neural networks, Arxiv:1603.05279, 2016.

Our Results: Model Compression for CIFAR (single-bit weights following training) Method Depth Width #params Top-1 Top-5 32-bit ResNet 20 1 11.5M 30.70% 10.80% BNN (googlenet) 1 13-52.9% 30.90% BWN (ResNet) 2 20 1 11.5M 39.2% 17.0% Our Resnet 20 1 11.5M 44.48% 20.9% We need to train for longer 1 Hubara et al., Quantized neural networks: Training neural networks with low precision weights and activations, Arxiv:1609.07061. 2 Rastegari et al., Xnor-net: Imagenet classification using binary convolutional neural networks, Arxiv:1603.05279, 2016.

Our Results: Compressed Learning for CIFAR Method Depth Width #params CIFAR-10 CIFAR-100 32-bit Wide ResNet 28 10 36.5M 4.00% 19.25% BNN (GoogleMet) 1 9 8 10.3M 10.15% N/A Xnor-net (ResNet) 2 8 8 11.7M 10.17% N/A Our Wide Resnet 20 4 4.3M 6.86% 25.93% Our Wide Resnet 20 10 26.8M 5.43% 23.01% Our Wide Resnet + model compression 20 10 26.8M 5.55% 23.7% 1 Hubara et al., Quantized neural networks: Training neural networks with low precision weights and activations, Arxiv:1609.07061. 2 Rastegari et al., Xnor-net: Imagenet classification using binary convolutional neural networks, Arxiv:1603.05279, 2016.

Summary

Model compression We achieved SOTA error rates on CIFAR-10 when using 1-bit weights at test time Same as error rates for full-precision! Achieved using far fewer training epochs

Learning compression 32 x reduced memory during BP+SGD Error rates fell by only ~1% (absolute) Drawback: cannot use xnor approache Advantage: better and faster learning

Next steps More training on Imagenet Faster BP+SGD using improved methods of feedback alignment Theory for why our approach works Add low bit-width gradients and updates Ultimately: low-power hardware BP+SGD Applications: not just supervised classifiers!

Thanks for your attention! mark.mcdonnell@unisa.edu.au cls-lab.org Mark D. McDonnell 1, Ruchun Wang 2 and André van Schaik 2