Computer Arithmetic in Deep Learning. Bryan

Similar documents
Python Machine Learning

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

arxiv: v1 [cs.lg] 7 Apr 2015

Lecture 1: Machine Learning Basics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

arxiv: v1 [cs.cl] 27 Apr 2016

Artificial Neural Networks written examination

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Lecture 2: Quantifiers and Approximation

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Speech Recognition at ICSI: Broadcast News and beyond

(Sub)Gradient Descent

Calibration of Confidence Measures in Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Human Emotion Recognition From Speech

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v1 [cs.cv] 10 May 2017

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Softprop: Softmax Neural Network Backpropagation Learning

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

arxiv: v1 [cs.lg] 15 Jun 2015

WHEN THERE IS A mismatch between the acoustic

Knowledge Transfer in Deep Convolutional Neural Nets

Evolutive Neural Net Fuzzy Filtering: Basic Description

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Generative models and adversarial training

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Axiom 2013 Team Description Paper

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Forget catastrophic forgetting: AI that learns after deployment

Speech Emotion Recognition Using Support Vector Machine

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

INPE São José dos Campos

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

CS Machine Learning

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Model Ensemble for Click Prediction in Bing Search Ads

Test Effort Estimation Using Neural Network

Moderator: Gary Weckman Ohio University USA

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Learning Methods in Multilingual Speech Recognition

Major Milestones, Team Activities, and Individual Deliverables

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Second Exam: Natural Language Parsing with Neural Networks

Dialog-based Language Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v4 [cs.cl] 28 Mar 2016

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Artificial Neural Networks

Applications of memory-based natural language processing

An Introduction to Simio for Beginners

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A Case Study: News Classification Based on Term Frequency

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

A Review: Speech Recognition with Deep Learning Methods

A study of speaker adaptation for DNN-based speech synthesis

Georgetown University at TREC 2017 Dynamic Domain Track

THE world surrounding us involves multiple modalities

GACE Computer Science Assessment Test at a Glance

MYCIN. The MYCIN Task

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

SARDNET: A Self-Organizing Feature Map for Sequences

Knowledge-Based - Systems

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

An empirical study of learning speed in backpropagation

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Rule Learning With Negation: Issues Regarding Effectiveness

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Evolution of Symbolisation in Chimpanzees and Neural Nets

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Deep Neural Network Language Models

An investigation of imitation learning algorithms for structured prediction

Word Segmentation of Off-line Handwritten Documents

Process improvement, The Agile Way! By Ben Linders Published in Methods and Tools, winter

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

A Deep Bag-of-Features Model for Music Auto-Tagging

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

An Online Handwriting Recognition System For Turkish

Sight Word Assessment

Classification Using ANN: A Review

Learning to Schedule Straight-Line Code

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

Backwards Numbers: A Study of Place Value. Catherine Perez

Transcription:

Computer Arithmetic in Deep Learning Bryan Catanzaro

What do we want AI to do? Guide us to content Keep us organized Help us find things Help us communicate 帮助我们沟通 Drive us to work Serve drinks?

OCR-based Translation App Baidu IDL hello Bryan Catanzaro

Medical Diagnostics App Baidu BDL AskADoctor can assess 520 different diseases, representing ~90 percent of the most common medical problems. Bryan Catanzaro

Image Captioning Baidu IDL A yellow bus driving down a road with green trees and green grass in the background. Bryan Catanzaro Living room with white couch and blue carpeting. Room in apartment gets some afternoon sun.

Image Q&A Baidu IDL Sample questions and answers

Natural User Interfaces Goal: Make interacting with computers as natural as interacting with humans AI problems: Speech recognition Emotional recognition Semantic understanding Dialog systems Speech synthesis

Demo Deep Speech public API

Computer vision: Find coffee mug Andrew Ng

Computer vision: Find coffee mug Andrew Ng

Why is computer vision hard? The camera sees : Andrew Ng

Artificial Neural Networks Neurons in the brain Deep Learning: Neural network Output Andrew Ng

Computer vision: Find coffee mug Andrew Ng

Supervised learning (learning from tagged data) X Input Image Y Output tag: Yes/No (Is it a coffee mug?) Data: Yes No Learning X Y mappings is hugely useful Andrew Ng

Machine learning in practice Progress bound by latency of hypothesis testing Idea Think really hard Hack up in Matlab Test Code Run on workstation

Deep Neural Net A very simple universal approximator! X y j = f w ij x i i One layer x w y f(x) = ( 0, x<0 x, x 0 nonlinearity Deep Neural Net

Why Deep Learning? 1. Scale Matters Bigger models usually win 2. Data Matters More data means less cleverness necessary Accuracy Deep Learning Many previous methods 3. Productivity Matters Data & Compute Teams with better tools can try out more ideas

Training Deep Neural Networks y j = f X i w ij x i! x w y Computation dominated by dot products Multiple inputs, multiple outputs, batch means GEMM Compute bound Convolutional layers even more compute bound

Computational Characteristics High arithmetic intensity Arithmetic operations / byte of data O(Exaflops) / O(Terabytes) : 10^6 Math limited Arithmetic matters Medium size datasets Generally fit on 1 node Training 1 model: ~20 Exaflops Bryan Catanzaro

Speech Recognition: Traditional ASR Getting higher performance is hard Improve each stage by engineering Expert engineering. Accuracy Traditional ASR Data + Model Size

Speech recognition: Traditional ASR Huge investment in features for speech! Decades of work to get very small improvements Spectrogram MFCC Flux

Speech Recognition 2: Deep Learning! Since 2011, deep learning for features Acoustic Model HMM Language Model Transcription The quick brown fox jumps over the lazy dog.

Speech Recognition 2: Deep Learning! With more data, DL acoustic models perform better than traditional models DL V1 for Speech Accuracy Traditional ASR Data + Model Size

Speech Recognition 3: Deep Speech End-to-end learning Transcription The quick brown fox jumps over the lazy dog.

Speech Recognition 3: Deep Speech We believe end-to-end DL works better when we have big models and lots of data Deep Speech DL V1 for Speech Accuracy Traditional ASR Data + Model Size

End-to-end speech with DL Deep neural network predicts characters directly from audio T H _ E D O G......

Recurrent Network RNNs model temporal dependence Various flavors used in many applications LSTM, GRU, Bidirectional, Especially sequential data (time series, text, etc.) Sequential dependence complicates parallelism Feedback complicates arithmetic

Connectionist Temporal Classification (a cost function for end-to-end learning) We compute this in log space Probabilities are tiny

Training sets Train on 45k hours (~5 years) of data Still growing Languages English Mandarin End-to-end deep learning is key to assembling large datasets

Performance for RNN training 512 256 128 one node multi node TFLOP/s 64 32 16 8 4 2 Typical training run 1 1 2 4 8 16 32 64 128 Number of GPUs 55% of GPU FMA peak using a single GPU ~48% of peak using 8 GPUs in one node This scalability key to large models & large datasets

Computer Arithmetic for training Standard practice: FP32 But big efficiency gains from smaller arithmetic e.g. NVIDIA GP100 has 21 Tflops 16-bit FP, but 10.5 Tflops 32-bit FP Expect continued push to lower precision Some people report success in very low precision training Down to 1 bit! Quite dependent on problem/dataset Bryan Catanzaro

Training: Stochastic Gradient Descent X w 0 = w r w Q(x i,w) n Simple algorithm Add momentum to power through local minima Compute gradient by backpropagation Operates on minibatches This makes it a GEMM problem instead of GEMV Choose minibatches stochastically Important to avoid memorizing training order Difficult to parallelize Prefers lots of small steps Increasing minibatch size not always helpful i

Training: Learning rate w 0 = w n X i r w Q(x i,w) is very small (1e-4) We learn by making many very small updates to the parameters Terms in this equation often very lopsided Computer Arithmetic Problem

Cartoon optimization problem Q = (w 3) 2 +3 @Q @w = 2(w 3) =.01 [Erich Elsen]

Cartoon Optimization Problem @Q @w Q @Q @w w [Erich Elsen]

Rounding is not our friend w @Q @w Resolution of FP16 w [Erich Elsen]

Solution 1 Stochastic Rounding [S. Gupta et al., 2015] Round up or down with probability related to the distance to the neighboring grid points x = 100,y =0.1, =1 ( 100 w.p. 0.99 x + y = 101 w.p. 0.01 Efficient to implement Just need a bunch of random numbers And an FMA instruction with round-to-nearest-even [Erich Elsen]

Stochastic Rounding After adding.01, 100 times to 100 With r2ne we will still have 100 With stochastic rounding we will expect to have 101 Allows us to make optimization progress even when the updates are small [Erich Elsen]

Solution 2 High precision accumulation Keep two copies of the weights One in high precision (fp32) One in low precision (fp16) Accumulate updates to the high precision copy Round the high precision copy to low precision and perform computations [Erich Elsen]

High precision accumulation After adding.01, 100 times to 100 We will have exactly 101 in the high precision weights, which will round to 101 in the low precision weights Allows for accurate accumulation while maintaining the benefits of fp16 computation Requires more weight storage, but weights are usually a small part of the memory footprint [Erich Elsen]

Deep Speech Training Results FP16 storage FP32 math [Erich Elsen]

Deployment Once a model is trained, we need to deploy it Technically a different problem No more SGD Just forward-propagation Arithmetic can be even smaller for deployment We currently use FP16 8-bit fixed point can work with small accuracy loss Need to choose scale factors for each layer Higher precision accumulation very helpful Although all of this is ad hoc

Magnitude distributions 10000 Dense, Layer 1 10000000 frequency 1000 100 10 parameters input output 1-20 -15-10 -5 0 5 log_2(magnitude) Peaked power law distributions 1000000 100000 10000 1000 100 10 1 [M. Shoeybi]

Determinism Determinism very important So much randomness, hard to tell if you have a bug Networks train despite bugs, although accuracy impaired Reproducibility is important For the usual scientific reasons Progress not possible without reproducibility We use synchronous SGD

Conclusion Deep Learning is solving many hard problems Many interesting computer arithmetic issues in Deep Learning The DL community could use your help understanding them! Pick the right format Mix formats Better arithmetic hardware

Thanks Andrew Ng, Adam Coates, Awni Hannun, Patrick LeGresley, Erich Elsen, Greg Diamos, Chris Fougner, Mohammed Shoeybi and all of SVAIL Bryan Catanzaro