Phase-Change-Memory Devices for Non-von Neumann Computing in the AI Era Evangelos Eleftheriou IBM Fellow, IBM Research - Zurich
Application Trends Computational Complexity Graph Analytics HADOOP HPC O(N 3 ) O(N 2 ) O(N) MB Classical HPC Applications Uncertainty Quantification GB Deep Learning Database Queries Dimensional Reduction TB Knowledge Graph Creation Information Retrieval PB Data Volume
Performance and Power Efficiency Trends 100 90 80 70 60 50 40 30 20 10 2002 2016 Performance (Petaflops/s) Power efficiency (Gigaflops/W) Increasing gap between performance and power efficiency Diminishing performance/power efficiency gains from technology scaling 0
AI Computational Requirements: Very Challenging à 10 18 FLOPS for image classification à 10 7 hours for speech training in several languages Classical Scaling Alone: Not the Solution Key Focus: Decrease the power density and power consumption Overcome the CPU/memory bottleneck of conventional computing architectures Design new AI algorithms, accelerators, interconnect, and software technologies Merolla et al., Science, 2014
Improve von Neumann Computing Storage class memory Near memory computing Monolithic 3D integration MEMORY CMOS Processing Units CPU Burr et al., IBM J. Res. Dev., 2008 Vermij et al., Proc. ACM CF, 2016 Wong, Salahuddin, Nature Nanotechnology, 2015 Minimize the time and distance to memory access
Go beyond von Neumann Computing Neuromorphic computing Computational memory LeCun, Bengio, Hinton, Nature, 2015 Merolla et al., Science, 2014 Indiveri, Liu, Proc. IEEE, 2015 Borghetti et al., Nature, 2010 Di Ventra and Pershin, Scientific American, 2015 Hosseini et al., Electron Dev. Lett., 2015 Sebastian et al., Nature Communications, 2017
Neural Hardware: Digital or Analog? Fully digital Analog/hybrid Google TPU IBM Synapse Manchester Univ. Stanford UZH / ETH Heidelberg Univ. Area/power for SRAM/RRAM neurons and synapses 1 million neurons 1 million neurons Large improvements in power, area and learning performance for memristive neural HW Truly non-von Neumann computations Rajendran et al., IEEE Trans. Electron Dev., 2013 Rajendran et al., IEEE Trans. Electron Dev., 2013 Potential of a memristive neuron/synapse
Resistive Memory Devices Charge-based memory/storage à resistance-based memory/storage Spin-torque transfer magnetic random access memory (STT-MRAM) Metal oxide random access memory (ReRAM) Conductive bridge random access memory (CBRAM) Phase change memory (PCM) Significant impact on memory/storage hierarchy Monolithic integration of memories and computation units Sufficient richness of dynamics for non-von Neumann computing
Phase-Change Memory (PCM) Amorphize Crystallize High-resistance state Low-resistance state Use two distinct solid phases of a Ge-Sb-Te metal alloy to store a bit Use intermediate phases to obtain a continuum of different states or resistance levels Transition between phases by controlled heating and cooling
Phase-Change Devices in Spiking Neural Networks Synapse Postsynaptic potential ( input ) Neuron Ovshinsky, E\PCOS, 2004 Wright, Adv. Mater., 2011 Kuzum et al., Nano Lett., 2012 Jackson et al., ACM JETCS, 2013 Tuma et al., Nature Nanotechnology, 2016 Pantazi et al., Nanotechnology, 2016 Tuma, et al., IEEE Electron Dev. Lett., 2016 All-PCM architecture: Areal/energy efficiency Can we exploit some unique physical attributes?
Phase-Change Neurons Stochastic phase-change neurons T. Tuma, A. Pantazi, M. Le Gallo, A. Sebastian & E. Eleftheriou Nature Nanotechnology, Aug. 2016 The internal state of the neuron is stored in the phase configuration of a PCM device Neuronal dynamics emulated using the physics of crystallization Exhibit inherent stochasticity, which is key for neuronal population coding
Neuronal Population Coding How does the brain store and represent complex stimuli given the slowness, unreliability and uncertainty of individual neurons? Motion Vision Sound High-speed, information-rich stimuli Slow (~10 Hz), stochastic, unreliable neurons Spiking activity of neurons As in any good democracy, individual neurons count for little; it is population activity that matters. For example, as with control of eye and arm movements, visual discrimination is much more accurate than would be predicted from the responses of single neurons. Averbeck et al., Nature Reviews, 2006 Spiking activity T. Tuma et al., Nature Nanotechnology, 2016
Application of an SNN: Temporal Correlation Detection Algorithmic goals Determine whether some data streams are statistically correlated Observe variations in the activity of the correlated input Quickly react to occurrence of correlated inputs Continuously and dynamically re-evaluate the learned statistics Use only unsupervised learning & consume very low power FINANCE SCIENCE MEDICINE BIG DATA and more
Learning Patterns with a Spiking Neural Network Neuron #1: synaptic weights Input pattern Neuron #1: output Neuron #2: synaptic weights Neuron #2: output A. Pantazi et al., Nanotechnology, 2016 Purely unsupervised neuromorphic computation: No counting, no transfers between memory and CPU!
Computational Memory Processing unit & Conventional memory Processing unit & Computational memory Borghetti et al., Nature, 2010 Di Ventra and Pershin, Scientific American, 2015 Hosseini et al., Elect. Dev. Lett., 2015 Sebastian et al., Nature Communications, 2017 The concept: Perform certain computational tasks without the need to transfer data back and forth in the process
PCM to Perform Analog Matrix-Vector Multiplications = Burr et al., Adv. Phys: X, 2017 MAP to conductance values MAP to read voltage DECIPHER from the current Matrix multiplication: Exploits multi-level storage capability and Kirchhoff and Ohm laws A crossbar array performs fast matrix-vector multiplication without data movements in O(1)
How Precisely is the Multiplication? Matrix multiplication: Experimental results But: owing to device variability, stochasticity etc., the matrix-vector multiplication is not highly precise
Application 1: Optimization Solvers Input signal Measured Compressed measurements Reconstructed signal Reconstructed Le Gallo et. El., Proc. IEDM 2017 Compressed sensing: Reconstruction of signals from a small number of measurements from a high-dimensional signal Used in various applications such as MRI, facial recognition, holography, audio restoration or in mobile phone camera sensors
Compressed Sensing/Recovery Using Computational Memory Complexity reduction: O(N 2 ) O(N); Potential 10 6 speed-up on 1000x1000 pixel image Le Gallo et al., Proc. IEDM, 2017
Image Reconstruction with Computational Memory Reconstruction Error NMSE 10 0 10-1 10-2 10-3 PCM chip 4x4-bit Fixed-point Floating-point 0 10Iteration t20 30 Iterations t Experimental result: 128x128 image, 50% sampling rate, Computational memory unit with 131,072 PCM devices Le Gallo et. El., Proc. IEDM 2017 Estimated power reduction of 50x compared to using an optimized 4-bit FPGA matrix-vector multiplier that delivers same reconstruction accuracy at same speed.
Can We Compute with the Dynamics of PCM? A nanoscale non-volatile integrator Sebastian et al., Nature Communications, 2017 Can we exploit the crystallization dynamics for computational memory?
Application 2: Correlation Detection Goal: Detect temporal correlations between event-based data streams. Each process is assigned to a single PCM device. Whenever the process takes the value 1, a SET pulse is applied to the PCM device, with the amplitude or the width of the pulse being proportional to the instantaneous sum of all processes. The conductance of the memory devices deciphers the correlated groups. Sebastian et al., Nature Communications, 2017
Experimental Results (1 Million PCM Devices) Processes Conductances Very weak correlation of c = 0.01 No shuttling back and forth of data Massively parallel
Comparative Study IBM Power8+ Architecture Sebastian et al., Nature Communications, 2017 200x improvement in computation time! Peak dynamic power on the order of watts compared to hundreds of Watts
What if Arbitrarily High-precision is Needed? Mixed-precision computing to the rescue! Bulk of computations in low-precision Computational Memory Refinement in high-precision digital processing engine
Application 3: Linear Equation Solver Digital processor (High precision) Computational memory (Low precision) Le Gallo et al., Mixed-Precision In-Memory Computing, ArXiv, 2017 Solution iteratively updated with low-precision error-correction term Error-correction term obtained using inexact inner solver The matrix multiplications in the inner solver are performed using a PCM array
Linear Equation Solver: Experimental Results Le Gallo et al., Mixed-Precision In-Memory Computing, ArXiv, 2017 Mixed-precision computing provides a pathway for arbitrarily precise computation using computational memory.
System-Level Performance Analysis POWER8 CPU as high-precision processing unit, simulated memory computing unit Significant improvement in the time/energy to solution metric The higher the accuracy of the computational memory, the higher the gain
Application 4: Mixed-Precision Deep Learning Synaptic weight Synaptic weights stored in computational memory The matrix-vector multiplications associated with forward/backward propagation performed in place with low precision The desired weight updates are accumulated in high precision Nandakumar et al., arxiv:1712.01192, 2017
Simulation Results Nandakumar et al., Mixed-Precision Training of Deep Neural Networks Using Computational Memory ArXiv, 2017 Two PCM devices in differential configuration to represent a synapse Device-model-based network simulation achieves 97.78% test accuracy Additional drop from the read noise (0.26%) and analog-digital converters (0.12%)