ML/Hardware Co-design: Overview, Preliminary Result, and Open Opportunities Ce Zhang

Similar documents
(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

Python Machine Learning

CS Machine Learning

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v1 [cs.lg] 15 Jun 2015

University of Groningen. Systemen, planning, netwerken Bosman, Aart

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v1 [cs.dc] 19 May 2017

Artificial Neural Networks written examination

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Top US Tech Talent for the Top China Tech Company

School of Innovative Technologies and Engineering

Computer Science. Embedded systems today. Microcontroller MCR

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Introduction to Simulation

Reducing Features to Improve Bug Prediction

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

arxiv: v1 [cs.lg] 7 Apr 2015

COMPUTER SCIENCE GRADUATE STUDIES Course Descriptions by Research Area

Probabilistic Latent Semantic Analysis

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

WHEN THERE IS A mismatch between the acoustic

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Extending Place Value with Whole Numbers to 1,000,000

Feature-oriented vs. Needs-oriented Product Access for Non-Expert Online Shoppers

COMPUTER SCIENCE GRADUATE STUDIES Course Descriptions by Methodology

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Generative models and adversarial training

Innovative Teaching in Science, Technology, Engineering, and Math

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Course Specifications

CS 100: Principles of Computing

An empirical study of learning speed in backpropagation

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Computer Science 141: Computing Hardware Course Information Fall 2012

Detailed course syllabus

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Green Belt Curriculum (This workshop can also be conducted on-site, subject to price change and number of participants)

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Improving Fairness in Memory Scheduling

Mathematics. Mathematics

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

CSL465/603 - Machine Learning

Lecture 15: Test Procedure in Engineering Design

Speech Recognition at ICSI: Broadcast News and beyond

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Math Placement at Paci c Lutheran University

Model Ensemble for Click Prediction in Bing Search Ads

Lecture 1: Basic Concepts of Machine Learning

TU-E2090 Research Assignment in Operations Management and Services

Office Hours: Mon & Fri 10:00-12:00. Course Description

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Computer Science (CS)

An Introduction to Simio for Beginners

Georgetown University at TREC 2017 Dynamic Domain Track

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Laboratorio di Intelligenza Artificiale e Robotica

Grade 6: Correlated to AGS Basic Math Skills

Medical Complexity: A Pragmatic Theory

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Human Emotion Recognition From Speech

Deep Neural Network Language Models

B.S/M.A in Mathematics

95723 Managing Disruptive Technologies

Circuit Simulators: A Revolutionary E-Learning Platform

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

Learning to Schedule Straight-Line Code

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Statewide Framework Document for:

MTH 215: Introduction to Linear Algebra

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Speeding Up Reinforcement Learning with Behavior Transfer

Exploration. CS : Deep Reinforcement Learning Sergey Levine

MGT/MGP/MGB 261: Investment Analysis

GACE Computer Science Assessment Test at a Glance

Multi-Lingual Text Leveling

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Assignment 1: Predicting Amazon Review Ratings

Self Study Report Computer Science

Learning Disability Functional Capacity Evaluation. Dear Doctor,

MYCIN. The MYCIN Task

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

Finding, Hiring, and Directing e-learning Voices Harlan Hogan, E-learningvoices.com

Discriminative Learning of Beam-Search Heuristics for Planning

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Rule Learning With Negation: Issues Regarding Effectiveness

Australian Journal of Basic and Applied Sciences

Transcription:

ML/Hardware Co-design: Overview, Preliminary Result, and Open Opportunities Ce Zhang (ce.zhang@inf.ethz.ch)

Machine Learning: Why should we care?

plus some other (equally important) reasons! 3

4 Machine Learning Needs @ ETH (Some Samples)

ONE SIMPLE REASON If people who have 20 Science&Nature papers think machine learning is important to treat us when we are sick, we d better help them with that.

Overview of This Lecture 6 1. (Whiteboard - 20 mins) Overview of Machine Learning from a System Perspective - How many of you know Linear Regression? - How many of you know Support Vector Machine? - How many of you know Gradient Descent? - How many of you know Stochastic Gradient Descent? 2. (15 mins) A sample of our previous work related to hardware & machine learning - NUMA - CPU vs. GPU 3. (10 mins) Ongoing - - Low Precision Arithmetic

A Short Tutorial: Linear Regression Patient Survival Time Predict

A Short Tutorial: Linear Regression Patient Survival Time Features Age Weight Stage Predict

A Short Tutorial: Linear Regression Patient Survival Time Features Age Weight Stage Predict Data Age Weight Stage 25 105.5 2 54 106.2 4 64 107.2 1 Days 250 70? Training Testing

A Short Tutorial: Linear Regression Data Goal of Training Millions/Billions! Age 25 54 64 Heig 5.5 6.2 7.2 Stag 2 4 1 A Days 250 70? b Loss (Sum of prediction errors) Model Age Heig Stag x Assumption The prediction of r th patient is! " $ r th row of A

More General than Linear Regression Many machine learning tasks can be written as - min % &((, * + ) $ +./ Billions! 28 28

Whiteboard 1.Gradient Descent 2.Stochastic Gradient Descent 3.Analytical Comparison between SGD and GD 4.Other Smarter ways to design the gradient estimator 1.Mini-batch 2.Low-precision 5.A Taxonomy of System Bottlenecks (Single Worker) 6.Distributed Setting

Low Precision Arithmetic

Limited Precision 14 Can we do low-precision data (not gradient)? More difficult. (Can you see why?) Whiteboard Single Bit Gradient? NVLINK, DGX-1 Can we do near memory computation here?!

15 Linear Regression: 32-bit Data => 1-bit Data 0.08 0.06 32/1.5 = 21x Bandwidth Reduction?! Better Training Loss 0.04 0.02 32-bit 1.5bit Double Sampling (We get this curve today @5am :) 1-bit Random Rounding 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Iterations

Deep Learning: CPU vs. GPU

Whiteboard Convolution & Neural Network

Single-Machine: CPU vs. GPU On which machine should we run? CPU or GPU? I have a GPU Cluster I have 5000 CPU cores I have $100K to spend on the cloud EC2: c4.4xlarge 8 cores@2.90ghz 0.7TFlops EC2: g2.2xlarge 1.5K cores@800mhz 1.2TFlops Not a 10x gap? Can we close this gap?

Caffe con Troll A prototype system to study the CPU/GPU tradeoff. Same-input-same-output as Caffe. http://github.com/hazyresearch/caffecontroll

What we found 1.1 Relative Speed 0.83 0.55 0.28 0 Caffe CPU CcT CPU Caffe GPU Caffe CPU CcT CPU c4.4x_large ($0.68/h) c4.4x_large ($0.68/h) g2.2x_large ($0.47/h) c4.8x_large ($1.37/h) c4.8x_large ($1.37/h)

What we found 1.1 Relative Speed 0.83 0.55 0.28 0 Caffe CPU CcT CPU Caffe GPU Caffe CPU CcT CPU c4.4x_large ($0.68/h) c4.4x_large ($0.68/h) g2.2x_large ($0.47/h) c4.8x_large ($1.37/h) c4.8x_large ($1.37/h)

What we found 1.1 Relative Speed 0.83 0.55 0.28 0 Caffe CPU CcT CPU Caffe GPU Caffe CPU CcT CPU c4.4x_large ($0.68/h) c4.4x_large ($0.68/h) g2.2x_large ($0.47/h) c4.8x_large ($1.37/h) c4.8x_large ($1.37/h)

What we found 1.1 Relative Speed 0.83 0.55 Proportional to FLOPs! 0.28 0 Caffe CPU CcT CPU Caffe GPU Caffe CPU CcT CPU c4.4x_large ($0.68/h) c4.4x_large ($0.68/h) g2.2x_large ($0.47/h) c4.8x_large ($1.37/h) c4.8x_large ($1.37/h)

Four Shallow Ideas Described in Four Pages arxiv:1504.04343

One of the four shallow ideas 3 CPU Cores 3 Images Strategy 1 Strategy 2 If the amount of data is too small for each core, the process might not be CPU bound. For AlexNet over Haswell CPUs, Strategy 2 is 3-4x faster.

Ongoing Work

Where to run Machine Learning?

Big Data in Small Pockets: A Vision Hypothesis 1: 2min on Jetson TX1 How fast can I train an SVM over 100GB data? PCIe Gen2 4x 6sec = 1Epoch Jetson TX1 1TFLOPS, 4GB RAM 15W FPGA Data Compression? Phone Energy Efficiency? Go Beyond Deep Learning How fast can I train SVM+LR over 100GB? Hypothesis 2: Same as 1 SVM (Bandwidth Bound) Batching? Model Compression (Hashing Trick for Training)? Which operations should happen on host/device?

Precision-aware Network Protocol

How Should Limited Precision Primitives be Supported? Low-precision Network Primitives? TCP/IP for ML? 30 TensorFlow CNTK Caffe MxNET SINGA Precision Aware Network Protocol Network Need FPGA to be fast?

How to Program Machine Learning?

32 Three Worlds: One System? Machine Learning - Tensor, Tensor, Tensor - Real Algebra Database - Relation - Relational Algebra Matrix Factorization let matrix be float[][] let result be (P, Q) where P is float[][] and R is float[][] COPY matrix from matrix.tsv Partial Schema (Static Analysis to figure out the rest?) SQL/Spark Spark - Tuples - Map Reduce COPY result from `zero()` v_{i,j} = P_{i,-} * Q_{j,-} Real-Algebra View Logically, not different, all DataCube Can we bridge the gap of physical representation? loss = \sum_{i,j}(\sqrt(v_{i,j} matrix_{i,j})) min loss over P and Q Real-Algebra Query LaTeX-compilant (Math should look like Math)

How to Guide user to use Machine Learning?

Cheat Sheet 34 Are you serious? What does it mean to be a good Cheat sheet? HCI => I want a quantitive measure If X then Y: - How to give Premise? - How to give Consequence?

How to take advantage Machine Learning?

More Intelligent/Robust Systems 36 What would happen if you ask today's DB System a query that it was not programmed to answer? DB System Given q, Find d: min q /\ d / d Given q, Find d: min q /\ d / q \/ d WTF!@#%#@ I don t understand!! SMT Solver to automatically figure out how to answer new queries Simple, but we made the production system of a leading security company 100x faster

Even Crazier 37 Can we beat human in generating Assembly? Not a new dream Super optimizer (Henry Massalin 1987) Beat Human in Go! BUT we are at the right time to making it come true!

How to change the world with Machine Learning?

Applications Cyber security: FireEye & IBM Watson Astrophysics: ETH Patient Stratification: UZH Hospital & ETH Genomics: ETH Aneurysm: ZHAW Social Science: ETH Dialogue System: BMW More to come! 39 How to support 100 applications with sublinear cost (development time, machine time, management time)? Our final goal in the next K years.