Neural Networks. Robert Platt Northeastern University. Some images and slides are used from: 1. CS188 UC Berkeley

Similar documents
Python Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Machine Learning Basics

Artificial Neural Networks written examination

(Sub)Gradient Descent

Generative models and adversarial training

arxiv: v1 [cs.lg] 15 Jun 2015

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Knowledge Transfer in Deep Convolutional Neural Nets

CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Model Ensemble for Click Prediction in Bing Search Ads

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

arxiv: v1 [cs.cv] 10 May 2017

Test Effort Estimation Using Neural Network

THE enormous growth of unstructured data, including

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Softprop: Softmax Neural Network Backpropagation Learning

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Forget catastrophic forgetting: AI that learns after deployment

CSL465/603 - Machine Learning

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Attributed Social Network Embedding

SORT: Second-Order Response Transform for Visual Recognition

Assignment 1: Predicting Amazon Review Ratings

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

WHEN THERE IS A mismatch between the acoustic

INPE São José dos Campos

Evolutive Neural Net Fuzzy Filtering: Basic Description

Cultivating DNN Diversity for Large Scale Video Labelling

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Axiom 2013 Team Description Paper

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

arxiv: v1 [cs.lg] 7 Apr 2015

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Using focal point learning to improve human machine tacit coordination

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Calibration of Confidence Measures in Speech Recognition

A Deep Bag-of-Features Model for Music Auto-Tagging

arxiv:submit/ [cs.cv] 2 Aug 2017

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Human Emotion Recognition From Speech

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Second Exam: Natural Language Parsing with Neural Networks

Lecture 1: Basic Concepts of Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Mathematics process categories

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Diverse Concept-Level Features for Multi-Object Classification

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Speaker Identification by Comparison of Smart Methods. Abstract

Learning Methods for Fuzzy Systems

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Review: Speech Recognition with Deep Learning Methods

Deep Neural Network Language Models

Learning From the Past with Experiment Databases

arxiv: v2 [cs.cv] 30 Mar 2017

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

A study of speaker adaptation for DNN-based speech synthesis

An empirical study of learning speed in backpropagation

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

arxiv: v4 [cs.cl] 28 Mar 2016

Introduction to Causal Inference. Problem Set 1. Required Problems

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

arxiv: v1 [cs.cl] 27 Apr 2016

12- A whirlwind tour of statistics

Speech Recognition at ICSI: Broadcast News and beyond

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

SARDNET: A Self-Organizing Feature Map for Sequences

A Case Study: News Classification Based on Term Frequency

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Artificial Neural Networks

School of Innovative Technologies and Engineering

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Multi-label classification via multi-target regression on data streams

On the Formation of Phoneme Categories in DNN Acoustic Models

Word Segmentation of Off-line Handwritten Documents

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

arxiv: v1 [cs.cl] 2 Apr 2017

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Speech Emotion Recognition Using Support Vector Machine

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Transcription:

Neural Networks Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley

Problem we want to solve The essence of machine learning: A pattern exists We cannot pin it down mathematically We have data on it A pattern exists. We don t know it. We have data to learn it. Learning from data to get an information that can make prediction

Problem we want to solve Applicant information: Age Gender Annual salary 23 years male $30,000 Years in residence Years in job Current debt 1 year 1 year $15,000 Approve credit?

Problem we want to solve Formalization: Input: x (customer application) Output: y (good/bad customer?) Target function: (ideal credit approval formula) Data: (x1, y1), (x2, y2),, (xn, yn) (historical records) Hypothesis: (formula/classifier to be used)

Problem we want to solve ( Ideal credit approval function ) Training Examples (x1, y1),, (xn, yn) (historical records of credit customer) Learning Algorithm A Hypothesis Set (set of candidate formulas) (final credit approval formula)

Applications We will focus on these applications We will ignore these applications image segmentation speech-to-text natural language processing.. but deep learning has been applied in lots of ways...

Example of a deep neural network

The multi-layer perceptron A single neuron (i.e. unit) Activation function summation where

The multi-layer perceptron Different activation functions: sigmoid tanh rectified linear unit (ReLU)

The multi-layer perceptron Gradient of sigmoid is: Different activation functions: sigmoid tanh rectified linear unit (ReLU)

The multi-layer perceptron Different activation functions: sigmoid tanh rectified linear unit (ReLU) ReLU is relatively new efficient to evaluate enables more layers b/c attenuates gradient less

The multi-layer perceptron One layer neural network has a simple interpretation: linear classification. X_1 == symmetry X_2 == avg intensity Y == class label (binary) What do w and b correspond to in this picture?

Training Given a dataset: Define loss function:

Training Given a dataset: Define loss function: Loss function tells us how well the network classified x^i

Training Given a dataset: Define loss function: Loss function tells us how well the network classified x^i Method of training: adjust w, b so as to minimize the net loss over the dataset i.e.: adjust w, b so as to minimize: If the sum of losses is zero, then the network has classified the dataset perfectly

Training Method of training: adjust w, b so as to minimize the net loss over the dataset i.e.: adjust w, b so as to minimize:

Training Method of training: adjust w, b so as to minimize the net loss over the dataset i.e.: adjust w, b so as to minimize: How?

Time out for gradient descent Suppose someone gives you an unknown function F(x) you want to find a minimum for F but, you do not have an analytical description of F(x) Use gradient descent! all you need is the ability to evaluate F(x) and its gradient at any point x 1. pick 2. 3. 4. 5.... at random

Time out for gradient descent Suppose someone gives you an unknown function F(x) you want to find a minimum for F but, you do not have an analytical description of F(x) Use gradient descent! all you need is the ability to evaluate F(x) and its gradient at any point x 1. pick 2. 3. 4. 5.... at random

Training Method of training: adjust w, b so as to minimize the net loss over the dataset i.e.: adjust w, b so as to minimize: Do gradient descent on dataset: 1. repeat 2. 3. 4. until converged Where:

Training Method of training: adjust w, b so as to minimize the net loss over the dataset This is the similar to logistic regression i.e.: adjust w, b so as to minimize: logistic regression uses a cross entropy loss we are using a quadratic loss Do gradient descent on dataset: 1. repeat 2. 3. 4. until converged Where:

Training example

Going deeper: a one layer network Input layer Hidden layer Each hidden node is connected to every input Output layer

Multi-layer evaluation works similarly a1 a2 a3 a4 Single activation: Vector of hidden layer activations

Multi-layer evaluation works similarly Vector of hidden layer activations a1 a2 a3 a4 Single activation: Vector of activations: where

Multi-layer evaluation works similarly Vector of hidden layer activations a1 a2 a3 a4 Called forward propagation b/c the activations are propogated forward... Single activation: Vector of activations: where

Can create networks of arbitrary depth... Input layer Hidden layer 1 Hidden layer 2 Hidden layer 3 Output layer Forward propagation works the same for any depth network. Whereas a single output node corresponds to linear classification, adding hidden nodes makes classification non-linear

How do we train multi-layer networks? Almost the same as in the single-node case... Do gradient descent on dataset: 1. repeat 2. 3. 4. until converged Now, we re doing gradient descent on all weights/biases in the network not just a single layer this is called backpropagation

Backpropagation http://ufldl.stanford.edu/tutorial/supervised/multilayerneuralnetworks/

Training in mini-batches 1. repeat 2. A batch is typically between 32 and 128 samples randomly sample a mini-batch: 3. 3. 4. until converged Training in mini-batches helps b/c: don t have to load the entire dataset into memory training is still relatively stable random sampling of batches helps avoid local minima

Convolutional layers Deep multi-layer perceptron networks general purpose involve huge numbers of weights We want: special purpose network for image and NLP data fewer parameters fewer local minima Answer: convolutional layers!

Convolutional layers Convolutional Hidden layer No longer dense connection! Image stride Filter size pixels

Convolutional layers Two dimensional example: Why do you think they call this convolution?

Convolutional layers

Example: MNIST digit classification with LeNet MNIST dataset: images of 10,000 handwritten digits Objective: classify each image as the corresponding digit

Example: MNIST digit classification with LeNet LeNet: two convolutional layers conv, relu, pooling two fully connected layers relu last layer has logistic activation function

Example: MNIST digit classification with LeNet Load dataset, create train/test splits

Example: MNIST digit classification with LeNet Define the neural network structure: Input Conv1 Conv2 FC1 FC2

Example: MNIST digit classification with LeNet Train network, classify test set, measure accuracy notice we test on a different set (a holdout set) than we trained on Using the GPU makes a huge differece...

Deep learning packages You don t need to use Matlab (obviously) Tensorflow is probably the most popular platform Caffe and Theano are also big

Another example: image classification w/ AlexNet ImageNet dataset: millions of images of objects Objective: classify each image as the corresponding object (1k categories in ILSVRC)

Another example: image classification w/ AlexNet AlexNet has 8 layers: five conv followed by three fully connected

Another example: image classification w/ AlexNet AlexNet has 8 layers: five conv followed by three fully connected

Another example: image classification w/ AlexNet AlexNet won the 2012 ILSVRC challenge sparked the deep learning craze

What exactly are deep conv networks learning?

What exactly are deep conv networks learning?

What exactly are deep conv networks learning?

What exactly are deep conv networks learning?

What exactly are deep conv networks learning?

What exactly are deep conv networks learning? FC layer 6

What exactly are deep conv networks learning? FC layer 7

What exactly are deep conv networks learning? Output layer

Finetuning AlexNet has 60M parameters therefore, you need a very large training set (like imagenet) Suppose we want to train on our own images, but we only have a few hundred? AlexNet will drastically overfit such a small dataset (won t generalize at all)

Finetuning Idea: 1. pretrain on imagenet 2. finetune on your own dataset AlexNet has 60M parameters therefore, you need a very large training set (like imagenet) Suppose we want to train on our own images, but we only have a few hundred? AlexNet will drastically overfit such a small dataset (won t generalize at all)