CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh February 28, 2017

Similar documents
Python Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Artificial Neural Networks written examination

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

arxiv: v1 [cs.lg] 15 Jun 2015

(Sub)Gradient Descent

Generative models and adversarial training

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Lecture 1: Machine Learning Basics

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v1 [cs.cv] 10 May 2017

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Test Effort Estimation Using Neural Network

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Knowledge Transfer in Deep Convolutional Neural Nets

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Softprop: Softmax Neural Network Backpropagation Learning

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Model Ensemble for Click Prediction in Bing Search Ads

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

arxiv: v1 [cs.lg] 7 Apr 2015

Cultivating DNN Diversity for Large Scale Video Labelling

SORT: Second-Order Response Transform for Visual Recognition

Diverse Concept-Level Features for Multi-Object Classification

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Attributed Social Network Embedding

Evolutive Neural Net Fuzzy Filtering: Basic Description

An empirical study of learning speed in backpropagation

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

INPE São José dos Campos

Second Exam: Natural Language Parsing with Neural Networks

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

arxiv: v1 [cs.cl] 27 Apr 2016

Forget catastrophic forgetting: AI that learns after deployment

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Deep Neural Network Language Models

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

CS Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Artificial Neural Networks

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Device Independence and Extensibility in Gesture Recognition

Lip Reading in Profile

Calibration of Confidence Measures in Speech Recognition

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

THE enormous growth of unstructured data, including

SARDNET: A Self-Organizing Feature Map for Sequences

Human Emotion Recognition From Speech

A Deep Bag-of-Features Model for Music Auto-Tagging

WHEN THERE IS A mismatch between the acoustic

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning to Schedule Straight-Line Code

A Review: Speech Recognition with Deep Learning Methods

Software Maintenance

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Mathematics process categories

arxiv: v2 [cs.cv] 30 Mar 2017

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Word Segmentation of Off-line Handwritten Documents

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Semi-Supervised Face Detection

Classification Using ANN: A Review

Image based Static Facial Expression Recognition with Multiple Deep Network Learning

On the Formation of Phoneme Categories in DNN Acoustic Models

Evolution of Symbolisation in Chimpanzees and Neural Nets

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

arxiv: v4 [cs.cl] 28 Mar 2016

Using focal point learning to improve human machine tacit coordination

Introduction to Simulation

Lecture 10: Reinforcement Learning

Syntactic systematicity in sentence processing with a recurrent self-organizing network

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

A deep architecture for non-projective dependency parsing

There are some definitions for what Word

Axiom 2013 Team Description Paper

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

On the Combined Behavior of Autonomous Resource Management Agents

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A Comparison of Annealing Techniques for Academic Course Scheduling

A study of speaker adaptation for DNN-based speech synthesis

THE world surrounding us involves multiple modalities

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Residual Stacking of RNNs for Neural Machine Translation

Transcription:

CS 2750: Machine Learning Neural Networks Prof. Adriana Kovashka University of Pittsburgh February 28, 2017

HW2 due Thursday Announcements Office hours on Thursday: 4:15pm-5:45pm Talk at 3pm: http://www.sam.pitt.edu/arc- 2017/arc2017-schedule/ Exam Mean 53.04 (76%) Median 56.50 (81%)

Plan for the next few lectures Neural network basics Architecture Biological inspiration Loss functions Training with gradient descent and backpropagation Practical matters Overfitting prevention Transfer learning Software packages Convolutional neural networks (CNNs) Special operations for processing images Recurrent neural networks (RNNs) Special operations for processing sequences (e.g. language)

Neural network definition Activations: Nonlinear activation function h (e.g. sigmoid, tanh, RELU): Figure from Christopher Bishop

Neural network definition Layer 2 Layer 3 (final) (binary) Outputs (multiclass) (binary) Finally:

Activation functions Sigmoid Leaky ReLU max(0.1x, x) tanh ReLU tanh(x) max(0,x) Maxout ELU Andrej Karpathy

A multi-layer neural network Nonlinear classifier Can approximate any continuous function to arbitrary accuracy given sufficiently many hidden units Lana Lazebnik

Inspiration: Neuron cells Neurons accept information from multiple inputs transmit information to other neurons Multiply inputs by weights along edges Apply some function to the set of inputs at each node If output of function over threshold, neuron fires Text: HKUST, figures: Andrej Karpathy

Multilayer networks Cascade neurons together Output from one layer is the input to the next Each layer has its own sets of weights HKUST

Feed-forward networks Predictions are fed forward through the network to classify HKUST

Feed-forward networks Predictions are fed forward through the network to classify HKUST

Feed-forward networks Predictions are fed forward through the network to classify HKUST

Feed-forward networks Predictions are fed forward through the network to classify HKUST

Feed-forward networks Predictions are fed forward through the network to classify HKUST

Feed-forward networks Predictions are fed forward through the network to classify HKUST

Weights to learn! Weights to learn! Weights to learn! Weights to learn! Deep neural networks Lots of hidden layers Depth = power (usually) Figure from http://neuralnetworksanddeeplearning.com/chap5.html

How do we train them? There is no analytical solution for the weights We will iteratively find such a set of weights that allow the outputs to match the desired outputs We want to minimize a loss function (a function of the weights in the network) For now let s simplify and assume there s a single layer of weights in the network

Softmax loss scores = unnormalized log probabilities of the classes where Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class: cat car frog 3.2 5.1-1.7 Andrej Karpathy

Softmax loss unnormalized probabilities cat car frog 3.2 5.1-1.7 exp 24.5 164.0 0.18 normalize 0.13 0.87 0.00 L_i = -log(0.13) = 0.89 unnormalized log probabilities probabilities Adapted from Andrej Karpathy

Regularization L1, L2 regularization (weight decay) Dropout Randomly turn off some neurons Allows individual neurons to independently be responsible for performance Dropout: A simple way to prevent neural networks from overfitting [Srivastava JMLR 2014] Adapted from Jia-bin Huang

Gradient descent We ll update the weights Move in direction opposite to gradient: Time L Learning rate Figure from Andrej Karpathy

Mini-batch gradient descent Rather than compute the gradient from the loss for all training examples, could only use some of the data for each gradient update We cycle through all the training examples multiple times; each time we ve cycled through all of them once is called an epoch Allows faster training (e.g. on GPUs), parallelization Figure from Andrej Karpathy

Gradient descent in multi-layer nets How to update the weights at all layers? Answer: backpropagation of error from higher layers to lower layers Figure from Andrej Karpathy

How to compute gradient? In a neural network: Gradient is: Denote the errors as: Also:

Backpropagation: Error For output units (e.g. identity output, least squares loss): For hidden units: Backprop formula:

Example (identity output function) Two layer network w/ tanh at hidden layer: Derivative: Minimize: Forward propagation:

Example (identity output function) Errors at output: Errors at hidden units: Derivatives wrt weights:

Backpropagation: Graphic example First calculate error of output units and use this to change the top layer of weights. output k Update weights into j hidden j input i Adapted from Ray Mooney, equations from Chris Bishop

Backpropagation: Graphic example Next calculate error for hidden units based on errors on the output units it feeds into. output k hidden j input i Adapted from Ray Mooney, equations from Chris Bishop

Backpropagation: Graphic example Finally update bottom layer of weights based on errors calculated for hidden units. output k Update weights into i hidden j input i Adapted from Ray Mooney, equations from Chris Bishop

Comments on training algorithm Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely. However, in practice, does converge to low error for many large networks on real data. Thousands of epochs (epoch = network sees all training data once) may be required, hours or days to train. To avoid local-minima problems, run several trials starting with different random weights (random restarts), and take results of trial with lowest training set error. May be hard to set learning rate and to select number of hidden units and layers. Neural networks had fallen out of fashion in 90s, early 2000s; back with a new name and significantly improved performance (deep networks trained with dropout and lots of data). Ray Mooney, Carlos Guestrin, Dhruv Batra

error Over-training prevention Running too many epochs can result in over-fitting. on test data 0 # training epochs on training data Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error. Adapted from Ray Mooney

error Determining best number of hidden units Too few hidden units prevents the network from adequately fitting the data. Too many hidden units can result in over-fitting. on test data 0 # hidden units on training data Use internal cross-validation to empirically determine an optimal number of hidden units. Ray Mooney

Effect of number of neurons more neurons = more capacity Andrej Karpathy

Effect of regularization Do not use size of neural network as a regularizer. Use stronger regularization instead: (you can play with this demo over at ConvNetJS: http://cs.stanford. edu/people/karpathy/convnetjs/demo/classify2d.html) Andrej Karpathy

Hidden unit interpretation Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space. On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc. However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature. Ray Mooney

You need a lot of data if you want to train/use deep nets Transfer Learning Adapted from Andrej Karpathy

Transfer learning: Motivation The more weights you need to learn, the more data you need That s why with a deeper network, you need more data for training than for a shallower network But: If you have sparse data, you can just train the last few layers of a deep net Set these to the already learned weights from another network Learn these on your own task

Transfer learning Source: e.g. classification of animals Target: e.g. classification of cars 1. Train on source (large dataset) 2. Small dataset: 3. Medium dataset: finetuning more data = retrain more of the network (or all of it) Freeze these Freeze these Train this Train this Lecture 11-29 Another option: use network as feature extractor, train SVM/LR on extracted features for target task Adapted from Andrej Karpathy

Pre-training on ImageNet Have a source domain and target domain Train a network to classify ImageNet classes Coarse classes and ones with fine distinctions (dog breeds) Remove last layers and train layers to replace them, that predict target classes Oquab et al., Learning and Transferring Mid-Level Image Representations, CVPR 2014

Transfer learning with CNNs is pervasive Image Captioning Karpathy and Fei-Fei, Deep Visual- Semantic Alignments for Generating Image Descriptions, CVPR 2015 CNN pretrained on ImageNet Object Detection Ren et al., Faster R-CNN, NIPS 2015 Adapted from Andrej Karpathy

Another soln for sparse data: Augmentation Create virtual training samples Horizontal flip Random crop Color casting Geometric distortion Jia-bin Huang Deep Image [Wu et al. 2015]

Packages Caffe and Caffe Model Zoo Torch Theano with Keras/Lasagne MatConvNet TensorFlow

Learning Resources http://deeplearning.net/ http://cs231n.stanford.edu (CNNs, vision) http://cs224d.stanford.edu/ (RNNs, language)

Summary Feed-forward network architecture Training deep neural nets We need an objective function that measures and guides us towards good performance We need a way to minimize the loss function: (stochastic, mini-batch) gradient descent We need backpropagation to propagate error towards all layers and change weights at those layers Practices for preventing overfitting, training with little data