CS-E Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 1 / 27 N

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

arxiv: v1 [cs.lg] 15 Jun 2015

Artificial Neural Networks written examination

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Learning Methods for Fuzzy Systems

Generative models and adversarial training

Knowledge Transfer in Deep Convolutional Neural Nets

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

(Sub)Gradient Descent

Deep Neural Network Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Lecture 1: Basic Concepts of Machine Learning

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v3 [cs.cl] 7 Feb 2017

arxiv: v1 [cs.cv] 10 May 2017

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Axiom 2013 Team Description Paper

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Softprop: Softmax Neural Network Backpropagation Learning

arxiv: v4 [cs.cl] 28 Mar 2016

Artificial Neural Networks

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Evolution of Symbolisation in Chimpanzees and Neural Nets

Attributed Social Network Embedding

Model Ensemble for Click Prediction in Bing Search Ads

Calibration of Confidence Measures in Speech Recognition

Henry Tirri* Petri Myllymgki

Evolutive Neural Net Fuzzy Filtering: Basic Description

CSL465/603 - Machine Learning

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v1 [cs.lg] 7 Apr 2015

CS Machine Learning

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Assignment 1: Predicting Amazon Review Ratings

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

SARDNET: A Self-Organizing Feature Map for Sequences

Learning to Schedule Straight-Line Code

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Review: Speech Recognition with Deep Learning Methods

Discriminative Learning of Beam-Search Heuristics for Planning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Deep Facial Action Unit Recognition from Partially Labeled Data

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

A deep architecture for non-projective dependency parsing

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Test Effort Estimation Using Neural Network

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Laboratorio di Intelligenza Artificiale e Robotica

CS 446: Machine Learning

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Dropout improves Recurrent Neural Networks for Handwriting Recognition

ON THE USE OF WORD EMBEDDINGS ALONE TO

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Speech Emotion Recognition Using Support Vector Machine

Knowledge-Based - Systems

Laboratorio di Intelligenza Artificiale e Robotica

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Truth Inference in Crowdsourcing: Is the Problem Solved?

arxiv: v2 [cs.ir] 22 Aug 2016

Georgetown University at TREC 2017 Dynamic Domain Track

An empirical study of learning speed in backpropagation

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

A Deep Bag-of-Features Model for Music Auto-Tagging

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Mathematics subject curriculum

Time series prediction

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Residual Stacking of RNNs for Neural Machine Translation

Issues in the Mining of Heart Failure Datasets

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Software Maintenance

Boosting Named Entity Recognition with Neural Character Embeddings

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

A Reinforcement Learning Variant for Control Scheduling

The RWTH Aachen University English-German and German-English Machine Translation System for WMT 2017

Device Independence and Extensibility in Gesture Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Syntactic systematicity in sentence processing with a recurrent self-organizing network

Level 6. Higher Education Funding Council for England (HEFCE) Fee for 2017/18 is 9,250*

WHEN THERE IS A mismatch between the acoustic

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

INPE São José dos Campos

Transcription:

CS-E4050 - Deep Learning Session 2: Introduction to Deep Learning, Deep Feedforward Networks Jyri Kivinen Aalto University 16 September 2015 Presentation largely based on material in Lecun et al. (2015) and in Goodfellow et al. (2016, Chapters 1 and 6). CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 1 / 27 N

Table of Contents Introduction to Deep Learning Deep Feed-Forward Networks Background Terminology, Properties, Example Architecture Variants Parameter Optimization Home exercises CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 2 / 27 N

What is Deep Learning? An approach of/to Artificial Intelligence (AI); primarily by Artificial Neural Networks (ANNs) (?). History: Cybernetics (1940s-1960s), Connectionism [/Parallel Distributed Processing] (1980s-1990s), Deep Learning (2006-); ANNs. Deep learning Representation learning Machine learning AI. Central properties: the use of multilayer-organizable parameterized computational models with at least two layers having adaptive parameters that are adjusted based on training data, for data modelling and analysis (?); end-to-end learning. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 3 / 27 N

What is Deep Learning? Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553): 436-444, May 2015. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 4 / 27 N

Graphical Representations of Example Models (Demo) CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 5 / 27 N

What is Deep Learning? Deep vs. shallow model; distributed vs. local representation? Divide-and-conquer via compositionality? Classical models and parameter learning techniques include the Multi-Layer Perceptron (MLP), and the back-propagation of errors-algorithm, respectively. More recent important conceptual developments includes the Deep Belief Networks of Hinton, Osindero and Teh (2006). Increasing availability of training data and capacity and effectiveness of computational resources (software,hardware) have brought problem solving improvements via deep learning. Example model types that have been part in producing effective recent application results are so-called (deep) convolutional networks (ConvNets) and so-called recurrent neural networks (RNNs). It has been recently a very active area of research and development, with e.g. rapidly expanding research literature and software resources. LeCun et al., 2015 predicted (in 2015) that unsupervised learning will increase its importance; two expected large impact areas of then few next years included computational vision and natural language understanding. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 6 / 27 N

Table of Contents Introduction to Deep Learning Deep Feed-Forward Networks Background Terminology, Properties, Example Architecture Variants Parameter Optimization Home exercises CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 7 / 27 N

Table of Contents Introduction to Deep Learning Deep Feed-Forward Networks Background Terminology, Properties, Example Architecture Variants Parameter Optimization Home exercises CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 8 / 27 N

Deep Feed-Forward Networks Artificial neural network models with multiple layers of computational units connected together in a feed-forward manner; a classical example is the MLP. They can be found in stand-alone and in supportive roles of other models, in the literature. The class of models allow for highly flexible non-linear function approximation; there is theory (see e.g. Goodfellow et al., 2016) suggesting that even appropriate single-hidden-layer networks have so-called universal function approximation properties. Many machine learning problems can be casted as function approximation. Obtaining good performance in practical applications (often) requires the adjustment of many parameters, which can be a difficult optimization task. They have been applied (very effectively) in/to several kinds of tasks, from supervised learning tasks such as categorization and regression, to unsupervised learning tasks such as density estimation and feature discovery. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 9 / 27 N

Table of Contents Introduction to Deep Learning Deep Feed-Forward Networks Background Terminology, Properties, Example Architecture Variants Parameter Optimization Home exercises CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 10 / 27 N

Basic Terminology (later used notation) Units and layers: input (x), hidden (h), output (y); units in layers are sometimes grouped/arranged into so-called channels (as e.g. in convolutional networks). Width and depth: the number of units in a layer defines its width, the number of layers defines the model depth. Feed-forward connections: input to hidden, hidden to hidden, and hidden to output; layers can be skipped, too. Parameters (Θ): Connection weights, unit biases. Functions: Activation function, objective (/cost) function (C). CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 11 / 27 N

Example Networks: Demo CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 12 / 27 N

Table of Contents Introduction to Deep Learning Deep Feed-Forward Networks Background Terminology, Properties, Example Architecture Variants Parameter Optimization Home exercises CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 13 / 27 N

Unit Types, Activation Functions A set of common unit activation functions: h(z) = linear(z) = z; fully differentiable, unbounded; e.g. in obtaining a continuous-valued network output (note that compositions of linear functions is linear). h(z) = sigmoid(z) = (1 + exp ( z)) 1 ; fully differentiable, bounded; tanh(z) = 2sigmoid(2z) 1 sometimes preferred over. Rectified linear unit: relu(z) = max (0, z); not fully differentiable (only within the pieces, same for classical threshold-units), not bounded, yet excellent default choice for several models. h(z) = softmax(z) i = exp z i ; fully differentiable; e.g. target j exp z j is one-of-k. The (unit) input is often of form z = b + i W iu i. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 14 / 27 N

Unit Types, Activation Functions: Demo CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 15 / 27 N

Cost Functions A classical cost function is the mean-squared error (of those between the produced output and the target). Networks can be developed for producing as their output an encoding of the parameters of a distribution; say of the conditional distribution p(y x). Such techniques have been used e.g. in Density Networks (see e.g. MacKay, 1995) and in Variational Autoencoders (we ll discuss these later in the course). A natural and well-justifiable cost function is then the negative log-likelihood, i.e. C({x (n), y (n) } N n=1) = 1 N N n=1 log p(y(n) x (n) ). CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 16 / 27 N

Cost Functions: Demo CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 17 / 27 N

Enforcing and Encouraging Constraints Examples: Enforcing parameter properties: Tied parameters: fixed parameter values: e.g. weights are zero outside a region of input (e.g. obtain locality), adapted but shared parameter values: e.g. shared parameters across units (e.g. obtain translation equivariance) Enforcing unit properties: unit value clamping. Encouraging parameter and unit properties: Term in the cost funtion that implicitly affects the parameters or unit activations in a certain way; e.g. an additional term having L2-decay on the weights (we ll discuss these later) CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 18 / 27 N

Table of Contents Introduction to Deep Learning Deep Feed-Forward Networks Background Terminology, Properties, Example Architecture Variants Parameter Optimization Home exercises CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 19 / 27 N

Parameter Optimization The adaptive parameters need to be tuned, and the usual route for the parameter adjustment is via iterative, derivative/gradient-based learning. One widely-used algorithm in doing the optimization is stochastic gradient descent A key part is computing the gradient vector, e.g. Θ C(x, y, Θ), computing C(x, y, Θ)/ θ i for each index i, where the full set of (adaptive) parameters Θ = {θ i } I i=1. Parameter update (with learning rate η): θ i θ i η C(x, y, Θ)/ θ i CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 20 / 27 N

Gradient Computation In feed-forward networks, computing the partial derivatives of an objective function with respect to the parameters, is (usually) effectively implemented via the backpropagation-algorithm. The set of the different partial derivatives (of the full gradient) have functionally (and then computationally) shared parts and the algorithm takes such into account in avoiding redundancy in the full gradient computation. In the algorithm, the partial derivatives are computed in a single layer-by-layer pass, proceeding from the output layer to the input layer, computing and then distributing (the) common parts along the way. Software such as Theano (see e.g. Theano Development Team, 2016, Bergstra et al., 2010) allow for automatic differentiation utilizing the underlying techniques of its effectiveness. Such can have e.g. prototyping speed (including gradient correctness checking) benefits. Numerical estimates based on so-called central differences-approach can (also) be used for gradient checking (Bishop, 1995). CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 21 / 27 N

Gradient Computation Example (Credits: Tapani Raiko) Two-hidden-layer network: x, h (1), h 2, and y are the sets of unit values of the input layer, the first hidden layer, the second hidden, and the output layer, respectively. Connection weights θ (1) (x h (1) ), θ (2) (h (1) h (2) ), and θ (3) (h (2) y); together forming parameters Θ. Partial derivatives in the network: C θ (3) ij C θ (2) jk C θ (1) kl = C y i y i θ (3) ij = i = i,j C y i C y i y i h (2) j y i h (2) j h (2) j θ (2) jk h (2) j h (1) k h (1) k θ (1) kl CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 22 / 27 N

Backpropagation (Credits: Tapani Raiko) Dynamic programming avoids exponential complexity. Store intermediate results C h (2) j C h (1) k to get all layers L simply as = i = j C y i y i h (2) j C h (2) j h (2) j h (1) k C θ (L) ij = C h (L) i h (L) i θ (L) ij CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 23 / 27 N

Table of Contents Introduction to Deep Learning Deep Feed-Forward Networks Background Terminology, Properties, Example Architecture Variants Parameter Optimization Home exercises CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 24 / 27 N

Home Exercises Read any parts of Chapter 6 not read yet. Derive gradient computation (computing the partial derivatives of an objective function w.r.t model parameters) using the back-propagation algorithm assuming one of the two following objective functions: N C i (x, y; Θ) = log p i (y n x n, Θ), i [1, 2] n=1 Alternative 1: p 1 (y x, Θ) = Normal(y; FFnet 1 (x; Θ), 1), a univariate normal distribution with variance 1, and mean FFnet 1 (x; Θ), where FFnet 1 (x; Θ) denotes a fully-connected two-hidden-layer feed-forward network mapping from a two-dimensional continuous-valued x onto a continuous scalar; the network has parameters Θ. Alternative 2: p 2 (y x, Θ) = Bernoulli(y; sigmoid(ffnet 2 (x; Θ))), a Bernoulli-distribution with activation (success) probability sigmoid(ffnet 2 (x; Θ)), where FFnet 2 (x; Θ) denotes a fully-connected single-hidden layer feed-forward network mapping from a two-dimensional continuous-valued x onto a continuous scalar; the network has parameters Θ. Choose the hidden unit activation functions, but assume they are non-linear. Choose also the number of hidden units, except have least three of them per layer. The derivation needs to be part of your first report. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 25 / 27 N

Home Exercises Next time we have a session using Theano, where you will be, among other things, implementing gradient computation, with and without automatic (symbolic) differentiation. Before the session, (in addition to the gradient derivation) also familiarize with Theano: Follow the tutorial at http://deeplearning.net/ software/theano/tutorial/index.html, considering (at least) the following parts: Prerequisites, Basics: Baby Steps-Algebra, and Basics: More Examples. We are expecting to go through some parts of the Deep Learning Summer School, Montreal 2016 Introduction to Theano-tutorial by Pascal Lamblin, available via VideoLectures.NET at http://videolectures.net/ deeplearning2016_lamblin_theano/. View the presentation and come to the session with any questions. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 26 / 27 N

References J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley and Y. Bengio. Theano: A CPU and GPU Math Expression Compiler. In Proc., Python for Scientific Computing Conference (SciPy), 2010. C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. Book in preparation for MIT Press, 2016. URL: http://www.deeplearningbook.org. G. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 2006. D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In Proc., International Conference on Learning Representations (ICLR), 2014 (arxiv:1312.6114v10 [stat.ml]). Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553): 436-444, May 2015. D. J. C. Mackay. Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A. 354 (1): 73 80. Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arxiv e-prints, abs/1605.02688, May 2016. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 27 / 27 N