CS-E Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 1 / 27 N

CS-E4050 - Deep Learning Session 2: Introduction to Deep Learning, Deep Feedforward Networks Jyri Kivinen Aalto University 16 September 2015 Presentation largely based on material in Lecun et al. (2015) and in Goodfellow et al. (2016, Chapters 1 and 6). CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 1 / 27 N

Table of Contents Introduction to Deep Learning Deep Feed-Forward Networks Background Terminology, Properties, Example Architecture Variants Parameter Optimization Home exercises CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 2 / 27 N

What is Deep Learning? An approach of/to Artificial Intelligence (AI); primarily by Artificial Neural Networks (ANNs) (?). History: Cybernetics (1940s-1960s), Connectionism [/Parallel Distributed Processing] (1980s-1990s), Deep Learning (2006-); ANNs. Deep learning Representation learning Machine learning AI. Central properties: the use of multilayer-organizable parameterized computational models with at least two layers having adaptive parameters that are adjusted based on training data, for data modelling and analysis (?); end-to-end learning. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 3 / 27 N

What is Deep Learning? Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553): 436-444, May 2015. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 4 / 27 N

Graphical Representations of Example Models (Demo) CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 5 / 27 N

What is Deep Learning? Deep vs. shallow model; distributed vs. local representation? Divide-and-conquer via compositionality? Classical models and parameter learning techniques include the Multi-Layer Perceptron (MLP), and the back-propagation of errors-algorithm, respectively. More recent important conceptual developments includes the Deep Belief Networks of Hinton, Osindero and Teh (2006). Increasing availability of training data and capacity and effectiveness of computational resources (software,hardware) have brought problem solving improvements via deep learning. Example model types that have been part in producing effective recent application results are so-called (deep) convolutional networks (ConvNets) and so-called recurrent neural networks (RNNs). It has been recently a very active area of research and development, with e.g. rapidly expanding research literature and software resources. LeCun et al., 2015 predicted (in 2015) that unsupervised learning will increase its importance; two expected large impact areas of then few next years included computational vision and natural language understanding. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 6 / 27 N

Deep Feed-Forward Networks Artificial neural network models with multiple layers of computational units connected together in a feed-forward manner; a classical example is the MLP. They can be found in stand-alone and in supportive roles of other models, in the literature. The class of models allow for highly flexible non-linear function approximation; there is theory (see e.g. Goodfellow et al., 2016) suggesting that even appropriate single-hidden-layer networks have so-called universal function approximation properties. Many machine learning problems can be casted as function approximation. Obtaining good performance in practical applications (often) requires the adjustment of many parameters, which can be a difficult optimization task. They have been applied (very effectively) in/to several kinds of tasks, from supervised learning tasks such as categorization and regression, to unsupervised learning tasks such as density estimation and feature discovery. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015Feedforward 9 / 27 N

Basic Terminology (later used notation) Units and layers: input (x), hidden (h), output (y); units in layers are sometimes grouped/arranged into so-called channels (as e.g. in convolutional networks). Width and depth: the number of units in a layer defines its width, the number of layers defines the model depth. Feed-forward connections: input to hidden, hidden to hidden, and hidden to output; layers can be skipped, too. Parameters (Θ): Connection weights, unit biases. Functions: Activation function, objective (/cost) function (C). CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 11 / 27 N

Example Networks: Demo CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 12 / 27 N

Unit Types, Activation Functions A set of common unit activation functions: h(z) = linear(z) = z; fully differentiable, unbounded; e.g. in obtaining a continuous-valued network output (note that compositions of linear functions is linear). h(z) = sigmoid(z) = (1 + exp ( z)) 1 ; fully differentiable, bounded; tanh(z) = 2sigmoid(2z) 1 sometimes preferred over. Rectified linear unit: relu(z) = max (0, z); not fully differentiable (only within the pieces, same for classical threshold-units), not bounded, yet excellent default choice for several models. h(z) = softmax(z) i = exp z i ; fully differentiable; e.g. target j exp z j is one-of-k. The (unit) input is often of form z = b + i W iu i. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 14 / 27 N

Unit Types, Activation Functions: Demo CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 15 / 27 N

Cost Functions A classical cost function is the mean-squared error (of those between the produced output and the target). Networks can be developed for producing as their output an encoding of the parameters of a distribution; say of the conditional distribution p(y x). Such techniques have been used e.g. in Density Networks (see e.g. MacKay, 1995) and in Variational Autoencoders (we ll discuss these later in the course). A natural and well-justifiable cost function is then the negative log-likelihood, i.e. C({x (n), y (n) } N n=1) = 1 N N n=1 log p(y(n) x (n) ). CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 16 / 27 N

Cost Functions: Demo CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 17 / 27 N

Enforcing and Encouraging Constraints Examples: Enforcing parameter properties: Tied parameters: fixed parameter values: e.g. weights are zero outside a region of input (e.g. obtain locality), adapted but shared parameter values: e.g. shared parameters across units (e.g. obtain translation equivariance) Enforcing unit properties: unit value clamping. Encouraging parameter and unit properties: Term in the cost funtion that implicitly affects the parameters or unit activations in a certain way; e.g. an additional term having L2-decay on the weights (we ll discuss these later) CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 18 / 27 N

Parameter Optimization The adaptive parameters need to be tuned, and the usual route for the parameter adjustment is via iterative, derivative/gradient-based learning. One widely-used algorithm in doing the optimization is stochastic gradient descent A key part is computing the gradient vector, e.g. Θ C(x, y, Θ), computing C(x, y, Θ)/ θ i for each index i, where the full set of (adaptive) parameters Θ = {θ i } I i=1. Parameter update (with learning rate η): θ i θ i η C(x, y, Θ)/ θ i CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 20 / 27 N

Gradient Computation In feed-forward networks, computing the partial derivatives of an objective function with respect to the parameters, is (usually) effectively implemented via the backpropagation-algorithm. The set of the different partial derivatives (of the full gradient) have functionally (and then computationally) shared parts and the algorithm takes such into account in avoiding redundancy in the full gradient computation. In the algorithm, the partial derivatives are computed in a single layer-by-layer pass, proceeding from the output layer to the input layer, computing and then distributing (the) common parts along the way. Software such as Theano (see e.g. Theano Development Team, 2016, Bergstra et al., 2010) allow for automatic differentiation utilizing the underlying techniques of its effectiveness. Such can have e.g. prototyping speed (including gradient correctness checking) benefits. Numerical estimates based on so-called central differences-approach can (also) be used for gradient checking (Bishop, 1995). CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 21 / 27 N

Gradient Computation Example (Credits: Tapani Raiko) Two-hidden-layer network: x, h (1), h 2, and y are the sets of unit values of the input layer, the first hidden layer, the second hidden, and the output layer, respectively. Connection weights θ (1) (x h (1) ), θ (2) (h (1) h (2) ), and θ (3) (h (2) y); together forming parameters Θ. Partial derivatives in the network: C θ (3) ij C θ (2) jk C θ (1) kl = C y i y i θ (3) ij = i = i,j C y i C y i y i h (2) j y i h (2) j h (2) j θ (2) jk h (2) j h (1) k h (1) k θ (1) kl CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 22 / 27 N

Backpropagation (Credits: Tapani Raiko) Dynamic programming avoids exponential complexity. Store intermediate results C h (2) j C h (1) k to get all layers L simply as = i = j C y i y i h (2) j C h (2) j h (2) j h (1) k C θ (L) ij = C h (L) i h (L) i θ (L) ij CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 23 / 27 N

Home Exercises Read any parts of Chapter 6 not read yet. Derive gradient computation (computing the partial derivatives of an objective function w.r.t model parameters) using the back-propagation algorithm assuming one of the two following objective functions: N C i (x, y; Θ) = log p i (y n x n, Θ), i [1, 2] n=1 Alternative 1: p 1 (y x, Θ) = Normal(y; FFnet 1 (x; Θ), 1), a univariate normal distribution with variance 1, and mean FFnet 1 (x; Θ), where FFnet 1 (x; Θ) denotes a fully-connected two-hidden-layer feed-forward network mapping from a two-dimensional continuous-valued x onto a continuous scalar; the network has parameters Θ. Alternative 2: p 2 (y x, Θ) = Bernoulli(y; sigmoid(ffnet 2 (x; Θ))), a Bernoulli-distribution with activation (success) probability sigmoid(ffnet 2 (x; Θ)), where FFnet 2 (x; Θ) denotes a fully-connected single-hidden layer feed-forward network mapping from a two-dimensional continuous-valued x onto a continuous scalar; the network has parameters Θ. Choose the hidden unit activation functions, but assume they are non-linear. Choose also the number of hidden units, except have least three of them per layer. The derivation needs to be part of your first report. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 25 / 27 N

Home Exercises Next time we have a session using Theano, where you will be, among other things, implementing gradient computation, with and without automatic (symbolic) differentiation. Before the session, (in addition to the gradient derivation) also familiarize with Theano: Follow the tutorial at http://deeplearning.net/ software/theano/tutorial/index.html, considering (at least) the following parts: Prerequisites, Basics: Baby Steps-Algebra, and Basics: More Examples. We are expecting to go through some parts of the Deep Learning Summer School, Montreal 2016 Introduction to Theano-tutorial by Pascal Lamblin, available via VideoLectures.NET at http://videolectures.net/ deeplearning2016_lamblin_theano/. View the presentation and come to the session with any questions. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 26 / 27 N

References J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley and Y. Bengio. Theano: A CPU and GPU Math Expression Compiler. In Proc., Python for Scientific Computing Conference (SciPy), 2010. C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. Book in preparation for MIT Press, 2016. URL: http://www.deeplearningbook.org. G. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 2006. D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In Proc., International Conference on Learning Representations (ICLR), 2014 (arxiv:1312.6114v10 [stat.ml]). Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553): 436-444, May 2015. D. J. C. Mackay. Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A. 354 (1): 73 80. Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arxiv e-prints, abs/1605.02688, May 2016. CS-E4050 - Deep Learning Session 2: Introduction to Deep 16 September Learning, Deep 2015 Feedforward 27 / 27 N