DEEP LEARNING AND ITS APPLICATION NEURAL NETWORK BASICS

Argument on AI 1. Symbolism 2. Connectionism 3. Actionism Kai Yu. SJTU Deep Learning Lecture. 2

Argument on AI 1. Symbolism Symbolism AI Origin Cognitive element Core of AI Based on Representatives mathematic logic symbol knowledge and knowledgebased theoretical system hypothesis of symbol operation system and principle of limit reasonability Newell, Shaw, Simon and Nilsson Kai Yu. SJTU Deep Learning Lecture. 3

Argument on AI 2. Connectionism Symbolism AI Origin Cognitive element Core of AI Based on Representatives bionics neuron brain working mode NN and connectionism and learning algorithm between NN Meculloch-Pitts, Hopfield and Rumelhart Kai Yu. SJTU Deep Learning Lecture. 4

Argument on AI 3. Actionism Symbolism AI Origin Cognitive element Core of AI Based on Representatives mathematic logic perception and actions perception-action working mode cybernetics and control principle of perception action Winner, Brooks Kai Yu. SJTU Deep Learning Lecture. 5

Biological Neuron 10 billion neurons in human brain Summation of input stimuli Spatial (signals) Temporal (pulses) Threshold over composed inputs Constant firing strength 1,000,000 billion synapses in human brain Chemical transmission and modulation of signals Inhibitory synapses Excitatory synapses Kai Yu. SJTU Deep Learning Lecture. 6

Biological Neural Networks 100,000 synapses per neuron Computational power = connectivity Plasticity new connections strength of connections modified Kai Yu. SJTU Deep Learning Lecture. 7

Neural Dynamics 40 20 mv membrane rest activation 0 Action potential -20-40 -60-80 -100-120 Refractory time 0 10 20 30 40 50 60 70 80 90 100 ms Action potential 100mV Threshold potential -20~-30mV Rest potential -65mV Spike time 1-2ms Refractory time 10-20ms Kai Yu. SJTU Deep Learning Lecture. 8

Connectionist Model Kai Yu. SJTU Deep Learning Lecture. 9

What is an Artificial Neural Network? An artificial neural network is a network of many simple processors (neurons, units) Units are linked by connections Each connection has a weight associated with it Units operate only locally on their weights and the inputs received through connections Kai Yu. SJTU Deep Learning Lecture. 10

What is an Artificial Neural Network? An ANN is a massively parallel distributed processor made up of simple processing unit, which has a natural propensity for storing experimental knowledge and making it available for use. It resembles the brain in two respects: Knowledge is acquired by the network from its environment through a learning process Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge Kai Yu. SJTU Deep Learning Lecture. 11

First Generation NN 1943, McCulloch and Pitts developed basic models of neurons. Perceptron with no hidden layer. Kai Yu. SJTU Deep Learning Lecture. 12

First Generation NN 1948, Wiener: cybernetics 1949, Hebb: learning rule (Hebb s rule) 1958, Rosenblatt: perceptron model and perceptron convergence algorithm 1960, Widrow-Hoff: least mean square algorithm 1969, Minsky-Papert: limitations of perceptron: (can not solve nonlinearly separable problems) Kai Yu. SJTU Deep Learning Lecture. 13

Second Generation NN Multi-Layer Perceptron (MLP) (80 90 ). Back-Propagation. Kai Yu. SJTU Deep Learning Lecture. 14

Second Generation NN 1980s, Stephen Grossberg: Adaptive resonance theory 1982, Hopfield: energy function, recurrent network model 1982, Kohonen: self-organizing maps 1986, Rumelhart, Hinton et. al.: back-propagation 1990s: Decline Require experience and skills Easy to over-train or be trapped in a local optima Hard to go deep Kai Yu. SJTU Deep Learning Lecture. 15

Renaissance of NN 2006, Geoffrey Hinton invented Deep Belief Networks (DBN) to allow fast and effective deep neural network learning. Pre-train each layer from bottom up Each pair of layers is an Restricted Boltzmann Machine(RBM) Jointly fine-tune all layers using backpropagation Kai Yu. SJTU Deep Learning Lecture. 16

Perceptron: The base for ANN Input variable: Output variable: Weights: Kai Yu. SJTU Deep Learning Lecture. 17

Activation Functions Hardlimit Step fun. Kai Yu. SJTU Deep Learning Lecture. 18

Decision Surface of a Perceptron A perceptron represents a decision surface in a d dimensional space as a hyper-plane Works only for those sets of examples that are linearly separable Many boolean functions can be represented by a perceptron: AND, OR, NAND,NOR Kai Yu. SJTU Deep Learning Lecture. 19

Example AND NAND Solid circle: 1 Hollow circle: 0 OR NOR Kai Yu. SJTU Deep Learning Lecture. 20

Example Kai Yu. SJTU Deep Learning Lecture. 21

Error Gradient Descent Given a lossfunction E(X, t, w) Ideal approach: closed-form solution r w E(X, t, w) =0 solving for w will be troublesome if not impossible. Practical approach: gradientdescent Start at some value of the weights Update the weights iteratively using If there is only one local optima, GD is guaranteed to converge Kai Yu. SJTU Deep Learning Lecture. 22

Gradient Descent Example Kai Yu. SJTU Deep Learning Lecture. 23

Stochastic Gradient Descent Gradient descent 2 (oneupdatewith { } 2 all data) MX w n = w n 1 r w E(X, t, w) =w n 1 r w E(x m,t m, w) Stochastic X gradient descent X (oneupdatewith a randomly selected single data) SGD is much faster than GD m=1 w n = w n 1 r w E(x m,t m, w) 2 { } 2 Kai Yu. SJTU Deep Learning Lecture. 24

Perceptron Algorithm Consider a perceptron with n inputs: (vector input) and n+1 weights: X For linearly separable data set 2N x i 2 R n t i 2 {+1, 1} X m = {(x 1,t 1 ),, (x m,t m )} X Howcan we find and undercriterion: E(X, t, w) = MX m=1 E(x m,t m, w) = X m2n err t m (w T x m +w 0 ) Kai Yu. SJTU Deep Learning Lecture. 25

Perceptron Convergence Theorem Stochastic gradient descent E(x m,t m, w) = t m (w T x m + w 0 ) m 2 N err w n = w n 1 r w E(x m,t m, w) =w n 1 + t m x m w0 n = w0 n 1 r w0 E(x m,t m, w) =w0 n 1 + t m If training data is linearly separable, the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Kai Yu. SJTU Deep Learning Lecture. 26

Perceptron Algorithm Initialize weights and learning rate compute perceptron outputs apply SGD to update weights N outputs == targets Y output learned weights Kai Yu. SJTU Deep Learning Lecture. 27

Perceptron example Training set: Initial Weights Learning rate is set to 1. Kai Yu. SJTU Deep Learning Lecture. 28

Perceptron example First Iteration: Kai Yu. SJTU Deep Learning Lecture. 29

Perceptron example Second Iteration: Kai Yu. SJTU Deep Learning Lecture. 30

Perceptron example Check: Output: W, b Kai Yu. SJTU Deep Learning Lecture. 31

Example XOR? More Non-linear Layers Noway to get a solution for perception. self-contradictory! Kai Yu. SJTU Deep Learning Lecture. 32

Hidden Units: Multi-Layer NN Multi-Layer Perceptron (MLP) Kai Yu. SJTU Deep Learning Lecture. 33

Expressive Capabilities of NNs Boolean functions: Every Boolean function can be represented by a network with a single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrary small error, by network with one hidden layer [Cybenko 1989; Hornik et al 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. Kai Yu. SJTU Deep Learning Lecture. 34

Expressive Capabilities of NNs Rough proof of Boolean functions How to construct such 2-Layer MLP x1 x2 x3 y 0 0 1 1 0 1 0 1 1 1 1 1... 0 w21=[1,1,1] OR cell w11=[0,0,1] w12=? w13=? x1 x2 x3 Selector cells y = [(NOT x 1 ) AND (NOT x 2 ) AND (x 3 )] OR [(NOT x 1 ) AND (x 2 ) AND (NOT x 3 )] OR [(x 1 ) AND (x 2 ) AND (x 3 )] Kai Yu. SJTU Deep Learning Lecture. 35

Non-linearActivation Function (HiddenLayer) Sigmoid Tangent When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator. Kai Yu. SJTU Deep Learning Lecture. 36

Output function: Logistic Regression The structure of logistic regression: Post-processing outside NN y (i) 2 {0, 1} X x (i) 2 R n Logistic (sigmoid) function denotes confidence of target class Kai Yu. SJTU Deep Learning Lecture. 37

Output function: Softmax Training data Softmax function constructs probability distribution of K- dimensional output (K classes) Kai Yu. SJTU Deep Learning Lecture. 38

NN Output Function Summary Linear output: j is input feature index, i is output class index Logistic Regression: denote confidence of each class Softmax: denote probability of each class When k=2, softmax is similar to logistic regression Kai Yu. SJTU Deep Learning Lecture. 39

NN Loss Function (Criterion for Param. Est.) Consider M output targets, N data samples Sum of square error: Cross-entropy: Kai Yu. SJTU Deep Learning Lecture. 40

Matching Output Function with Loss Function Regression Output: linear Loss function: sum of square error Binary Classification Output: logistic (sigmoid)/softmax Loss function: cross entropy Multi-classification Output: softmax Loss function: cross entropy Q: For binary classification, what sthe lossfunction for logistic and softmax output respectively? Kai Yu. SJTU Deep Learning Lecture. 41

Error Back-Propagation for Multi-Layer NN While numeric gradient computation can be used to estimate the gradient and thereby adjust the weights of the neural net, doing so is not very efficient. A more efficient, if not slightly more confusing method of computing the gradient, is to use back-propagation. Back-propagation (BP) is the most widely used parameter update approach for multi-layerneural network Kai Yu. SJTU Deep Learning Lecture. 42

Back-propagation Algorithm(1) Review multi-layer neural networks Feed forward operation is a chain function calculations

Back-propagation Algorithm(2) Lossfunctionexample: square error NN example: a simple one layer linear model: So the derivative of loss function (single sample) is:

Back-propagation Algorithm(3) General unit activation in a multilayer network: Activation function Forward propagation: calculate for each unit Activation Input/output of hidden layer The loss L depends on only through : Error signal

Back-propagation Algorithm(4) Output unit with linear output function: Hidden unit which sends inputs to units : Check all nodes connected to t Apply chain rules Update weights (learning rate ): Kai Yu. SJTU Deep Learning Lecture. 46

Back-propagation Algorithm(5) BP algorithm for multi-layer NN can be decomposed in the following four steps: I. Feed-forward computation II. Back propagation to the output layer III. Back propagation to the hidden layer IV. Weight updates Kai Yu. SJTU Deep Learning Lecture. 47

Example of BP Sigmoid Consider a 2-dimensional neuron (inputs x and weights w) that uses the sigmoid activation function. Differentiate sub-functions in the expression. Kai Yu. SJTU Deep Learning Lecture. 48

Example of BP Sigmoid The inputs are [x0,x1] and the (learnable) weights are [w0,w1,w2]. The forward pass computes values from inputs to output (green). The backward pass then performs back-propagation to compute the gradients (red). -1.0=1.00*(-1) -0.20 =-0.53 * exp(-1) 1.37 = 0.37 + 1-0.53 =1.00 * -1/(1.37^2) If learning rate=1, updated weights: w0=1.8, w1=-3.39, w2=-2.8 Kai Yu. SJTU Deep Learning Lecture. 49

Patterns in Backward Propagation The add gate distributes the gradient equally to all of its inputs. The max gate routes the gradient unchanged to exactly one of its inputs with the highest forward. The multiply gate: Its local gradients are the switched input values multiplied by the gradient on its output during the chain rule. Kai Yu. SJTU Deep Learning Lecture. 50

Computational Efficiency The back-propagation algorithm is computationally more efficient than standard numerical minimization. Suppose that is the total number of weights and biases in the network. Back-propagation: the evaluation is for large, as there are many more weights than units. Standard approach: perturb each weight, and forward propagate to compute the change in. This will requires computations, so the total complexity is. Kai Yu. SJTU Deep Learning Lecture. 51

Application Classify points Task: use a 2-Layer MLP to classify 3 classes of 2-dimensional points Kai Yu. SJTU Deep Learning Lecture. 52

Application Classify points Structure of NN Training Number of epochs Learning rate DEMO Kai Yu. SJTU Deep Learning Lecture. 53

Application Approximate y=sin(x) Task: use a 2-Layer MLP to approximate Kai Yu. SJTU Deep Learning Lecture. 54

Application Approximate y=sin(x) Structure of NN Training Number of epochs Learning rate DEMO Kai Yu. SJTU Deep Learning Lecture. 55

Types of NNs DNN(Deep neural networks) A fancy Playground: http://playground.tensorflow.org/ Kai Yu. SJTU Deep Learning Lecture. 56

Types of NNs CNN(Convolutional neural networks) Kai Yu. SJTU Deep Learning Lecture. 57

Types of NNs RNN(Recurrent neural networks) Kai Yu. SJTU Deep Learning Lecture. 58

DL Assignments https://github.com/caodi0207/deep-learning- Course-2017 New assignments will be uploaded to this repo Assignment submission: File name pattern: StudentID-YourName-AssignmentID.zip. E.g.: 12345- 小明 -as2.zip. Upload your zip file to: ftp://202.120.38.125. Public account: dl2016/dl2016. Be careful to upload to corresponding folder. Kai Yu. SJTU Deep Learning Lecture. 59

DL Assignments Kai Yu. SJTU Deep Learning Lecture. 60

DL Assignments Discuss and Q&A Kai Yu. SJTU Deep Learning Lecture. 61

DL Assignments Discuss and Q&A If you encounter any troubles or find any bugs, feel free to discuss and help others in those issues. Contact TAs: caodi0207@sjtu.edu.cn( 曹迪 ) Kai Yu. SJTU Deep Learning Lecture. 62

dl_assignment1 Softmax Two-layer MLP Kai Yu. SJTU Deep Learning Lecture. 63