Evolution of Neural Networks. October 20, PDF Free Download

Evolution of Neural Networks October 20, 2017

Single Layer Perceptron, (1957) Frank Rosenblatt 1957 1957

Single Layer Perceptron Perceptron, invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt, The New York Times reported the perceptron to be the embryo of an electronic computer that will be able to walk, talk, see, write, reproduce itself and be conscious of its existence, Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp. 386 408. doi:10.1037/h0042519. Image Source: http://sebastianraschka.com/articles/2015_singlelayer_neurons.html

Single Layer Perceptron supervised learning, binary classification, like regression, linear classifier. to mimic how a single neuron in the brain works: It either fires or not. How it works? receives multiple input signals, signals summed and that exceed a certain threshold, it returns a signal, The aim of the perceptron algorithm is to draw linear decision boundary Image Source: http://sebastianraschka.com/articles/2015_singlelayer_neurons.html

Single Layer Perceptron, (1957) Frank Rosenblatt 1957 1957 1969 Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class»

Minsky M. L. and Papert S. A. 1969. Perceptrons. Cambridge, MA: MIT Press. Image Source: https://pmirla.github.io/2016/08/16/ai-winter.html

Single Layer Perceptron, (1957) Frank Rosenblatt 1957 1957 1969 1971 Rosenblatt died AI WINTER Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class»

Single Layer Perceptron, (1957) Frank Rosenblatt The Canadian Institute for Advanced Research (CIFAR) is founded 1957 1957 1969 1971 1982 Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» AI WINTER Rosenblatt died

Who ended the AI Winter? CIFAR & Canadian Researchers (Geoffery Hinton, Yann Lecun, Yoshua Bengio) and others, "Perceptrons - Expanded Edition (Minsky and Papert ) was reprinted in 1987 where some errors in the original text are shown and corrected. AI Winter. How Canadians contributed to end it?, https://pmirla.github.io/2016/08/16/ai-winter.html

Single Layer Perceptron, (1957) Frank Rosenblatt The Canadian Institute for Advanced Research (CIFAR) is founded MultiLayer FeedForward Neural Networks 1957 1957 1969 1971 1982 1982-1990 Rosenblatt died Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» AI WINTER

Multi-layer Perceptrons Multi-layer perceptron (Werbos 1974, Rumelhart, McClelland, Hinton 1986), a feedforward neural network with one or more layers between input and output layer, data flows in one direction from input to output layer (forward), trained with the backpropagation learning algorithm, can solve problems which are not linearly separable, Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. "Learning Internal Representations by Error Propagation". David E. Rumelhart, James L. McClelland, and the PDP research group. (editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundation. MIT Press, 1986 Image Source: https://en.wikipedia.org/wiki/feedforward_neural_network#/media/file:xor_perceptron_net.png

Image Source: http://www.di.unito.it/~cancelli/retineu11_12/fnn.pdf

Single Layer Perceptron, (1957) Frank Rosenblatt The Canadian Institute for Advanced Research (CIFAR) is founded MultiLayer FeedForward Neural Networks 1957 1957 1969 1971 1982 1982-1990 1990 AI WINTER Rosenblatt died Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» Elman(1990) «no memory, think from scratch every second»

Problems of Multi-layer Feed Forward Networks Humans don t start thinking from scratch every second, understand each word based on understanding of previous words, MLPs can t do this, since they do not have a memory, A MLP cannot use its reasoning about previous events about something to inform later ones. http://colah.github.io/posts/2015-08-understanding-lstms/ Image Source: https://psychology.iresearchnet.com/social-psychology/social-cognition/memory/

Nature of Recurrent Neural Networks Jordan (1986), The recurrent connections allow the network s hidden units to see its own previous output, Thus, the subsequent behavior can be shaped by previous responses These recurrent connections are what give the network memory (Jordan 1986, explained by (Elman, 1990)). Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179 211. http://doi.org/10.1016/0364-0213(90)90002-e. Jordan, M.I. (1986). Serial order: A parallel distributed processing approach (Tech. Rep. No. 8604). San Diego: University of California, Institute for Cognitive Science.

Nature of Recurrent Neural Networks Elman (1990), By adding a context layer to the model, activations in the hidden layer are copied to the context layer on a one for one basis when the time is t. Thus, when the time is t+1, the context units contain values which are exactly the hidden unit values at time t. These context units are also hidden in the sense that they interact exclusively with other nodes internal to the network, and not the outside world. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179 211. http://doi.org/10.1016/0364-0213(90)90002-e.

How RNN works? At time t; the input units receive the first input in the sequence. both the input units and context units activate the hidden units the hidden units then feed forward to activate the output units. the hidden units also feedback to activate the context units. this constitutes the forward activation. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179 211. http://doi.org/10.1016/0364-0213(90)90002-e. Image Source: http://www.lund.irf.se/helioshome/elman.html

How RNN works? If there is learning, the output is compared with a teacher input, and back propagation of error is used to adjust connection strengths incrementally At time t+1; the above sequence is repeated, now, the context units contain values which are exactly the hidden unit values at time t. These context units thus provide the network with memory. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179 211. http://doi.org/10.1016/0364-0213(90)90002-e. Image Source: http://www.lund.irf.se/helioshome/elman.html

RNN for Machine Translation Image Source: http://cs224d.stanford.edu/lectures/cs224d-lecture8.pdf

Interim Summary Traditional Neural Network: Don't have memory they start thinking from scratch every time. Recurrent Neural Network: Have context layers (loops) that allow information to persist. The hidden layer of RNN represents all previous history, not just n 1 previous words, thus the model can theoretically represent long context patterns (Mikolov, 2012).

Single Layer Perceptron, (1957) Frank Rosenblatt The Canadian Institute for Recurrent Neural Networks Advanced Research (CIFAR) is founded MultiLayer FeedForward Neural Networks 1957 1957 1969 1971 1982 1982-1990 1990 1990-1994 1994 AI WINTER Rosenblatt died Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» Elman(1990) «no memory, think from scratch every second» Bengio et al. (1994) «in practice, RNNs are considered difficult to train due to the so-called vanishing and exploding gradient problems»

Problems with RNNs RNN faces an increasingly difficult problem as the duration of the dependencies to be captured increases, RNN Learning algorithm: compute the gradient of a cost function with respect to the weights of the network, This gradient sometimes vanishes and sometimes explodes which are vanishing and exploding gradients. Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157 166. Image Source: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/recurrent_neural_networks.html

Problems with RNNs Exploiding Gradients the large increase in the norm of the gradient during training, Such events are due to the explosion of the long term components, which can grow exponentially more than short term ones. Solution: Clipping the gradient's temporal components when it exceeds in absolute value, like a fixed threshold or Long Short-Term Memories Vanishing Gradients When the long term components go exponentially fast to norm 0, making it impossible for the model to learn correlation between temporally distant events. Solution: Long Short-Term Memories Pascanu, R., Mikolov, T., & Bengio, Y. (2013, February). On the difficulty of training recurrent neural networks. In International Conference on Machine Learning (pp. 1310-1318).

The Problem of Long-Term Dependencies A task displays long-term dependencies if prediction of the desired output at the time t depends on the input presented at an earlier time T t. When T becomes large, it is extremely difficult to attain convergence (Bengio et al., 1994) Previous language texts might inform the understanding of the present text, The clouds are in the. (sky) The clouds and the stars are in the. (sky) Here the gap between the relevant information and the place that it s needed is small, RNNs can learn to use the past information. Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157 166. Source: http://colah.github.io/posts/2015-08-understanding-lstms/

The Problem of Long-Term Dependencies Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

The Problem of Long-Term Dependencies I grew up in France I speak fluent. (French) It s entirely possible for the gap between the relevant information and the point where it is needed to become very large. Unfortunately, as that gap grows, RNNs become unable to learn to connect the information, Source: http://colah.github.io/posts/2015-08-understanding-lstms/

The Problem of Long-Term Dependencies Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

Single Layer Perceptron, (1957) Frank Rosenblatt 1957 The Canadian Institute for Recurrent Neural Networks Advanced Research (CIFAR) is founded MultiLayer FeedForward Neural Networks LSTMs 1957 1969 1971 1982 1982-1990 1990 1990-1994 1994 1997 Rosenblatt died Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» AI WINTER Elman(1990) «no memory, think from scratch every second» Bengio (1994) «in practice, RNNs are considered difficult to train due to the so-called vanishing and exploding gradient problems» Hochreiter & Schmidhuber(1997) «LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error flow»

Long Short-Term Memory Training the algorithm becomes very difficult due to the vanishing gradients because the influence of short-term dependencies dominates in the weights gradient. Solution: efficient, gradient based algorithm for an architecture enforcing constant (thus neither exploding nor vanishing) error flow S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735 1780, 1997.

Long Short Term Memory Networks (LSTMs) LSTM has been to modify the architecture of the hidden units by introducing gates; which explicitly control the flow of information as a function of both the state and the input. Specifically, the signal stored in a hidden unit must be explicitly erased by a forget gate and is otherwise stored indefinitely. This allows information to be carried over long periods of time. For each memory cell, the network computes the output of four gates: an update gate, input gate, forget gate, output gate http://colah.github.io/posts/2015-08-understanding-lstms/

RNNs Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step CELL STATE The cell state has a key role, It runs straight down the entire chain, with only some minor linear interactions. multiply add It s very easy for information to just flow along it unchanged. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step GATES The LSTM handles removing and adding information to the cell sate via gates, Gates are a way to optionally let information through. sigmoid function outputs 0 let nothing through, sigmoid function outputs 1 let everything through An LSTM has four of these gates, to protect and control the cell state. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step FORGET GATE Firstly, LSTM has to decide what information we re going to throw away from the cell state; forget gate layer. It looks at h t 1 and x t, and outputs a number between 0 and 1 for each number in the cell state. 1 represents completely keep this 0 represents completely get rid of this.. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step INPUT AND UPDATE GATES Secondly, decide what new information we re going to store in the cell state. This has two parts; First, a sigmoid layer called the input gate layer decides which values we ll update. Then, a tanh layer creates a vector of new candidate values, C t, that could be added to the state. In the next step, we ll combine these two to create an update to the state. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step FORGET, INPUT AND UPDATE GATES Thirdly, action!!! It s time to update the old cell state, C t 1, into the new cell state Ct. multiply the old state by f t, forgetting the things, add i t C t. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step OUTPUT GATE Finally, LSTM has to decide what the output would be, run a sigmoid layer which decides what parts of the cell state it is going to output. put the cell state through tanh (to push the values to be between 1 and 1), multiply it by the output of the sigmoid gate, so that it only outputs the parts it decided to. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

Language Example I grew up in France.. She was beautiful. Her eyes..

LSTMs Summary LSTMs avoid long-term dependency problems thanks to its complex inner structure, four layers; Forget gate layer; info to be thrown away, Adding new info to cell state by gates, Update the old state by multiplying the old state to forget and add the new candidate value. Decide the output. http://colah.github.io/posts/2015-08-understanding-lstms/

Single Layer Perceptron, (1957) Frank Rosenblatt The Canadian Institute for Recurrent Neural Networks Advanced Research (CIFAR) is founded MultiLayer FeedForward Neural Networks LSTMs Deep Learning Era 1957 1957 1969 1971 1982 1982-1990 1990 1990-1994 1994 1997 1997 - now Rosenblatt died Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» AI WINTER Elman(1990) «no memory, think from scratch every second» Bengio (1994) «in practice, RNNs are considered difficult to train due to the so-called vanishing and exploding gradient problems» Hochreiter & Schmidhuber(1997) «LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error flow»

Success of Deep Learning http://people.idsia.ch/~juergen/impact-on-most-valuable-companies.html

QUESTIONS? CONCERNS? COMMENTS? 43

Evolution of Neural Networks. October 20, 2017