Evolution of Neural Networks. October 20, 2017

Similar documents
Artificial Neural Networks written examination

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Artificial Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Second Exam: Natural Language Parsing with Neural Networks

Learning Methods for Fuzzy Systems

Python Machine Learning

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

arxiv: v1 [cs.lg] 7 Apr 2015

Dropout improves Recurrent Neural Networks for Handwriting Recognition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Axiom 2013 Team Description Paper

Softprop: Softmax Neural Network Backpropagation Learning

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

*** * * * COUNCIL * * CONSEIL OFEUROPE * * * DE L'EUROPE. Proceedings of the 9th Symposium on Legal Data Processing in Europe

An empirical study of learning speed in backpropagation

Deep Neural Network Language Models

arxiv: v1 [cs.lg] 15 Jun 2015

Learning to Schedule Straight-Line Code

arxiv: v4 [cs.cl] 28 Mar 2016

INPE São José dos Campos

Evolution of Symbolisation in Chimpanzees and Neural Nets

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

(Sub)Gradient Descent

Test Effort Estimation Using Neural Network

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

SARDNET: A Self-Organizing Feature Map for Sequences

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Evolutive Neural Net Fuzzy Filtering: Basic Description

A study of speaker adaptation for DNN-based speech synthesis

arxiv: v1 [cs.cv] 10 May 2017

Lecture 1: Machine Learning Basics

CS Machine Learning

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

phone hidden time phone

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Abstractions and the Brain

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Henry Tirri* Petri Myllymgki

Generative models and adversarial training

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Device Independence and Extensibility in Gesture Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

On the Formation of Phoneme Categories in DNN Acoustic Models

Human Emotion Recognition From Speech

Knowledge Transfer in Deep Convolutional Neural Nets

Forget catastrophic forgetting: AI that learns after deployment

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Knowledge-Based - Systems

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A Pipelined Approach for Iterative Software Process Model

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Backwards Numbers: A Study of Place Value. Catherine Perez

An Empirical and Computational Test of Linguistic Relativity

CSL465/603 - Machine Learning

1 NETWORKS VERSUS SYMBOL SYSTEMS: TWO APPROACHES TO MODELING COGNITION

An OO Framework for building Intelligence and Learning properties in Software Agents

Lecture 1: Basic Concepts of Machine Learning

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Rule Learning With Negation: Issues Regarding Effectiveness

Discriminative Learning of Beam-Search Heuristics for Planning

Adaptive learning based on cognitive load using artificial intelligence and electroencephalography

Speaker Identification by Comparison of Smart Methods. Abstract

Data Fusion Through Statistical Matching

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

arxiv: v5 [cs.ai] 18 Aug 2015

A Review: Speech Recognition with Deep Learning Methods

Dialog-based Language Learning

A Reinforcement Learning Variant for Control Scheduling

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Lecture 10: Reinforcement Learning

Neuro-Symbolic Approaches for Knowledge Representation in Expert Systems

Syntactic systematicity in sentence processing with a recurrent self-organizing network

Word Segmentation of Off-line Handwritten Documents

Visual CP Representation of Knowledge

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

arxiv: v2 [cs.ir] 22 Aug 2016

THE world surrounding us involves multiple modalities

A deep architecture for non-projective dependency parsing

MYCIN. The MYCIN Task

Rule Learning with Negation: Issues Regarding Effectiveness

Kamaldeep Kaur University School of Information Technology GGS Indraprastha University Delhi

Model Ensemble for Click Prediction in Bing Search Ads

Time series prediction

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Classification Using ANN: A Review

Transcription:

Evolution of Neural Networks October 20, 2017

Single Layer Perceptron, (1957) Frank Rosenblatt 1957 1957

Single Layer Perceptron Perceptron, invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt, The New York Times reported the perceptron to be the embryo of an electronic computer that will be able to walk, talk, see, write, reproduce itself and be conscious of its existence, Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6, pp. 386 408. doi:10.1037/h0042519. Image Source: http://sebastianraschka.com/articles/2015_singlelayer_neurons.html

Single Layer Perceptron supervised learning, binary classification, like regression, linear classifier. to mimic how a single neuron in the brain works: It either fires or not. How it works? receives multiple input signals, signals summed and that exceed a certain threshold, it returns a signal, The aim of the perceptron algorithm is to draw linear decision boundary Image Source: http://sebastianraschka.com/articles/2015_singlelayer_neurons.html

Single Layer Perceptron, (1957) Frank Rosenblatt 1957 1957 1969 Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class»

Minsky M. L. and Papert S. A. 1969. Perceptrons. Cambridge, MA: MIT Press. Image Source: https://pmirla.github.io/2016/08/16/ai-winter.html

Single Layer Perceptron, (1957) Frank Rosenblatt 1957 1957 1969 1971 Rosenblatt died AI WINTER Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class»

Single Layer Perceptron, (1957) Frank Rosenblatt The Canadian Institute for Advanced Research (CIFAR) is founded 1957 1957 1969 1971 1982 Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» AI WINTER Rosenblatt died

Who ended the AI Winter? CIFAR & Canadian Researchers (Geoffery Hinton, Yann Lecun, Yoshua Bengio) and others, "Perceptrons - Expanded Edition (Minsky and Papert ) was reprinted in 1987 where some errors in the original text are shown and corrected. AI Winter. How Canadians contributed to end it?, https://pmirla.github.io/2016/08/16/ai-winter.html

Single Layer Perceptron, (1957) Frank Rosenblatt The Canadian Institute for Advanced Research (CIFAR) is founded MultiLayer FeedForward Neural Networks 1957 1957 1969 1971 1982 1982-1990 Rosenblatt died Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» AI WINTER

Multi-layer Perceptrons Multi-layer perceptron (Werbos 1974, Rumelhart, McClelland, Hinton 1986), a feedforward neural network with one or more layers between input and output layer, data flows in one direction from input to output layer (forward), trained with the backpropagation learning algorithm, can solve problems which are not linearly separable, Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. "Learning Internal Representations by Error Propagation". David E. Rumelhart, James L. McClelland, and the PDP research group. (editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundation. MIT Press, 1986 Image Source: https://en.wikipedia.org/wiki/feedforward_neural_network#/media/file:xor_perceptron_net.png

Image Source: http://www.di.unito.it/~cancelli/retineu11_12/fnn.pdf

Single Layer Perceptron, (1957) Frank Rosenblatt The Canadian Institute for Advanced Research (CIFAR) is founded MultiLayer FeedForward Neural Networks 1957 1957 1969 1971 1982 1982-1990 1990 AI WINTER Rosenblatt died Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» Elman(1990) «no memory, think from scratch every second»

Problems of Multi-layer Feed Forward Networks Humans don t start thinking from scratch every second, understand each word based on understanding of previous words, MLPs can t do this, since they do not have a memory, A MLP cannot use its reasoning about previous events about something to inform later ones. http://colah.github.io/posts/2015-08-understanding-lstms/ Image Source: https://psychology.iresearchnet.com/social-psychology/social-cognition/memory/

Nature of Recurrent Neural Networks Jordan (1986), The recurrent connections allow the network s hidden units to see its own previous output, Thus, the subsequent behavior can be shaped by previous responses These recurrent connections are what give the network memory (Jordan 1986, explained by (Elman, 1990)). Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179 211. http://doi.org/10.1016/0364-0213(90)90002-e. Jordan, M.I. (1986). Serial order: A parallel distributed processing approach (Tech. Rep. No. 8604). San Diego: University of California, Institute for Cognitive Science.

Nature of Recurrent Neural Networks Elman (1990), By adding a context layer to the model, activations in the hidden layer are copied to the context layer on a one for one basis when the time is t. Thus, when the time is t+1, the context units contain values which are exactly the hidden unit values at time t. These context units are also hidden in the sense that they interact exclusively with other nodes internal to the network, and not the outside world. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179 211. http://doi.org/10.1016/0364-0213(90)90002-e.

How RNN works? At time t; the input units receive the first input in the sequence. both the input units and context units activate the hidden units the hidden units then feed forward to activate the output units. the hidden units also feedback to activate the context units. this constitutes the forward activation. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179 211. http://doi.org/10.1016/0364-0213(90)90002-e. Image Source: http://www.lund.irf.se/helioshome/elman.html

How RNN works? If there is learning, the output is compared with a teacher input, and back propagation of error is used to adjust connection strengths incrementally At time t+1; the above sequence is repeated, now, the context units contain values which are exactly the hidden unit values at time t. These context units thus provide the network with memory. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179 211. http://doi.org/10.1016/0364-0213(90)90002-e. Image Source: http://www.lund.irf.se/helioshome/elman.html

RNN for Machine Translation Image Source: http://cs224d.stanford.edu/lectures/cs224d-lecture8.pdf

Interim Summary Traditional Neural Network: Don't have memory they start thinking from scratch every time. Recurrent Neural Network: Have context layers (loops) that allow information to persist. The hidden layer of RNN represents all previous history, not just n 1 previous words, thus the model can theoretically represent long context patterns (Mikolov, 2012).

Single Layer Perceptron, (1957) Frank Rosenblatt The Canadian Institute for Recurrent Neural Networks Advanced Research (CIFAR) is founded MultiLayer FeedForward Neural Networks 1957 1957 1969 1971 1982 1982-1990 1990 1990-1994 1994 AI WINTER Rosenblatt died Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» Elman(1990) «no memory, think from scratch every second» Bengio et al. (1994) «in practice, RNNs are considered difficult to train due to the so-called vanishing and exploding gradient problems»

Problems with RNNs RNN faces an increasingly difficult problem as the duration of the dependencies to be captured increases, RNN Learning algorithm: compute the gradient of a cost function with respect to the weights of the network, This gradient sometimes vanishes and sometimes explodes which are vanishing and exploding gradients. Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157 166. Image Source: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/recurrent_neural_networks.html

Problems with RNNs Exploiding Gradients the large increase in the norm of the gradient during training, Such events are due to the explosion of the long term components, which can grow exponentially more than short term ones. Solution: Clipping the gradient's temporal components when it exceeds in absolute value, like a fixed threshold or Long Short-Term Memories Vanishing Gradients When the long term components go exponentially fast to norm 0, making it impossible for the model to learn correlation between temporally distant events. Solution: Long Short-Term Memories Pascanu, R., Mikolov, T., & Bengio, Y. (2013, February). On the difficulty of training recurrent neural networks. In International Conference on Machine Learning (pp. 1310-1318).

The Problem of Long-Term Dependencies A task displays long-term dependencies if prediction of the desired output at the time t depends on the input presented at an earlier time T t. When T becomes large, it is extremely difficult to attain convergence (Bengio et al., 1994) Previous language texts might inform the understanding of the present text, The clouds are in the. (sky) The clouds and the stars are in the. (sky) Here the gap between the relevant information and the place that it s needed is small, RNNs can learn to use the past information. Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157 166. Source: http://colah.github.io/posts/2015-08-understanding-lstms/

The Problem of Long-Term Dependencies Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

The Problem of Long-Term Dependencies I grew up in France I speak fluent. (French) It s entirely possible for the gap between the relevant information and the point where it is needed to become very large. Unfortunately, as that gap grows, RNNs become unable to learn to connect the information, Source: http://colah.github.io/posts/2015-08-understanding-lstms/

The Problem of Long-Term Dependencies Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

Single Layer Perceptron, (1957) Frank Rosenblatt 1957 The Canadian Institute for Recurrent Neural Networks Advanced Research (CIFAR) is founded MultiLayer FeedForward Neural Networks LSTMs 1957 1969 1971 1982 1982-1990 1990 1990-1994 1994 1997 Rosenblatt died Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» AI WINTER Elman(1990) «no memory, think from scratch every second» Bengio (1994) «in practice, RNNs are considered difficult to train due to the so-called vanishing and exploding gradient problems» Hochreiter & Schmidhuber(1997) «LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error flow»

Long Short-Term Memory Training the algorithm becomes very difficult due to the vanishing gradients because the influence of short-term dependencies dominates in the weights gradient. Solution: efficient, gradient based algorithm for an architecture enforcing constant (thus neither exploding nor vanishing) error flow S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735 1780, 1997.

Long Short Term Memory Networks (LSTMs) LSTM has been to modify the architecture of the hidden units by introducing gates; which explicitly control the flow of information as a function of both the state and the input. Specifically, the signal stored in a hidden unit must be explicitly erased by a forget gate and is otherwise stored indefinitely. This allows information to be carried over long periods of time. For each memory cell, the network computes the output of four gates: an update gate, input gate, forget gate, output gate http://colah.github.io/posts/2015-08-understanding-lstms/

RNNs Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step CELL STATE The cell state has a key role, It runs straight down the entire chain, with only some minor linear interactions. multiply add It s very easy for information to just flow along it unchanged. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step GATES The LSTM handles removing and adding information to the cell sate via gates, Gates are a way to optionally let information through. sigmoid function outputs 0 let nothing through, sigmoid function outputs 1 let everything through An LSTM has four of these gates, to protect and control the cell state. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step FORGET GATE Firstly, LSTM has to decide what information we re going to throw away from the cell state; forget gate layer. It looks at h t 1 and x t, and outputs a number between 0 and 1 for each number in the cell state. 1 represents completely keep this 0 represents completely get rid of this.. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step INPUT AND UPDATE GATES Secondly, decide what new information we re going to store in the cell state. This has two parts; First, a sigmoid layer called the input gate layer decides which values we ll update. Then, a tanh layer creates a vector of new candidate values, C t, that could be added to the state. In the next step, we ll combine these two to create an update to the state. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step FORGET, INPUT AND UPDATE GATES Thirdly, action!!! It s time to update the old cell state, C t 1, into the new cell state Ct. multiply the old state by f t, forgetting the things, add i t C t. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

LSTMs Step by Step OUTPUT GATE Finally, LSTM has to decide what the output would be, run a sigmoid layer which decides what parts of the cell state it is going to output. put the cell state through tanh (to push the values to be between 1 and 1), multiply it by the output of the sigmoid gate, so that it only outputs the parts it decided to. Image Source: http://colah.github.io/posts/2015-08-understanding-lstms/

Language Example I grew up in France.. She was beautiful. Her eyes..

LSTMs Summary LSTMs avoid long-term dependency problems thanks to its complex inner structure, four layers; Forget gate layer; info to be thrown away, Adding new info to cell state by gates, Update the old state by multiplying the old state to forget and add the new candidate value. Decide the output. http://colah.github.io/posts/2015-08-understanding-lstms/

Single Layer Perceptron, (1957) Frank Rosenblatt The Canadian Institute for Recurrent Neural Networks Advanced Research (CIFAR) is founded MultiLayer FeedForward Neural Networks LSTMs Deep Learning Era 1957 1957 1969 1971 1982 1982-1990 1990 1990-1994 1994 1997 1997 - now Rosenblatt died Minsky and Papert (1969) «incapable of usefully representing or approximating functions outside a very narrow and special class» AI WINTER Elman(1990) «no memory, think from scratch every second» Bengio (1994) «in practice, RNNs are considered difficult to train due to the so-called vanishing and exploding gradient problems» Hochreiter & Schmidhuber(1997) «LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error flow»

Success of Deep Learning http://people.idsia.ch/~juergen/impact-on-most-valuable-companies.html

QUESTIONS? CONCERNS? COMMENTS? 43