Backpropagation in recurrent MLP

Similar documents
Artificial Neural Networks written examination

Python Machine Learning

A Neural Network GUI Tested on Text-To-Phoneme Mapping

(Sub)Gradient Descent

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Knowledge Transfer in Deep Convolutional Neural Nets

An empirical study of learning speed in backpropagation

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

arxiv: v1 [cs.lg] 15 Jun 2015

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Softprop: Softmax Neural Network Backpropagation Learning

Lecture 1: Machine Learning Basics

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Axiom 2013 Team Description Paper

Evolutive Neural Net Fuzzy Filtering: Basic Description

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Comparison of Annealing Techniques for Academic Course Scheduling

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Learning Methods for Fuzzy Systems

Test Effort Estimation Using Neural Network

SARDNET: A Self-Organizing Feature Map for Sequences

Reinforcement Learning by Comparing Immediate Reward

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Seminar - Organic Computing

Evolution of Symbolisation in Chimpanzees and Neural Nets

INPE São José dos Campos

Artificial Neural Networks

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Generative models and adversarial training

Kamaldeep Kaur University School of Information Technology GGS Indraprastha University Delhi

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Classification Using ANN: A Review

FF+FPG: Guiding a Policy-Gradient Planner

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Degeneracy results in canalisation of language structure: A computational model of word learning

CS Machine Learning

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

An Empirical and Computational Test of Linguistic Relativity

Syntactic systematicity in sentence processing with a recurrent self-organizing network

Lecture 10: Reinforcement Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

arxiv: v1 [cs.cv] 10 May 2017

Deep Neural Network Language Models

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Discriminative Learning of Beam-Search Heuristics for Planning

Second Exam: Natural Language Parsing with Neural Networks

Attributed Social Network Embedding

Using focal point learning to improve human machine tacit coordination

Time series prediction

Learning to Schedule Straight-Line Code

A deep architecture for non-projective dependency parsing

Comment-based Multi-View Clustering of Web 2.0 Items

BUSINESS INTELLIGENCE FROM WEB USAGE MINING

The Evolution of Random Phenomena

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Probability and Statistics Curriculum Pacing Guide

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Model Ensemble for Click Prediction in Bing Search Ads

On-the-Fly Customization of Automated Essay Scoring

An OO Framework for building Intelligence and Learning properties in Software Agents

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Development of Multistage Tests based on Teacher Ratings

On the Combined Behavior of Autonomous Resource Management Agents

A Reinforcement Learning Variant for Control Scheduling

12- A whirlwind tour of statistics

Truth Inference in Crowdsourcing: Is the Problem Solved?

Laboratorio di Intelligenza Artificiale e Robotica

Statewide Framework Document for:

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

NEURAL PROCESSING INFORMATION SYSTEMS 2 DAVID S. TOURETZKY ADVANCES IN EDITED BY CARNEGI-E MELLON UNIVERSITY

An Introduction to Simio for Beginners

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

STA 225: Introductory Statistics (CT)

Learning and Transferring Relational Instance-Based Policies

Predicting Future User Actions by Observing Unmodified Applications

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Human Emotion Recognition From Speech

The Strong Minimalist Thesis and Bounded Optimality

The Impact of Formative Assessment and Remedial Teaching on EFL Learners Listening Comprehension N A H I D Z A R E I N A S TA R A N YA S A M I

Improving Fairness in Memory Scheduling

arxiv: v1 [cs.cl] 2 Apr 2017

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

CSL465/603 - Machine Learning

School of Innovative Technologies and Engineering

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Transcription:

Backpropagation in recurrent MLP Christopher Bishop, Pattern Recognition and Machine Learning, Springer, 2006 Chapter 5.3 Training and design issues in MLP Christopher Bishop, Pattern Recognition and Machine Learning, Springer, 2006 Chapter 5.3 1

Remember: local minima 2-Aug-13 http://w3.ualg.pt/~jvo/ml Local minima 2

Local minima and weights initialization Backpropagation perform gradient descent and finds a local, not necessarily global, error minimum Run Backpropagation N times with different initial small random weights Heuristic: weight range should be approximately +/- 1/(number of weights coming into a node) The Momentum term Δw[k] = ηδ[k] x[k] + αδw[k-1]; α [0,1] Smooths the effect of weight adjustment overtime by avoiding sudden change in the weights 3

Typical error evolution during training E E E # iter # iter # iter Steady, rapid decline in total error Reduce learning parms. - may indicate data is not learnable Seldom a local minimum - reduce learning or momentum parameter; - Re-initiate & re-run Typicall training parameters Highly application dependent Typical Range learning rate, η 0.1 0.001-0.99 momentum, α 0.5 0.1-0.9 Better: During training automatically adjust individual learning rate parameters for each weight 4

Individual adaptive learning rate parameters Individual adaptive learning rate parameters Each weight w k,j has its own rate η k,j If Δw k, j remains in the same direction, increase η k,j If Δw k, j changes the direction, decrease η k,j 5

Experimental comparison Training for XOR problem (batch mode) 25 simulations with random initial weights Success if E averaged over 50 consecutive epochs is less than 0.04 method simulations success Mean epochs BP 25 24 16,859.8 BP with momentum BP with adaptive etas 25 25 2,056.3 25 22 447.3 Faster convergence There are other more efficient (with faster convergence) optimization methods than gradient descent: Newton s method uses a quadratic approximation (2 nd order Taylor expansion) F(x+Δx) = F(x) + F(x) Δx + Δx 2 F(x) Δx + Conjugate gradients Levenberg-Marquardt algorithm 6

When is a neural network trained? Objective: To achieve generalization (accuracy on new examples/cases) Preventing over-fitting/over-training Over-fitting/over-training problem: trained net fits the training samples perfectly (E reduced to 0) but it does not give accurate outputs for inputs not in the training set Train the network using a training set + test set Validate the trained network against a separate test set hereafter referred to as a production test set Monitor error on the test sets as network trains 7

Large sample method: A large data set is available Available Examples 70% Divided randomly 30% Training Set Test Set Production Set Generalization error = test error Used to develop one ANN model Compute Test error Cross-validation: When the available data is small Available Examples 90% Training Set Test Set Used to develop K different ANN models 10% Pro. Set Accumulate test errors Repeat 10 times Generalization error determined by mean test error and stddev 8

Preventing over-fitting/over-training Stop network training just prior to over-fit error occurring - early stopping How to select between two ANN models? A statistical test of hypothesis is required to ensure that a significant difference exists between the error rates of two ANN models If Large Sample method has been used then apply McNemar s test If Cross-validation then use a paired t test for difference of two proportions * 9

Design issues Network architecture: How many nodes? Open issues: How many layers? How many nodes per layer? Automated methods: augmentation (cascade correlation) weight pruning and elimination (optimal brain damage) 10

Structure of artificial neurons Choice of input integration: summed, squared and summed multiplied Choice of activation (transfer) function: logistic hyperbolic tangent Guassian linear soft-max Selecting a learning rule Backpropation (stocastic vs. batch version) Momentum term Adaptive learning rates Faster convergence techniques Newton s method uses a quadratic approximation (2 nd order Taylor expansion) F(x+Δx) = F(x) + F(x) Δx + Δx 2 F(x) Δx + Conjugate gradients Levenberg-Marquardt algorithm Variety of performance (cost) functions 11

Weight Decay/Regularization Adjust the error function to penalize the unnecessary growth of weights: E= 1 2 ( tj yj) + λ 2 2 j i w 2 ij Δw =Δw λw ij ij ij where: λ is the weight -cost parameter Summary Backpropagation in recurrent MLP Training and design issues in MLP Weights initialization Momentum term Typical training parameters Adaptive independent learning rate parameters Ensuring generalization Architecture of network Structure of artificial neurons Learning rules 12