Backpropagation in recurrent MLP

Backpropagation in recurrent MLP Christopher Bishop, Pattern Recognition and Machine Learning, Springer, 2006 Chapter 5.3 Training and design issues in MLP Christopher Bishop, Pattern Recognition and Machine Learning, Springer, 2006 Chapter 5.3 1

Remember: local minima 2-Aug-13 http://w3.ualg.pt/~jvo/ml Local minima 2

Local minima and weights initialization Backpropagation perform gradient descent and finds a local, not necessarily global, error minimum Run Backpropagation N times with different initial small random weights Heuristic: weight range should be approximately +/- 1/(number of weights coming into a node) The Momentum term Δw[k] = ηδ[k] x[k] + αδw[k-1]; α [0,1] Smooths the effect of weight adjustment overtime by avoiding sudden change in the weights 3

Typical error evolution during training E E E # iter # iter # iter Steady, rapid decline in total error Reduce learning parms. - may indicate data is not learnable Seldom a local minimum - reduce learning or momentum parameter; - Re-initiate & re-run Typicall training parameters Highly application dependent Typical Range learning rate, η 0.1 0.001-0.99 momentum, α 0.5 0.1-0.9 Better: During training automatically adjust individual learning rate parameters for each weight 4

Individual adaptive learning rate parameters Individual adaptive learning rate parameters Each weight w k,j has its own rate η k,j If Δw k, j remains in the same direction, increase η k,j If Δw k, j changes the direction, decrease η k,j 5

Experimental comparison Training for XOR problem (batch mode) 25 simulations with random initial weights Success if E averaged over 50 consecutive epochs is less than 0.04 method simulations success Mean epochs BP 25 24 16,859.8 BP with momentum BP with adaptive etas 25 25 2,056.3 25 22 447.3 Faster convergence There are other more efficient (with faster convergence) optimization methods than gradient descent: Newton s method uses a quadratic approximation (2 nd order Taylor expansion) F(x+Δx) = F(x) + F(x) Δx + Δx 2 F(x) Δx + Conjugate gradients Levenberg-Marquardt algorithm 6

When is a neural network trained? Objective: To achieve generalization (accuracy on new examples/cases) Preventing over-fitting/over-training Over-fitting/over-training problem: trained net fits the training samples perfectly (E reduced to 0) but it does not give accurate outputs for inputs not in the training set Train the network using a training set + test set Validate the trained network against a separate test set hereafter referred to as a production test set Monitor error on the test sets as network trains 7

Large sample method: A large data set is available Available Examples 70% Divided randomly 30% Training Set Test Set Production Set Generalization error = test error Used to develop one ANN model Compute Test error Cross-validation: When the available data is small Available Examples 90% Training Set Test Set Used to develop K different ANN models 10% Pro. Set Accumulate test errors Repeat 10 times Generalization error determined by mean test error and stddev 8

Preventing over-fitting/over-training Stop network training just prior to over-fit error occurring - early stopping How to select between two ANN models? A statistical test of hypothesis is required to ensure that a significant difference exists between the error rates of two ANN models If Large Sample method has been used then apply McNemar s test If Cross-validation then use a paired t test for difference of two proportions * 9

Design issues Network architecture: How many nodes? Open issues: How many layers? How many nodes per layer? Automated methods: augmentation (cascade correlation) weight pruning and elimination (optimal brain damage) 10

Structure of artificial neurons Choice of input integration: summed, squared and summed multiplied Choice of activation (transfer) function: logistic hyperbolic tangent Guassian linear soft-max Selecting a learning rule Backpropation (stocastic vs. batch version) Momentum term Adaptive learning rates Faster convergence techniques Newton s method uses a quadratic approximation (2 nd order Taylor expansion) F(x+Δx) = F(x) + F(x) Δx + Δx 2 F(x) Δx + Conjugate gradients Levenberg-Marquardt algorithm Variety of performance (cost) functions 11

Weight Decay/Regularization Adjust the error function to penalize the unnecessary growth of weights: E= 1 2 ( tj yj) + λ 2 2 j i w 2 ij Δw =Δw λw ij ij ij where: λ is the weight -cost parameter Summary Backpropagation in recurrent MLP Training and design issues in MLP Weights initialization Momentum term Typical training parameters Adaptive independent learning rate parameters Ensuring generalization Architecture of network Structure of artificial neurons Learning rules 12