Learning of Open-Loop Neural Networks with Improved Error Backpropagation Methods

Learning of Open-Loop Neural Networks with Improved Error Backpropagation Methods J. Pihler Abstract The paper describes learning of artificial neural networks with improved error backpropagation methods. Artificial neural networks were included into the algorithm of differential protection in order to recognise patterns of transformer's inrush current, with the purpose of improving the reliability of the protection system. The quality of improved learning methods can be assessed by comparing individual methods, number of hidden neuron layers, number of neurons in each individual layer, number of learning epochs, learning time, learning error and test error are varied. Learned neural networks were tested with patterns that were used for learning, as well as with patterns that were not used in the learning process. Keywords Neural network, Error backpropagation method, Improved learning methods, Transformer inrush. A I. INTRODUCTION rtificial neural network (ANN) represents a tool for solving those real problems classical analytical methods are not sufficient or the problem cannot be further generalised. In electric power systems, which are conditioned by two additional criteria of security and reliability of operation, it is used in taking the final decisions. ANN consists of a large number of neurons that are mutually connected and process data in parallel way with regard to dynamic condition of the neural network and with regard to its external inputs. Since it is by learning able to adapt to input information and given requirements, it is classified among adaptive systems. Learning is also related to the characteristics of associativity and generalisation. ANN is a robust system, since some neurons (process units) can be removed and the network will still correctly operate; only the results will be slightly worse. Characteristics of robustness, learning, associativity and generalisation give ANN a higher degree of flexibility. In electric power engineering open-loop ANN with the belonging error backpropagation learning algorithm are most commonly used [1]. Such neural networks are after the successful learning able to provide reasonable answers to the input data they have never encountered before, to approximate functions with a finite number of discontinuities, and to sort input vectors in a user-defined way. However, the basic J. Pihler is with the University of Maribor, Faculty of Electrical Engineering and Computer Sciences, Slovenia (e-mail: joze.pihler@unimb.si). gradient method of learning algorithm with error backpropagation, i.e. gradient descent method, in many cases does not converge to the solution fast enough and often even does not reach it. In order to accelerate and to improve efficiency in learning open-loop neural networks, the authors [], [3] and others developed improved learning methods with error backpropagation, which can converge even 100-times faster. In the second section of the paper an ANN is presented. The third section gives an overview of improved learning methods. The last, fourth section brings a comparison of these methods used in an actual example of power transformer inrush current recognition with ANN [4], [6]. As it is well known, in the case of unloaded transformer switching on, current in the primary winding can rise up to several times its nominal value. Protective relay would operate, although a fault did not occur. To avoid this problem, elements of artificial intelligence artificial neural networks are included into the existing algorithms of transformer protection. II. ARTIFICIAL NEURAL NETWORK ANN is a parallel information distribution structure, composed of processing elements neurons that are mutually connected with signal connections. Limits of capabilities of artificial neuron networks are set much above the capabilities of a single neuron. In the vast majority of applications the so called "feedforward" neural networks are used. The term topology in this kind of neural networks comprises the number of network layers and the number of neurons in these layers. We distinguish between the networks having one, two or more layers of neurons. Typical representatives of open-loop ANN are single-layer and multilayer perceptrons. A single-layer perceptron can only be used for simple examples of classification of patterns in a plane. Much wider area of use has multilayer perceptrons, which actually represent an expansion of single-layer perceptron with output layer and several hidden neuron layers. Outputs from the first neuron layer are at the same time inputs to the second layer and outputs from the second layer are inputs to the third neuron layer. Outputs from the third layer can be written in the matrix form [ 3] [ 3] [ ] [ ] [ 1] [ 1] y = F W F W F W x, (1) x is vector of inputs to ANN, y is vector of outputs 153

from ANN, W [n] are matrices of weights of individual layers, while F [n] is matrix of transfer functions of individual neutron layers. III. LEARNING OF ANN WITH IMPROVED METHODS Learning of ANN is an optimisation of weights with regard to a certain objective function. In this learning method the desired output from the neural network is given together with each input value. Thus it is possible to exactly determine the transformation of outputs to inputs that has to be performed by the ANN. Controlled learning is in principle implementation of optimisation methods to ANN. In addition to the parameters (weights), ANN also has inputs into the networks which are also changing. These inputs represent the learning set which should have as much characteristics of the total set as possible (of all possible inputs). The function that is minimised is a multiparameter function of error of one learning repetition with outputs from the learning set. Method of error backpropagation is used like learning rule in multilayer perceptrons. This method enables calculation of errors in individual hidden layers on the basis of error of the ANN output layer. In Fig. 1 it is illustrated on the example of three-layer ANN. Fig. 1 composition of three-layer ANN with error backpropagation algorithm The first layer has R inputs, weighted with weights W [1] and connected with S 1 neurons, while the second and third layer contain S and S 3 neurons, respectively, with adequately weighted connections. Input signals x i, (i = 1,...R) and desired output signals d j, (j = 1, S 3 ) participate in the learning process. The task of the learning process is adaptation of all ANN weights W [n] (n=1,,3) in such a way that the deviation between the desired outputs d and actual outputs y in the average of all p learning patterns will be minimal. For minimisation of the function of the sum of squared errors standard gradient procedure, i.e. gradient descent method was used [3]. The total squared error is defined by the equation p j = 1 S 1 3 E = ( d y ). () The basic learning procedure with error backpropagation principle consists of the following sequence: initialisation of all [ n] weights w ij and thresholds b ij, calculation of outputs from all neurons for all input patterns (so called "feed-forward" calculation), definition of desired outputs and calculation of [ n] local error δ for all layers (back-propagation calculation), j adaptation of weights and calculation of new weights. The above described steps are repeated until the sum of squared errors E reaches pre-prescribed value and ANN converges, or the maximum number of learning epochs is reached. One learning epoch comprises one calculation of outputs from all layers of neurons, calculation of the sum of squared errors E, backpropagation calculation of partial errorsδ, calculation of changed weights, as well as new weights and thresholds for the next epoch. A. Improved method of error backpropagation momentum gradient method Weakness of the classical error backpropagation algorithm with gradient method lies in the fact that in some combinations of initial values of weights learning of the neural network can end in local instead of global minimum. To avoid the local minimum, the change of weights is defined in discrete form as [ n]( k + 1) [ n]( k ) [ n]( k ) [ n]( k ) ij c ij ( 1 c ) η δ j i w = m w + m x, (3) wij is change of weights, m c momentum constant, η learning constant, k calculation step, δ j partial error in backpropagation calculation and x i inputs to the neural network.. B. Learning with adaptive constant Learning constant in the standard learning method with gradient descent remains unchanged during learning. If too high learning constant is chosen, it is possible that oscillations of the sum of squared errors depending on the number of learning epochs occur. On the other hand, if the selected learning constant is too low, the time to converge may become too long. Therefore the most optimal approach is to vary the learning constant during the learning process. The basic idea of this method is in the use of gradient procedure for calculation of two new points instead of one [3]. The point with lower error is then used in the next iteration. C. Jumping method with error backpropagation Multilayer neural networks usually use sigmoid functions in hidden layers. These functions convert infinite values at the input in the finite area at the output. Increasing of input values must bring their gradient towards zero value. This causes a problem when sigmoid functions are used in learning of neural networks with gradient descent method, since the value of gradient may be very low. The result is too small change of weights, although they might be far from the optimum values. The principle of jumping method with error backpropagation is to remove these negative effects of partial derivatives values. Only the sign of derivative is used to define the direction in which the weights will be changed [3]. The absolute value of derivative does not have any effect on the change of weights. If the sign in two iterations remains the 154

same, learning is accelerated. Contrary, it is decelerated if the sign in two iterations changes. If derivative is zero, learning constant remains unchanged. D. Conjugate gradient method The basic gradient descent method changes weights in the direction of the steepest descent (in the negative direction of the gradient). This is the direction of the steepest descent of the function. Nevertheless, the fact that the function in this direction has the steepest descent, does not necessarily lead to the fastest convergence. In the conjugate gradient method [] a searching function is used alongside related (derived) paths, which enable faster convergence than in the direction of the steepest descent. E. Pseudo-Newton methods Newton method is an alternative to conjugate gradient method for quick optimisation. It belongs to the group of second order algorithms, which consider more data on the form of error function than only the size of gradient [3]. Second order methods use quadratic approximation of the error ( k ) function. If w is weight vector in k th iteration, then the new vector of weights is ( k + 1) ( k ) 1 k E = w w H E w ( w) ( w) w, (4) - Hessian matrix of partial derivatives of second order error function [3]. Unfortunately, calculation of Hessian matrix for open-loop neural network is complicated and time consuming. For this reason inverted Hessian matrix is used. Such a method is called pseudo-newton method [3]. The non-elements are all set to zero and only the diagonal elements are calculated. In this case the equation of weights vector is generalised: E w ( k + 1) ( k ) wi wi = wi E w wi. (5) Although pseudo-newton method converges in some iteration, it requires more computation in each iteration than conjugate gradient method. Simplified Hessian matrix of dimension n x n, n is number of weights and thresholds in the network, needs to be saved in individual iterations. For larger networks is therefore more convenient to use jumping method or conjugate gradient method. F. One-step secant method From the second-order information only one-dimensional minimisation steps and information on curvature E in the direction of change, obtained from current and previous partial derivative of E in this direction, are used. Secant step method is based on independent steps of optimisation of individual weights. It uses quadratic one-dimensional approximation of the error function. Change of individual weight in k th step is defined as ( k w ) E ( k ) ( k 1) wi wi = wi k 1 k E w E w wi wi, (6) it is assumed that E is calculated in steps (k-1) and k with the use of change of weight obtained from w k 1 i previous secant or standard gradient method. G. Levenberg-Marguardt method Levenberg-Marguardt method is, similarly as pseudo- Newton method, designed to reach the second-order learning speed without the necessity to compute Hessian matrix [5]. When the minimised function (equation ()) has the form of sum of squares (typical for learning of open-loop networks), it can be written in the following form: T E = e e, (7) T = e11 es31 e1 es3 e1 p e S3 p e L L L L, e = d y, k = 1, K, S, p number of learning cases, S 3 - number of 3 neurons in the output layer. e is error vector (for all p). Values of new weights are calculated using the following equation: ( k 1) ( k ) T 1 T w + ij = wij η + J J I J e. (8) J is Jacobian matrix containing first derivatives of error function with respect to weights and thresholds, while e is vector of network error function. I is identity matrix, η is learning constant. If η equals zero, the equation (8) becomes Newton method using simplified Hessian matrix. For higher values of η the equation (8) becomes gradient descent method with small step. In each iteration η is adapted in such a way that it ensures convergence. IV. COMPARISON OF ANN LEARNING WITH IMPROVED METHODS IN THE CASE OF POWER TRANSFORMER INRUSH CURRENT RECOGNITION One of essential properties of ANN is inclusion of expert knowledge that was obtained with analysis of operation of power transformer in steady-state condition, during transients and in the case of faults. This knowledge is included in the preparation of characteristic patterns for learning of ANN. In the composition of patterns for learning of ANN all the above mentioned forms of current have to be taken into consideration. An important piece of information in composition of patterns is the number of discrete values that describe one pattern. This number depends upon time of sampling of primary and secondary currents in the case of protection system operation. Fig. shows time behaviour for twelve above mentioned forms and forms of current in stationary conditions, which are used for learning of ANN. 155

P 1 test P 1 test S 1 3 E = E : = d y, (9) test p P p= 1 p 1 j 1 test P = = test P test is the number of all patterns that were used for testing the neural network. In contrast to learning errors, test errors are higher in the second order methods. Fig. time behaviour of primary winding currents and belonging desired outputs for learning of ANN in recognition of transformer inrush current This figure also shows desired outputs for individual forms (d = 0 no inrush current or d = 1 inrush current) that are also needed for learning of ANN. From patterns that each contain 0 samples of the pattern shown in Fig., a set of patterns, stored in a matrix with dimension 0x1300, is created using moving window principle. This means that neuron network is learned with 1300 patterns (each of them contains 0 samples). In all hidden layers tangent sigmoid functions were used, while linear transfer function was used in the output layer. To make comparison of learning results easier we decided to present the results in graphical form. Numbers from 1 to 8 on horizontal axis of all graphs represent learning methods (1 standard gradient, momentum gradient, 3 with adaptive learning constant, 4 jumping gradient, 5 conjugate gradient, 6 pseudo-newton, 7 one-step secant, 8 Levenberg- Marquardt method). In the first step we found the optimum number of neurons in hidden layers (two-layer 9-1, three-layer 0-9-1, four-layer 13-9-6-1 neurons) on the basis of minimum errors at 00 learning epochs. Number of learning epochs was then increased to 1000 and for them we calculated learning error, test error and time. Learning error for all learning methods is shown in Fig. 3. Fig. 3 values of learning error for different learning methods Test error that is shown in Fig. 4 represents the error between actual output and desired output, and was calculated on the basis of equation (9) as Fig. 4 values of test errors for different learning methods Second order methods also have the longest learning times, as shown in Fig. 5. Fig. 5 learning times for different learning methods Lower learning errors, obtained by means of improved learning methods do not necessarily mean lower test errors, which is just another proof that it is very difficult to set adequate learning conditions and criteria before learning. It is the most obvious for Levenberg-Marquardt method which has the lowest learning errors, but at the same time the highest test errors. V. CONCLUSION A comparison of results of learning and testing, yielded by different improved methods, was made on the case transformer inrush current recognising with ANN. As expected before the analyses, the best learning results were obtained by secondorder methods, such as Levenberg-Marquardt, pseudo-newton and one-step secant methods. A significant progress in learning was also shown by jumping and conjugates gradient methods. Standard gradient, momentum gradient and adaptive constant methods proved to be methods with rather poor convergence rate. An important aspect in learning of ANN was to include all characteristics of transient and steady-state conditions of power transformer in a representative way. This enabled to make ANN capable to recognise well enough current patterns that 156

were not used in the learning process. The decisive criterion of learning was square of error between actual and desired output for a single pattern. It was the lowest for Levenberg-Marquardt method in all selected network topologies, which demonstrates that this method is undoubtedly the best method for learning open-loop neural networks at present. Nevertheless, only on the basis of this piece of information it is not possible to make adequate conclusion on operation of already learned neural network. Test error is in all network topologies for this method the highest. ANN was capable to very well remember learning patterns, but was not learned to perform generalisation on new testing patterns. This is a consequence of the so-called excessive adaptation which occurs when too many neurons are used in the hidden layers. The same problem also occurred in all other methods that were used for learning of ANN. Test error was for other methods lower because the accuracy of learning was lower, which caused higher degree of generalisation of the results. This somehow compensated the influence of excessive adaptation caused by too high number of neurons in hidden layers, and thus brought satisfactory results from the point of view of differential protection only in testing of ANN that was learned with jumping gradient method. The guidelines that are shown by the presented results and findings of other authors reveal some new comprehensions in learning that could be of a great help in understanding of a so complex problem, such as building and learning of ANN. REFERENCES [1] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning internal representations by error propagation, MA:MIT Press, Cambridge (1986). [] Neural Network Toolbox: for use with MATLAB, http://www.mathworks.com/access/helpdesk/help/toolbox/nnet/nnet.sht ml (004). [3] R. Rojas, Neural networks: a systematic introduction, Springer, Berlin (1996). [4] J. Pihler, "Power transformer protection with neural network", doctoral dissertation, Technical Faculty of Maribor (1995). [5] B. M. Wilamowski, S. Iplikci, O. Kaynak, M. Efe, An Algorithem for Fast Convergence in Training Neural Networks, http://nn.uidaho.edu/pap/ (006). [6] S. Šuster, "Learning the artificial neural networks with improved error back-propagation methods", diploma, University of Maribor, Faculty of Electrical Engineering and Computer Science (006). 157