OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7, 701 03 Ostrava 1 Czech Republic vaclav.kocian@osu.cz Abstract: The article deals with possibilities of optimization of classifiers based on neural networks which use Hebbian learning mechanism. The experimental study was conducted. The study shows, that badly designed learning patterns can prevent the network from learning under certain circumstances. The new term of irrelevant items of input vectors has been introduced in the article. Also we have introduced a optimization method. This method helps to avoid problems caused by so-called irrelevant items of input vectors and thus makes the learning algorithm more robust. The method lays off the self classifying algorithm. Thanks to the fact it is very easy to equip any arbitrary algorithm with it. Keywords: Neural networks, Hebbian learning, irrelevant items, patterns optimization, pattern preprocessing 1 Hebbian Networks Hebbian learning theory can be summarized in the following rule: "Cells that fire together, wire together. [2] The rule seeks to explain "associative learning", in which simultaneous activation of cells leads to strengthening their links. The main advantage of Hebbian algorithm is its simplicity and thus its speed. The basic variant of the algorithm only needs the operations of addition and multiplication of integers. In addition, we can consider as an advance the repeatability of the calculation (calculations in the Hebbian algorithm are not burdened with randomness). This allows relatively easy to study the behavior on specific training sets. In addition, there is a possibility that discovered regularities will be applicable for some other types of networks. For a description of the learning process, we consider the trivial model network with one input and one output neuron connected with a single connection (see Fig. 1). In complex networks, these rules apply to all such triplets (input, output, connection). Neural networks are taught in so-called cycles. During each such cycle, all the training patterns are presented to the network one time. We derive formulas for calculating the value of weight w after the submission of the n-th pattern: At the start, all weights w are initialized by value of I ( I =0 according to [3] ): w n = I, n=0. After the presentation of each (the n-th) pattern, the current value of w is raised by the product of the appropriate input and output: w n =w n 1 x n y n, n 0. Therefore we can express the weight value w at the end of the first cycle, e.g. after a presentation of the m patterns: means a change of the w after one cycle. m w m = I i=1 m x i y i, where the expression x i y i i =1 Since the set of patterns presented to the network in each cycle is always the same, we can label the sum as C (e.g. change of the w after one learning cycle). The weight value at the end of the first cycle can be then written as w 1 =I C. To calculate the value of w after a p-th cycle, we can use the expression: w p =I pc.
Fig. 1: Trivial neural network considered in description of the learning process Fig. 2: General topology of classifier. Weights of connections w11-wij are modified in accordance with the Hebbian learning rule. 2 The original experimental study - motivation We noticed an unexpected behavior of classifier during our experiments with adaptation aimed at pattern recognition in time series [1]. It inspired us to study the influence of learning patterns shape on ability of neural network to adapt properly. The aim of our original experimental work was to test the ability of Hebbian networks to learn the fundamental trends from typical time series (rising, descent, resistance, support). We used two sets of artificially generated patterns. Both sets P1 (see Fig. 3, Table 1) and P2 (see Fig. 4, Table 2) contain patterns with the same meaning but derived from original data using different methods of binarization. Patterns bitmaps (bit arrays of size 8x8) were converted into onedimensional vectors with a length of 64 bits by concatenation of successive rows of the bitmap matrix. Each of output vectors T with 4 bits had only one of the bits active, which determined the number of the class assigned to the pattern, i.e. 1 - Rising 2 - Descent; 3 - Resistance and 4 Support. Fig. 3: Patterns from set P1 - input patterns bitmaps (lower square) and required responses (upper rectangle) Fig. 4: Patterns from set P2 - input patterns bitmaps (lower square) and required responses (upper rectangle) Table 1: P1 - vectors T and S. Values of -1 are written using the character '-' and values of +1 are written using the Pat T S 1 +--- -------+ ------+- -----+-- ----+--- ---+---- --+----- -+------ +------- 2 -+-- +------- -+------ --+----- ---+---- ----+--- -----+-- ------+- -------+ 3 --+- ---++--- ---++--- --+--+-- --+--+-- -+----+- -+----+- +------+ +------+ 4 ---+ +------+ +------+ -+----+- -+----+- --+--+-- --+--+-- ---++--- ---++--- Table 2: P2 - vectors T and S. Values of -1 are written using the character '-' and values of +1 are written using the Pat. T S 1 +--- -------+ ------++ -----+++ ----++++ ---+++++ --++++++ -+++++++ ++++++++ 2 -+-- ++++++++ -+++++++ --++++++ ---+++++ ----++++ -----+++ ------++ -------+ 3 --+- ---++--- ---++--- --++++-- --++++-- -++++++- -++++++- ++++++++ ++++++++ 4 ---+ ++++++++ ++++++++ -++++++- -++++++- --++++-- --++++-- ---++--- ---++--- 2.1 The original experiment procedure First, patterns from the set P1 (Fig. 3, Table 1) were presented to the network. The network was able to recognize only two of four submitted patterns. Then, patterns from the set P2 ( Fig. 4, Table 2) were presented to the network. The network was able to learn all patterns correctly.
Finally, patterns from the set P1 were presented (in active mode) to the network, which was adapted to P2. Then the network was able to classify all patterns from the set P1 correctly. The original motivation for creating the set P2 was to verify the assumption, that the presentation of "flat" patterns allows the network to obtain more general "knowledge" about the nature of patterns. Then, such network is more able to detect sequences with a lower amplitude or other slope of the curve. The experimental study seemed to confirm a correctness of this assumption. Moreover, if the "correct" patterns are presented to the network, it can learn to recognize patterns, which were even impossible to learn separately. 3 Projection of the problem into simpler patterns When analyzing the behavior of the described above, we had to repeat our experimental study with simpler patterns. We created two sets R1 (Fig. 5, Table 3) and R2 (Fig. 6, Table 4). Each of them contains four patterns. Input pattern's length is 6 and output pattern s length is 4. The behavior of network working with these two patterns was analogous to the original experiment. First, the network was not able to learn set R1. When the network was adapted with R2, then it was able to correctly classify all patterns from R2 and R1. Fig. 5: Set R1. Network is not able to learn it Table 3: Set R1, vectors T and S. Values of -1 are written using the character '-' and values of +1 are written using the Pat. T S 1 +--- +----- 2 -+-- -+---- 3 --+- --+--- 4 ---+ ---+-- Fig. 6: Set R2. Once the network learns patterns from R2, it can also classify patterns from R1. Table 4: Set R2, vectors T and S. Values of -1 are written using the character '-' and values of +1 are written using the Pat. T S 1 +--- +---+- 2 -+-- -+--+- 3 --+- --+--+ 4 ---+ ---+-+ Patterns in both sets R1 and R2 differ only in the values of the 5-th and 6-th input items. While values of these items are the same in all patterns of R1, these values differ in patterns of R2. Looking more carefully at both sets R1 and R2, we can see that outputs just "copy" the first four inputs, regardless of the value of 5-th and 6-th input items. We can intuitively say, that the 5-th and the 6-th item are both irrelevant. 3.1 Adaptation The network topology which we used is shown in Fig. 7. For each set of R1 and R2, a separate instance of the classifier was created. The Table 5 shows the network adaptation during the first learning cycle. Since patterns in R1 and R2 differ only in the 5-th and 6-th input bit, the first six columns of Table 5 are the same for both patterns. Columns 7 and 8 show the values for the 5-th and 6-th item from R1. Columns 9 and 10 show the values for the 5-th and 6-th item from R2. The closing rows of Table 5 show weight values after the first learning cycle for both sets R1 and R2. We can pronounce the following: 1. A total of twelve connections end the adaptation with zero weight values. Such connections can be considered as redundant in terms of network capacity to remember or recognize patterns. 2. Every connection related to 5-th and 6-th items have a non-zero values, i.e. they affect the work of classifiers during adaptive and active modes. 3. All connections weights W11,W22, W33 a W44 have the same value 4. 4. All connections weights Wb1-Wb4 (bias) have the same value -2.
For better illustration, we present the structure of neural network without connections with zero weight value (marked as redundant) in Fig. 8. Between the adaptation to R1 and R2, the difference is only in the weight values on connections related to the 5-th and 6-th inputs. Table 5: Evolution of weight values during learning process on sets R1 and R2. Items of input and output vectors that take a positive value are highlighted in black. Set R1,R2 Set R1 Set R2 Initialization Y1 wb1=0 w11=0 w21=0 w31=0 w41=0 w51=0 w61=0 w51=0 w61=0 Y2 wb2=0 w12=0 w22=0 w32=0 w42=0 w52=0 w62=0 w52=0 w62=0 Y3 wb3=0 w13=0 w23=0 w33=0 w43=0 w53=0 w63=0 w53=0 w63=0 Y4 wb4=0 w14=0 w24=0 w34=0 w44=0 w54=0 w64=0 w54=0 w64=0 1. Step Y1 1 1-1 -1-1 -1-1 1-1 Y2-1 -1 1 1 1 1 1-1 1 Y3-1 -1 1 1 1 1 1-1 1 Y4-1 -1 1 1 1 1 1-1 1 2. Step Y1 0 2-2 0 0 0 0 0 0 Y2 0-2 2 0 0 0 0 0 0 Y3-2 0 0 2 2 2 2-2 2 Y4-2 0 0 2 2 2 2-2 2 3. Step Y1-1 3-1 -1 1 1 1 1-1 Y2-1 -1 3-1 1 1 1 1-1 Y3-1 -1-1 3 1 1 1-3 3 Y4-3 1 1 1 3 3 3-1 1 4. Step Y1-2 4 0 0 0 2 2 2-2 Y2-2 0 4 0 0 2 2 2-2 Y3-2 0 0 4 0 2 2-2 2 Y4-2 0 0 0 4 2 2-2 2 Fig. 7: Topology of neural network for processing patterns from training sets R1 and R2 (B=1). Fig. 8: Structure of neural network (from Fig. 7) after adaptation on sets R1 and R2. Connections with zero weight values were omitted. In case of R1, the values of the dotted and the dashed connections are identical (2), in case of R2 they are different (contrary).
3.2 Analysis of adaptation results: Looking at the closing rows of Table 5 and at the Fig. 8, it is possible to express values of output neurons activations after passing training set as follows (1) : = X j w jj X 5 w 5j X 6 w 6j B w bj (1) Substituting values of weights after adaptation of R1 into equation (1) we obtain the following relation (2): = X j.4 1.2 1. 2 1. 2 (2) Now, we can generalize equation (2) and after the n-th pass we get neuron activation, which is expressed according to the formula (3). = X j. 4n 2n 2n 1. 2n (3), which can be reduced to (4): =n X j. 4 6 (4) From equation (4) it is clear that the value for the set R1 can never be positive. It is because Xj takes either value 1 or 1, therefore can only have values 10n or 2n. As the value of X5= X6= 1 for all patterns from R1 set, the sum of their contributions to value of each output neuron for each pattern is equal to 4n and the network will never be able to successfully learn patterns from R1 set. Substituting values related to R2 set into equation (1) in the same way, as we did with R1, we get the following (5) : =n X j. 4 2 (5) The formula (5) shows that values of X5 and X6 help to deduce the correct class of pattern presented (they restrict the choice into two possible models). Their values in patterns 1 and 2 increase Y 1 and Y 2 of value 4 while reduce Y 3 Y 4 of value 4. Their values in patterns 3 a 4 do the opposite. The weights of connections related to the 5-th and 6-th input are exactly contrary. It implies, if the values of X5 and X6 are the same (the case of R1 set), their contribution to the activation value of each output in each pattern is zero. Therefore, network adapted to R2 correctly identifies patterns from R1 too. 4 Optimization of classifier As we have shown in the previous example, difficulties with training set R1 lies in components X5 and X6, which have the same value in all patterns. Therefore, these components do not help us to assign proper classes to patterns. We can describe these components as excessive (irrelevant). In addition, during the learning process, connections related to these components get nonzero values, which leads to confusion and network losses its learning ability. Based on our experimental study, we proposed a method of evaluating the relevance of the input vector components. Principles of the method are simple: 1. Before adaptation, algorithm walks through the training set and identify as irrelevant all the items, whose value in all patterns is the same. 2. Weights of connections related to the irrelevant items are ignored during the adaptation. 3. Thanks to that, such weights remain 0. Algorithm that marks irrelevant items can be written as follows: 1. Mark all items as irrelevant. 2. Load input vector of the first pattern and remember values of its items. 3. Repeat with all successive patterns: a. Load input vector. b. Mark every irrelevant item as relevant in case, that its actual value differs from that in the first pattern. 4. End.
This modified classifier is now possible to adapt to the both sets R1 and R2. Using this preprocessing, neural network becomes more specified to an actual training set, i.e it loss some of its generalization ability. Fig. 9 shows topology of the network, which uses the proposed algorithm to identification of irrelevant items, which are highlighted in gray. Related connections (dashed) are then ignored during the adaptation process. The Fig. 10 shows the structure of the neural network after adaptation to the R1. Connections with zero-weight values were excluded. Fig. 9: Network topology for R1 set after preprocessing. Items X5 and X6 are marked as irrelevant. The values of weights of related connections remain zero during the whole adaptation. Fig. 10: The structure of neural network after its adaptation of R1. Connections with zero weight values were excluded. Finally, both original data sets P1 (see Fig. 3, Table 1) and P2 (see Fig. 4, Table 2) were presented to the adjusted classifier. Looking at Fig. 11 and Fig. 12, we can see irrelevant items in both sets marked as gray. As expected, the classifier now can learn and correctly classify all training patterns of both sets P1 and P2. In the case, the adaptation of P2 set do not lead to correct classification of patterns of P1 but the network behavior is in line with expectations. Due to the elimination of redundant items from training sets, the network has lost some of its generalization ability. Fig. 11: Patterns from the set P1 showing irrelevant components (gray) Fig. 12 Patterns from the set P2 showing irrelevant components (gray) The training set P3 has been designed in the final step of our experimental study, which includes all patterns from sets P1 and P2. No irrelevant components were found in this united training set. Then, its adaptation process went correctly in accordance to expectations, where all patterns from P3 set ( e.g P3=P1 P2.) were correctly adapted. 5 Conclusion In this experimental study we have managed to explain the cause of the unexpected behavior of the neural network, which we have seen in previous time-series-related experiments [1]. We have designed, theoretically justified and experimentally tested a new method for preprocessing of a training set. This method enhances the ability of neural network to learn and classify patterns. References [1] Janošek M., Kocian V., Kotyrba M., Volná E., Pattern recognition and system adaptation In Kováčová, M. (ed.): Proceedings of the 10th International Conference on Applied Mathematics, Aplimat 2011, Bratislava, Slovakia, 2011, pp. 1217-1226 [2] Doidge, Norman, The Brain That Changes Itself,Viking Press, 2007 [3] Laurene V. Fausett, Fundamentals Of Neural Networks: Architectures, Algorithms And Applications, Prentice Hallm, 1994 [4] Leandro Nunes de Castro, Fundamentals of Natural Computing, Chapman & Hall, 2006 [5] Bishop, Neural Networks for Pattern Recognition. Oxford: Oxford University Press, 1997