GRADUAL INFORMATION MAXIMIZATION IN INFORMATION ENHANCEMENT TO EXTRACT IMPORTANT INPUT NEURONS

Proceedings of the IASTED International Conference Artificial Intelligence and Applications (AIA 214) February 17-19, 214 Innsbruck, Austria GRADUAL INFORMATION MAXIMIZATION IN INFORMATION ENHANCEMENT TO EXTRACT IMPORTANT INPUT NEURONS Ryotaro Kamimura IT Education Center and School of Science and Technology Tokai Univerisity 4-1-1 Kitakaname Hiratsuka, Kanagawa, Japan email: ryo@keyaki.cc.u-tokai.ac.jp Ryozo Kitajima School of Science and Technology Tokai University 4-1-1 Kitakaname Hiratsuka, Kanagawa, Japan email: 3btad4@mail.tokai-u.jp ABSTRACT In this paper, we propose a new type of informationtheoretic method called gradual information maximization to detect important input neurons (variables) in the self-organizing maps. The information enhancement method has been developed to detect important components in neural networks. However, in the information enhancement method, we have found that information for detecting important neurons is not necessarily acquired. The gradual information maximization aims to acquire information generated in the course of learning as much as possible. This means that information accumulated in every stage of learning can be used for the detection of important neurons. We applied the method to the analysis of a public poll opinion toward a city government in Tokyo metropolitan area. The method extracted clearly one important variable of meeting places. By examining carefully the public documents of the city, we found that the problem of meeting places in the city was considered to be one of the most serious financial problems. Thus, the finding by the gradual information maximization represents an important problem in the city. KEY WORDS Gradual information maximization, Information enhancement, SOM, Information theoretical method, Public opinion poll. 1 Introduction 1.1 Problems of Information Enhancement We have proposed a new type of information-theoretic method called information enhancement [1], [2] to detect the importance of components in neural networks. This information enhancement method is based on a supposition that competitive learning [3], [4], [5] is a realization of mutual information maximization between output neurons and input patterns [6], [7], [8]. In computing the information enhancement, we focus on or enhance a component in a neural network and compute mutual information. If mutual information increases by this enhancement, the component is considered to be important. We applied the method to the detection of the importance of input neurons or variables [1], [2]. The variable selection is an important area in neural networks as well as machine learning [9], [1], [11]. When the variable selection is applied to unsupervised learning such as competitive learning and SOM, we have faced a serious problem, because there are not explicit criteria to measure the importance of input variables. In the information enhancement learning, the explicit criterion for the importance is mutual information between output neurons and input patterns. Then, we applied the information enhancement based on mutual information to many problems [12], [13], [1], [2], [14], [15]. However, we found that we could not necessarily increase information content in neurons to realize the extraction of a small number of important input variables. In particular, when the problems become complex, the information enhancement method is not good at increasing information and extracting a small number of important input variables. 1.2 Gradual Information Maximization We have stated that the information enhancement does not necessarily increase information and extract the important input variables. This is due to the inability to maximize information content in output neurons as well as input neurons. We found through experiments that information contained in input and output neurons can be obtained by gradual change in the spread parameter or the slow acquisition of information content. This approach is well-known in the information-theoretic method, namely, annealing method. However, we call it gradual information maximization to stress that our method tries to increase information. Though in the information enhancement method, we used the iterative procedures to obtain important neurons [2], [14], [15], the iterative procedures were restricted to a few steps of iterations. We extend these restricted operations to more general approach for obtaining sufficient information. DOI: 1.2316/P.214.816-8 386

1.3 Outline In Section 2, we first show that competitive learning can be described by mutual information maximization. Then, we present how to compute the information enhancement. Finally, gradual information maximization is intuitively explained. In Section 3, we applied the method to the analysis of an opinion poll by a local government in Tokyo metropolitan area. In the experiments, we show that information increased rapidly by the gradual information maximization, while by using the standard information maximization, information increased very slowly. In addition, by gradual information maximization, clearer class structure was revealed and only one input neuron fired, while all the others ceased to do so. 2 Theory and Computational Methods 2.1 Information-Theoretic Competitive Learning We have found that competitive learning can be realized by maximizing mutual information between output neurons and input patterns [6], [7], [8]. The information enhancement method is based on a supposition that competitive learning is a realization of mutual information maximization between output neurons and input patterns. As shown in Figure 1, let p(s) and p(j s) denote the probabilities of occurrence of the sth input pattern and the firing probability of the jth output neuron for the sth input pattern, then we have mutual information where MI = s=1 j=1 p(j) = M p(s)p(j s) log p(j s) p(j), (1) p(s)p(j s). (2) s=1 When this mutual information is maximized, just one neuron fires, while all the others ceases to do so. Thus, mutual information is expected to correspond to competitive learning. The importance of components in a neural network can be immediately determined with respect to the mutual information. If a component contributes more to mutual information, it can be considered to be more important. If another component does not contributes to the mutual information, it should be considered to be less important. 2.2 Information Enhancement Method We briefly present how to enhance a specific input neuron (variable) and to compute mutual information. In this information enhancement, we try to determine the importance of input neurons with respect to information in output neurons as shown in Figure 1. p(k) p(s) xs k Input neurons Input information w kj p(j s) Output neurons Mutual information Figure 1. Network architecture for gradual information maximization. The sth input pattern can be represented by x s = [x s 1, x s 2,, x s L ]T, s = 1, 2,, S. Connection weights into the jth output neuron are computed by w j = [w 1j, w 2j,, w Lj ] T, j = 1, 2,..., M. The output from the output neuron, with the enhanced kth input neuron, is defined by ( ) L vj,k s (x s k = exp w kj) 2, (3) l=1 2σ 2 kl where σ denotes the spread parameter. The spread parameter is changed by using the parameter β (β > ) { 1/β, k = l(enhanced) σ kl = β, otherwise. When we try to enhance the kth input neuron, we use the parameter 1/β. On the other hand, the remaining neurons are relaxed by the parameter β. By normalizing this output, we have the firing probability vj,k s p(j s; k) = M. (4) m=1 vs m,k By using this probability, we have mutual information when the kth input neuron is enhanced, where MI(k) = s=1 j=1 M p(s)p(j s; k) log p(j; k) = 1 S p(j s; k) p(j; k), (5) p(j s; k). (6) s=1 With this mutual information, we can determine the importance of input neurons considered as the firing rates p(k) = MI(k) M (7) l=1 MI(l). 387

When the firing probability becomes higher, mutual information becomes larger. The importance of input neurons corresponds to how much mutual information they can increase. Then, we consider the other information defined for input neurons. The input information is defined by decrease from maximum uncertainty to observed uncertainty of input neurons I = log L + L p(k) log p(k). (8) k=1 When this input information increases, fewer input neurons fire. When the input information is maximized, only one input neuron fires, while all the others ceases to do. 2.3 Gradual Information Maximization In gradual information maximization, connection weights are obtained through two steps of learning, namely, inner and outer learning cycle. In the outer learning cycle, the spread parameter β is gradually increased using connection weights at the previous β 1th step, where β = 1, 2,. In the inner learning cycle, the parameter β is fixed, and learning is forced to be continued until no change in connection weights can be seen. The step in the inner learning cycle is denoted by θ = 1, 2,, θ f, where θ f is the final step in the inner learning cycle. Let us show how to compute connection weights. At the βth stage of outer learning, the firing probabilities of input neurons and connection weights at the β 1th stage of outer-learning are used, Then, the inner learning begins where winners are determined and connection weights are updated until connection weights ceases to change, namely, the learning cycle reaches the θ f step. Then, the β + 1th outer learning cycle begins with the same procedures of inner learning. Let us explain it in more detail. The parameter β is set in the outer learning cycle, and then the inner learning cycle begins with the fixed value of the parameter. Let x s and (β,θ f ) w j denote input and weight column vectors at the βth outer learning cycle and at the θ f final inner learning step, then distance between input patterns and connection weights at the (β, 1)th cycle, namely, at the βth outer cycle and the first inner cycle is (β,1) x s w j 2 = L (β 1,θ f ) p(k)(x s k (β 1,θ f ) w kj ) 2. k=1 The (β,1) c s th winning neuron is computed by (9) (β,1) c s = argmin j (β,1) x s w j. (1) Let us consider the following neighborhood function usually used in self-organizing maps ( h j (β,1) c = exp r ) j r(β,1) c s 2 s 2σγ 2, (11) where r j and r(β,1) c denote the position of the jth and s the (β,1) c s th neuron on the output space and σ γ is a spread parameter. Then, the re-estimation equation in the batch mode becomes S (β,1) s=1 w j = h j (β,1) c sxs S s=1 h. (12) j (β,1) c s As mentioned, the inner learning cycle continues until a certain stopping criterion is met, namely, until the inner learning cycle reaches its final step of (β, θ f ). We consider the inner learning cycle to be finished when distances between connection weights at the present and at the previous learning inner learning cycle are less than.1. Then, we must increase the value of the parameter β and then again a new inner learning cycle begins. This method lies in the accumulation of information obtained in learning. More specifically, the present learning process is based on the information obtained in the previous steps. 3 Results and Discussion We applied the method to the analysis of a public opinion poll by a local government of Tama city in the Tokyo metropolitan area. The public opinion poll data was recorded between 1981 and 211. The principal objective of this experiment is to examine whether our method can extract the small number of input neurons (variables) and whether the importance of these variables can be explained by the actual events or the problems in the city. 3.1 Quantitative Evaluation Figure 2 shows information (a), quantization errors (b) and topographic errors (c), when the parameter β was increased from two to ten. By using gradual information maximization, information increased rapidly as shown in Figure 2(a1), while by standard information maximization, information increased very slowly in Figure 2 b1). The standard information maximization method is one where parameter values are directly given without considering previous states. The quantization error deceased gradually by gradual information maximization in Figure 2(a2), while it deceased very slightly but sharply, when the parameter was six by standard information maximization in Figure 2(b2). By gradual information maximization, the topographic error remained zero until the parameter β was increased to eight in Figure 2(a3). Then, the topographic error increased sharply when the parameter β was increased from nine in Figure 2(a3). On the other hand, by standard information maximization, the topographic error was zero for any value of the parameter β in Figure 2(b3). The results show that by gradual information maximization, the information increased sufficiently, while standard information maximization could not increase information to the level by gradual information maximization. The increase did not affect the quantization errors. However, it 388

Error Information Error 2. 1.5 1..5.45.4.35.3.25.1.8.6.4.2 was accompanied by large increase in topographic errors. Thus,we can say that the information enhancement aims to extract more information from input patterns even at the expense of topological preservation. This suggests that we need to choose the parameter β carefully to keep the map quality appropriate, when using the information enhancement by gradual information maximization. 2 4 6 8 1 (a1) 2 4 6 8 1 (a2) 2 4 6 8 1 (a3) SOM = (a) Gradual information maximization Information (1) INF.448.446.444.442.44 2 4 6 8 1.45 SOM =.45 SOM =.45 Error (2) QE Error.1.8.6.4.2 2 4 6 8 1 SOM = SOM = (3) TE 2. 1.5 1..5 (b1) (b2) SOM = 2 4 6 8 1 (b3) (b) Standard information maximization Figure 2. Input information (INF), quantization (QE) and topographic errors (TE) by gradual and standard information maximization. 3.2 Visual Evaluation Then, we evaluated visual performance by computing the standard U-matrix representing distances between neurons. The U-matrix method has been used to detect class boundaries in the SOM [16]. Figure 3 shows the U-matrix (a) and labels (b) by the conventional SOM. Though a class boundary seemed to be present in the middle of the matrix, it was rather weak. Figure 4 shows U-matrices by gradual information maximization when the parameter β was increased from 2 (a) to 1 (e). When the parameter β was two in Figure 4(a), the U-matrix was the sames as that of SOM in Figure 3(a). When the parameter was increased from four in Figure 4(b) to eight in Figure 4(d), the class boundary in the middle of the matrix in warmer colors became apparent. Then, the class boundary became weaker, when the parameter β reached ten in Figure 4(e). As shown in Figure 2(a3), until the parameter β was increased to eight, the topographic error was zero. Then, when the parameter β was increased from nine to ten in Figure 2(a3), the topographic error increased rapidly. This show that we must carefully choose the parameter β for visualization, paying due attention to topographic and quantization errors. Figure 5 shows U-matrices by standard information maximization. As can be seen in the figure, though the class boundary in warmer colors became clearer, it was weaker than that by gradual information maximization in Figure 4. Figure 6 shows the U-matrices and labels by gradual information maximization, when the parameter β was eight. As shown in the figure, the clear class boundary in warmer colors divided input patterns into two classes, namely, before and after 2. This means that between before and after 2, there existed a sharp gap in the public opinion in the city. (a) U-matrix 23 25 1998 1996 1984 1981 1982 26 24 27 1992 1985 21 22 1999 1989 2 1995 19922 19912 199211 19871988 (b) Label Figure 3. U-matrix and labels by SOM. 21 211 29 1993 1994 1997 Figure 7(a) and (b) show the firing rates by gradual and standard information maximization. When the parameter β was increased from two in Figure 7(a1) to ten in Figure 7(a5), only the input neuron No.14 won the competition and fires strongly, while all the other neurons ceased to do so. On the other hand, by standard information maximization in Figure 7(b), though the input neuron No.14 gradually became stronger, it was weaker than that by gradual information maximization in Figure 7(a). This result shows that gradual information maximization condensed much information in patterns into one input neuron, while by standard information maximization, it was impossible to detect a small number of input neurons. 389

(a) = 2 (b) = 4 (c) = 6 (a) = 2 (b) = 4 (c) = 6 (d) = 8 (e) = 1 Figure 4. U-matrices by gradual information maximization, when the parameter β was increased from two (a) to ten (e). (d) = 8 (e) = 1 Figure 5. U-matrices by standard information maximization, when the parameter β was increased from two (a) to ten (e). 3.3 Discussion Gradual information maximization procedures were successful in accumulating information in input neurons gradually in the course of learning. Obtained information content was far larger than that by standard information maximization without the accumulation of information in Figure 2. However, we can point out two problems of the methods, namely, the choice of the parameter and heavy computation. First, in gradual information maximization, the parameter β was increased gradually, while checking quantization and topographic errors. When too much information is accumulated, topological preservation tends to be violated and topographic errors tend to increase as shown in Figure 2. At the present stage of research, we have not yet had any criteria to obtain an optimal value of the parameter β. Thus, we need to examine relations between the parameter and topological preservation more exactly how to determine the optimal value for the parameter. Second, gradual information maximization is computationally expensive, because we must compute mutual information for each input neuron. Every time the parameter is increased, we must compute mutual information. Thus, we need to simplify the computational procedures as much as possible, in particular, when the method can be applied to large-scale practical problems. After 2 Before 2 (a) = 8 23 25 1996 1984 1981 1982 1998 1995 26 27 19922 1992 1985 24 1999 19912 1987 21 1989 (b) Label 22 2 199211 Figure 6. U-matrix and labels by gradual information maximization, when the parameter β was eight. We have seen that the opinion poll data was divided into two periods, namely, before and after 2 in Figure 6 and the the most important variable was No.14 representing meeting places in Figure 7. This means that the variable meeting places has much influence on the public opinion poll. We tried to find some evidence to support this finding by gradual information maximization. Then, we 1988 21 1994 211 29 1993 1997 39

found a white paper by the city published in 23 on a financial problem of the city. The white paper stated that the number of meeting places in the city was much larger than that of the other neighboring cities. Then, the majority of the meeting paces were very old and should be rebuilt. This finding in the white paper implies that the problem of meeting places became serious in the city around 2. This fact certainly supports the importance of input variable No.14 by our method. 4 Conclusion 1..8.6.4.2 1..8 2 4 6 8 1 12 14 (a1) = 2 1.. 8. 6. 4. 2 1.. 8 2 4 6 8 1 12 14 (b1) = 2 We have proposed a new computational method called gradual information maximization to improve the information enhancement method to detect the important input neurons (variables). The new method lies in gradual change in the parameter β (outer learning cycle) and in firing rates (inner learning cycle) in order to accumulate information content. By using this computational method, information is gradually accumulated in the course of learning. We applied the method to the analysis of public opinion poll of a city in Tokyo metropolitan area. We found that the input variable meeting places played an important role in the public opinion poll. The white paper by the city confirmed the importance of the input variable No.14, because the problem became a serious financial one in the city. Thus, the finding by our method well corresponds to the fact and problem in the city. Finally, because gradual information maximization needs the expensive computation of mutual information, we need to simplify the computation procedures as much as possible for more practical problems. References [1] R. Kamimura, Information-theoretic enhancement learning and its application to visualization of selforganizing maps, Neurocomputing, vol. 73, no. 13-15, pp. 2642 2664, 21. [2] R. Kamimura, Double enhancement learning for explicit internal representations: unifying selfenhancement and information enhancement to incorporate information on input variables, Applied Intelligence, pp. 1 23, 211. [3] D. E. Rumelhart and D. Zipser, Feature discovery by competitive learning, in Parallel Distributed Processing (D. E. Rumelhart and G. E. H. et al., eds.), vol. 1, pp. 151 193, Cambridge: MIT Press, 1986. [4] T. Kohonen, Self-Organization and Associative Memory. New York: Springer-Verlag, 1988..6.4.2 2 4 6 8 1 12 14 (a2) = 4 1..8.6.4.2 2 4 6 8 1 12 14 (a3) = 6 1..8.6.4.2 2 4 6 8 1 12 14 (a4) = 8 1..8 Meeting places Meeting places Meeting places.6.4.2 2 4 6 8 1 12 14 (a5) = 1 (a) Gradual information maximization. 6. 4. 2 1.. 8. 6. 4. 2 1.. 8. 6. 4. 2 1..8.6.4.2 2 4 6 8 1 12 14 (b2) = 4 2 4 6 8 1 12 14 (b3) = 6 2 4 6 8 1 12 14 (b4) = 8 2 4 6 8 1 12 14 (b5) = 1 (b) Standard information maximization [5] T. Kohonen, Self-Organizing Maps. Springer-Verlag, 1995. [6] R. Kamimura, T. Kamimura, and T. R. Shultz, Information theoretic competitive learning and linguistic rule acquisition, Transactions of the Japanese Society for Artificial Intelligence, vol. 16, no. 2, pp. 287 298, 21. Figure 7. Firing probabilities of input neurons when the parameter β is increased from two (1) to ten (5). 391

[7] R. Kamimura, T. Kamimura, and O. Uchida, Flexible feature discovery and structural information control, Connection Science, vol. 13, no. 4, pp. 323 347, 21. [8] R. Kamimura, Information-theoretic competitive learning with inverse Euclidean distance output units, Neural Processing Letters, vol. 18, pp. 163 184, 23. [9] I. Guyon and A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research, vol. 3, pp. 1157 1182, 23. [1] A. Rakotomamonjy, Variable selection using SVMbased criteria, Journal of Machine Learning Research, vol. 3, pp. 1357 137, 23. [11] S. Perkins, K. Lacker, and J. Theiler, Grafting: Fast, incremental feature selection by gradient descent in function space, Journal of Machine Learning Research, vol. 3, pp. 1333 1356, 23. [12] R. Kamimura, Information loss to extract distinctive features in competitive learning, in Systems, Man and Cybernetics, 27. ISIC. IEEE International Conference on, pp. 1217 1222, IEEE, 27. [13] R. Kamimura, Conditional information and information loss for flexible feature extraction, in Neural Networks, 28. IJCNN 28.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, pp. 274 283, IEEE, 28. [14] R. Kamimura, Self-enhancement learning: targetcreating learning and its application to self-organizing maps, Biological cybernetics, pp. 1 34, 211. [15] R. Kamimura, Selective information enhancement learning for creating interpretable representations in competitive learning, Neural Networks, vol. 24, no. 4, pp. 387 45, 211. [16] A. Ultsch, Maps for the visualization of highdimensional data spaces, in Proceedings of the 4th Workshop on Self-organizing maps, pp. 225 23, 23. 392