GRADUAL INFORMATION MAXIMIZATION IN INFORMATION ENHANCEMENT TO EXTRACT IMPORTANT INPUT NEURONS

Similar documents
OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

SARDNET: A Self-Organizing Feature Map for Sequences

Artificial Neural Networks written examination

Lecture 1: Machine Learning Basics

Probabilistic Latent Semantic Analysis

Evolutive Neural Net Fuzzy Filtering: Basic Description

Reinforcement Learning by Comparing Immediate Reward

CSC200: Lecture 4. Allan Borodin

Python Machine Learning

Learning Methods for Fuzzy Systems

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

INPE São José dos Campos

Axiom 2013 Team Description Paper

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

(Sub)Gradient Descent

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Softprop: Softmax Neural Network Backpropagation Learning

Evolution of Symbolisation in Chimpanzees and Neural Nets

Generative models and adversarial training

Georgetown University at TREC 2017 Dynamic Domain Track

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

WHEN THERE IS A mismatch between the acoustic

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Knowledge Transfer in Deep Convolutional Neural Nets

A study of speaker adaptation for DNN-based speech synthesis

A student diagnosing and evaluation system for laboratory-based academic exercises

Test Effort Estimation Using Neural Network

A Pipelined Approach for Iterative Software Process Model

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Rule Learning With Negation: Issues Regarding Effectiveness

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Syntactic systematicity in sentence processing with a recurrent self-organizing network

A Comparison of Annealing Techniques for Academic Course Scheduling

Time series prediction

Acquiring Competence from Performance Data

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Reinforcement Learning Variant for Control Scheduling

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

CSL465/603 - Machine Learning

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

Learning Methods in Multilingual Speech Recognition

Computerized Adaptive Psychological Testing A Personalisation Perspective

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

Statewide Framework Document for:

Seminar - Organic Computing

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Cooperative Game Theoretic Models for Decision-Making in Contexts of Library Cooperation 1

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Software Maintenance

Modeling function word errors in DNN-HMM based LVCSR systems

Assignment 1: Predicting Amazon Review Ratings

Abstractions and the Brain

Communication and Cybernetics 17

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

MYCIN. The MYCIN Task

Human Emotion Recognition From Speech

arxiv: v1 [cs.lg] 15 Jun 2015

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

Why Did My Detector Do That?!

An Empirical and Computational Test of Linguistic Relativity

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

An Introduction to Simio for Beginners

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

AMULTIAGENT system [1] can be defined as a group of

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Henry Tirri* Petri Myllymgki

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Attributed Social Network Embedding

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Truth Inference in Crowdsourcing: Is the Problem Solved?

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Circuit Simulators: A Revolutionary E-Learning Platform

Rule Learning with Negation: Issues Regarding Effectiveness

Constructing a support system for self-learning playing the piano at the beginning stage

Honors Mathematics. Introduction and Definition of Honors Mathematics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Mathematics. Mathematics

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

The Future of Consortia among Indian Libraries - FORSA Consortium as Forerunner?

The Strong Minimalist Thesis and Bounded Optimality

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Mining Association Rules in Student s Assessment Data

Device Independence and Extensibility in Gesture Recognition

Transcription:

Proceedings of the IASTED International Conference Artificial Intelligence and Applications (AIA 214) February 17-19, 214 Innsbruck, Austria GRADUAL INFORMATION MAXIMIZATION IN INFORMATION ENHANCEMENT TO EXTRACT IMPORTANT INPUT NEURONS Ryotaro Kamimura IT Education Center and School of Science and Technology Tokai Univerisity 4-1-1 Kitakaname Hiratsuka, Kanagawa, Japan email: ryo@keyaki.cc.u-tokai.ac.jp Ryozo Kitajima School of Science and Technology Tokai University 4-1-1 Kitakaname Hiratsuka, Kanagawa, Japan email: 3btad4@mail.tokai-u.jp ABSTRACT In this paper, we propose a new type of informationtheoretic method called gradual information maximization to detect important input neurons (variables) in the self-organizing maps. The information enhancement method has been developed to detect important components in neural networks. However, in the information enhancement method, we have found that information for detecting important neurons is not necessarily acquired. The gradual information maximization aims to acquire information generated in the course of learning as much as possible. This means that information accumulated in every stage of learning can be used for the detection of important neurons. We applied the method to the analysis of a public poll opinion toward a city government in Tokyo metropolitan area. The method extracted clearly one important variable of meeting places. By examining carefully the public documents of the city, we found that the problem of meeting places in the city was considered to be one of the most serious financial problems. Thus, the finding by the gradual information maximization represents an important problem in the city. KEY WORDS Gradual information maximization, Information enhancement, SOM, Information theoretical method, Public opinion poll. 1 Introduction 1.1 Problems of Information Enhancement We have proposed a new type of information-theoretic method called information enhancement [1], [2] to detect the importance of components in neural networks. This information enhancement method is based on a supposition that competitive learning [3], [4], [5] is a realization of mutual information maximization between output neurons and input patterns [6], [7], [8]. In computing the information enhancement, we focus on or enhance a component in a neural network and compute mutual information. If mutual information increases by this enhancement, the component is considered to be important. We applied the method to the detection of the importance of input neurons or variables [1], [2]. The variable selection is an important area in neural networks as well as machine learning [9], [1], [11]. When the variable selection is applied to unsupervised learning such as competitive learning and SOM, we have faced a serious problem, because there are not explicit criteria to measure the importance of input variables. In the information enhancement learning, the explicit criterion for the importance is mutual information between output neurons and input patterns. Then, we applied the information enhancement based on mutual information to many problems [12], [13], [1], [2], [14], [15]. However, we found that we could not necessarily increase information content in neurons to realize the extraction of a small number of important input variables. In particular, when the problems become complex, the information enhancement method is not good at increasing information and extracting a small number of important input variables. 1.2 Gradual Information Maximization We have stated that the information enhancement does not necessarily increase information and extract the important input variables. This is due to the inability to maximize information content in output neurons as well as input neurons. We found through experiments that information contained in input and output neurons can be obtained by gradual change in the spread parameter or the slow acquisition of information content. This approach is well-known in the information-theoretic method, namely, annealing method. However, we call it gradual information maximization to stress that our method tries to increase information. Though in the information enhancement method, we used the iterative procedures to obtain important neurons [2], [14], [15], the iterative procedures were restricted to a few steps of iterations. We extend these restricted operations to more general approach for obtaining sufficient information. DOI: 1.2316/P.214.816-8 386

1.3 Outline In Section 2, we first show that competitive learning can be described by mutual information maximization. Then, we present how to compute the information enhancement. Finally, gradual information maximization is intuitively explained. In Section 3, we applied the method to the analysis of an opinion poll by a local government in Tokyo metropolitan area. In the experiments, we show that information increased rapidly by the gradual information maximization, while by using the standard information maximization, information increased very slowly. In addition, by gradual information maximization, clearer class structure was revealed and only one input neuron fired, while all the others ceased to do so. 2 Theory and Computational Methods 2.1 Information-Theoretic Competitive Learning We have found that competitive learning can be realized by maximizing mutual information between output neurons and input patterns [6], [7], [8]. The information enhancement method is based on a supposition that competitive learning is a realization of mutual information maximization between output neurons and input patterns. As shown in Figure 1, let p(s) and p(j s) denote the probabilities of occurrence of the sth input pattern and the firing probability of the jth output neuron for the sth input pattern, then we have mutual information where MI = s=1 j=1 p(j) = M p(s)p(j s) log p(j s) p(j), (1) p(s)p(j s). (2) s=1 When this mutual information is maximized, just one neuron fires, while all the others ceases to do so. Thus, mutual information is expected to correspond to competitive learning. The importance of components in a neural network can be immediately determined with respect to the mutual information. If a component contributes more to mutual information, it can be considered to be more important. If another component does not contributes to the mutual information, it should be considered to be less important. 2.2 Information Enhancement Method We briefly present how to enhance a specific input neuron (variable) and to compute mutual information. In this information enhancement, we try to determine the importance of input neurons with respect to information in output neurons as shown in Figure 1. p(k) p(s) xs k Input neurons Input information w kj p(j s) Output neurons Mutual information Figure 1. Network architecture for gradual information maximization. The sth input pattern can be represented by x s = [x s 1, x s 2,, x s L ]T, s = 1, 2,, S. Connection weights into the jth output neuron are computed by w j = [w 1j, w 2j,, w Lj ] T, j = 1, 2,..., M. The output from the output neuron, with the enhanced kth input neuron, is defined by ( ) L vj,k s (x s k = exp w kj) 2, (3) l=1 2σ 2 kl where σ denotes the spread parameter. The spread parameter is changed by using the parameter β (β > ) { 1/β, k = l(enhanced) σ kl = β, otherwise. When we try to enhance the kth input neuron, we use the parameter 1/β. On the other hand, the remaining neurons are relaxed by the parameter β. By normalizing this output, we have the firing probability vj,k s p(j s; k) = M. (4) m=1 vs m,k By using this probability, we have mutual information when the kth input neuron is enhanced, where MI(k) = s=1 j=1 M p(s)p(j s; k) log p(j; k) = 1 S p(j s; k) p(j; k), (5) p(j s; k). (6) s=1 With this mutual information, we can determine the importance of input neurons considered as the firing rates p(k) = MI(k) M (7) l=1 MI(l). 387

When the firing probability becomes higher, mutual information becomes larger. The importance of input neurons corresponds to how much mutual information they can increase. Then, we consider the other information defined for input neurons. The input information is defined by decrease from maximum uncertainty to observed uncertainty of input neurons I = log L + L p(k) log p(k). (8) k=1 When this input information increases, fewer input neurons fire. When the input information is maximized, only one input neuron fires, while all the others ceases to do. 2.3 Gradual Information Maximization In gradual information maximization, connection weights are obtained through two steps of learning, namely, inner and outer learning cycle. In the outer learning cycle, the spread parameter β is gradually increased using connection weights at the previous β 1th step, where β = 1, 2,. In the inner learning cycle, the parameter β is fixed, and learning is forced to be continued until no change in connection weights can be seen. The step in the inner learning cycle is denoted by θ = 1, 2,, θ f, where θ f is the final step in the inner learning cycle. Let us show how to compute connection weights. At the βth stage of outer learning, the firing probabilities of input neurons and connection weights at the β 1th stage of outer-learning are used, Then, the inner learning begins where winners are determined and connection weights are updated until connection weights ceases to change, namely, the learning cycle reaches the θ f step. Then, the β + 1th outer learning cycle begins with the same procedures of inner learning. Let us explain it in more detail. The parameter β is set in the outer learning cycle, and then the inner learning cycle begins with the fixed value of the parameter. Let x s and (β,θ f ) w j denote input and weight column vectors at the βth outer learning cycle and at the θ f final inner learning step, then distance between input patterns and connection weights at the (β, 1)th cycle, namely, at the βth outer cycle and the first inner cycle is (β,1) x s w j 2 = L (β 1,θ f ) p(k)(x s k (β 1,θ f ) w kj ) 2. k=1 The (β,1) c s th winning neuron is computed by (9) (β,1) c s = argmin j (β,1) x s w j. (1) Let us consider the following neighborhood function usually used in self-organizing maps ( h j (β,1) c = exp r ) j r(β,1) c s 2 s 2σγ 2, (11) where r j and r(β,1) c denote the position of the jth and s the (β,1) c s th neuron on the output space and σ γ is a spread parameter. Then, the re-estimation equation in the batch mode becomes S (β,1) s=1 w j = h j (β,1) c sxs S s=1 h. (12) j (β,1) c s As mentioned, the inner learning cycle continues until a certain stopping criterion is met, namely, until the inner learning cycle reaches its final step of (β, θ f ). We consider the inner learning cycle to be finished when distances between connection weights at the present and at the previous learning inner learning cycle are less than.1. Then, we must increase the value of the parameter β and then again a new inner learning cycle begins. This method lies in the accumulation of information obtained in learning. More specifically, the present learning process is based on the information obtained in the previous steps. 3 Results and Discussion We applied the method to the analysis of a public opinion poll by a local government of Tama city in the Tokyo metropolitan area. The public opinion poll data was recorded between 1981 and 211. The principal objective of this experiment is to examine whether our method can extract the small number of input neurons (variables) and whether the importance of these variables can be explained by the actual events or the problems in the city. 3.1 Quantitative Evaluation Figure 2 shows information (a), quantization errors (b) and topographic errors (c), when the parameter β was increased from two to ten. By using gradual information maximization, information increased rapidly as shown in Figure 2(a1), while by standard information maximization, information increased very slowly in Figure 2 b1). The standard information maximization method is one where parameter values are directly given without considering previous states. The quantization error deceased gradually by gradual information maximization in Figure 2(a2), while it deceased very slightly but sharply, when the parameter was six by standard information maximization in Figure 2(b2). By gradual information maximization, the topographic error remained zero until the parameter β was increased to eight in Figure 2(a3). Then, the topographic error increased sharply when the parameter β was increased from nine in Figure 2(a3). On the other hand, by standard information maximization, the topographic error was zero for any value of the parameter β in Figure 2(b3). The results show that by gradual information maximization, the information increased sufficiently, while standard information maximization could not increase information to the level by gradual information maximization. The increase did not affect the quantization errors. However, it 388

Error Information Error 2. 1.5 1..5.45.4.35.3.25.1.8.6.4.2 was accompanied by large increase in topographic errors. Thus,we can say that the information enhancement aims to extract more information from input patterns even at the expense of topological preservation. This suggests that we need to choose the parameter β carefully to keep the map quality appropriate, when using the information enhancement by gradual information maximization. 2 4 6 8 1 (a1) 2 4 6 8 1 (a2) 2 4 6 8 1 (a3) SOM = (a) Gradual information maximization Information (1) INF.448.446.444.442.44 2 4 6 8 1.45 SOM =.45 SOM =.45 Error (2) QE Error.1.8.6.4.2 2 4 6 8 1 SOM = SOM = (3) TE 2. 1.5 1..5 (b1) (b2) SOM = 2 4 6 8 1 (b3) (b) Standard information maximization Figure 2. Input information (INF), quantization (QE) and topographic errors (TE) by gradual and standard information maximization. 3.2 Visual Evaluation Then, we evaluated visual performance by computing the standard U-matrix representing distances between neurons. The U-matrix method has been used to detect class boundaries in the SOM [16]. Figure 3 shows the U-matrix (a) and labels (b) by the conventional SOM. Though a class boundary seemed to be present in the middle of the matrix, it was rather weak. Figure 4 shows U-matrices by gradual information maximization when the parameter β was increased from 2 (a) to 1 (e). When the parameter β was two in Figure 4(a), the U-matrix was the sames as that of SOM in Figure 3(a). When the parameter was increased from four in Figure 4(b) to eight in Figure 4(d), the class boundary in the middle of the matrix in warmer colors became apparent. Then, the class boundary became weaker, when the parameter β reached ten in Figure 4(e). As shown in Figure 2(a3), until the parameter β was increased to eight, the topographic error was zero. Then, when the parameter β was increased from nine to ten in Figure 2(a3), the topographic error increased rapidly. This show that we must carefully choose the parameter β for visualization, paying due attention to topographic and quantization errors. Figure 5 shows U-matrices by standard information maximization. As can be seen in the figure, though the class boundary in warmer colors became clearer, it was weaker than that by gradual information maximization in Figure 4. Figure 6 shows the U-matrices and labels by gradual information maximization, when the parameter β was eight. As shown in the figure, the clear class boundary in warmer colors divided input patterns into two classes, namely, before and after 2. This means that between before and after 2, there existed a sharp gap in the public opinion in the city. (a) U-matrix 23 25 1998 1996 1984 1981 1982 26 24 27 1992 1985 21 22 1999 1989 2 1995 19922 19912 199211 19871988 (b) Label Figure 3. U-matrix and labels by SOM. 21 211 29 1993 1994 1997 Figure 7(a) and (b) show the firing rates by gradual and standard information maximization. When the parameter β was increased from two in Figure 7(a1) to ten in Figure 7(a5), only the input neuron No.14 won the competition and fires strongly, while all the other neurons ceased to do so. On the other hand, by standard information maximization in Figure 7(b), though the input neuron No.14 gradually became stronger, it was weaker than that by gradual information maximization in Figure 7(a). This result shows that gradual information maximization condensed much information in patterns into one input neuron, while by standard information maximization, it was impossible to detect a small number of input neurons. 389

(a) = 2 (b) = 4 (c) = 6 (a) = 2 (b) = 4 (c) = 6 (d) = 8 (e) = 1 Figure 4. U-matrices by gradual information maximization, when the parameter β was increased from two (a) to ten (e). (d) = 8 (e) = 1 Figure 5. U-matrices by standard information maximization, when the parameter β was increased from two (a) to ten (e). 3.3 Discussion Gradual information maximization procedures were successful in accumulating information in input neurons gradually in the course of learning. Obtained information content was far larger than that by standard information maximization without the accumulation of information in Figure 2. However, we can point out two problems of the methods, namely, the choice of the parameter and heavy computation. First, in gradual information maximization, the parameter β was increased gradually, while checking quantization and topographic errors. When too much information is accumulated, topological preservation tends to be violated and topographic errors tend to increase as shown in Figure 2. At the present stage of research, we have not yet had any criteria to obtain an optimal value of the parameter β. Thus, we need to examine relations between the parameter and topological preservation more exactly how to determine the optimal value for the parameter. Second, gradual information maximization is computationally expensive, because we must compute mutual information for each input neuron. Every time the parameter is increased, we must compute mutual information. Thus, we need to simplify the computational procedures as much as possible, in particular, when the method can be applied to large-scale practical problems. After 2 Before 2 (a) = 8 23 25 1996 1984 1981 1982 1998 1995 26 27 19922 1992 1985 24 1999 19912 1987 21 1989 (b) Label 22 2 199211 Figure 6. U-matrix and labels by gradual information maximization, when the parameter β was eight. We have seen that the opinion poll data was divided into two periods, namely, before and after 2 in Figure 6 and the the most important variable was No.14 representing meeting places in Figure 7. This means that the variable meeting places has much influence on the public opinion poll. We tried to find some evidence to support this finding by gradual information maximization. Then, we 1988 21 1994 211 29 1993 1997 39

found a white paper by the city published in 23 on a financial problem of the city. The white paper stated that the number of meeting places in the city was much larger than that of the other neighboring cities. Then, the majority of the meeting paces were very old and should be rebuilt. This finding in the white paper implies that the problem of meeting places became serious in the city around 2. This fact certainly supports the importance of input variable No.14 by our method. 4 Conclusion 1..8.6.4.2 1..8 2 4 6 8 1 12 14 (a1) = 2 1.. 8. 6. 4. 2 1.. 8 2 4 6 8 1 12 14 (b1) = 2 We have proposed a new computational method called gradual information maximization to improve the information enhancement method to detect the important input neurons (variables). The new method lies in gradual change in the parameter β (outer learning cycle) and in firing rates (inner learning cycle) in order to accumulate information content. By using this computational method, information is gradually accumulated in the course of learning. We applied the method to the analysis of public opinion poll of a city in Tokyo metropolitan area. We found that the input variable meeting places played an important role in the public opinion poll. The white paper by the city confirmed the importance of the input variable No.14, because the problem became a serious financial one in the city. Thus, the finding by our method well corresponds to the fact and problem in the city. Finally, because gradual information maximization needs the expensive computation of mutual information, we need to simplify the computation procedures as much as possible for more practical problems. References [1] R. Kamimura, Information-theoretic enhancement learning and its application to visualization of selforganizing maps, Neurocomputing, vol. 73, no. 13-15, pp. 2642 2664, 21. [2] R. Kamimura, Double enhancement learning for explicit internal representations: unifying selfenhancement and information enhancement to incorporate information on input variables, Applied Intelligence, pp. 1 23, 211. [3] D. E. Rumelhart and D. Zipser, Feature discovery by competitive learning, in Parallel Distributed Processing (D. E. Rumelhart and G. E. H. et al., eds.), vol. 1, pp. 151 193, Cambridge: MIT Press, 1986. [4] T. Kohonen, Self-Organization and Associative Memory. New York: Springer-Verlag, 1988..6.4.2 2 4 6 8 1 12 14 (a2) = 4 1..8.6.4.2 2 4 6 8 1 12 14 (a3) = 6 1..8.6.4.2 2 4 6 8 1 12 14 (a4) = 8 1..8 Meeting places Meeting places Meeting places.6.4.2 2 4 6 8 1 12 14 (a5) = 1 (a) Gradual information maximization. 6. 4. 2 1.. 8. 6. 4. 2 1.. 8. 6. 4. 2 1..8.6.4.2 2 4 6 8 1 12 14 (b2) = 4 2 4 6 8 1 12 14 (b3) = 6 2 4 6 8 1 12 14 (b4) = 8 2 4 6 8 1 12 14 (b5) = 1 (b) Standard information maximization [5] T. Kohonen, Self-Organizing Maps. Springer-Verlag, 1995. [6] R. Kamimura, T. Kamimura, and T. R. Shultz, Information theoretic competitive learning and linguistic rule acquisition, Transactions of the Japanese Society for Artificial Intelligence, vol. 16, no. 2, pp. 287 298, 21. Figure 7. Firing probabilities of input neurons when the parameter β is increased from two (1) to ten (5). 391

[7] R. Kamimura, T. Kamimura, and O. Uchida, Flexible feature discovery and structural information control, Connection Science, vol. 13, no. 4, pp. 323 347, 21. [8] R. Kamimura, Information-theoretic competitive learning with inverse Euclidean distance output units, Neural Processing Letters, vol. 18, pp. 163 184, 23. [9] I. Guyon and A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research, vol. 3, pp. 1157 1182, 23. [1] A. Rakotomamonjy, Variable selection using SVMbased criteria, Journal of Machine Learning Research, vol. 3, pp. 1357 137, 23. [11] S. Perkins, K. Lacker, and J. Theiler, Grafting: Fast, incremental feature selection by gradient descent in function space, Journal of Machine Learning Research, vol. 3, pp. 1333 1356, 23. [12] R. Kamimura, Information loss to extract distinctive features in competitive learning, in Systems, Man and Cybernetics, 27. ISIC. IEEE International Conference on, pp. 1217 1222, IEEE, 27. [13] R. Kamimura, Conditional information and information loss for flexible feature extraction, in Neural Networks, 28. IJCNN 28.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, pp. 274 283, IEEE, 28. [14] R. Kamimura, Self-enhancement learning: targetcreating learning and its application to self-organizing maps, Biological cybernetics, pp. 1 34, 211. [15] R. Kamimura, Selective information enhancement learning for creating interpretable representations in competitive learning, Neural Networks, vol. 24, no. 4, pp. 387 45, 211. [16] A. Ultsch, Maps for the visualization of highdimensional data spaces, in Proceedings of the 4th Workshop on Self-organizing maps, pp. 225 23, 23. 392