A New Approach to Three Ensemble Neural Network Rule Extraction Using Recursive-Rule extraction algorithm

Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 A New Approach to Three Ensemble Neural Network Rule Extraction Using Recursive-Rule extraction algorithm Yoichi Hayashi, Ryusuke Sato and Sushmita Mitra Abstract In this paper, we propose a Three Ensemble neural network rule extraction algorithm. Then we investigate Hayashi s first question, Can the Ensemble-Recursive-Rule extraction (E-Re-RX) algorithm be extended to an ensemble neural network consisting of three or more MLPs and extract comprehensible rules? The E-Re-RX algorithm is an effective rule extraction algorithm for dealing with data sets that mix discrete and continuous attributes. Using the experimental results, we consider the three MLP ensemble Re-RX algorithm from various points of view. Finally, we present provisional positive conclusions. I. INTRODUCTION A. Rule Extraction from Neural Network Ensemble In 1990, Hansen and Salamon [1] showed that the generalization ability of learning systems based on artificial neural networks can be significantly improved through ensembles of artificial neural networks, i.e., training multiple artificial neural networks and combining their predictions via voting. Since combining ensembles network remarkably well, it became a very popular topic in both neural network and machine learning communities [2]. Although many authors have generated comprehensible models from individual networks, much less work has been done to explain neural network ensembles [3]. Bologna proposed the Discretized Interpretable Multi-Layer Perceptron (DIMLP) model with generated rules from neural network ensembles [4]. The DIMLP is a special neural network model for which symbolic rules are generated to clarify the knowledge embedded within connections and activation neurons. Bologna described how to translate symbolic rules into the DIMLP and how to extract rules from one or several combined neural networks. The Rule Extraction From Network Ensemble (REFNE) [2] approach proposed by Zhou et al. is designed to extract symbolic rules from trained neural network ensembles that perform classification tasks. REFNE utilizes trained ensembles to generate a number of instances and then extracts rules from those instances. REFNE can gracefully break the ties made by individual neural networks in prediction [2]. Yoichi Hayashi. Author is with the Department of Computer Science, Meiji University, Tama-ku, Kawasaki 214-8571, Japan (e-mail: hayashiy@cs.meiji.ac.jp ) Ryusuke Sato. Author is with the Department of Computer Science, Meiji University, Tama-ku, Kawasaki 214-8571, Japan (e-mail: sally36505@hotmail.com ) Sushmita Mitra. Author is with the Machine Intelligent Unit, Indian Statistical Institute, Kolkaka-700108, India (e-mail: sushmita@isical.ac.in ) Zhou et al. [5] analyzed the relationship between the ensemble and its component neural networks from the context of both regression and classification. Their work revealed that it may be better to ensemble many instead of all available neural networks. This result is interesting because most approaches ensemble all available neural networks for prediction. In 2012, Hara and Hayashi proposed two ensemble neural network rule extraction algorithms. The former is for two-class classification [6]. The latter is for multiple-class classification [7]. Both of these algorithms use standard MLPs and the Re-RX algorithm proposed by Setiono [8]. The recognition accuracy of these algorithms is very high. II. PURPOSE OF THIS PAPER In 2013, Hayashi delivered a survey paper on neural data analysis using the ensemble concept and provided three open questions [9] as future work regarding the Ensemble-Recursive Rule extraction (E-Re-RX) algorithm. In this paper, we investigate the first one, Can the E-Re-RX algorithm be extended to an ensemble neural network consisting of three or more MLPs and extract comprehensible rules? III. STRUCTURE OF THE E-RE-RX ALGORITHM A. Re-RX Algorithm The Re-RX algorithm [8] is designed to generate classification rules from data sets that have both discrete and continuous attributes. The algorithm is recursive in nature and generates hierarchical rules. The rule conditions for discrete attributes are disjointed from those for continuous attributes. The continuous attributes only appear in the conditions of the rules lowest in the hierarchy. The outline of the algorithm is as follows. Algorithm Re-RX(S, D, C) Input: Dataset S having discrete attributes D and continuous attributes C. Output: A set of classification rules. 1. Train and prune a neural network [10] using the dataset S and all its D and C attributes. 2. Let D and C be the sets of discrete and continuous attributes, respectively, still present in the network, and let S be the of data samples correctly classified by the pruned network. 3. If D has associated continuous attributes dataset C, 978-1-4673-6129-3/13/$31.00 2013 IEEE 835

generate a hyperplane to split the samples in S according to the values of the continuous attributes C, then stop. Otherwise, by using only the discrete attributes D, generate the set of classification rules R for data set S. For each rule Ri generated: If support(ri)>δ 1 and error(ri)>δ 2, then - Let Si be the set of data samples that satisfy the condition of rule Ri and let Di be the set of discrete attributes that do not appear in the rule condition of Ri. - If D has associated continuous attributes C, generate a hyperplane to split the data in Si according to the values of the continuous attributes Ci, then stop. Otherwise, call Re-RX(Si, Di, Ci). In the above, we define the support of (Ri) to be the proportion of dataset covered by rule Ri and the error of (Ri) to be the proportion of data it incorrectly classifies. B. Three Ensemble-Re-RX (E-Re-RX) Algorithm We proposed the Two Ensemble-Recursive-Rule extraction (E-Re-RX) algorithm [6],[7], [9]. In this algorithm, primary rules are generated, followed by secondary rules to handle only those instances that do not satisfy the primary rules, and then these rules are integrated. We showed that this reduces the complexity of using multiple neural networks. This method achieves an extremely high recognition accuracy, even with multiclass problems. In this paper, we restrict ourselves to backpropagation neural networks (MLPs) with one hidden unit because such networks have been shown to possess an universal approximation property. With the neural network ensemble, it is possible to determine the final output by integrating the outputs of three neural networks. With the E-Re-RX algorithm, it is possible to determine the overall final output by integrating the rules. This rule integration enables the reduction of the number of neural networks and irrelevant rules. The essentials of the Three E-Re-RX algorithm are outlined as follows: Fig. 1. Schematic diagram of Three Ensemble-Re-RX algorithm Three Ensemble-Re-RX algorithm Inputs: Learning data sets LD, LDf, LDff Outputs: Primary rule set, secondary rule set, and tertiary rule set 1) Randomly extract data samples of an arbitrary proportion from learning data set LD, and name the set of extracted data samples LD. 2) Train and prune [10] the first neural network using LD. 3) Apply the Re-RX algorithm to the output of step 2, and output the primary rule set. 4) Based on these primary rules, create the set LDf from data samples that do not satisfy the rules. 5) Randomly extract data samples of an arbitrary proportion from learning data set LDf, and name the set of extracted data samples LDf. 6) Train and prune [10] the two ensemble neural network using LDf. 7) Apply the Re-RX algorithm to the output of step 6, and output the secondary rule set. 8) Integrate the primary rule set and the secondary rule set. 9) Based on the rules integrated in step 8, create the set LDff of data samples that do not satisfy these rules. 10) Randomly extract data samples of an arbitrary proportion from learning data set LDff, and name the set of extracted data LDff. 11) Train and prune [10] the three ensemble neural network using LDff. 12) Apply the Re-RX algorithm to the output of step 11, and output the tertiary rule set. 13) Integrate the primary rule set, the secondary rule set, and the tertiary rule set. The following provides supplemental descriptions to the respective steps above. First, in step 1, LD is created with the number of data of an arbitrary proportion of the learning data set. Data samples for LD are randomly selected. The main purpose for selecting data of an arbitrary proportion is to study the influence that the selected proportion imposes on the three ensemble neural network. Moreover, extraction of dataset at an arbitrary proportion enables learning with a smaller learning dataset than that needed in conventional methods. This methodology provides the benefit of preventing overfitting, because a smaller the number of data generally provides fewer optimum local solutions. Thus, in the event of a local solution, the total number of local solutions is few, and sufficient recognition accuracy can be shown. Data samples that cause overfitting can also be eliminated. Dataset that trigger overfitting are considered disparate among all the dataet. Fewer the number of data in the learning dataset lessen the degree of disparity from the whole perspective and can suppress overfitting. In step 2, both learning and pruning use the LD set created in step 1. The learning process and pruning algorithm adopt the methodology in cited literature [10]. Note that the intermediate unit of the neural network is one, which means that attributes left after pruning are those strongly required for 836

classifying the classes. Thus, overall rules are expressed with fewer attributes, which helps to secure recognition. In step 3, rules are extracted according to the Re-RX algorithm in cited literature [8]. The extracted rules are appraised, and rules to be re-extracted are selected. The J4.8[12] algorithm is used to extract rules. The rule set ultimately obtained here is considered the primary rule set. In step 4, LDf is created by using the primary rule set obtained in step 3. LDf dataset are selected from the learning dataset but do not satisfy the primary rule set, and so these are established as a dataset. Datasets satisfying the primary rule set are not considered, because such samples, if used for learning and pruning and rule extracting, would lead to extraction of an identical rule set, which would be totally extraneous. By learning and pruning with dataset that do not satisfy the primary rule set, extraction of the input attribute, output relations and rules is possible for coverage not provided by the primary rule set. In step 5, LDf is created with the number of data of an arbitrary proportion of LDf. In step 6, learning and pruning is done in the same manner as in step 2, which uses LDf. According to the Re-RX algorithm, the obtained output is appraised, and rules to be re-extracted are selected in step 7 in the same manner as in step 3 to yield the secondary rule set. In step 8, the obtained primary and secondary rule sets are integrated. The following rules are defined and followed in integrating the rule sets. 1) When all of the attributes and their values appearing in a rule are exactly identical and the class label is identical, then the rule in the secondary rule set is integrated into the primary rule set. 2) When all of the attributes and their values appearing in a rule are partially picked up in another rule whose attributes and associated values are identical, then the latter rule is integrated into the former encompassing rule, regardless of the class label. 3) Conflicting rules are integrated into the primary rule set. The foregoing integrating rules are described with reference to some examples. First, the rule in 1) is obvious. If attributes and their values appearing in the two rules are entirely identical, and their class labels are the same, the rules are identical. Thus, the secondary rule set is integrated into the primary rule set. For example, let us assume that within the primary rule set the following rule is obtained, R: If D42 = 0, then predict Class 1, and within the secondary rule set, the following rule is obtained, Rf: If D42 = 0, then predict Class 1. In this case, the attributes and their values appearing in the rules are entirely identical, and their class labels are the same. In such a case, the rules are deemed identical and integrated with R. Next, we consider the second rule. For this purpose, the following example provides the state of an extracted rule that satisfies this second rule. We assume in the primary rule set the following rule is obtained: R: If D42 = 1 and D38 = 0 and D43 = 0 and D27 = 0 and D24 = 0 and D45 = 0 and D2 = 0 and D21 = 1, then predict Class 1. In the secondary rule set the following rule is obtained: Rf: If D24 = 0 and D2 = 0 and D45 = 0 and D21 = 1, then predict Class 2. In this case, the attributes appearing in Rf are encompassed by R. Their values are also understood to be identical. In this case, although the class labels for rules R and Rf are different, integration is possible from the viewpoint that Rf is included in R, because these if-then rule expressions were originally prepared from decision trees. A single rule can be considered a branch within a decision tree, which although different, derives the same input and output rules. If the remaining attributes of one rule are considered to impose a serious influence in determining the class label, then the longer rule includes the shorter rule. Thus, integration is thought to be possible under this second rule. Finally, we consider the tertiary rule, which handles conflicting rules between the primary rule set and the secondary rule set. Two situations are envisioned for the rules. First, attributes and their values appearing in the rules are entirely identical, but the class labels are different. Second, attributes appearing in the rules are entirely identical but take on different values, while the class labels are identical. For these conflicting rules, in both cases the subject rule within the secondary rule set is integrated with the subject rule within the primary rule set. The grounds for this are related to the decision-tree generation method of the J4.8 algorithm [12]. Generation of the J4.8 algorithm is based on the data volume and the entropy of the dataset. Branches of the decision tree are neither generated in excess nor under low probability. Thus, in comparing the number of data, the primary rule set always is larger than the secondary rule set, and the decision tree of the former reflects the entire learning set more than that of the latter in terms of branches generated, when the entire data set is considered. For example, let us assume that within the primary rule set the following rule is obtained, R: If D42 = 0, then predict Class 1.0, and within the secondary rule set the following rule is obtained, Rf: If D42 = 0, then predict Class 2. These rules compare rules whose appearing attributes and associated values are identical, but whose class labels are different. In this case Rf of the secondary rule set is integrated with R as the sole rule. In the case above, rule Rf of the secondary rule set is integrated with rule R of the primary rule set; however, the second rule set naturally may have R integrated with Rf, which needs to be monitored. In step 9, the integrated rules obtained in step 8 are used to prepare data set LDff that does not satisfy those rules. In step 10, LDff is created with the number of data of an 837

arbitrary proportion of LDff. In step 11, learning and pruning are conducted in the same manner as that in step 2 by using LDff. In step 12, according to the Re-RX algorithm, the obtained output is appraised, and the rules to be re-extracted are selected in the same manner as in step 3 to yield the tertiary rule set. In step 13, the obtained primary, secondary, and tertiary rule sets are integrated. Execution of the above 13 steps completes the E-Re-RX algorithm. IV. EXPERIMENTS AND RESULTS A. Characteristics of Data Sets The data set used for the testing comes from data sets obtained from the UCI Repository [11]. The data set in this study utilized data sets having both discrete value attributes and continuous value attributes, for the following two reasons: Re-RX must be valid for data sets having both discrete value attributes and continuous value attributes, and many data sets of real-world problems consist of discrete value attributes and continuous value attributes. In essence, a method demonstrated as valid against data sets having both discrete value attributes and continuous value attributes are more practical. The following Table 1 shows the number of data sample, their number of discrete value attributes, their number of continuous value attributes, and their number of classes. TABLE 1 DATA SET CHARACTERISTICS Card (0.5) Card (0.7) German Credit Card (0.65) Data Attributes Discrete Attributes Continuous Attributes Classes 690 51 45 6 2 690 51 45 6 2 1000 63 56 7 2 Card dataset [11] was tested in two ways: proportions of 0.5 and 0.7 when LD was created. The purpose was to compare influences on the recognition accuracy and the number of rules depending on the proportion extracted. The German Credit Card data [11] set is also a data set concerning credit cards. When it was obtained from the UCI Repository, it consisted of so-called clustering of alphabetic classifications, and discrete values. The dataset was used after this clustering was substituted with 0 s and 1 s, and discrete value attributes were normalized and amended to continuous values in the range between 0 and 1. The German Credit Card data set was tested at a proportion set to 0.65 when LD was created. B. Results for German Credit Data The German Credit Card data samples were divided into training data and test data. As the divided proportions, training data was 50% of all 1,000 data (500 data). The remaining 50% (500 data) was test data. Learning and pruning of the first neural network used LD' training data at a proportion of 65% that consisted of 325 data samples. From the results, rules were extracted from a decision tree according to J4.8, and rules were re-extracted according to the Re-RX algorithm. The thresholds of rule re-extraction of the Re-RX algorithm were δ 1 = δ 2 = 0.09. The following primary rule set was obtained. Primary rule set R1: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=0 and D32=0 and D43=0, then R1a: If C5 <= 0.382353: Class 1 R1b: If C5 > 0.382353: Class 2 R2: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=0 and D32=0 and D43=1, then Class 2 R3: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=0 and D32=1, then Class 2 R4: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=1, then Class 1 R5: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=1, then Class 1 R6: If D1=0 and D29=0 and D49=0 and D15=0 and D9=1, then Class 2 R7: If D1=0 and D29=0 and D49=0 and D15=1, then Class 2 R8: If D1=0 and D29=0 and D49=1, then Class 2 R9: If D1=0 and D29=1, then Class 1 R10: If D1=1, then Class 1 The recognition rate of the test dataset by the primary rule set was 71.8%. LDf was created, based on this primary rule set. LDf became a data set composed of 139 data. LDf was randomly extracted at a proportion of 65%, that is, 90 data, as LDf, which was used for learning and pruning of the two ensemble neural networks. Re-extraction of the rules was performed in the same manner as in the first neural network. Consequently, the following secondary rule set was obtained. Secondary rule set Rf1: If D20=0 and D2=0 and D15=0, then Class 2 Rf2: If D20=0 and D2=0 and D15=1, then Class 1 Rf3: If D20=0 and D2=1 and D21=0, then Class 1 Rf4: If D20=0 and D2=1 and D21=1, then Class 2 Rf5: If D20=1 and D1=0, then Class 1 (included in R6) Rf6: If D20=1 and D1=1, then Class 2 838

A comparison between the primary rule set and the secondary rule set showed that rule Rf5 was included in R6. Therefore, Rf5 was integrated with R6 at this time. The integrated rule set to the secondary rule set became as follows. Integrated Rules (primary rule set and secondary rule set) R1: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=0 and D32=0 and D43=0, then R1a: If C5 <= 0.382353: Class 1 R1b: If C5 > 0.382353: Class 2 R2: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=0 and D32=0 and D43=1, then Class 2 R3: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=0 and D32=1, then Class 2 R4: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=1, then Class 1 R5: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=1, then Class 1 R6: If D1=0 and D29=0 and D49=0 and D15=0 and D9=1, then Class 2 R7: If D1=0 and D29=0 and D49=0 and D15=1, then Class 2 R8: If D1=0 and D29=0 and D49=1, then Class 2 R9: If D1=0 and D29=1, then Class 1 R10: If D1=1, then Class 1 R11: If D20=0 and D2=0 and D15=0, then Class 2 R12: If D20=0 and D2=0 and D15=1, then Class 1 R13: If D20=0 and D2=1 and D21=0, then Class 1 R14: If D20=0 and D2=1 and D21=1, then Class 2 R15: If D20=1 and D1=1, then Class 2 The recognition rate of the test data set to the secondary rule set was 92.8%. LDff was created based on this rule set. LDff became a data set consisting of 27 data. LDff was randomly extracted at a proportion of 65%, that is, 18 data, as LDff, which was used for learning and pruning of the three ensemble neural networks. Re-extraction of the rules was performed in the same manner as the one ensemble neural network. Consequently, the following tertiary rule set was obtained. Integrated rules (primary rule set, secondary rule set, and tertiary rule set) R1: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=0 and D32=0 and D43=0, then R1a: If C5 <= 0.382353: Class 1 R1b: If C5 > 0.382353: Class 2 R2: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=0 and D32=0 and D43=1, then Class 2 R3: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=0 and D32=1, then Class 2 R4: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=0 and D51=1, then Class 1 R5: If D1=0 and D29=0 and D49=0 and D15=0 and D9=0 and D20=1, then Class 1 R6: If D1=0 and D29=0 and D49=0 and D15=0 and D9=1, then Class 2 R7: If D1=0 and D29=0 and D49=0 and D15=1, then Class 2 R8: If D1=0 and D29=0 and D49=1, then Class 2 R9: If D1=0 and D29=1, then Class 1 R10: If D1=1, then Class 1 R11: If D20=0 and D2=0 and D15=0, then Class 2 R12: If D20=0 and D2=0 and D15=1, then Class 1 R13: If D20=0 and D2=1 and D21=0, then Class 1 R14: If D20=0 and D2=1 and D21=1, then Class 2 R15: If D20=1 and D1=1, then Class 2 R16: If D4=0 and D18=0 and D21=0, then Class 2 R17: If D4=0 and D18=0 and D21=1, then Class 1 R18: If D4=0 and D18=1, then Class 1 R19: If D4=1, then R19a: If D20 = 0: Class 1 R19b: If D20 = 1: Class 2 The recognition rate of the test data set up to the tertiary rule set was 97.4%. The results thus far are shown in the following table. TABLE 2 GERMAN (0.65) EXPERIMENTAL RESULTS Tertiary rule set Rff1: If D4=0 and D18=0 and D21=0, then Class 2 Rff2: If D4=0 and D18=0 and D21=1, then Class 1 Rff3: If D4=0 and D18=1, then Class 1 Rff4: If D4=1 then R4a: If D20 = 0: Class 1 R4b: If D20 = 1: Class 2 A comparison between the integrated rules (primary rule set and the secondary rule set) and the tertiary rule set revealed no conflicts or encompassing rules. Thus, the following integrated rules were obtained at this time. V. DISCUSSION We examine the obtained results from various view point when three ensemble neural networks were used. The test results for the number of rules, recognition rates, and the number of data are summarized in the tables 3. 839

TABLE 3 COMPARISON OF THE NUMBER OF RULES FOR EACH DATA SET A review of the table 4 shows that rule extractions increased with the increase in ensemble neural networks. For Card, however, rules were not extracted at the time of three ensembles. As neural networks increased in number, the proportion of increase became smaller. TABLE 4 COMPARISON OF RECOGNITION RATES Recognition rates rose with the increase in neural networks. The margin of increase was large from the first to the second, but this is not in proportion to the rate of increase in the number of rules. That is explained in the following table for the comparison of input dataset of the neural networks. TABLE 5 COMPARISON OF THE NUMBER OF INPUT DATA SET The number of data sets decreased in inverse proportion to the increase in neural networks. In the transfer to the next neural network, data that did not satisfy the rules previously extracted in a neural network were selected and used as input. Since the amount of input decreased in considerably high proportion from one ensemble neural network to two ensemble neural networks for both Card and German, it is understood that a lot of data satisfied the extracted rules. The recognition rate increased accordingly. Even if many rules are extracted, the recognition rate does not increase if little data satisfies the rules. It may be said that good rules that can satisfy an enormous amount of data efficiently with a single rule were extracted through this testing. Next, the possibility of rule extraction is examined, which is the purpose of this testing. A review of the results of rule extraction in Table 3 shows that rules were extracted when three neural networks were used for German. For Card, however, rules were not extracted. The comparison of the input counts in Table 5 shows that Card had inputs of only single digits at the stage of the third network, whereas German had two digits remaining. When the test was repeated for not only this table but also various data sets, it was found that all attributes were pruned and rules were not extracted in a neural network unless at least 2 digits of the third neural network input remained during learning and pruning in a neural network. Whether or not the number of inputs of "2 digits" could be the threshold to use three ensemble neural networks could not be concluded unconditionally, since not every data set was tested with every neural network. Nonetheless, the following three points were learned from the testing. 1) When there was little input, learning and pruning of a neural network caused most attributes to be pruned, and rules were not extracted. 2) Because Re-Rx used only recognized data for rule extraction, if there was little input, learning became insufficient, unrecognized data increased, and data used for rule extraction became small. 3) If there was little input, rules were not extracted when there was bias in the data. These outcomes happened many times when inputs were 10 or fewer in the course of repeating the testing. Next, recognition rates are related to the possibility of rule extraction. The recognition rate of Card surpassed 80% at the first neural network stages for both cases, as shown in Table 5. In contrast, for German the rate was a low value of 70%. In the algorithm of E-Re-RX, the inputs of the next neural network were a collection of data that did not satisfy the rules extracted in the previous neural network. Data that did not satisfy a rule became smaller and the input to the next neural network decreased as the recognition rate improved. Since the recognition rate of one neural network was high to some extent from the beginning in the data set of Card, the proportion of input that remained in the two ensemble neural networks was lower for Card than for German. Consequently, a tertiary rule set was not extracted in Card, and in contrast, the recognition rate for German, which did extract a tertiary rule set, ultimately became higher. Thus, for the possibility of rule extraction by recognition rate, it can be said that the data sets with low recognition rates in one neural network tend to have rules easily extracted by a neural network of two ensembles or more. Comparing the complexity of data between Card and German revealed that Card consists of 690 data samples with 51 attributes, and German consists of 1,000 data samples with 63 attributes. When the data was more complicated, the effectiveness in extracting rules by neural networks of two ensembles and three ensembles will be improved. 840

VI. CONCLUSION The testing results reported in this paper show that rules were not extracted in three ensemble neural networks for the Card data, whereas rules were extracted for the German data. The complexity of data (the number of data and the number of attributes), recognition rates, and the number of input were the predominant causes. The recognition rate for the primary rule set was not so high when the data was complicated. Consequently, the data which did not satisfy the rules increased, and the possibility of rule extraction increased for the secondary rule set and the tertiary rule set. The E-Re-RX algorithm, described in recent research [6, 7, 9], can achieve high recognition rates when rules are extracted. The current testing was performed by using neural networks limited to three, but higher recognition rates can be expected when four or five can be applied. Conditions for the handled data sets become necessary, however, to apply up to four and five. Data sets that can be classified at the stage of a primary rule set, like the Card data, are not suitable for an ensemble neural network. If data is classified well at an early stage, the effectiveness of many ensemble neural networks put to use is validated. The recognition rates for using two ensembles also achieved considerably high values, as the experimental results show, but the foremost advantage of using three ensemble neural networks was a recognition rate even higher than the high recognition rate of using two. For data sets that are not simple and less likely to see much increase in the recognition rate in one neural network only, the effectiveness in using multiple ensemble neural networks can be expected to increase. Empirically, these tests confirmed that rule extractions were successful in using Three Ensemble-Rule extraction for a mono-class and a complicated data set such as German. However, testing of rule extraction that used three ensemble neural networks on a multi-class data set such as Thyroid [11] was not necessarily successful. Such testing will be a future research topic. Possible development beyond the current testing includes the dynamic varying and optimizing of the proportion of classification between training data and test data. For German data, as an example, the proportion of classification was set at 50%. When training data was classified one time at a proportion of 75% during the course of testing and the testing was attempted, there was too much training. The attributes generally did not get pruned and remained in the first neural network, and an enormous number of primary rules were extracted. Finally, as additional future work, we should investigate the following an open question on the Three Ensemble-Re-RX algorithm: 1) Can the Three Ensemble-Re-RX algorithm be extended to an ensemble neural network consisting of three or more MLPs and extract comprehensible rules for the multi-class classification problem? REFERENCES [1] L.K. Hansen and P. Salamon, Neural network ensembles, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 12, No. 993 1001, 1990. [2] Z.-H. Zhou, Extracting symbolic rules from trained neural network ensembles, AI Communications, Vol. 16, pp. 3 15, 2003.., Vol. 19, pp. 299 307, 2008. [3] G. Bologna, Is it worth generating rules from neural network ensembles?, J. of Applied Logic, Vol. 2, pp. 325 348, 2004. [4] G. Bologna, A model for single and multiple knowledge based networks," Artificial Intelligence in Medicine, Vol. 28, pp. 141 163, 2003. [5] Z.-H. Zhou, Ensemble neural networks: Many could be better than all, Artificial Intelligence, Vol. 137, pp. 239 263, 2002. [6] A. Hara and Y. Hayashi, Ensemble neural network rule extraction using Re-RX algorithm, in Proc. of WCCI (IJCNN) 2012, June 10 15, Brisbane, Australia, pp. 604 609, 2012. [7] A. Hara and Y. Hayashi, A new neural data analysis approach using ensemble neural network rule extraction, In Proc. 22nd Int. Conf. Artificial Neural Netw., (Lecture Notes in Computer Science, Vol. 7552), Lausanne, Switzerland, September, pp. 515 522, 2012. [8] R. Setiono, B. Baesens and C. Mues, Recursive neural network rule extraction for data with mixed attributes, IEEE Trans. Neural Netw., Vol. 19, pp. 299 307, 2008. [9] Y. Hayashi, Neural data analysis: Ensemble neural network rule extraction approach and its theoretical and historical backgrounds, In Proc. 12th Int. Conf. Artificial Intelligence and Soft computing (Lecture Notes in Artificial Intelligence), Keynote Speech, Zakopane, Poland, June 9 13, 2013 (Accepted as an invited paperr). [10] R. Setiono, A penalty-function approach for pruning feedforward neural networks, Neural Comp., Vol. 9, No. 1, pp.185-204, 1977 [11] University of California, Irvine Machine Learning Repository, http://archive.ics.ics.uci.edu/ml/ [12] http://www.cs.waikato.ac.nz/~ml/weka/index_downloading.html 841