Use of Neural Networks for Data Mining in Official Statistics

Size: px

Start display at page:

Download "Use of Neural Networks for Data Mining in Official Statistics"

Godwin Hensley
5 years ago
Views:

1 Use of Neural Networks for Data Mining in Official Statistics Jana Juriová 1 1 Institute of Informatics and Statistics (INFOSTAT), juriova@infostat.sk Abstract One of the main challenges raised by data mining applications is dealing with data sets having a large number of variables: so-called high-dimensional data. In this aspect the officially surveyed statistical data (especially large data sets and data sets with outliers) can be regarded so called golden mines that can be analysed to gain useful additional information hidden inside. The main goal of this paper (except for presenting the basic principles of neural networks) is to introduce in which cases and for what kind of databases neural networks can be appropriate in the field of statistics (with focus on the official statistics regarding the including of this theme into the FP7 project BLUE-ETS). The example of using neural network for classification of data is demonstrated in the paper by experimental results on a set of testing statistical data (from enterprise and trade statistics). Keywords: neural networks, classification, BLUE-ETS 1. Introduction Nowadays the national statistical institutes of many countries have been collecting and storing vast amount of data (survey data or administrative data) that represent potential sources of new information. This is the reason why official statistical databases should be further analysed and examined. To analyse and examine such large amount of data new sophisticated techniques of data mining and knowledge discovery have to be researched, developed and tested. It is possible to use different soft computing tools and their hybridization. But each soft computing methodology has its advantages and disadvantages and appropriate data sets and cases when it is suitable to be used (Mitra et al, 2002). Every technique that uses some mechanisms inspired by biological evolution and human perception belongs to the class of techniques called soft computing. In fact, the goal of these methods is to extract useful regularities from large databases. This paper focuses solely on one class of such techniques neural networks. Firstly the basic principles of neural networks will be presented in the second section of the paper. The third section will concentrate on their advantages that can be beneficially used in the field of official statistics. Thus this section of the paper will concentrate on the difference of this technique from traditionally used statistical analysis and also on the cases and types of databases, for which neural networks can be appropriate in the field of official statistics. The potential usage of neural networks for official statistics will be

oriented on enterprise and trade statistics regarding the including of this theme into the FP7 project BLUE-ETS, in which our institution (INFOSTAT) has been participating.

2 oriented on enterprise and trade statistics regarding the including of this theme into the FP7 project BLUE-ETS, in which our institution (INFOSTAT) has been participating. In one subsection also the project BLUE-ETS will be described and the role of using neural networks in the solving process of the project. And finally the forth section will introduce a simple example for usage of neural networks for classification of data in the field of enterprise and trade statistics surveyed by the Statistical Office of the Slovak Republic. 2. The basic concept of neural networks The basic concept of artificial neural networks is that they simulate a representation of non-transparent knowledge and abilities in human brain that we cannot express by words, but we can use them. In fact they are abilities gained by training. Neural network is a computational model whose construction is based on the abstraction of biological neural systems attributes. In general we can say that neural networks are sets of simple computational units that are highly interconnected. Thus neural networks are composed of simple elements operating in parallel. As in nature, the connections between elements largely determine the network function. A neural network can be trained to perform a particular function by adjusting the values of the connections (weights) between elements. The Picture 1 represents a simple three-layer neural network. The first layer represents inputs (original data) and its function is only to distribute input signals. The executive function is given only to the second two layers: hidden layer and outputs (new information). There exist also networks with more layers, but in this paper only the three-layer neural network is regarded. Picture 1 According to the direction the signals in neural networks are spread two basic groups of neural networks are known. In feed-forward neural networks the signal is being spread from input neurons (neurons whose inputs are signals from environment) through hidden neurons (neurons that are connected by its inputs and also outputs with other neurons, these need not be at all in some kinds of neural networks) towards output neurons (neurons whose outputs lead into environment). Thus, in this network the information moves in only one direction, forward and there are no cycles or loops in the network. The second group recurrent neural networks is a class of neural network where

3 connections between units form a directed cycle. However, the feed forward neural network was the first and still is the simplest type of artificial neural network. The question is why neural networks are desirable to use. As they are based on the principles of human brain, its main attribute is to generalize from abstract. This is their big advantage and this function can be in general used for example for recognizing patterns in the presence of noise or making decisions for current problems based on prior experience. 3. The use of neural networks in the field of statistics On the basis of previous description given in the section 2 neural networks can solve problems that are not easy solvable by usual and traditionally used computing techniques. In fact, neural networks have not been originally intended and suited for data mining (Lu et al., 1996). However, this approach differs from traditional statistical data analysis in two main ways (Bengio et al., 2002). Firstly they rely less on statistical assumptions on the actual distribution of the data and secondly they rely less on models allowing simple mathematical analysis, but they use sophisticated models that can learn complicated nonlinear dependencies from large data sets. In general neural networks fit best for problems where the classification rules are not simple to define or for analysing dependencies between variables or when the compiled mathematical model is too complicated to solve. So, firstly neural networks can be useful for detection of possible patterns or classes of similar data. In fact, neural networks stand for a new alternative for classification of data in the cases when a simple filter cannot be used or as an alternative to fuzzy logic when the borders of classes are not determined exactly. To summarize their possible usage neural networks in statistics are suitable in the following cases: classification to nearest pattern in memory (the advantage is that neural networks are able to classify without being exact), prediction of future events based on the past experience, prediction of latent variables that are not easily measured and solving of non-linear regression problems. The terminology used in statistics applied for neural networks is the following: the inputs are regarded as explanatory variables, the neuron contains the individual units in the hidden layer (or layers) of a neural network and the outputs are response variables (or predictions in some specific cases). Also an activation function is applied in the hidden layer or layers between inputs and outputs. Weights are considered results (parameters) of an objective function used while training a network (usually sum of square errors is being minimized). In the process of training the optimal network parameters (weights) are searched for performing a particular task. Generally there are two suitable kinds of data sets when the usage of neural networks is appropriate. The first kind of databases are huge data sets with unknown distributions and the second suitable kind is smaller data set, but with outliers as neural networks are resistant to outliers. But in general one of the basic conditions of successful using of neural network is large enough database as neural networks need a lot of data to be trained sufficiently to achieve the required accuracy rate and afterwards the trained network can be applied for new data or the rest of the database to classify or predict data or construct latent variables.

4 3.1 Project BLUE-ETS The project BLUE/ETS is funded by the European Commission under the 7 th Framework Programme. BLUE-ETS is the acronym for BLUE-Enterprise and Trade Statistics, as it is about official business statistics aiming at providing high quality and robust statistical information for better policy and socio-economic research. Its main goal is developing new tools and knowledge on selected key statistical issues and methodologies. The project consists of several interconnected work packages. The focus of one of work packages led by our institution INFOSTAT is oriented on innovative methods, tools and procedures which need to be developed to exploit better and more efficiently the potential of administrative data. In this work package new techniques of data mining should be used for data analysis: fuzzy logic, genetic programming and neural networks as well. All these approaches should be tested on data from the system Intrastat and then applied for specific problems in the field of enterprise and trade statistic for data analysis and knowledge discovery to lower response burden on enterprises. 4. An example of data classification by means of neural network As it has been already mentioned above one of the data mining problems is classification. In the case of neural networks the classification can be useful in the case of large datasets and when the classification rules are not simple to define. The disadvantage is that the articulating of the classification rules becomes a difficult problem (Lu et al., 1996) and it complicates the computing process. However, the main contribution of neural networks towards data mining stems just from rule extraction and clustering (Mitra et al., 2002). Typically a network is first trained to achieve the required accuracy rate. Redundant connections of the network are then removed using a pruning algorithm. The link weights and activation values of the hidden units in the network are analysed and classification rules are generated (Tickle et al., 1998). But in this paper the focus is oriented only on the analysis and results and not on the extracting of classification rules. At the very beginning of our research only a simple neural network was used for testing purposes. A three-layer feed forward neural network has been constructed and trained for classification of data into two groups. There exist several common network structures: Kohonen model, Hopfield model (Rojas, 1996), Boltzmann machine a stochastic version of the Hopfield model used for optimization problems, but in this first step of research the simple neural network system has been used for a simple illustration of neural networks possibilities and application. However, users can build networks designed for a particular problem. The input data are individual data surveyed for statistical units small enterprises. In this example only two indicators are used: the registered number of employees and the registered number of employees - women. The reason for two indicators is the possibility to depict results in two-dimensional space. For the construction and training of the neural network individual data from two branches (according to the classification NACE II are used, thus two classification groups are created. The first group of data is from the branch of Warehousing and storage - NACE 52 (Graph 1) and the second group of data from the

5 branch of Motion picture, video and television - NACE 59 (Graph 2). Each group or class is characterized by a certain structure of employees: number of all employees and number of employed women. Graph 1 Structure of employees Graph 2 Structure of employees Firstly a three-layer neural network was created with two inputs - data from twodimensional space, two outputs - two classification groups and ten hidden neurons. As the output function of neurons the so-called logistic function was chosen: 1 y (1) x 1 e In the case of neural networks the logistic function is often used as an output function (especially for classification of data into two groups). If we classify output into more than two groups, the softmax function will be more suitable. (The softmax activation function is a biologically plausible approximation to the maximum operation.) After the construction of neural network its training for classification into two groups was realised. The trained network can be used for classification of original data; this step stands for verification of the model. The borders of groups created by neural network are not exact as the neural network computes probabilities, with which a certain unit belongs to each class. On the Graph 3 the function (green line) is depicted that divides the twodimensional space into two classes (red and blue points). As we can see the borders are not really exact. However, our testing data were not so large data sets, but they are data with outliers (for the reason of simplicity). However, the probability of inclution into a certain class is relatively high (around 80 per cent). Once the network is trained new data can be run through it and classified. The network will classify new data based on the previous data it was trained with. If an exact match cannot be found, it will match with the closest pattern in memory. In our case e.g. data from other branch can be chosen and we can analyse, to which of two classes it belongs. The result is the information on similarity of employee s structure of both groups. For example we have data from the branch of Transport - NACE 49 (Graph 4) and we would like to analyse to which of the two branches mentioned above it is similar by its structure of employees. After using the constructed neural network it is obvious that the branch of

6 Transport is very similar by its employees structure to the first class - the branch of Warehousing and storage. The probability is in average 91% for the first class. Graph 3 Borders of classes Graph 4 Structure of employees The presented approach can be naturally enlarged for more indicators and several classification groups. In addition it is interesting to mention that although neural networks are able to solve classification problems the way how they reach it is hidden inside a network (neural networks are regarded as so called black boxes ). In such cases when user knows the potential functional relationship between the dependent and independent variables fuzzy

7 systems could provide a better solution. But there exist also other real situations. User can have some indications of relationship between the dependent and independent variables, but no specific structures. In that case a model can be created by fuzzy system and neural network can adjust parameters of classification model (Fullér, 1995). The other possible use of neural networks in the field of statistics and simultaneously their big advantage is searching for a model, the most suitable functional form. This approach is especially suitable in such cases when the user has no idea of the functional relationship between the dependent and independent variables. Otherwise the regression model is more suitable. But the use of neural networks is one of the possibilities how to let data to define their functional form themselves and neural networks can even find relations that are not expected at all. The disadvantage is a harder interpretation of estimated model; therefore it is useful to use this approach for prediction of non-linear data, but not in the cases when the explanation of relations between data is needed. On the other hand, there are also other advantages of using neural networks for predictions. The neural networks can work not only with numeric data, but they can use also symbolic (nominal) values. And neural networks enable predictions for several steps ahead. So the conclusion is that the further research will be oriented on searching suitable models in problematic cases through neural networks and also it could be interesting to examine applicability of neuro-fuzzy systems in the field of official statistics. 5. Conclusion This paper introduces and summarizes the potential use of neural networks (as one of the soft-computing tools) in the field of official statistics for analysing large databases. They can be useful in various cases, but the most important is classification and modelling nonlinear relations in large data sets or in data sets with outliers. Recently various soft computing methodologies have been applied to handle the different data mining problems (fuzzy logic, genetic algorithms and programming, neural networks). Each tool contributes a distinct methodology for addressing problems in its domain. In the future these methods and techniques should be used in a cooperative manner. So the result would be a more intelligent and robust system providing the most suitable approximate solution. References Mitra S., Pal S. K., Mitra P. (2002) Data Mining in Soft Computing Framework: A Survey, IEEE Transactions on Neural Networks, Vol. 13, No. 1. Bengio Y., Buhmann J. M., Embrechts M., Zurada J. M. (2000) Introduction to the Special Issue on Neural Networks for Data Mining and Knowledge Discovery, IEEE Trans. Neural Networks, vol. 11, pp Fullér R. (1995) Neural Fuzzy Systems, Åbo Academi Universtity, Åbo.

8 Lu H. J., Setiono R., Liu H. (1996) Effective Data Mining Using Neural Networks, IEEE Trans. Knowledge Data Eng., Vol. 8, pp Rojas R. (1996) Neural Networks: A Systematic Introduction, Springer-Verlag, Berlin. Tickle A. B., Andrews R., Golea M., Diederich J. (1998) The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained artificial neural networks, IEEE Trans. NeuralNetworks, vol. 9, pp

Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled