Volgenau School of Engineering. Final Report of Project ECE

Volgenau School of Engineering Final Report of Project ECE 699-002 Title: Evaluation of Learning Algorithms on the Data of Self-Organizing Network to Select a Model for Predicting of the Next Call Blocking Probability Authors: HoseinMohammadiMakrani G#00975445 Mohammed Rahbini G#00764831 Professor: Dr. Monson H. Hayes Fall 2015

Table of content 1- Introduction.. 3 2- Main Idea...... 4 3- Implementation.... 6 3-1- Data Set. 6 3-2- Bayesian Neural Network 8 3-3- Kernel Regression. 20 3-4- Ensemble Learning 28 4- Conclusion... 30 5- Reference... 31

1- Introduction The increasing availability of large amounts of historical data and the need of performing accurate forecasting of future behavior in several scientific and applied domains demands the efficient techniques able to infer from observations (like time series) the stochastic dependency between past and future [1]. A time series is a set of observations (x), each one being recorded at a specified time (t) [2]. Examples of time series are [3]: Meteorology: weather variables, like temperature, pressure, wind. Economy and finance: economic factors (GNP), financial indexes, exchange rate, spread. Marketing: activity of business, sales. Industry: electric load, power consumption, voltage, sensors. Biomedicine: physiological signals (EEG), heart-rate, patient temperature. Web: clicks, logs. Genomics: time series of gene expression during cell cycle. Time series are studied for several purposes such as the forecasting of the future based on knowledge of the past, the understanding of the phenomenon underlying the measures, or simply a succinct description of the salient features of the series [1]. There are 3 type of series variation. The first one is trend which is a long-term change in the mean level. The second one is seasonal which is periodic and during time the data show same behavior and the last one is irregular. Figure 1 [3] shows the variation. Figure 1

2- Main Idea In a Self Organizing Network, it is interesting to predict the probability of call blocking probability at next interval. Based on this prediction, the network can adopt itself with new condition and prepare for likely congestion. Prediction could be done using machine learning methods. For prediction it is essential to study and analyze the patterns which arrive in a certain period. After applying these machinelearning algorithms successfully it is feasible to learn about such time series to predict empty blocks in the spectrum and the probability of congestion in the SON in the future. Therefore, if a model of call blocking through time is available, one can easily predict upcoming delays, blockings, and essentially make the SON much more efficient at allocating its services. In this project we want to evaluate some learning algorithms and make a comparison across them to select the best algorithm to model a time series data, which would allow us to predict the next observation interval from the current time. Actually, our goal is to use machine- learning techniques to learn and thus predict the future congestion in an SON. There are several algorithms available in the technical literature that we will eventually choose from. Our Nominated algorithms are Bayesian neural networks, kernel regression, ensemble learning [4]. These algorithms will apply to a data set (time series), and then the corresponding results will be compared with each other. Eventually, the selected algorithm would be the one with a better prediction and the minimum average prediction error over some period of time (like figure 2 [5]). The criteria for selecting an algorithm are training error, complexity, and validation error.

Figure 2 Let s define the problem of this project more clearly. In an SON, we wish to learn properties of the time series formed by the call blocking probabilities measured over a fixed time interval. We then wish to predict the congestion in the network in the short future that follows thereafter. Basically, we have access to the information of a Cell of SON. We need to apply each of the algorithms which we have selected in order to model the call blocking, or congestion rate, probabilities measured over the finite interval. Thus, after we model the data by an algorithm, we then will be able to predict call blocking/congestion probabilities for the next designated interval using that specific algorithm.

3- Implementation In this chapter, we describe each technique thoroughly and present the result of prediction for each one. Techniques are Bayesian neural networks, kernel regression, ensemble learning, and Linear discriminate classification. These techniques are implemented by MATLAB at this project. Also we will look at our data set first. 3-1- Data set Our Data Set has call blocking probability of a cell of SON. The Data is for a month (4 weeks). Each interval is half hour, so the total number of interval is equal to 1344. This Data set is provided by Reverb Network Company (in a excel file) [6]. Figure 3 shows some part of our data set. Some characteristics of this data set are as below: Average: 0.1091 Standard deviation: 0.1025 Maximum: 0.647 Minimum: 5e-6 Threshold for considering congestion: 0.2 Figure 3

Images 4, 5, and 6 are presentation for one day, one week, and whole data. As it seems, this data has seasonal nature which we described it before. Figure 4 Figure 5

Figure 6 We utilize 3 weeks of this data set as a training set and use forth week for testing and prediction. So, all of our reports are based on this division. We must mention here that the testing part (week four) of our data set has 59 congestion spots. Therefore, our desire is that our learning techniques predict these spots as much as possible and still have a minimum prediction error. In our experiment, all errors (training error, prediction error) are reported as MSE (mean squared error). 3-2- Bayesian Neural network Fortunately, the MATLAB has a data analysis toolbox. One part of this toolbox is Neural Net Time Series (ntstool) which solves nonlinear time series problem using dynamic neural network. Figure 7 shows this toolbox. Figure 7 A Bayesian neural network (BNN) is a neural network designed based on a Bayesian probabilistic formulation. As such BNN s are related to the classical statistics concept of Bayesian parameter estimation, and are also related to the

concept of regularization such as in ridge regression. BNN s have enjoyed wide applicability in many areas such as economics/finance and engineering. The idea of BNN is to treat the network parameters or weights as random variables, obeying some a priori distribution. This distribution is designed so as to favor low complexity models, i.e., models producing smooth fits. Once the data are observed, the posterior distribution of the weights is evaluated and the network prediction can be computed. The predictions will then reflect both the smoothness aspect imposed through the prior and the fitness accuracy aspect imposed by the observed data. A closely related concept is the regularization aspect, whereby the following objective function is constructed and minimized this formula: J = a E D + (1 a)e W Where E D is the sum of the square errors in the network outputs, E W is the sum of the squares of the network parameters (i.e., weights), and is the regularization parameter. For the Bayesian approach, the typical choice of the prior is the following normal density that puts more weight onto smaller network parameter values as follow: p(w) = [((1-a)/pi)^(L/2)] * e^ [(1 a)e W ] Where L denotes the number of parameters (weights). The posterior is then given by: p(w D,a ) = p(d w,a) * p(w a) / p(d ) Where D represents the observed data. Assuming normally distributed errors, the probability density of the data given the parameters can be evaluated as: p(d w, a) = [(a/pi)^m/2] * e^( ae D ) Where M is the number of training data points. By substituting the expressions for the densities, we get: p(w D, a) = c exp( J ) Where c is some normalizing constant. The regularization constant is also determined using Bayesian concepts, from: p(a D) = p(d a) * p(a) / p(d) Both two last expressions should be maximized to obtain the optimal weights and alpha (a) parameter, respectively. The term p(d a) in last equation is obtained by a quadratic approximation of J in terms of the weights and then integrating out the weights.

We used the MATLAB version trainbr for BNN (applied to a multilayer perceptron architecture). This routine is based on the algorithm proposed by Foresee and Hagan. For prediction, we used Nonlinear Autoregressive model. Figure 8 shows this one. Figure 8 For training part of the network, we set options of test and validation as follow: Figure 9 After setting parameters, then we have to define the configuration of network (number of delays and number of neurons) that figure 10 presents it. Figure 10

We made a lot of experiments and tried to play with these two main parameters to find the best configuration for prediction. To find the best configuration we considered following values for delay and neuron: Values for delay: 12, 24, 48, 96 Values for neuron: 5, 10, 15, 20 For the sake of training time, we don t grow our hidden layer s neuron. Actual training time of network for the delay=96 and neuron=20 is about an hour. Table 1 shows all of our experiment and their results. In the table 1, number of correct prediction means that how many congestion spots are predicted correctly by Bayesian neural network. The total number of congestion in testing set is 59. Also, number of wrong prediction indicates that how many times the neural network predicts we will have congestion at next interval but actually there isn t any congestion at next interval. Hidden neuron Table 1 Delay Training Number of correct prediction error error prediction 5 12 9.00E-04 0.0023 46 11 5 24 7.60E-04 0.0022 45 10 5 48 4.76E-04 0.0018 47 9 5 96 2.67E-04 0.0015 49 7 10 12 8.46E-04 0.0022 46 11 10 24 5.60E-04 0.0019 46 10 10 48 2.27E-04 0.0016 48 9 10 96 2.23E-13 0.0021 49 9 15 12 6.73E-04 0.0021 46 10 15 24 4.55E-04 0.0018 46 9 15 48 7.00E-05 0.0019 48 7 15 96 1.27E-14 0.0024 47 14 20 12 5.00E-04 0.0022 45 10 20 24 3.63E-04 0.0019 46 10 20 48 4.26E-06 0.0028 46 12 Number of wrong prediction The table 1 reveals that the best configuration for prediction of next call blocking probability and congestion spot (where the probability is bigger than 0.2) is a network with 5 hidden neuron and 96 delay. By this configuration, prediction error

(mse) is 0.0015 and also the network is successful to predict 49 congestion spot correctly and only has 7 wrong predictions of congestion. The other information that we can extract from this table is that when we increase the number delays then training error goes down. Figure 11,12, and 13 show this relation for number of neurons 5, 10, and 15. 1.00E-03 9.00E-04 8.00E-04 7.00E-04 6.00E-04 5.00E-04 4.00E-04 3.00E-04 2.00E-04 1.00E-04 0.00E+00 0 20 40 60 80 100 120 Series1 Figure 11 9.00E-04 8.00E-04 7.00E-04 6.00E-04 5.00E-04 4.00E-04 3.00E-04 2.00E-04 1.00E-04 0.00E+00 0 20 40 60 80 100 120 Series1 Figure 12

8.00E-04 7.00E-04 6.00E-04 5.00E-04 4.00E-04 3.00E-04 Series1 2.00E-04 1.00E-04 0.00E+00 0 20 40 60 80 100 120 Figure 13 Also, there is similar relation between training error and the number of neuron in hidden layer. Figure 14 shows this relation for the number of delay 48. 5.00E-04 4.50E-04 4.00E-04 3.50E-04 3.00E-04 2.50E-04 2.00E-04 1.50E-04 1.00E-04 5.00E-05 0.00E+00 0 5 10 15 20 25 Series1 Figure 14 The other relation which can be extracted from this table is that when we increase the complexity of neural network (increase the number of delay), first the prediction error goes to be better but from certain point (where training error is close to zero) this process is inverted and the prediction error is increased. So, we can conclude that by reducing the training error, actually we are decreasing the

generalization. Figure 15, 16, and 17 depict this conclusion for the network with the number of neurons 10, 15, and 20. The Y axis is prediction error and X axis represents the step of delays. 0.0025 0.002 0.0015 0.001 Series1 0.0005 0 1 2 3 4 Figure 15 0.003 0.0025 0.002 0.0015 0.001 Series1 0.0005 0 1 2 3 4 Figure 16

0.003 0.0025 0.002 0.0015 0.001 Series1 0.0005 0 1 2 3 Figure 17 Now, we present some picture of training diagram of networks. Following figures show training diagram when neural network is trained by training set. At these pictures, the actual data of training set and the learned data by network are plotted and blow of that is error diagram. Here we just show some nominate pictures. Figure 18, 19, 10, 21, and 22 shows diagrams for network with these configuration (hidden neuron=5, delay=12), (hidden neuron=5, delay=96), (hidden neuron=10, delay=48), (hidden neuron=15, delay=48), and (hidden neuron=20, delay=24). Figure 18

Figure 19 Figure 20

Figure 21 Figure 22 Here, we present the prediction images of nominated networks. The blue bubbles are the actual data and the red line is prediction of that point. Figure 23, 24, 25, and 26 are prediction result for networks with the configuration as follow: (neuron=5, delay=96), (neuron=10, delay=48), (neuron=15, delay=48), (neuron=20, delay=24).

Figure 23 Figure 24

Figure 25 Figure 26

3-3- Kernel Regression Nadaraya and Watson developed this model. It is commonly called the Nadaraya Watson estimator or the kernel regression estimator. In the machine learning community, the term generalized regression is typically used. The GR model is a nonparametric model where the prediction for a given data point x is given by the average of the target outputs of the training data points in the vicinity of the given point x. The local average is constructed by weighting the points according to their distance from x, using some kernel function. We used the typical Gaussian kernel H(u) = [e^ ((u^2)/2)]/ 2. The parameter H, called the bandwidth, is an important parameter as it determines the smoothness of the fit, since increasing it or decreasing it will control the size of the smoothing region. As it obvious that regression doesn t need to be trained, so we only did regression for week 4 to be able have a fair comparison between Bayesian Neural network and Gaussian Kernel regression. However kernel regression doesn t have the training, it needs some points to predict next interval. In our experiment we regard these initial points as training point and then we calculate training error for them. To show how regression works, we present some pictures which reveal that how bandwidth has effect on the regression and how the next point is predicted by regression. Figures 27, 28, 29, 30, and 31 are related to bandwidth of 0.3, 0.5, 1, 2, and 5. In these pictures, blue bubbles are actual data and the red line is regression. Also, it is visible that the last point of picture (which red line ends) is predicted point. Therefore, here we have 6 actual points and 7 regression points. The last point (7) is prediction.

Figure 27 Figure 28

Figure 29 Figure 30

Figure 31 We can extract some information about the behavior of kernel regression from these images. The first thing is that by increasing the bandwidth (h), the curve goes to be smoother. We can say that when the h is too small, the regression follows the actual data (that we can assume the training error is zero) and the prediction value is equal to the last interval. When we increase the (h), the prediction point starts to change its value. Figure 32 reveal the relation of prediction error and bandwidth. 0.006 0.005 0.004 0.003 0.002 Series1 0.001 0 1 2 3 4 5 6 7 Figure 32

For our implementation, we consider different bandwidth as follow: 0.5, 0.8, 1, 1.2, 1.5, 2, and 5. The table 2 shows the result for kernel regression experiment. Table 2 Bandwidth Training error Prediction error Number of correct Number of prediction wrong prediction 0.5 1.63E-05 0.0012 51 8 0.8 1.00E-04 0.0013 50 7 1 1.62E-04 0.0014 50 9 1.2 2.40E-04 0.0016 48 9 1.5 3.73E-04 0.0019 48 11 2 6.30E-04 0.0025 46 14 5 2.30E-03 0.0051 39 23 Based on this table, we can say that the best prediction is for regression with bandwidth 0.5. An important result of this regression is that call blocking probabilities are following each other and there is strong relation between next call blocking probability and previous intervals. This kernel regression is successful to do 51 correct predictions of congestion (from 59 congestions in week 4) with the prediction error of 0.0012. Also, the number of wrong predictions of congestion is 8. Figures 32, 33, 34, 35, 36, 37, and 38 depict the regression prediction for week 4 and they represent regression for bandwidth 0.5, 0.8, 1, 1.2, 1.5, 2, 5. It is important to consider that this regression is not just a regression of actual data, of course each point of this regression (red line) is calculated (predicted) based previous interval and finally we draw all predicted data in one figure (red line). The green line is threshold (0.2).

Figure 33 Figure 34

Figure 35 Figure 36

Figure 37 Figure 38

Figure 39 3-4- Ensemble Learning Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem. Ensemble learning is primarily used to improve the (classification, prediction, function approximation, etc.) performance of a model, or reduce the likelihood of an unfortunate selection of a poor one. Other applications of ensemble learning include assigning a confidence to the decision made by the model, selecting optimal (or near optimal) features, data fusion, incremental learning, nonstationary learning and error-correcting. In our project, we used our two previous techniques to build the third technique. So, our first algorithm is Bayesian Neural Network and the second one is Kernel Regression. In our ensemble learning, we first obtain results (prediction points) from each technique and then calculate new prediction point by voting of them. A fair voting technique between two points is averaging. Therefore, we regard the average prediction points of each mentioned algorithm as final prediction point. By this way, we want to increase the prediction of next call blocking probability. The result shows that our expectation is true and the prediction was enhanced

significantly. Table 3 shows the value related to ensemble learning and figure 40 depict the prediction line versus actual data. Table 3 Number of correct Number of wrong Prediction error prediction prediction 54 5 5.6e-4 Figure 40

4- Conclusion In this project we evaluate three learning techniques and now we want to make a comparison across them to select the best algorithm to model our time series data, which would allow us to predict the next observation interval from the current time and find congestion spots. Our Nominated algorithms were Bayesian neural networks, kernel regression, ensemble learning. The selected algorithm would be the one with a better prediction and the minimum average prediction error over some period of time. The criterion for selecting this algorithm is prediction error. For the Bayesian Neural Network, the best configuration is 5 neurons with 96 delays. For Kernel regression, the best configuration is bandwidth 0.5. For Ensemble learning we have 2 techniques which we calculate the average of their response as predicted value. Table 4 shows the best result for these 3 techniques. Technique Bayesian neural network Kernel regression Ensemble learning configuration Neurons=5, delays=96 Training error Table 4 Prediction error Number of correct prediction Number of wrong prediction 2.67E-04 0.0015 49 7 H=0.5 1.63E-05 0.0012 51 8 NN+KR -- 5.6e-4 54 5

5- Reference [1]: Bontempi, Gianluca, Souhaib Ben Taieb, and Yann-Aël Le Borgne. "Machine learning strategies for time series forecasting." In Business Intelligence, pp. 62-77. Springer Berlin Heidelberg, 2013. [2]:Brockwell, Peter J., and Richard A. Davis. Time series: theory and methods. Springer Science & Business Media, 2013. [3]:Bontempi, Gianluca. "Machine Learning Strategies for Time Series Prediction." [4]: Ahmed, Nesreen K., et al. "An empirical comparison of machine learning models for time series forecasting." Econometric Reviews 29.5-6 (2010): 594-621. [5]:http://www.mathworks.com/products/demos/machinelearning/boosted_regression/boostedRegression_01.png [6]: http://www.reverbnetworks.com/