The Comparison and Combination of Genetic and Gradient Descent Learning in Recurrent Neural Networks: An Application to Speech Phoneme Classification

The Comparison and Combination of Genetic and Gradient Descent Learning in Recurrent Neural Networks: An Application to Speech Phoneme Classification Rohitash Chandra School of Science and Technology The University of Fii rohitashc@unifii.ac.f Christian W. Omlin The University of Western Cape Abstract We present a training approach for recurrent neural networks by combing evolutionary and gradient descent learning. We train the weights of the network using genetic algorithms. We then apply gradient descent learning on the knowledge acquired by genetic training to further refine the knowledge. We also use genetic neural learning and gradient descent learning for training on the same network topology for comparison. We apply these training methods to the application of speech phoneme classification. We use Mel frequency cepstral coefficients for feature extraction of phonemes read from the TIMIT speech database. Our results show that the combined genetic and gradient descent learning can train recurrent neural networks for phoneme classification; however, their generalization performance does not show significant difference when compared to the performance of genetic neural learning and gradient descent alone. Genetic neural learning has shown the best training performance in terms of training time. 1. Introduction Recurrent neural networks have been an important focus of research as they can be applied to difficult problems involving time-varying patterns. Their applications range from speech recognition and financial prediction to gesture recognition [1]-[3]. Recurrent neural networks are capable of modeling complicated recognition tasks. They have shown more accuracy in speech recognition in cases of low quality noisy data compared to hidden Markov models [4]. Recurrent neural networks are dynamical systems and it has shown been that they can represent deterministic finite automaton in their internal weight representations [5]. Backpropagation through time employs gradient descent learning and are popular for training recurrent neural networks. The goal for gradient descent learning is to minimize the networks output error by adusting the weights of the network upon the presentation of training samples. Backpropagation through time is an extension of the backpropagation algorithm used for training feedforward networks. The algorithm unfolds a recurrent neural network in time and views it as a deep multilayer feedforward network. Gradient descent learning faces the problem of getting the network trapped in the local minima. A momentum term is usually used to cater for this learning difficulty. In some cases, the network topology is usually pruned for improving the networks generalization by deleting some neurons from the network [6]. In this paper, we will use gradient descent learning to train recurrent neural networks for phoneme classification. Knowledge based neurocomputing is a paradigm which combines expert knowledge into neural networks prior to training for an improved training and generalization performance [7]. Expert knowledge provides the network with hints during training. This paradigm is limited to application where expert knowledge is not available. Evolutionary optimization techniques such as genetic algorithms have been popular for training neural networks other than gradient decent learning [8]. It has been observed that genetic algorithms overcome the problem of local minima whereas in gradient descent search for the optimal solution, it may be difficult to drive the network out of the local minima which in turn proves costly in terms of training time. In this paper, we will show how genetic algorithms can be applied to train recurrent neural networks and compare their performance with gradient descent learning. We will combine genetic and gradient descent learning to train the network architecture for classification of two phonemes extracted from the TIMIT speech database. After successful training, we will use gradient descent learning to train further on the knowledge acquired in the genetic training process. In this way, we will combine

both training paradigms. Gradient descent learning will be used to further refine the knowledge acquired by genetic training. 2. Definition and Methods 2.1 Recurrent Neural Networks Neural networks are loosely modeled to the brain. They learn by training on past experience and make good generalization on unseen instances. Neural networks are characterized into feedforward and recurrent neural networks. Feedforward networks are used in application where the data does not contain time variant information while recurrent neural networks model time series sequences and possesses dynamical characteristics. Recurrent neural networks contain feedback connections. They have the ability to maintain information from past states for the computation of future state outputs. second-order recurrent networks [10], NARX networks [11] and LSTM recurrent networks [12]. A detailed study about the vast variety of recurrent neural networks is beyond the scope of this paper. We will use first order recurrent neural network to show the combination of evolutionary and gradient descent learning. Their dynamics is shown in equation 1. K J S i ( t ) = g V ik S k ( t 1) + W i I ( t 1) k = 1 = 1 where Sk ( t ) and I ( t ) represent the output of the state neuron and input neurons, respectively. V and ik (1) W i represent their corresponding weights. g(.) is a sigmoidal discriminant function. Backpropagation employs gradient descent learning and is the most popular algorithm used for training neural networks. One limitation for training neural networks using gradient descent learning is their weakness of getting trapped in the local minima resulting in poor training and generalization performance. Evolutionary optimization methods such as genetic algorithms are also used for neural network training; they do not face the problems faced during gradient descent learning. In the past, research had been done to improve the training performance of neural networks which has significance on its generalization. Symbolic or expert knowledge is inserted into neural networks prior to training for a better training and generalization performance. It has been shown that deterministic finitestate automata can be directly encoded into recurrent neural networks prior to training [6]. Until recently, neural networks were viewed as black boxes as they could not explain the knowledge learnt in the training process. The extraction of rules from neural networks shows how they arrived to a particular solution after training. The extraction of finite-state automata from trained recurrent neural networks shows that they have characteristics for modeling dynamical systems. Recurrent neural networks are composed of an input layer, a context layer which provides state information, a hidden layer and an output layer as shown in Figure 1. Each layer contains one or more processing units called neurons which propagate information from one layer to the next by computing a non-linear function of their weighted sum of inputs. Popular architectures of recurrent neural networks include first-order recurrent networks [9], Figure 1: First order recurrent neural network architecture. The recurrence from the hidden to the context layer is shown. Dashed lines indicate that more neurons can be used in each layer depending on the application. 2.2 Backpropagation Through-Time Backpropagation is the most widely applied learning algorithm for both feedforward and recurrent neural networks. It learns the weights for a multilayer network, given a network with a fixed set of weights and interconnections. Backpropagation employs gradient descent to minimize the squared error between the networks output values and desired values for those outputs. The learning problem faced by backpropagation is to search a large hypothesis space defined by weight values for all the units of the network.

Backpropagation through time (BPTT) is a gradient descent learning algorithm used for training first order recurrent neural networks [13]. BPTT is the extension of backpropagation algorithm. The general idea behind BPTT is to unfold the recurrent neural network in time so that it becomes a deep multilayer feedforward network. When unfolded in time, the network has the same behavior as a recurrent neural network for a finite number of time steps. The goal of gradient descent learning is to minimize the sum of squared errors by propagating error signals backward through the network architecture upon the presentation of training samples from the training set. These error signals are used to calculate the weight updates which represent the knowledge learnt in neural networks. In gradient descent search for a solution, the network searches through a weight space of errors. Therefore it may get trapped in a local minima easily. This may prove costly in terms for network training and generalization performance. In time varying sequences, longer patterns represent long time dependencies. Gradient descent has difficulties in learning long time dependencies as error gradient vanishes with increasing duration of dependencies [14]. Given below are the training equations unfolded in time, hence time t become the layer L. For each training example d, every weight w is updated by adding w to it. i E α d wi = (2) wi where E is the error on training example d, summed over d all output units in the network Here 1 E d S m L 2 d = ( ) (3) 2 = 1 d is the desired output for neuron in the output L layer which containing m neurons, and S is the network output of neuron in the output layer L. The weight updates after computing the derivation is done by: w = αδ S (4) L L L 1 i i where α is the learning rate constant. The learning rate determines how fast the weights are updated in the direction of the gradient. The error gradient for neuron i, δ for the output layer is given by: L i i δ L = ( d S L ) S L (1 S L ) (5) The error gradient for the hidden layers is given by: m L L L 1 L 1 (1 ) δ + + k k k = 1 S S w Heuristic to improve the performance of backpropagation include adding a momentum term and training multiple networks with the same data but different small random initializations prior to training. 2.3 Evolutionary of Recurrent Neural Networks 2.3.1 Genetic Algorithms Genetic algorithms provide a learning method motivated by biological evolution. They are search techniques that can be used for both solving problems and modeling evolutionary systems [15]. The problem faced by genetic algorithms is to search a space of candidate hypothesis and find the best hypothesis. The hypothesis fitness is a numerical measure which computes the best hypothesis that optimizes the problem. The algorithm operates by iteratively updating a pool of hypothesis, called the population. The population consists of many individuals called chromosomes. All member of the population are evaluated by the fitness function on each iteration. A new population is then generated by probabilistically selecting the most fit chromosomes from the current population. Some of the selected chromosomes are added to the new generation while others are selected as parent chromosomes. Parent chromosomes are used for creating new offspring s by applying genetic operators such as crossover and mutation. Traditionally, the chromosomes represent bit strings; however, real number representation is possible. 2.3.2 Genetic Algorithms for Neural Networks. In order to use genetic algorithms for training neural networks, we need to represent the problem as chromosomes. Real numbered values of weights must be encoded in the chromosome other than binary values. This is done by altering the crossover and mutation operators. A crossover operator takes two parent chromosomes and creates a single child chromosome by randomly selecting corresponding genetic materials from both parents. The mutation operator adds a small random real number between -1 and 1 to a randomly selected gene in the chromosome. (6)

In evolutionary neural learning, the task of genetic algorithms is to find the optimal set of weights in a network topology which minimizes the error function. The fitness function must define the performance of the neural network. Thus, the fitness function is the reciprocal of sum of squared error of the neural network. To evaluate the fitness function, each weight encoded in the chromosome is assigned to the respective weight links of the network. The training set of examples is then presented to the network which propagates the information forward and the sum of squared errors is calculated. In this way, genetic algorithms attempts to find a set of weights which minimizes the error function of the network. Recurrent neural networks have been trained by evolutionary computation methods such as genetic algorithms which optimizes the weights in the network architecture for a particular problem. Compared to gradient descent learning, genetic algorithms can help the network to escape from the local minima. 2.4 Knowledge Based Neurocomputing The general paradigm of knowledge based neurocomputing includes the combination of symbolic knowledge in neural networks for better training and generalization performance [7]. The fidelity in the mapping of the prior knowledge is very important since the network may not take advantage of poorly encoded knowledge. Poorly encoded knowledge may hinder the learning process. Good prior knowledge encoding may provide the network with beneficial features such as: 1) The learning process may lead to faster convergence to a solution meaning better training performance, 2) networks trained with prior knowledge may provide better generalization when compared to networks trained with no prior knowledge and, 3) the rules in prior knowledge may help to generate additional training data which are not present in the original data set. Prior knowledge usually represented in the form of explicit rules in symbolic form is encoded in neural networks by programming some weights prior to training [7]. In feedforward neural networks, prior knowledge is encoded in propositional logic expression form by programming a subset of weights. Prior knowledge also determines the topology of the network i.e. the number of neurons and hidden layers appropriate for encoding the knowledge. The paradigm has been successfully applied to real world problems including bio-conservation [16] and molecular biology [17]. The prior or expert knowledge helps the network to get better generalization and training performance when compared with network architecture without prior knowledge encoding. For recurrent neural networks, finite-state automata are the basis for knowledge insertion. It has been shown that deterministic finite-state automata can be encoded in discrete-time second-order recurrent neural networks by directly programming a small subset of available weights [6]. For first order recurrent neural networks, a method for encoding finite-state automata has been proposed and shown [18]. 2.5 Speech Phoneme Classification A speech sequence contains huge amount of irrelevant information. In order to model them, feature extraction is necessary. In feature extraction, useful information from speech sequences are extracted which is then used for modeling. Recurrent neural networks and hidden Markov models have been successfully applied for modeling speech sequences [1,4]. They have been applied to recognize words and phonemes. The performance of speech recognition system can be measured in terms of accuracy and speed. Recurrent neural networks are capable of modeling complicated recognition tasks. They have shown more accuracy in recognition in cases of low quality noisy data compared to hidden Markov models. However, hidden Markov models have shown to perform better when it comes to large vocabularies. Extensive research on the application of research recognition has been done for more than forty years; however, scientists are unable to implement systems which can show excellent performance in environments with background noise. Mel frequency cepstral coefficients (MFCC) are useful feature extraction techniques as the Mel filter has characteristics similar to the human auditory system [19]. The human ear performs similar techniques before presenting information to the brain for processing. We will apply MFCC feature extraction techniques to extract features from phonemes from the TIMIT speech database. MFCC feature extraction is done by applying the following procedure. A frame of speech signal obtained by windowing and is presented to the discrete Fourier transformation function to change the signal from time domain to its frequency domain. Then the discrete Fourier transformed based spectrum is mapped onto the Mel scale using triangular overlapping windows. Finally, we compute the log energy at the output of each filter and then do a discrete cosine transformation of the Mel amplitudes from which we obtain a vector of MFCC features.

3. Combining Evolutionary and Gradient Descent Learning In the past, a vast amount of research had been done to improve the training performance of neural networks. Some of the proposed methods include adding a momentum during training, pruning the network architecture, and combining expert knowledge with neural network training. In the previous sections, we have discussed the maor training methods for recurrent neural network and outlined how their training performance can be improved. Knowledge based neurocomputing has proven successful as the paradigm uses symbolic knowledge for programming a subset of weights into the network architecture prior to training. However, they can not be applied in application where the expert knowledge is not available. We have discussed how recurrent neural networks are trained using genetic algorithms. Genetic algorithms guess the optimal set of weights for a given network architecture. We need to investigate what sets of weight initializations are best for training using genetic algorithms. We have addressed the problem that in gradient descent learning, the network may become trapped in the local minima resulting in poor training and generalization performance. However, in training using genetic algorithms, there is no such problem. In this paper, we will combine both genetic and gradient descent learning. We will train a recurrent neural network architecture using genetic neural learning, after successful training we will apply gradient descent learning to further refine the knowledge acquired by genetic neural learning. We will train recurrent neural networks using gradient descent learning and record their training performance for classification of speech phonemes obtained from the TIMIT database. We will then train recurrent neural networks for phoneme classification using genetic neural learning. We will use different sets of weight initialization. We will define a genetic population size and train the network architecture until they can learn the training dataset. 4. Empirical Results and Discussion 4.1 MFCC Feature Extraction We used Mel cepstral frequency coefficients for feature extraction from two phonemes read from the TIMIT speech database. We read phonemes b and d from the training and testing set in the TIMIT database. The training dataset contained 645 samples while the testing set consisted of 238 samples. For each phoneme read, we applied a window of size 512 every 256 sample point. We then windowed the signal and presented it to the discrete Fourier transformation. Furthermore, we mapped the spectrum onto the Mel scale using triangular filters and then did a discrete cosine transformation on the Mel amplitudes. In this way we obtained a vector of 12 MFCC features for each frame of the phoneme. 4.2 Recurrent Neural Networks using Gradient Descent for Phoneme Classification We used the training and testing data set of features from the two phonemes b and d as discussed in subsection 4.1. We used the recurrent neural network topology as follows: 12 neurons in the input layer which represents the speech feature input and 2 neurons in the output layer representing each phoneme. We used a learning rate of 0.2 and ran experiment with 12, 14, 16 and 18 neurons in the hidden layer. We used the backpropagation through-time algorithm which employs gradient descent for training. We ran two maor experiments; Table 1 shows illustrative results for experiment 1 which uses small random weights in the range of -1 to 1 while Table 2 shows larger weight values used for weight initialization. We would terminate training if the network could learn 88% of the training samples and tested the networks generalization performance with data set not included in the training set. For both experiments, we set the maximum training time of 100 epochs. The results show that gradient descent has been successful in training recurrent neural networks in the application of phoneme classification. Excessive number of neurons in the hidden layer has training difficulty. The two different sets of weight initializations have no maor effect in the generalization performance. Upon successful training, the best generalization performance recorded was 82.6% on the presentation of data set which was not included in the training set. Table 1: Gradient descent learning Epochs 12 100 82.8% 82.6% 14 100 87.5% 82.6% 16 100 88% 82.6% 18 100 0.2% 0.4% Small random weights initialised in the range of - 1 to 1

Table 2: Gradient descent learning Epochs 12 100 88% 82.6% 14 100 88% 82.6% 16 100 88% 82.6% 18 100 0.2% 0.4% Large random weights initialised in the range of - 5 to 5 4.3 Recurrent Neural Networks using Genetic Algorithms We obtained the training and testing data set of phonemes b and d as discussed in Section 4.1. We used the recurrent neural network topology as follows: 12 neurons in the input layer which represents the speech frame input and 2 neurons in the output layer representing each phoneme. We experimented with different number of neurons in the hidden layer. Table 3: Genetic neural learning 12 2 88% 82.6% 14 4 88% 82.6% 16 7 88% 82.6% 18 3 87.5% 82.6% Small random weights initialised in the range of - 1 to 1 Table 4: Genetic neural learning 12 4 88% 82.6% 14 4 88% 82.6% 16 3 88% 82.6% 18 2 88% 82.6% Large random weights initialised in the range of - 5 to 5 We ran some sample experiments and found that the population size of 40, crossover probability of 0.7 and mutation probability of 0.1 have shown good genetic training performance. Therefore, we used these values for all our experiments. We ran two maor experiments with different weight initialization prior to training and trained the network until it could at least learn 88% of samples in the training set. Illustrative results for each experiment are shown in Table 3 and Table 4, respectively. The results show 82.6 percent generalization performance on unseen samples which were not included in the training process. Genetic neural learning has shown better training performance compared to gradient descent learning; however, their generalization performance is the same. 4.4 Combined Genetic and Gradient Descent Learning. We apply the combined genetic and gradient descent learning for phoneme classification using recurrent neural networks. We classify two phonemes from features extracted in Section 4.1. We used the network topology as discussed in Section 4.3. We applied genetic algorithms for training, once the network has shown to learn 88% of the samples; we terminate the training and apply gradient descent to further refine the knowledge learnt by genetic training. We trained for 100 training epochs. Table 5 and 6 show illustrative results of the two maor experiments initialized with two different sets of weights, respectively. The results show that the generalization performance of combined genetic and gradient descent learning does not improve significantly when compared to the performance of genetic neural learning alone. Table 5: Genetic and gradient descent learning 12 100 81.8% 82.6% 14 100 81.1% 82.6% 16 100 81.3% 82.6% 18 100 81.3% 82.6% Small random weights initialised in the range of - 1 to 1 Table 6: Genetic and gradient descent learning 12 100 78.1% 82.6% 14 100 85.6% 82.6% 16 100 81.3% 82.6% 18 100 81.5 82.6% Large random weights initialised in the range of - 5 to 5

5. Conclusions We have discussed about the popular recurrent neural network training paradigms and outlined their strengths and limitations. We discussed the application of recurrent neural networks on speech phoneme classification using Mel frequency cepstral coefficients feature extraction techniques. We have successfully trained recurrent neural networks to classify phonemes b and d extracted from the speech database by gradient descent and genetic training methods. We have successfully compared and combined genetic and gradient descent learning in recurrent neural networks. Our results demonstrate that genetic neural learning has better training performance over gradient descent; however, their generalization performance is the same. We combined genetic and gradient descent learning for recurrent network training and found out that their generalization performance does not perform satisfactorily when compared to genetic neural learning alone. The application of the combined training method to other real world application problems remains an open question. 6. References [1] A.J Robinson, An application of recurrent nets to phone probability estimation, IEEE transactions on Neural Networks, vol.5, no. 2, 1994, pp. 298-305. [2] C.L. Giles, S. Lawrence and A.C. Tsoi, Rule inference for financial prediction using recurrent neural networks, Proc. of the IEEE/IAFE Computational Intelligence for Financial Engineering, New York City, USA, 1997, pp. 253-259 [3] K. Marakami and H Taguchi, Gesture recognition using recurrent neural networks, Proc. of the SIGCHI conference on Human factors in computing systems: Reaching through technology, Louisiana, USA, 1991, pp. 237-242. [4] M. J. F. Gales, Maximum likelihood linear transformations for HMM-based speech recognition, Computer Speech and Language, vol. 12, 1998, pp. 75-98. [5] C. Lee Giles, C.W Omlin and K. Thornber, Equivalence in Knowledge Representation: Automata, Recurrent Neural Networks, and dynamical Systems, Proc. of the IEEE, vol. 87, no. 9, 1999, pp.1623-1640. [6] C.W. Omlin and C.L. Giles: "Pruning Recurrent Neural Networks for Improved ", IEEE Transactions on Neural Networks, vol. 5, no. 5, 1994, pp. 848-851. [7] G. Towell and J.W. Shavlik, Knowledge based artificial neural networks, Artificial Intelligence, vol. 70, no.4, 1994, pp. 119-166. [8] C. Kim Wing Ku, M. Wai Mak, and W. Chi Siu, Adding learning to cellular genetic algorithms for training recurrent neural networks, IEEE Transactions on Neural Networks, vol. 10, no.2, 1999, pp. 239-252. [9] P. Manolios and R. Fanelli, First order recurrent neural networks and deterministic finite state automata, Neural Computation, vol. 6, no. 6, 1994, pp.1154-1172. [10] R. L. Watrous and G. M. Kuhn, Induction of finite-state languages using second-order recurrent networks, Proc. of Advances in Neural Information Systems, California, USA, 1992, pp. 309-316. [11] T. Lin, B.G. Horne, P. Tino and C.L. Giles, Learning long-term dependencies in NARX recurrent neural networks, IEEE Transactions on Neural Networks, vol. 7, no. 6, 1996, pp. 1329-1338. [12] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, vol. 9, no. 8, 1997, pp. 1735-1780. [13] P. J. Werbos, Backpropagation through time: what it does and how to do it, Proc. of the IEEE, vol. 78, no. 10, 1990, pp.1550-1560. [14] Y. Bengio, P. Simard and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol. 5, no. 2, 1994, pp. 157-166. [15] T. M. Mitchell, Machine Learning, McGraw Hill, 1997. [16] R. Chandra, R. Knight and C. W. Omlin, Combining Expert Knowledge and Ground Truth: A Knowledge based Neurocomputing Paradigm for Bio-conservation Decision Support Systems, Proceedings of the International Conference on Environmental Management, Hyderabad India, 2005. [17] C. W. Omlin and S. Snyders, Inductive bias strength in knowledge-based neural networks: application to magnetic resonance spectroscopy of breast tissues, Artificial Intelligence in Medicine, vol. 28, no. 2, 2003. [18] P. Fransconi, M. Gori, M. Maggini, and G. Soda, Unified integration of explicit rules and learning by example in recurrent networks, IEEE Transactions on Knowledge and Data Engineering, vol. 7, no. 2, 1995, pp. 340-346. [19] I. Potamitis, N. Fakotakis and G. Kokkinakis, "Improving the Robustness of Noisy MFCC Features Using Minimal Recurrent Neural Networks, Proc. of the IEEE International Joint Conference on Neural Networks, vol. 5, 2000, p. 5271.