The Comparison and Combination of Genetic and Gradient Descent Learning in Recurrent Neural Networks: An Application to Speech Phoneme Classification

Similar documents
AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Human Emotion Recognition From Speech

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

Rule Learning With Negation: Issues Regarding Effectiveness

Softprop: Softmax Neural Network Backpropagation Learning

Lecture 1: Machine Learning Basics

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Modeling function word errors in DNN-HMM based LVCSR systems

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Neuro-Symbolic Approaches for Knowledge Representation in Expert Systems

Evolutive Neural Net Fuzzy Filtering: Basic Description

Rule Learning with Negation: Issues Regarding Effectiveness

Knowledge Transfer in Deep Convolutional Neural Nets

INPE São José dos Campos

WHEN THERE IS A mismatch between the acoustic

Speaker Identification by Comparison of Smart Methods. Abstract

An empirical study of learning speed in backpropagation

Modeling function word errors in DNN-HMM based LVCSR systems

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Word Segmentation of Off-line Handwritten Documents

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Emotion Recognition Using Support Vector Machine

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Speech Recognition at ICSI: Broadcast News and beyond

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Knowledge-Based - Systems

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Lecture 1: Basic Concepts of Machine Learning

Lecture 10: Reinforcement Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Learning Methods in Multilingual Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Learning Methods for Fuzzy Systems

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Evolution of Symbolisation in Chimpanzees and Neural Nets

A study of speaker adaptation for DNN-based speech synthesis

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Test Effort Estimation Using Neural Network

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

A Reinforcement Learning Variant for Control Scheduling

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

SARDNET: A Self-Organizing Feature Map for Sequences

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

The Good Judgment Project: A large scale test of different methods of combining expert predictions

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Generative models and adversarial training

arxiv: v1 [cs.lg] 15 Jun 2015

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Artificial Neural Networks

Speaker recognition using universal background model on YOHO database

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

(Sub)Gradient Descent

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

arxiv: v1 [cs.cv] 10 May 2017

Classification Using ANN: A Review

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

CSL465/603 - Machine Learning

While you are waiting... socrative.com, room number SIMLANG2016

Axiom 2013 Team Description Paper

Using focal point learning to improve human machine tacit coordination

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

arxiv: v1 [cs.lg] 7 Apr 2015

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

A Pipelined Approach for Iterative Software Process Model

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Cooperative evolutive concept learning: an empirical study

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Reinforcement Learning by Comparing Immediate Reward

CS Machine Learning

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Combined Behavior of Autonomous Resource Management Agents

Comment-based Multi-View Clustering of Web 2.0 Items

A SURVEY OF FUZZY COGNITIVE MAP LEARNING METHODS

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Deep Neural Network Language Models

On-Line Data Analytics

Natural Language Processing. George Konidaris

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Transcription:

The Comparison and Combination of Genetic and Gradient Descent Learning in Recurrent Neural Networks: An Application to Speech Phoneme Classification Rohitash Chandra School of Science and Technology The University of Fii rohitashc@unifii.ac.f Christian W. Omlin The University of Western Cape Abstract We present a training approach for recurrent neural networks by combing evolutionary and gradient descent learning. We train the weights of the network using genetic algorithms. We then apply gradient descent learning on the knowledge acquired by genetic training to further refine the knowledge. We also use genetic neural learning and gradient descent learning for training on the same network topology for comparison. We apply these training methods to the application of speech phoneme classification. We use Mel frequency cepstral coefficients for feature extraction of phonemes read from the TIMIT speech database. Our results show that the combined genetic and gradient descent learning can train recurrent neural networks for phoneme classification; however, their generalization performance does not show significant difference when compared to the performance of genetic neural learning and gradient descent alone. Genetic neural learning has shown the best training performance in terms of training time. 1. Introduction Recurrent neural networks have been an important focus of research as they can be applied to difficult problems involving time-varying patterns. Their applications range from speech recognition and financial prediction to gesture recognition [1]-[3]. Recurrent neural networks are capable of modeling complicated recognition tasks. They have shown more accuracy in speech recognition in cases of low quality noisy data compared to hidden Markov models [4]. Recurrent neural networks are dynamical systems and it has shown been that they can represent deterministic finite automaton in their internal weight representations [5]. Backpropagation through time employs gradient descent learning and are popular for training recurrent neural networks. The goal for gradient descent learning is to minimize the networks output error by adusting the weights of the network upon the presentation of training samples. Backpropagation through time is an extension of the backpropagation algorithm used for training feedforward networks. The algorithm unfolds a recurrent neural network in time and views it as a deep multilayer feedforward network. Gradient descent learning faces the problem of getting the network trapped in the local minima. A momentum term is usually used to cater for this learning difficulty. In some cases, the network topology is usually pruned for improving the networks generalization by deleting some neurons from the network [6]. In this paper, we will use gradient descent learning to train recurrent neural networks for phoneme classification. Knowledge based neurocomputing is a paradigm which combines expert knowledge into neural networks prior to training for an improved training and generalization performance [7]. Expert knowledge provides the network with hints during training. This paradigm is limited to application where expert knowledge is not available. Evolutionary optimization techniques such as genetic algorithms have been popular for training neural networks other than gradient decent learning [8]. It has been observed that genetic algorithms overcome the problem of local minima whereas in gradient descent search for the optimal solution, it may be difficult to drive the network out of the local minima which in turn proves costly in terms of training time. In this paper, we will show how genetic algorithms can be applied to train recurrent neural networks and compare their performance with gradient descent learning. We will combine genetic and gradient descent learning to train the network architecture for classification of two phonemes extracted from the TIMIT speech database. After successful training, we will use gradient descent learning to train further on the knowledge acquired in the genetic training process. In this way, we will combine

both training paradigms. Gradient descent learning will be used to further refine the knowledge acquired by genetic training. 2. Definition and Methods 2.1 Recurrent Neural Networks Neural networks are loosely modeled to the brain. They learn by training on past experience and make good generalization on unseen instances. Neural networks are characterized into feedforward and recurrent neural networks. Feedforward networks are used in application where the data does not contain time variant information while recurrent neural networks model time series sequences and possesses dynamical characteristics. Recurrent neural networks contain feedback connections. They have the ability to maintain information from past states for the computation of future state outputs. second-order recurrent networks [10], NARX networks [11] and LSTM recurrent networks [12]. A detailed study about the vast variety of recurrent neural networks is beyond the scope of this paper. We will use first order recurrent neural network to show the combination of evolutionary and gradient descent learning. Their dynamics is shown in equation 1. K J S i ( t ) = g V ik S k ( t 1) + W i I ( t 1) k = 1 = 1 where Sk ( t ) and I ( t ) represent the output of the state neuron and input neurons, respectively. V and ik (1) W i represent their corresponding weights. g(.) is a sigmoidal discriminant function. Backpropagation employs gradient descent learning and is the most popular algorithm used for training neural networks. One limitation for training neural networks using gradient descent learning is their weakness of getting trapped in the local minima resulting in poor training and generalization performance. Evolutionary optimization methods such as genetic algorithms are also used for neural network training; they do not face the problems faced during gradient descent learning. In the past, research had been done to improve the training performance of neural networks which has significance on its generalization. Symbolic or expert knowledge is inserted into neural networks prior to training for a better training and generalization performance. It has been shown that deterministic finitestate automata can be directly encoded into recurrent neural networks prior to training [6]. Until recently, neural networks were viewed as black boxes as they could not explain the knowledge learnt in the training process. The extraction of rules from neural networks shows how they arrived to a particular solution after training. The extraction of finite-state automata from trained recurrent neural networks shows that they have characteristics for modeling dynamical systems. Recurrent neural networks are composed of an input layer, a context layer which provides state information, a hidden layer and an output layer as shown in Figure 1. Each layer contains one or more processing units called neurons which propagate information from one layer to the next by computing a non-linear function of their weighted sum of inputs. Popular architectures of recurrent neural networks include first-order recurrent networks [9], Figure 1: First order recurrent neural network architecture. The recurrence from the hidden to the context layer is shown. Dashed lines indicate that more neurons can be used in each layer depending on the application. 2.2 Backpropagation Through-Time Backpropagation is the most widely applied learning algorithm for both feedforward and recurrent neural networks. It learns the weights for a multilayer network, given a network with a fixed set of weights and interconnections. Backpropagation employs gradient descent to minimize the squared error between the networks output values and desired values for those outputs. The learning problem faced by backpropagation is to search a large hypothesis space defined by weight values for all the units of the network.

Backpropagation through time (BPTT) is a gradient descent learning algorithm used for training first order recurrent neural networks [13]. BPTT is the extension of backpropagation algorithm. The general idea behind BPTT is to unfold the recurrent neural network in time so that it becomes a deep multilayer feedforward network. When unfolded in time, the network has the same behavior as a recurrent neural network for a finite number of time steps. The goal of gradient descent learning is to minimize the sum of squared errors by propagating error signals backward through the network architecture upon the presentation of training samples from the training set. These error signals are used to calculate the weight updates which represent the knowledge learnt in neural networks. In gradient descent search for a solution, the network searches through a weight space of errors. Therefore it may get trapped in a local minima easily. This may prove costly in terms for network training and generalization performance. In time varying sequences, longer patterns represent long time dependencies. Gradient descent has difficulties in learning long time dependencies as error gradient vanishes with increasing duration of dependencies [14]. Given below are the training equations unfolded in time, hence time t become the layer L. For each training example d, every weight w is updated by adding w to it. i E α d wi = (2) wi where E is the error on training example d, summed over d all output units in the network Here 1 E d S m L 2 d = ( ) (3) 2 = 1 d is the desired output for neuron in the output L layer which containing m neurons, and S is the network output of neuron in the output layer L. The weight updates after computing the derivation is done by: w = αδ S (4) L L L 1 i i where α is the learning rate constant. The learning rate determines how fast the weights are updated in the direction of the gradient. The error gradient for neuron i, δ for the output layer is given by: L i i δ L = ( d S L ) S L (1 S L ) (5) The error gradient for the hidden layers is given by: m L L L 1 L 1 (1 ) δ + + k k k = 1 S S w Heuristic to improve the performance of backpropagation include adding a momentum term and training multiple networks with the same data but different small random initializations prior to training. 2.3 Evolutionary of Recurrent Neural Networks 2.3.1 Genetic Algorithms Genetic algorithms provide a learning method motivated by biological evolution. They are search techniques that can be used for both solving problems and modeling evolutionary systems [15]. The problem faced by genetic algorithms is to search a space of candidate hypothesis and find the best hypothesis. The hypothesis fitness is a numerical measure which computes the best hypothesis that optimizes the problem. The algorithm operates by iteratively updating a pool of hypothesis, called the population. The population consists of many individuals called chromosomes. All member of the population are evaluated by the fitness function on each iteration. A new population is then generated by probabilistically selecting the most fit chromosomes from the current population. Some of the selected chromosomes are added to the new generation while others are selected as parent chromosomes. Parent chromosomes are used for creating new offspring s by applying genetic operators such as crossover and mutation. Traditionally, the chromosomes represent bit strings; however, real number representation is possible. 2.3.2 Genetic Algorithms for Neural Networks. In order to use genetic algorithms for training neural networks, we need to represent the problem as chromosomes. Real numbered values of weights must be encoded in the chromosome other than binary values. This is done by altering the crossover and mutation operators. A crossover operator takes two parent chromosomes and creates a single child chromosome by randomly selecting corresponding genetic materials from both parents. The mutation operator adds a small random real number between -1 and 1 to a randomly selected gene in the chromosome. (6)

In evolutionary neural learning, the task of genetic algorithms is to find the optimal set of weights in a network topology which minimizes the error function. The fitness function must define the performance of the neural network. Thus, the fitness function is the reciprocal of sum of squared error of the neural network. To evaluate the fitness function, each weight encoded in the chromosome is assigned to the respective weight links of the network. The training set of examples is then presented to the network which propagates the information forward and the sum of squared errors is calculated. In this way, genetic algorithms attempts to find a set of weights which minimizes the error function of the network. Recurrent neural networks have been trained by evolutionary computation methods such as genetic algorithms which optimizes the weights in the network architecture for a particular problem. Compared to gradient descent learning, genetic algorithms can help the network to escape from the local minima. 2.4 Knowledge Based Neurocomputing The general paradigm of knowledge based neurocomputing includes the combination of symbolic knowledge in neural networks for better training and generalization performance [7]. The fidelity in the mapping of the prior knowledge is very important since the network may not take advantage of poorly encoded knowledge. Poorly encoded knowledge may hinder the learning process. Good prior knowledge encoding may provide the network with beneficial features such as: 1) The learning process may lead to faster convergence to a solution meaning better training performance, 2) networks trained with prior knowledge may provide better generalization when compared to networks trained with no prior knowledge and, 3) the rules in prior knowledge may help to generate additional training data which are not present in the original data set. Prior knowledge usually represented in the form of explicit rules in symbolic form is encoded in neural networks by programming some weights prior to training [7]. In feedforward neural networks, prior knowledge is encoded in propositional logic expression form by programming a subset of weights. Prior knowledge also determines the topology of the network i.e. the number of neurons and hidden layers appropriate for encoding the knowledge. The paradigm has been successfully applied to real world problems including bio-conservation [16] and molecular biology [17]. The prior or expert knowledge helps the network to get better generalization and training performance when compared with network architecture without prior knowledge encoding. For recurrent neural networks, finite-state automata are the basis for knowledge insertion. It has been shown that deterministic finite-state automata can be encoded in discrete-time second-order recurrent neural networks by directly programming a small subset of available weights [6]. For first order recurrent neural networks, a method for encoding finite-state automata has been proposed and shown [18]. 2.5 Speech Phoneme Classification A speech sequence contains huge amount of irrelevant information. In order to model them, feature extraction is necessary. In feature extraction, useful information from speech sequences are extracted which is then used for modeling. Recurrent neural networks and hidden Markov models have been successfully applied for modeling speech sequences [1,4]. They have been applied to recognize words and phonemes. The performance of speech recognition system can be measured in terms of accuracy and speed. Recurrent neural networks are capable of modeling complicated recognition tasks. They have shown more accuracy in recognition in cases of low quality noisy data compared to hidden Markov models. However, hidden Markov models have shown to perform better when it comes to large vocabularies. Extensive research on the application of research recognition has been done for more than forty years; however, scientists are unable to implement systems which can show excellent performance in environments with background noise. Mel frequency cepstral coefficients (MFCC) are useful feature extraction techniques as the Mel filter has characteristics similar to the human auditory system [19]. The human ear performs similar techniques before presenting information to the brain for processing. We will apply MFCC feature extraction techniques to extract features from phonemes from the TIMIT speech database. MFCC feature extraction is done by applying the following procedure. A frame of speech signal obtained by windowing and is presented to the discrete Fourier transformation function to change the signal from time domain to its frequency domain. Then the discrete Fourier transformed based spectrum is mapped onto the Mel scale using triangular overlapping windows. Finally, we compute the log energy at the output of each filter and then do a discrete cosine transformation of the Mel amplitudes from which we obtain a vector of MFCC features.

3. Combining Evolutionary and Gradient Descent Learning In the past, a vast amount of research had been done to improve the training performance of neural networks. Some of the proposed methods include adding a momentum during training, pruning the network architecture, and combining expert knowledge with neural network training. In the previous sections, we have discussed the maor training methods for recurrent neural network and outlined how their training performance can be improved. Knowledge based neurocomputing has proven successful as the paradigm uses symbolic knowledge for programming a subset of weights into the network architecture prior to training. However, they can not be applied in application where the expert knowledge is not available. We have discussed how recurrent neural networks are trained using genetic algorithms. Genetic algorithms guess the optimal set of weights for a given network architecture. We need to investigate what sets of weight initializations are best for training using genetic algorithms. We have addressed the problem that in gradient descent learning, the network may become trapped in the local minima resulting in poor training and generalization performance. However, in training using genetic algorithms, there is no such problem. In this paper, we will combine both genetic and gradient descent learning. We will train a recurrent neural network architecture using genetic neural learning, after successful training we will apply gradient descent learning to further refine the knowledge acquired by genetic neural learning. We will train recurrent neural networks using gradient descent learning and record their training performance for classification of speech phonemes obtained from the TIMIT database. We will then train recurrent neural networks for phoneme classification using genetic neural learning. We will use different sets of weight initialization. We will define a genetic population size and train the network architecture until they can learn the training dataset. 4. Empirical Results and Discussion 4.1 MFCC Feature Extraction We used Mel cepstral frequency coefficients for feature extraction from two phonemes read from the TIMIT speech database. We read phonemes b and d from the training and testing set in the TIMIT database. The training dataset contained 645 samples while the testing set consisted of 238 samples. For each phoneme read, we applied a window of size 512 every 256 sample point. We then windowed the signal and presented it to the discrete Fourier transformation. Furthermore, we mapped the spectrum onto the Mel scale using triangular filters and then did a discrete cosine transformation on the Mel amplitudes. In this way we obtained a vector of 12 MFCC features for each frame of the phoneme. 4.2 Recurrent Neural Networks using Gradient Descent for Phoneme Classification We used the training and testing data set of features from the two phonemes b and d as discussed in subsection 4.1. We used the recurrent neural network topology as follows: 12 neurons in the input layer which represents the speech feature input and 2 neurons in the output layer representing each phoneme. We used a learning rate of 0.2 and ran experiment with 12, 14, 16 and 18 neurons in the hidden layer. We used the backpropagation through-time algorithm which employs gradient descent for training. We ran two maor experiments; Table 1 shows illustrative results for experiment 1 which uses small random weights in the range of -1 to 1 while Table 2 shows larger weight values used for weight initialization. We would terminate training if the network could learn 88% of the training samples and tested the networks generalization performance with data set not included in the training set. For both experiments, we set the maximum training time of 100 epochs. The results show that gradient descent has been successful in training recurrent neural networks in the application of phoneme classification. Excessive number of neurons in the hidden layer has training difficulty. The two different sets of weight initializations have no maor effect in the generalization performance. Upon successful training, the best generalization performance recorded was 82.6% on the presentation of data set which was not included in the training set. Table 1: Gradient descent learning Epochs 12 100 82.8% 82.6% 14 100 87.5% 82.6% 16 100 88% 82.6% 18 100 0.2% 0.4% Small random weights initialised in the range of - 1 to 1

Table 2: Gradient descent learning Epochs 12 100 88% 82.6% 14 100 88% 82.6% 16 100 88% 82.6% 18 100 0.2% 0.4% Large random weights initialised in the range of - 5 to 5 4.3 Recurrent Neural Networks using Genetic Algorithms We obtained the training and testing data set of phonemes b and d as discussed in Section 4.1. We used the recurrent neural network topology as follows: 12 neurons in the input layer which represents the speech frame input and 2 neurons in the output layer representing each phoneme. We experimented with different number of neurons in the hidden layer. Table 3: Genetic neural learning 12 2 88% 82.6% 14 4 88% 82.6% 16 7 88% 82.6% 18 3 87.5% 82.6% Small random weights initialised in the range of - 1 to 1 Table 4: Genetic neural learning 12 4 88% 82.6% 14 4 88% 82.6% 16 3 88% 82.6% 18 2 88% 82.6% Large random weights initialised in the range of - 5 to 5 We ran some sample experiments and found that the population size of 40, crossover probability of 0.7 and mutation probability of 0.1 have shown good genetic training performance. Therefore, we used these values for all our experiments. We ran two maor experiments with different weight initialization prior to training and trained the network until it could at least learn 88% of samples in the training set. Illustrative results for each experiment are shown in Table 3 and Table 4, respectively. The results show 82.6 percent generalization performance on unseen samples which were not included in the training process. Genetic neural learning has shown better training performance compared to gradient descent learning; however, their generalization performance is the same. 4.4 Combined Genetic and Gradient Descent Learning. We apply the combined genetic and gradient descent learning for phoneme classification using recurrent neural networks. We classify two phonemes from features extracted in Section 4.1. We used the network topology as discussed in Section 4.3. We applied genetic algorithms for training, once the network has shown to learn 88% of the samples; we terminate the training and apply gradient descent to further refine the knowledge learnt by genetic training. We trained for 100 training epochs. Table 5 and 6 show illustrative results of the two maor experiments initialized with two different sets of weights, respectively. The results show that the generalization performance of combined genetic and gradient descent learning does not improve significantly when compared to the performance of genetic neural learning alone. Table 5: Genetic and gradient descent learning 12 100 81.8% 82.6% 14 100 81.1% 82.6% 16 100 81.3% 82.6% 18 100 81.3% 82.6% Small random weights initialised in the range of - 1 to 1 Table 6: Genetic and gradient descent learning 12 100 78.1% 82.6% 14 100 85.6% 82.6% 16 100 81.3% 82.6% 18 100 81.5 82.6% Large random weights initialised in the range of - 5 to 5

5. Conclusions We have discussed about the popular recurrent neural network training paradigms and outlined their strengths and limitations. We discussed the application of recurrent neural networks on speech phoneme classification using Mel frequency cepstral coefficients feature extraction techniques. We have successfully trained recurrent neural networks to classify phonemes b and d extracted from the speech database by gradient descent and genetic training methods. We have successfully compared and combined genetic and gradient descent learning in recurrent neural networks. Our results demonstrate that genetic neural learning has better training performance over gradient descent; however, their generalization performance is the same. We combined genetic and gradient descent learning for recurrent network training and found out that their generalization performance does not perform satisfactorily when compared to genetic neural learning alone. The application of the combined training method to other real world application problems remains an open question. 6. References [1] A.J Robinson, An application of recurrent nets to phone probability estimation, IEEE transactions on Neural Networks, vol.5, no. 2, 1994, pp. 298-305. [2] C.L. Giles, S. Lawrence and A.C. Tsoi, Rule inference for financial prediction using recurrent neural networks, Proc. of the IEEE/IAFE Computational Intelligence for Financial Engineering, New York City, USA, 1997, pp. 253-259 [3] K. Marakami and H Taguchi, Gesture recognition using recurrent neural networks, Proc. of the SIGCHI conference on Human factors in computing systems: Reaching through technology, Louisiana, USA, 1991, pp. 237-242. [4] M. J. F. Gales, Maximum likelihood linear transformations for HMM-based speech recognition, Computer Speech and Language, vol. 12, 1998, pp. 75-98. [5] C. Lee Giles, C.W Omlin and K. Thornber, Equivalence in Knowledge Representation: Automata, Recurrent Neural Networks, and dynamical Systems, Proc. of the IEEE, vol. 87, no. 9, 1999, pp.1623-1640. [6] C.W. Omlin and C.L. Giles: "Pruning Recurrent Neural Networks for Improved ", IEEE Transactions on Neural Networks, vol. 5, no. 5, 1994, pp. 848-851. [7] G. Towell and J.W. Shavlik, Knowledge based artificial neural networks, Artificial Intelligence, vol. 70, no.4, 1994, pp. 119-166. [8] C. Kim Wing Ku, M. Wai Mak, and W. Chi Siu, Adding learning to cellular genetic algorithms for training recurrent neural networks, IEEE Transactions on Neural Networks, vol. 10, no.2, 1999, pp. 239-252. [9] P. Manolios and R. Fanelli, First order recurrent neural networks and deterministic finite state automata, Neural Computation, vol. 6, no. 6, 1994, pp.1154-1172. [10] R. L. Watrous and G. M. Kuhn, Induction of finite-state languages using second-order recurrent networks, Proc. of Advances in Neural Information Systems, California, USA, 1992, pp. 309-316. [11] T. Lin, B.G. Horne, P. Tino and C.L. Giles, Learning long-term dependencies in NARX recurrent neural networks, IEEE Transactions on Neural Networks, vol. 7, no. 6, 1996, pp. 1329-1338. [12] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, vol. 9, no. 8, 1997, pp. 1735-1780. [13] P. J. Werbos, Backpropagation through time: what it does and how to do it, Proc. of the IEEE, vol. 78, no. 10, 1990, pp.1550-1560. [14] Y. Bengio, P. Simard and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol. 5, no. 2, 1994, pp. 157-166. [15] T. M. Mitchell, Machine Learning, McGraw Hill, 1997. [16] R. Chandra, R. Knight and C. W. Omlin, Combining Expert Knowledge and Ground Truth: A Knowledge based Neurocomputing Paradigm for Bio-conservation Decision Support Systems, Proceedings of the International Conference on Environmental Management, Hyderabad India, 2005. [17] C. W. Omlin and S. Snyders, Inductive bias strength in knowledge-based neural networks: application to magnetic resonance spectroscopy of breast tissues, Artificial Intelligence in Medicine, vol. 28, no. 2, 2003. [18] P. Fransconi, M. Gori, M. Maggini, and G. Soda, Unified integration of explicit rules and learning by example in recurrent networks, IEEE Transactions on Knowledge and Data Engineering, vol. 7, no. 2, 1995, pp. 340-346. [19] I. Potamitis, N. Fakotakis and G. Kokkinakis, "Improving the Robustness of Noisy MFCC Features Using Minimal Recurrent Neural Networks, Proc. of the IEEE International Joint Conference on Neural Networks, vol. 5, 2000, p. 5271.