I D I A P R E S E A R C H R E P O R T. Sriram Ganapathy a b. May to appear in EUSIPCO 2008

Similar documents
Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Recognition at ICSI: Broadcast News and beyond

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

WHEN THERE IS A mismatch between the acoustic

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Emotion Recognition Using Support Vector Machine

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Learning Methods in Multilingual Speech Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

A study of speaker adaptation for DNN-based speech synthesis

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segregation of Unvoiced Speech from Nonspeech Interference

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Speaker recognition using universal background model on YOHO database

Probabilistic Latent Semantic Analysis

On the Formation of Phoneme Categories in DNN Acoustic Models

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Automatic Pronunciation Checker

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Calibration of Confidence Measures in Speech Recognition

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Evolutive Neural Net Fuzzy Filtering: Basic Description

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Speech Recognition by Indexing and Sequencing

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Improvements to the Pruning Behavior of DNN Acoustic Models

Proceedings of Meetings on Acoustics

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

INPE São José dos Campos

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Automatic segmentation of continuous speech using minimum phase group delay functions

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

SARDNET: A Self-Organizing Feature Map for Sequences

Lecture 9: Speech Recognition

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Lecture 1: Machine Learning Basics

Learning Methods for Fuzzy Systems

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Python Machine Learning

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Voice conversion through vector quantization

A Neural Network GUI Tested on Text-To-Phoneme Mapping

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Affective Classification of Generic Audio Clips using Regression Models

Investigation on Mandarin Broadcast News Speech Recognition

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Edinburgh Research Explorer

Generative models and adversarial training

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Softprop: Softmax Neural Network Backpropagation Learning

Support Vector Machines for Speaker and Language Recognition

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Rule Learning With Negation: Issues Regarding Effectiveness

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Body-Conducted Speech Recognition and its Application to Speech Support System

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Speaker Recognition. Speaker Diarization and Identification

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Rule Learning with Negation: Issues Regarding Effectiveness

Using dialogue context to improve parsing performance in dialogue systems

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Reducing Features to Improve Bug Prediction

Knowledge-Based - Systems

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

A Case Study: News Classification Based on Term Frequency

Transcription:

R E S E A R C H R E P O R T I D I A P Spectro-Temporal Features for Automatic Speech Recognition using Linear Prediction in Spectral Domain Samuel Thomas a b Hynek Hermansky a b IDIAP RR 08-05 May 2008 to appear in EUSIPCO 2008 Sriram Ganapathy a b a IDIAP Research Institute, Martigny, Switzerland b Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland IDIAP Research Institute www.idiap.ch Av. des Prés Beudin 20 Tel: +41 27 721 77 11 P.O. Box 592 1920 Martigny Switzerland Fax: +41 27 721 77 12 Email: info@idiap.ch

IDIAP Research Report 08-05 Spectro-Temporal Features for Automatic Speech Recognition using Linear Prediction in Spectral Domain Samuel Thomas Sriram Ganapathy Hynek Hermansky May 2008 to appear in EUSIPCO 2008 Abstract. Frequency Domain Linear Prediction (FDLP) provides an efficient way to represent temporal envelopes of a signal using auto-regressive models. For the input speech signal, we use FDLP to estimate temporal trajectories of sub-band energy by applying linear prediction on the cosine transform of sub-band signals. The sub-band FDLP envelopes are used to extract spectral and temporal features for speech recognition. The spectral features are derived by integrating the temporal envelopes in short-term frames and the temporal features are formed by converting these envelopes into modulation frequency components. These features are then combined in the phoneme posterior level and used as the input features for a hybrid HMM-ANN based phoneme recognizer. The proposed spectro-temporal features provide a phoneme recognition accuracy of 69.1% (an improvement of 4.8% over the Perceptual Linear Prediction (PLP) base-line) for the TIMIT database.

2 IDIAP RR 08-05 1 Introduction Traditionally, acoustic features for Automatic Speech Recognition (ASR) systems are extracted by applying Bark or Mel scale integrators on power spectral estimates in short analysis windows (10 30 ms) of the speech signal. Typical examples of such features are the Mel Frequency Cepstral Coefficients (MFCC) [1] and Perceptual Linear Prediction (PLP) [2]. Most of the information contained in these acoustic features relate to formants which provide important cues for recognition of basic speech units. The signal dynamics are represented by a sequence of short-term feature vectors with each vector forming a sample of the underlying process. Further, additional information about the dynamics of the underlying speech signal is incorporated with these feature vectors using the derivative features. But, the problems of time-frequency resolution and efficient sampling of the short-term representation are addressed in an ad-hoc manner. It has been shown that important information for speech perception lies in the 1 16 Hz range of the modulation frequencies [3]. In order to exploit the information at these modulation frequencies, speech signals of relatively long temporal segments need to analyzed. An explicit incorporation of the information about the speech dynamics have been proposed for feature extraction [4, 5, 6]. Here, the long temporal trajectories (typically 1000ms) of spectral energy in critical bands are used for feature extraction. Recently, the technique of linear prediction (LP) in spectral domain (originally proposed for temporal noise shaping in audio coding [7]) was used for ASR feature extraction [8], where a representation of the temporal envelope in different frequency sub-bands is obtained by using the dual of the conventional linear prediction in time domain. In FDLP, the poles of the auto regressive (AR) model represent the temporal peaks rather than the spectral peaks. By using analysis windows of the order of hundreds of milliseconds, the technique automatically decides the distribution of the poles to best model the temporal envelope. Further, the model has some important advantages: Fine time-dependent resolution provides information about transient events in time like stop bursts. Long-term summarization of power in spectral bands presents the ability to capture complete description of the linguistic units lasting more than 10 ms. In this paper, we propose to exploit the above mentioned properties of FDLP by extracting spectrotemporal features for ASR. Specifically, the FDLP envelopes are used to obtain spectral and temporal features which are combined to form a joint spectro-temporal feature set. These features are input to a hybrid Hidden Markov Model - Artificial Neural Network (HMM-ANN) phoneme recognition system [9]. The HMM-ANN system has a Multi-Layer Perceptron (MLP) for estimating the phoneme posterior probabilities and a Viterbi decoder for finding the best phoneme sequence. The rest of the paper is organized as follows. In Sec. 2, we describe the FDLP technique which approximates temporal envelopes of a signal using linear prediction in spectral domain. The extraction of spectro-temporal features from the temporal envelopes is given in Sec. 3. Experiments with the proposed features for a phoneme recognition task in TIMIT database is reported in Sec. 4 along with a comparison of the other feature extraction techniques in the literature. In Sec. 5, we conclude with a discussion of the proposed features. 2 Modelling Sub-band temporal envelopes using FDLP The Hilbert envelope, which is the squared magnitude of the analytic signal, represents the instantaneous energy of a signal in the time domain. Hilbert envelopes are typically computed either by using the Hilbert transform operator in the time domain or by exploiting the causality of Discrete Fourier

IDIAP RR 08-05 3 2 x 104 (a) 0 2 0 50 100 150 200 250 300 350 400 450 500 550 4 x 108 (b) 3 2 1 0 0 50 100 150 200 250 300 350 400 450 500 550 4 x 108 3 (c) 2 1 0 0 50 100 150 200 250 300 350 400 450 500 550 Time (ms) Figure 1: Illustration of the all-pole modelling property of FDLP. (a) a portion of the speech signal, (b) its Hilbert envelope (c) all pole model obtained using FDLP Speech Signal DCT Critical Band Windowing FDLP Sub band Envelopes Figure 2: Deriving sub-band temporal envelopes from speech signal using FDLP Transforms (DFT) [10]. However, in order to use the Hilbert envelope as a dual of the power spectrum for speech recognition tasks, we require a parametric model. FDLP is an efficient technique for auto regressive (AR) modelling of temporal envelopes of a signal [7]. It represents a dual technique to the conventional Time Domain Linear Prediction (TDLP). In the case of TDLP, the AR model approximates the power spectrum of the input signal, whereas FDLP fits an all pole model to the Hilbert envelope (squared magnitude of the analytic signal). In our case, the FDLP technique is implemented in two parts - first, the discrete cosine transform (DCT) is applied on long segments of speech to obtain a real valued spectral representation of the signal. Then, linear prediction is performed on the DCT coefficients to obtain a parametric model of the temporal envelope. Fig. 1 shows an illustration of the AR modelling property of FDLP. It shows (a) a portion of speech signal of 500 ms duration, (b) its Hilbert envelope computed using the Fourier transform technique [10] and (c) an all pole approximation (of order 50) for the Hilbert Envelope using FDLP. For ASR tasks, the speech signal in long segments (hundreds of milliseconds) is decomposed into frequency sub-bands by windowing the DCT. Using FDLP, an all-pole minimum phase estimate of the temporal dynamics of each sub-band signal is obtained. The block schematic for the extraction of subband temporal envelopes from speech signal is shown in Fig. 2. The whole set of sub-band temporal envelopes forms a two dimensional (time-frequency) representation of the input signal energy.

4 IDIAP RR 08-05 temporal features frequency probability estimator Even Bands critical bands frequency probability estimator Odd Bands probability merger probabilities frequency spectral features time probability estimator Figure 3: Schematic of the joint spectro-temporal features for posterior based ASR 3 Spectro-temporal features using FDLP The sub-band temporal envelopes obtained from the FDLP are used to derive spectral and temporal features. These features are then combined to obtain joint spectro-temporal features which are used for posterior based speech recognition system. The joint spectro-temporal features adaptively capture fine temporal nuances with high temporal resolution while at the same time summarize the spectral evolution in time scales of hundreds of milliseconds. 3.1 Deriving short-term spectral features from sub-band temporal envelopes The conventional feature extraction methods obtain short-term spectral features by integrating the estimate of power spectrum of the signal in sub-bands (example PLP [2]). Similar to the representation of energy in the spectral domain in the form of power spectrum, the distribution of energy in the time domain is expressed in the form of Hilbert envelope. Since integration of signal energy is identical in time and frequency domain, the Hilbert envelope can equivalently be utilized for obtaining the sub-band energy based short term spectral features. The sub-band temporal envelopes are obtained using the FDLP technique applied on relatively long temporal segments (1000 ms) of the input signal. These sub-band envelopes are integrated in short term frames (of the order of 25 ms with a shift of 10 ms). These short term sub-band energies are converted to short-term cepstral features similar to the PLP feature extraction technique [2]. As the features are derived from short temporal segments, they capture spectral details in the order of 10 ms.

IDIAP RR 08-05 5 Speech Signal Spectro Temporal Feature Extraction Probability Estimator HMM ANN Recognizer Figure 4: HMM - ANN speech recognizer 3.2 Deriving long-term temporal features from sub-band temporal envelopes The long-term sub-band envelopes from the FDLP form a compact representation of the temporal dynamics over long regions of the speech signal. We use cepstral recursion to convert our all-pole models of the temporal trajectories into modulation spectral components [8]. Since the speech recognizer requires features at a frame rate of 10 ms, the modulation frequency components for the current frame in each sub-band are obtained along with contextual information of neighboring frames (in a manner similar to the TRAP-TANDEM setup [5]). The recognition performance depends on the number of neighboring frames forming the context for the current frame (varied from 10 40). The temporal features for each sub-band are stacked together and fed to the posterior probability estimator. 3.3 Combining Spectro-Temporal Features The posterior probability estimator is an ANN trained with features from the training data to estimate phoneme posterior probabilities [5]. The number of parameters to be trained in the Multi Layer Perceptron (MLP) are dependent on the input feature dimension and are limited by the amount of the training data. Since, the dimension of the temporal features is high, we use two posterior estimators for the temporal features corresponding to the even and odd bands respectively. Spectral features with a context of 9 frames are used in another posterior probability estimator and the output of these three neural net classifiers are combined using the Dempster Shafer (DS) theory of evidence [11]. Fig. 3 shows the schematic of the proposed combination of spectral and temporal features. These phoneme posterior probabilities are used in a phoneme posterior based speech recognition system. 4 Experiments and Results The phoneme recognition system is based on the Hidden Markov Model - Artificial Neural Network (HMM-ANN) paradigm [9] shown in Fig. 4. The MLP estimates the posterior probability of phonemes given the acoustic evidence P(q t = i x t ), where q t denotes the phoneme index at frame t, x t denotes the feature vector taken with a window of certain frames. The relation between the posterior probability P(q t = i x t ) and the likelihood P(x t q t = i) is given by the Bayes rule, p(x t q t = i) p(x t ) = P(q t = i x t ) P(q t = i). (1) It is shown in [9] that the neural network with sufficient capacity and trained on enough data estimates the true Bayesian a-posteriori probability. The scaled likelihood in an HMM state is given by Eq. 1, where we assume equal prior probability P(q t = i) for each phoneme i = 1,2...39. The state transition matrix is fixed with equal probabilities for self and next state transitions. Viterbi algorithm is applied to decode the phoneme sequence. Experiments were performed on TIMIT database, excluding sa dialect sentences. All speech files are sampled at 16 khz. The training data consists of 3000 utterances from 375 speakers, cross-validation data set consists of 696 utterances from 87 speakers and the test data set consists of 1344 utterances from 168 speakers. The TIMIT database, which is hand-labeled using 61 labels is mapped to the standard set of 39 phonemes [12].

6 IDIAP RR 08-05 Table 1: Phoneme Recognition Accuracies (%) for FDLP based Spectro-Temporal features. # Contextual Frames Acc. 10 68.3 20 69.1 30 69.0 40 68.4 Table 2: Phoneme Recognition Accuracies (%) for different feature extraction techniques. PLP 9 frame LPTRAP MRASTA FDLP Acc. 67.6 61.9 64.4 69.1 As explained in Sec. 3, the speech signal is processed to extract spectro-temporal features for every frame. These features are mean/variance normalized (across the training data set) to obtain feature vectors for every 10 ms of speech. A three layered MLP is used to estimate the phoneme posterior probabilities. The network is trained using the standard back propagation algorithm with cross entropy error criteria. The learning rate and stopping criterion are controlled by the frame classification rate on the cross validation data. In our system, the MLP consists of 1000 hidden neurons, and 39 output neurons (with soft max nonlinearity) representing the phoneme classes. The performance of phoneme recognition is measured in terms of phoneme accuracy. In the decoding step, all phonemes are considered equally probable (no language model). The optimal phoneme insertion penalty that gives maximum phoneme accuracy on the cross-validation data is used for the test data. For deriving the temporal envelopes from the speech signal, the current frame of 10 ms is appended with a number of neighboring frames in a manner similar to the TRAP-TANDEM setup [5]. The FDLP model order is fixed at an average rate of 100 poles per second for each sub-band. The subband decomposition for the spectral feature extraction is done on a Mel scale and 13 cepstral features are derived along with their first and second derivatives (similar to 39 dimensional PLP features). For deriving the temporal features, we use a critical band decomposition and obtain 21 modulation frequency components for each sub-band. These features are appended along with their first frequency derivatives (similar to M-RASTA features [6]) to obtain 42 dimensional temporal features for each sub-band. Due to the limitations in the amount of training data, we train separate MLPs for even and odd bands. The length of contextual information is varied and phoneme recognition is performed with spectro-temporal features. Table 1 summarizes the results for the experiments with FDLP based spectro temporal features. In the base-line experiments, PLP features with a 9 frame context [12], LP-TRAP features [8] and M-RASTA features [6] are used for the phoneme recognition task with the same hybrid HMM-ANN system. These results are shown in Table 2. The best base-line phoneme recognition accuracy is 67.6% for the PLP 9 frame context. The best phoneme recognition accuracies are obtained for spectro-temporal features derived using a context of 20 frames (i.e., with a FDLP frame length of 225 ms). The improvement over the PLP baseline is around 4.8% which is statistically significant. 5 Conclusions We have proposed a novel method of extracting spectro-temporal features for ASR. For this purpose, temporal envelopes of critical band sized sub-bands are modelled using Frequency Domain Linear Prediction. The spectral features are derived by integrating the FDLP envelopes in short-term frames and the temporal features are obtained by converting the temporal envelopes into modulation fre-

IDIAP RR 08-05 7 quency components. These features are combined in the phoneme posterior level and used as the input features to a hybrid HMM-ANN recognizer. The proposed features provide noticeable improvements over the PLP feature base-line for phoneme recognition tasks. The results are promising and encourage us to experiment on other tasks with different test and noisy conditions. 6 Acknowledgements This work is partially supported by the European IST Programme Project FP6-0027787 and the Swiss National Center of Competence in Research (NCCR) on Interactive Multi-modal Information Management (IM2); managed by the IDIAP Research Institute on behalf of the Swiss Federal Authorities. This paper only reflects the authors views and funding agencies are not liable for any use that may be made of the information contained herein. The authors would like to thank Joel Pinto, Petr Motlicek and Fabio Valente for helpful discussions and code fragments. Furthermore, we would also like to thank Marios Athineos and Dan Ellis for PLP and FDLP feature extraction codes. References [1] S.B. Davis and P. Mermelstein, Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, in IEEE Trans. on Acoustics, Speech and Signal Proc. Vol. 28, pp. 357-366, 1980. [2] H. Hermansky, Perceptual Linear Predictive (PLP) Analysis of Speech, J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, 1990. [3] R. Drullman, J.M. Festen and R. Plomp, Effect of Reducing Slow Temporal Modulations on Speech Reception., in J. Acoust. Soc. Am., Vol. 95(5), pp. 2670-2680, 1994. [4] H. Hermansky and S. Sharma, TRAPS - Classifiers of Temporal Patterns., in Proc. of ICSLP, Sydney, Australia, Vol. 3, pp. 1003-1006, 1998. [5] H. Hermansky, TRAP-TANDEM: Data-driven Extraction of Temporal Features from Speech, in Proc. of IEEE ASRU, St. Thomas, US Virgin Islands, pp. 255-260, 2003. [6] H. Hermansky and P. Fousek, Multi-Resolution RASTA Filtering for TANDEM-Based ASR, in Proc. of Interspeech, Lisbon, Portugal, pp. 361-364, 2005. [7] J. Herre and J.D Johnston, Enhancing the Performance of Perceptual Audio Coders by using Temporal Noise Shaping (TNS), in Proc. of 1st AES Conv., Los Angeles, USA, pp. 1-24, 1996. [8] M. Athineos, H. Hermansky and D.P.W Ellis, LP-TRAPS: Linear Predictive Temporal Patterns, in Proc. of Interspeech, Jeju Island, Korea, pp. 1154-1157, 2004. [9] H. Boulard and N. Morgan, Connectionist Speech Recognition - A Hybrid Approach, Kluwer Academic Publishers, Boston, 1994. [10] L.S. Marple, Computing the Discrete-Time Analytic Signal via FFT, in IEEE Trans. on Acoustics, Speech and Signal Proc., Vol. 47, pp. 2600-2603, 1999. [11] F. Valente and H. Hermansky, Combination of Acoustic Classifiers based on Dempster-Shafer Theory of Evidence, in Proc. of ICASSP, Hawaii, U.S.A, pp. 1129-1132, 2007. [12] J. Pinto, B. Yegnanarayana, H. Hermansky and M. M. Doss, Exploiting Contextual Information for Improved Phoneme Recognition, in Proc. of Interspeech, Antwerp, Belgium, pp. 1817-1820, 2007.