A Hybrid Neural Network/Hidden Markov Model Method for Automatic Speech Recognition Hongbing Hu Advisor: Stephen A. Zahorian Department of Electrical and Computer Engineering, Binghamton University 03/18/2008 1
Introduction Automatic Speech Recognition (ASR) Translate speech into text Most investigated research topic in speech processing Applications Speech user interface for computers (Microsoft speech recognition in Windows) Tl Telephone queries (operator/touch htone replacement) Voice Dialing (for cell phones) Difficulties of Automatic Speech Recognition Speaker variability (pronunciation, rate, overlaps) Acoustic variability y( (noise, reverb, talker movement) Style Variability (reading vs. conversational speech) 2
Speech Recognition Architecture Speech Waveform Feature Extraction Speech Features Classification (Recognition) Recognizer (HMM/NN) Phonemes i n i: d sil Words I need a 3
Hidden Markov Models (HMM) Hidden Markov Model (HMM) A stochastic process to determine probability of an observable sequence A finite-state machine, at each time t that a state j is entered, an observation o t is emitted with probability density b j (o t ) Transition from state i to state j modeled with probability a ij a 22 a 33 a 44 S i : State S 1 a 12 S 2 a 23 S 3 a 34 S 4 a 45 S 5 b 2 (o 1 ) b 3 (o i ) b 4 (o i ) a ij : Transition Probability b i (o j ): Emission i Probability bilit 4
HMMs in Speech Recognition Most popular approach for Phoneme HMM continuous recognition A HMM used to model a phoneme or a word Observable sequence associated with speech feature vectors O 1 O t Probability of a particular feature sequence over a HMM model computed to determine recognition T decision α j ( ) = α i ( t Speech Feature Vectors t 1) a b ( o ) i = 1 a ij : Transition Prob., b i (o j ): Emission Prob. ij j t 5
Neural Network (NN) Neural Network Inspired by biological nervous systems (such as our brain) Node (artificial neuron) Basic unit in a Neural Network Output determined from the weighted sum of its inputs and activation function x 1 Input w 1 Weight Node Output f y Activation Function y d ( i i 0 i= 1 = f x w + w ), where ( net w i 1 if x i f ) = 1 if net net < 0 0 6
Neural Networks in Speech Recognition A neural network consists of multiple layer nodes Input layer is enlarged to accept speech feature Recognition decision is made from output layer Node weights need to be trained for the desired output n 11 n 31 Speech Feature Vector n 12 n 21 n 32 Recognit ion Decision Input Layer n 1i n 2i n 3i Hidden Layer Output Layer 7
NN for Feature Dimensionality Reduction Difficulties in practical speech recognition Large dimensionality of acoustic feature spaces Significant load in model training ( Curse of dimensionality ) Nonlinear Principal Components Analysis (NLPCA) Neural Network based feature dimensionality reduction φ(x): Transformed feature of the data point x for machine learning x φ(x) ( ) φ( ) T f d f t f th φ(.) : D R R M R M : M dimension feature space φ(.): A neural network mapping to obtain more linear features 8
Nonlinear Principal Components Analysis Bottleneck neural network Input Data Dimensionality Reduced Data Dimensionality reduced data has more effective representation 9
Limitations of HMMs and NNs HMMs Poor discriminative power because of Maximum likelihood training criteria The first order Markov Assumption is only an approximation, leading to reduced performance Neural Networks Lack of ability to account for temporal variations in speech Lack of mathematical framework for combining phonetic models, thus a poor representation for continuous speech 10
Hybrid NN/HMM Method Neural Networks Feature dimensionality reduction ability Nonlinear transformation ability HMMs Long-term (dependencies) d continuous speech recognition Easily combined with language model Neural Network Hidden Markov Model Hybrid Recognition Method Flexibility and recognition performance Improvement 11
Hybrid NN/HMM Method Architecture Neural Network used for feature transformation Obtain low-dimensional but efficient representations of speech feature Middle layer of bottleneck neural network output dimensionality reduced feature HMM recognizer Each HMM corresponds to a phoneme, using phonetic feature detectors recognize dimensionality reduced features Pre-process HMM Recognizer Speech Feature Dimensionality Reduced Feature 12
Neural Network Training Neural Network Training 1 NN is trained from labeled training data Training target data corresponding each phoneme for output layer is generated using phoneme specific binary codes Back-propagation algorithm Speech Feature Training Space Target 1 0.8 0.5 0.6 0 0.4-0.5-1 0 10 20 100 80 60 40 20 300 1 0.8 Transformed 0.6 0.4 Feature Space 0.2 0 0.2 0 0 10 20 300 10 20 30 40 0 5 10 15 20 25 300 5 10 15 20 13
Experiments TIMIT database a total of 6300 sentences, about 400 minutes 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the U.S. Time-aligned phonetic transcription is provided 4680 sentences for training, 1620 for test HMM models (HTK toolkit ver 3.4) 39 phonemes (39 HMMs) 5 states and 3 mixtures Bigram language model 14
Feature Comparison MFCC (Mel-Frequency Cepstrum Coefficients) standard in speech recognition, but limited feature dimensionality) DCTC (Discrete Cosine Transform Coefficients) High dimension dynamic feature Results in Accuracy (HMM recognition only) Accuracy = (N-D-I-H)/N, N:number of phonemes, D: deletion errors, I: Insertion errors, H:substitution errors Num. of features DCTC MFCC (MFCC_E) 13 51.13% 50.96% 26 59.15% 51.81% 39 62.86% 64.77% (MFCC_E_D_A) 91 62.16% --- 15
Experimental Results for Hybrid Method 70 Reco ognition Acc curacy [%] 68 66 64 62 60 58 56 54 52 50 Training (91 Dim.) Test (91 Dim.) Training Data Test Data 50 30 25 20 15 13 10 6 4 Num of Dimensions in Reduced Feature Space Dimensionality reduced features yield higher accuracy than original 91 features 16
Conclusions A hybrid Neural Network/Hidden Markov Model is proposed Using the nonlinear transformation ability of Neural Networks, a hybrid method yields better performance Future works Exploring training target settings of Neural Network for more effective feature dimensionality reduction Global optimization of Neural Network and Hidden Markov Model 17