Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio State University, Columbus, H, USA hey, fosler@cse.ohio-state.edu Abstract Discriminative segmental models, such as segmental conditional random fields (SCRFs), have been successfully applied to speech recognition recently in lattice rescoring to integrate detectors across different levels of units, such as phones and words. However, the lattice generation has been constrained by a baseline decoder, typically a frame-based hybrid HMM- DNN system, which still suffers from the well-known frame independent assumption. In this paper, we propose to use SCRFs with DNNs directly as the acoustic model, a one-pass unified framework that can utilize local phone classifiers, phone transitions and long-span features, in direct word decoding to model phones or sub-phonetic segments with variable length. We describe a WFST-based approach to utilize the proposed acoustic model efficiently with the language model in first-pass word recognition. ur evaluation on the WSJ corpus shows our SCRF-DNN system outperforms a hybrid HMM-DNN system and a frame-level CRF-DNN system using the same label space. Index Terms: word recognition, segmental conditional random fields, first-pass decoder 1. Introduction Conventional Hidden Markov Models (HMMs) are used to model acoustic observations in a frame-by-frame fashion for speech recognition. They have long been known for the disadvantages of the conditional independence assumption between frames and the inability to incorporate long span features (e.g. phone duration and formant trajectories). There have been a number of studies attempting to address these limitations by extending the frame-level HMMs to segmental models [1, 2, 3]. Discriminative segmental models recently have been a promising direction of speech recognition research [4, 5, 6]. In particular, segmental conditional random fields (SCRFs) have achieved recent success in the lattice rescoring framework for improving state-of-the-art word recognition performance [7] due to their expressive log-linear feature integration and their discriminative nature at the sequence level. However, they remain to rely on lattices generated from a baseline HMM system to constrain candidate segmentations and label sequences for rescoring. nly recently have direct segment-based CRF approaches been explored as an first-pass decoding, but only for phone recognition [8, 9, 10]. In this work, we propose a WFSTbased decoding framework to utilize SCRFs as acoustic models directly along with language models for first-pass word decoding. As far as we know, this is the first attempt in the literature. In recent years, deep neural networks (DNNs) have achieved remarkable success in acoustic modeling [11, 12]. In our previous work [9], we have shown a sequence-based SCRF can be combined with a frame-based shallow neural network very effectively in a one-pass phone recognizer. DNNs, as sophisticated non-linear feature learners, can even further eliminate the traditional limitation of linear feature space for CRFs. Therefore, in this work, we investigate the effectiveness of the combination of SCRFs with DNNs as acoustic models in an end-to-end system for word recognition. We evaluate the proposed SCRF-DNN system on the WSJ0 corpus and find that even in the presence of trigram wordlevel language models and strong DNN acoustic models, a monophone-based SCRF-DNN system is still able to outperform a frame-level CRF-DNN system and a hybrid HMM-DNN system using the same monophone label space, and approaching the performance of a senone-based hybrid system. As in the case for phone recognition [9], the phone duration features and the acoustics-based phone transition features are useful for word recognition. 2. SCRF-based Acoustic Models 2.1. Frame-level Conditional Random Fields Conditional random fields (CRFs) [13], shown in Figure 1(a), are used for modeling sequence posteriors in structured prediction tasks, which are widely used in speech and language processing [14]. Let = {o 1, o 2,, o T } be the observation sequence, where T is the total number of frames in an utterance; let = {q 1, q 2,, q T } be the corresponding phone label sequence. Then assuming a standard frame-level (linear-chain) CRF, the probability of a phone label sequence conditioned on the observations is given by: where P ( ) = exp T t=1 iλifi(qt 1, qt, ot) Z() Z() = exp (1) T λ if i(q t 1, q t, o t) (2) t=1 Training is done through the conditional maximum likelihood (CML) criterion, i.e. by maximizing the posterior probabilities of the correct label sequences conditioned on the training observations. Decoding is done by choosing the label sequence with the highest posterior probability at test time [14]. 2.2. Segmental Conditional Random Fields Segmental conditional random fields, also known as SCRFs [5] or semi-markov CRFs [15], generalize the frame-level CRFs by i

cat /k ae t/ k k ae ae ae t o 1 o 2 o 3 o 4 o 5 o 6 (a) Frame-level CRFs. cat /k ae t/ k ae t o 1 o 2 o 3 o 4 o 5 o 6 e 0 e 1 e 2 e 3 (b) Segmental CRFs. E Figure 1: Frame-level CRFs and Segmental CRFs. allowing each label state to span several observations with variable length. As is shown in Figure 1(b), since each label state in the sequence in a SCRF can correspond to a chunk of observations in, the feature functions for a segment or between two segments can represent long-span dependencies, e.g. duration features. The Markov assumption is only held on the transitions between segment states, while the transitions within segments can be non-markovian. Given an observation sequence, suppose for a specific hypothesis, is segmented by SCRFs into J chunks. Let segment-level phone label sequence be denoted as = {q 1, q 2,, q J}, where 1 J T. Let e j be the frame index for the ending frame of the j th segment. Then E = {e 1, e 2,, e J} defines one possible segmentation for. Each label q j corresponds to a chunk of frames from (exclusive) to e j (inclusive), with the observations denoted as o e j. The joint probability distribution of the label sequence and its associated segmentation E conditioned on the observations is modeled as: P (, E ) = exp E j=1 i λifi(qj 1, qj, oe j ) Z() where Z() =,E s.t. = E E exp λ if i(q j 1, q j, o e j j=1 i (3) ) (4) Training and decoding needs to consider all possible label sequences and segmentations for inference, which can be done efficiently by extending the standard forward and backward recursions for frame-level CRFs to the segment level [9]. 2.3. DNNs as Features to CRFs / SCRFs We consider to use the linear output of a DNN before the softmax layer as the input to CRF or SCRF acoustic models. Let K be the dimension of the DNN output layer, the k th K output for the frame at time t be denoted as DNN k (o t). 2.3.1. Frame-level CRF features We combine each dimension of the DNN output with each phone state label u to construct a state feature as: f u,k (q t, o t) = DNN k (o t)δ(q t = u) (5) We use transition bias as a typical transition feature for CRFs that associates with a pair of phone states (u, v): f u,v(q t 1, q t) = δ(q t 1 = u)δ(q t = v) (6) In addition, we also use observation-dependent transition features that are represented by the DNN output, which have already been shown important in our previous work [9]: f u,v,k (q t 1, q t, o t) = DNN k (o t)δ(q t 1 = u)δ(q t = v) (7) In a CRF-DNN acoustic model, the CRF can be considered as an extension of the DNN with transition features and sequence-level softmax normalization; while the DNN can be considered as an extension of CRF feature functions from linear to non-linear. Notice that while the DNN models the frame posteriors in an HMM hybrid system, a CRF-DNN system models the sequence posteriors directly. So when we combine the CRF- DNN acoustic model with the lexicon and the language model in the ASR cascade for word decoding, in order to convert the phoneme sequence posteriors into likelihoods, we need to divide them by the sequence priors instead of the frame-level priors. We would explain how it can be done in the WFST decoding framework in Section 3. 2.3.2. SCRF features For SCRFs, transition bias is still used. In addition, we use state segmental features and boundary transition features, which are explained in detail below and illustrated in Figure 2(a) and 2(b). A segmental state feature associates a hypothesized segment label and the corresponding chunk of frames. It can be constructed by simple transformation of the frame-level features across the hypothesized segment. Let l = e j denote the segment length for o e j. We construct the DNN-related segmental state features as the form of: f u,k (q j, o e j ) = φ k (o e j )δ(q j = u) (8) where φ could be the following: Sub-sample Feature: φ k, t (o e j ) = DNN k (o ej 1 + t), t = {0.1l, 0.3l, 0.5l, 0.7l, 0.9l}. Avg Feature: φ k (o e j ) = ( DNN k (o t))/l. t (,e j ] Max Feature: φ k (o e j ) = max DNN k(o t). t (,e j ] Min Feature: φ k (o e j ) = min DNN k (o t). t (,e j ] In addition, we model duration as one-hot features associated with the hypothesized segment label in segmental state features: f u,d (q j, o e j ) = δ(l = d)δ(q j = u), for all 1 d D (9) where D is the maximum allowable segment duration. For efficiency and better generalizability, these segmental features are only used for states instead of transitions. For transitions, we use boundary transition features that associate two

consecutive hypothesized phone segment labels with a fixedsize context window of observations around their boundary as: f u,v,k, t (q j 1, q j, o +c +c ) = DNN k (o ej 1 + t)δ(q j 1 = u)δ(q j = v), for all c t c (10) where c is the predefined half-window size. Since the boundary transition features are not segmental, i.e. independent of the segment length, we can apply the implementation of Boundary- Factored SCRFs for efficient inference in training and decoding, which are proposed in our previous work [9]. q m-1 q m q m+1 5 sub-samples (equally spaced) avg max min duration (one-hot) (a) Segmental state features. 0 0 0 0 0 0 0 0 0 1 q n-1 q n q n+1 q n+2 (b) Boundary transition features. Figure 2: Segmental state features and boundary transition features for SCRFs. (a) Segmental state features including subsample, average, maximum, minimum and duration features associated with a hypothesized phone segment q m with 10 frames long. The output dimension K of DNNs is simplified to be 3 for illustration. Duration features are encoded as one-hot features: only the bit corresponding to the hypothesized segment length is fired as 1, otherwise 0 (assuming the maximum duration D is 10). (b) Boundary transition features with a +/- 1 frame context window (c=1) that are associated with a pair of hypothesized phone segment labels (q n, q n+1). 3. WFST-based Word Decoding 3.1. HMM-based Decoding The conventional statistical model for ASR generates the most likely word sequence W given the observation through maximizing the conditional probability P (W ), based on Bayes decision rule: W = arg max P (W ) (11) W arg max W max P ( )P ( W )P (W ) (12) denotes a sequence of sub-word units (typically phones or context dependent phones) corresponding to W. P ( ) is an acoustic model; P ( W ) is a pronunciation model or dictionary model; and P(W) is a language model. In HMM-based ASR systems, P ( ) is modeled by an HMM-GMM generative model, or by a hybrid HMM-DNN model, which converts the DNN state posteriors to likelihoods through dividing them by state priors [16]. A traditional HMMbased decoder can be encoded as a static WFST-based decoding graph by composing transducers at different levels for probabilistic transduction [17]: H C L G (13) H is a transducer that has, on its output, context dependent phones, and on its input, symbols representing acoustic states. The HMM state topology is represented in H with weights as probabilities of state self-loops and transitions. C transduces the context dependent phones into independent ones. L is a WFST for the lexicon that maps phones into words with pronunciation probabilities. And G maps one word sequence to another based on language model probabilities. The four components are composed as a static graph, which is expanded at test time to accept state-level acoustic model scores at each frame from a GMM or DNN component. All the probabilities are encoded as negative log likelihoods as the arc weights so that the most likely word sequence corresponds to the shortest path of the expanded graph with respect to the tropical semiring. 3.2. CRF-based Decoding In a CRF-based system as in Section 2, we model the posterior probability P ( ) directly instead of P ( ). In spirit to the idea of hybrid HMM-DNN systems, we follow Morris [18] to incorporate a CRF acoustic model into word decoding by transforming the posterior probability based on Bayes rule again: P ( )P () P ( ) = (14) By applying Eq. (14) to Eq. (12), we can formulate CRFbased ASR decoding as Eq. (16) for finding the most likely word sequence: W arg max max W = arg max max W P ( )P () P ( W )P (W ) (15) P ( ) P ( W )P (W ) (16) The phoneme sequence posterior P ( ) could be modeled by frame-level CRFs as in Eq. (1), or by segmental CRFs as in Eq. (3). is the phoneme sequence prior, modeled by an n-gram language model at the phone or state level. P ( W ) and P (W ) are the same as in a HMM-based system. To decode with a CRF-based acoustic model, we can also apply Eq. (16) to the WFST decoding framework with an extension as: H C P L G (17) The tranducers C, L and G are unchanged as original. However, H in this case represents the state transition and crossphone transition topology in CRFs or SCRFs, and its arcs are unweighted since the transition scores could be based on acoustic observations, which are provided by the CRF or SCRF acoustic models P ( ) at test time along the acoustic state scores. The phone sequence prior can be represented in a transducer P, where the arc weights are positive log likelihoods of the phone n-gram probabilities, since is in the denominator of Eq. (16). P is then composed to the right of C in the case of monophone n-gram LM priors, or to the left of C in the case of context-dependent phone priors.

P ( ) for CRFs or SCRFs cannot be fully calculated until the end of the utterance. However, in the time-synchronous viterbi decoding [17], we only need to provide the accumulated negative log of the numerator (i.e. the weighted feature sum) in Eq. (1) and (3) up to the current time step along the path, since the log of the denominator Z() would be the same for all paths eventually. This enables beam pruning and online decoding. 4. Evaluation We evaluate the proposed SCRF-DNN system on the WSJ0 corpus for continuous speech recognition. This is a corpus of read speech from the Wall Street Journal by native English speakers. A training set of 7138 utterances from 83 speakers (about 15h) is used to build recognition models. A development set of 368 utterances from 10 speakers is used to tune the models prior to evaluation. Specifically this work evaluates system performance on the Eval-92 test set 5K vocabulary task. The evaluation set includes 330 utterances from 8 different speakers. All systems are evaluated using the same standard 5K closed vocabulary trigram language model provided by the task. As initial experiments, we build our SCRF-DNN acoustic models with monophone labels. For direct comparison, a hybrid HMM-DNN system and a frame-level CRF-DNN system are built with 3-state monophone state labels. We use the Kaldi toolkit [19] to build a hybrid HMM-DNN system using 40-dimensional log Mel filterbank features with their deltas and double-deltas from a 11-frame context window. We first pretrain the DNN generatively with stacked RBMs, which are then used to initialize the DNN with 7 hidden layers of 2048 sigmoid units. Then we train the DNN with the monophone state targets using the alignment obtained from an HMM-GMM system trained with MFCC features. Since the initial monophone alignment is not very accurate, we realign the training data with the trained DNN and retrain the DNN using the new alignment. We repeat this process for three times until the performance becomes saturated. We further improve the system by applying smbr-based sequence training on the DNN [20]. For faster convergence of the smbr training, we regenerate the lattices after the first iteration and train for 4 more iterations. For the decoding, we use the standard 5K closed vocabulary trigram language models provided by the task. Both the frame-level CRF-DNN system and the SCRF- DNN system use the same DNN that is trained in the hybrid system. Linear output of DNNs normalized at the utterance level is used as the features to CRFs and SCRFs since it works slightly better than the DNN posteriors in our preliminary experiments. We use stochastic gradient descent for the optimization of both CRFs and SCRFs. For the learning rate scheduling, CRFs training uses adaptive gradient descent [21], which converges within about 20 iterations; SCRFs training already converges very quickly with a small fixed learning rate (0.0003) within about 5 iterations, with no more improvement by switching to adaptive gradient descent or learning rate halving scheduling. We use a bigram phone LM for the phone sequence prior component in the SCRF decoding graph with a scaling factor, similarly to the standard language model scaling factor. Both factors are tuned in the development set. Following our previous work [9], we use +/- 6 frames of the boundaries as the context window for the boundary transition features for SCRFs, and 10 frames as the maximum duration for each segment (phones that are longer than 10 frames are evenly split into smaller segments). The result comparison is shown in Table 1. The baseline monophone hybrid HMM-DNN system achieves 3.9% WER on the test set, which shows it is already a strong monophone system. The frame-level CRF-DNN system achieves 4.2% WER, which is insignificantly worse than the hybrid system. The SCRF-DNN system obtains 3.3% WER, which outperforms both frame-level baseline systems significantly with the same monophone label space. Note that both frame-level systems use 3-state monophones as the targets, while SCRFs only use 1-state monophone targets because it models a phoneme as a whole unit. But the sub-sample features and the DNN 3-state output help SCRFs to model internal phone states implicitly. Systems phone label space Test WER hybrid HMM-DNN 3-state monophones 3.9 % frame-level CRF-DNN 3-state monophones 4.2 % SCRF-DNN 1-state monophones 3.3 % hybrid HMM-DNN senones 2.5 % Table 1: WER on the Eval-92 5K closed vocabulary task. We also trained a hybrid HMM-DNN system with tied triphone states (senones) as the targets, which achieves 2.5% WER. That means, with only 1-state monophone targets, our SCRF-DNN system already achieves the improvement half way through from a hybrid monophone system to a senone system. In future work, we would like to incorporate context dependency into SCRFs through either the feature space or the label space. For example, we could use DNN senone posteriors or linear output as the transition features associated with a pair of phones, in which case we can constrain the parameter space by only considering top few senones for each pair of transition. The CML training criterion for CRFs and SCRFs is effectively equivalent to the maximum mutual information (MMI) training criterion for hybrid HMM-DNN sequence training [14, 22], except that we do not train CRFs or SCRFs with the language models together as in [23, 20]. Also, we do not jointly train the sequence models with DNNs as in [24, 10]. Applying both techniques might further improve the result of our SCRF- DNN system. In addition, alternative sequence training criteria might be useful for SCRFs, such as smbr[23, 20], large margin[25] or other cost functions investigated in [26]. 5. Conclusion Segmental conditional random fields have been mainly used in the lattice rescoring framework for word recognition or onepass decoding for phone recognition in the literature. We apply SCRFs to one-pass word recognition for the first time by introducing a WFST-based decoding framework that can incorporate SCRFs and DNNs as acoustic models efficiently with language models. The experiment results on a 5K vocabulary read speech corpus (WSJ0) show that the proposed SCRF-DNN system outperforms both a hybrid HMM-DNN system and a framelevel CRF-DNN system with the same label space significantly. We show that SCRFs are capable of modeling variable-length phone segments directly with duration features and the aggregation of local DNN output within a segment. In future work, we would like to explore the integration of DNN senone posteriors as well as sequence training with language models. 6. Acknowledgements This work was supported by NSF Grant IIS-1409431. We would like to thank Jeremy Morris for valuable discussions and the use of his frame-level ASR-CRaFT toolkit.

7. References [1] M. stendorf, V. V. Digalakis, and. A. Kimball, From hmm s to segment models: A unified view of stochastic modeling for speech recognition, Speech and Audio Processing, IEEE Transactions on, vol. 4, no. 5, pp. 360 378, 1996. [2] J. R. Glass, A probabilistic framework for segment-based speech recognition, Computer Speech & Language, vol. 17, no. 2, pp. 137 152, 2003. [3] M. De Wachter, M. Matton, K. Demuynck, P. Wambacq, R. Cools, and D. Van Compernolle, Template-based continuous speech recognition, Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, no. 4, pp. 1377 1390, 2007. [4] M. Layton and M. Gales, Augmented statistical models for speech recognition, in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, vol. 1. IEEE, 2006, pp. I I. [5] G. Zweig and P. Nguyen, A segmental CRF approach to large vocabulary continuous speech recognition, in Proceedings of the IEEE Workshop on Automatic Speech Recognition Understanding (ASRU 09), Merano, Italy, Dec. 2009, pp. 152 157. [6] S. Zhang, A. Ragni, and M. Gales, Structured log linear models for noise robust speech recognition, Signal Processing Letters, IEEE, vol. 17, no. 11, pp. 945 948, 2010. [7] G. Zweig, P. Nguyen, D. Van Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell, M. Wang, F. Sha, H. Hermansky, D. Karakos, A. Jansen, S. Thomas, G. S. V. S. Sivaram, S. Bowman, and J. Kao, Speech recognition with segmental conditional random fields: A summary of the JHU CLSP 2010 summer workshop, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 11), Prague, Czech Republic, May 2011, pp. 5044 5047. [8] G. Zweig, Classification and recognition with direct segment models, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 12), Kyoto, Japan, Mar. 2012, pp. 4161 4164. [9] Y. He and E. Fosler-Lussier, Efficient segmental conditional random fields for phone recognition, in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech 12), Portland, R, USA, Sep. 2012. [10]. Abdel-Hamid, L. Deng, D. Yu, and H. Jiang, Deep segmental neural networks for speech recognition, in INTERSPEECH, 2013. [11] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82 97, 2012. [12] L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams et al., Recent advances in deep learning for speech research at microsoft, ICASSP 2013, 2013. [13] J. Lafferty, A. McCallum, and F. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, 2001. [14] E. Fosler-Lussier, Y. He, P. Jyothi, and R. Prabhavalkar, Conditional random fields in speech, audio, and language processing, Proceedings of the IEEE, vol. 101, no. 5, pp. 1054 1075, 2013. [15] S. Sarawagi and W. W. Cohen, Semi-Markov conditional random fields for information extraction, in Advances in Neural Information Processing Systems (NIPS 04), Vancouver, British Columbia, Canada, Dec. 2004, pp. 1185 1192. [16] H. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid Approach. Kluwer Academic Publishers, 1993. [17] M. Mohri, F. Pereira, and M. Riley, Speech recognition with weighted finite-state transducers, in Springer Handbook of Speech Processing. Springer, 2008, pp. 559 584. [18] J. J. Morris, A study on the use of conditional random fields for automatic speech recognition, Ph.D. dissertation, The hio State University, 2010. [19] D. Povey, A. Ghoshal, G. Boulianne, L. Burget,. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. ian, P. Schwarz et al., The Kaldi speech recognition toolkit, in Proc. of ASRU, 2011, pp. 1 4. [20] B. Kingsbury, Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling, in Proc. of ICASSP, 2009, pp. 3761 3764. [21] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., vol. 12, pp. 2121 2159, 2011. [22] X. He, L. Deng, and W. Chou, Discriminative learning in sequential pattern recognition, IEEE Signal Processing Magazine, vol. 25, no. 5, pp. 14 36, Sep. 2008. [23] D. Povey, Discriminative training for large vocabulary speech recognition, Cambridge, UK: Cambridge University, vol. 79, 2004. [24] R. Prabhavalkar and E. Fosler-Lussier, Backpropagation training for multilayer conditional random field based phone recognition, in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 5534 5537. [25] S.-X. Zhang and M. J. Gales, Structured svms for automatic speech recognition, Audio, Speech, and Language Processing, IEEE Transactions on, vol. 21, no. 3, pp. 544 555, 2013. [26] H. Tang, K. Gimpel, and K. Livescu, A comparison of training approaches for discriminative segmental models, in Proc. Annual Conference of International Speech Communication Association (INTERSPEECH), 2014.