CS229 Final Project. Re-Alignment Improvements for Deep Neural Networks on Speech Recognition Systems. Firas Abuzaid. Abstract.

Similar documents
Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Improvements to the Pruning Behavior of DNN Acoustic Models

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

arxiv: v1 [cs.lg] 7 Apr 2015

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Learning Methods in Multilingual Speech Recognition

Deep Neural Network Language Models

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Recognition at ICSI: Broadcast News and beyond

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

A study of speaker adaptation for DNN-based speech synthesis

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Calibration of Confidence Measures in Speech Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

arxiv: v1 [cs.cl] 27 Apr 2016

WHEN THERE IS A mismatch between the acoustic

On the Formation of Phoneme Categories in DNN Acoustic Models

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Lecture 1: Machine Learning Basics

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

Python Machine Learning

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Artificial Neural Networks written examination

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Softprop: Softmax Neural Network Backpropagation Learning

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Mandarin Lexical Tone Recognition: The Gating Paradigm

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speaker recognition using universal background model on YOHO database

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Knowledge Transfer in Deep Convolutional Neural Nets

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

INPE São José dos Campos

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Speaker Identification by Comparison of Smart Methods. Abstract

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Proceedings of Meetings on Acoustics

SARDNET: A Self-Organizing Feature Map for Sequences

Memory-based grammatical error correction

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Model Ensemble for Click Prediction in Bing Search Ads

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Automatic Pronunciation Checker

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Generative models and adversarial training

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

English Language and Applied Linguistics. Module Descriptions 2017/18

Letter-based speech synthesis

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

arxiv: v1 [cs.lg] 15 Jun 2015

Transcription:

CS229 Final Project Re-Alignment Improvements for Deep Neural Networks on Speech Recognition Systems Abstract The task of automatic speech recognition has traditionally been accomplished by using Hidden Markov Models (HMM), which are effective in modeling time-varying spectral vector sequences [1]. Gaussian Mixture Models (GMM) have been used to determine how well each state of the HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. However, recent research has shown that current speech recognition systems that use Deep Neural Networks (DNN) with many hidden layers can outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin [2], [3]. We investigate further possible improvements of the DNN-HMM hybrid, by examining the role of forced alignments in the training algorithm. Introduction In the context of automatic speech recognition, the GMM-HMM hybrid approach has a serious shortcoming GMMs are statistically inefficient for modeling data that lie on or near a non-linear manifold in the data space. Because speech is typically produced by modulating a relatively small number of parameters, it has been hypothesized that the true underlying structure is much lowerdimensional than is immediately apparent in a window that contains lots of coefficients. In contrast with GMMs, Deep Neural Networks have the potential to learn much better models of data that lie on or near a non-linear manifold. Many studies have confirmed this hypothesis, with DNN systems outperforming GMMs in ASR by approximately 6% for Individual Word Error Rates (IWER) [2]. These networks are trained to optimize a given training objective function using the standard error backpropagation procedure. In a DNN-HMM hybrid system, the DNN is trained to provide posterior probability estimates for the different HMM states. Typically, cross-entropy is used as the objective function, and the optimization is done through stochastic gradient Firas Abuzaid descent (SGD) [4], [5]. For any given objective, the important quantity to calculate is its gradient with respect to the activations at the output layer. The gradients for all the parameters of the network can be derived from this one quantity based on the backpropagation procedure. One significant hurdle in training speech recognition systems is determining the appropriate alignment between word sequences and acoustic observations. Typically, the acoustic data is divided into frames, with each frame approximately 15-25 ms in size. These frames must then be aligned with the sequence of words in the training set [1]. Usually, to determine these alignments, we start by obtaining an initial forced alignment, thus giving us a starting point to improve our posterior probabilities by training the DNN over multiple epochs. Specifically, we start by using an initial GMM-HMM baseline to assign feature vectors in our training set to different HMM states using the Viterbi algorithm. The Viterbi algorithm effectively chooses the most likely state sequences in our HMM model that corresponds to the correct observation sequences in our training data. Thus, the assignment is based on the most likely path through our initial composite HMM model. Once we effectively force an alignment of the acoustics with the word transcripts in our training set, and we run the Viterbi algorithm to pass through those specific state sequences, these initial alignments can then be used by the DNN to train and improve our composite HMM model further (via backpropagation) and obtain more accurate log-likelihood probabilities for our HMM. It has been shown that this forced alignment approach is the accepted method for HMM-based speech recognition systems. An interesting area of research, then, has focused on whether this same procedure can be used throughout multiple epochs of the DNN. Specifically, we re investigating whether recomputing these forced alignments after every epoch of DNN training will lead to a further increase in

accuracy for our speech recognition system. Recent research see Vesely et al. s (2013) study [6] demonstrates that computing these re-alignments could indeed be effective in reducing the Word Error Rate (WER) of these speech recognition systems. 50 WER for Small and Large Training Set Experiments We ran two sets of experiments to test our hypothesis. For our first set of experiments, we started with a smaller training set, simply to get an early indication of whether our hypothesis was correct. We trained the DNN on 60 hours of telephone conversation recordings from the Switchboard corpus, and we configured our neural network to have 5 hidden layers with 1200 neurons. We trained the DNN for 4 epochs. To test our hypothesis, we computed a re-alignment after each epoch, using the Viterbi algorithm that s used initially to compute the forced alignment before the first epoch. We simultaneously ran a separate DNN that did not compute these re-alignments after each epoch. We then compare the accuracy for both recognizers by evaluating them on the training set and the test set provided in the Kaldi open source toolkit [7]. Our second experiment was identical to the first, except we trained on a much larger corpus of training data 300 hours from Switchboard. We also modified the neural network to contain 2048 neurons instead of 1200. To implement our DNN speech recognizer, we used the Kaldi-Stanford codebase, a variant of the Kaldi open source toolkit, maintained by Professor Andrew Ng s Deep Learning research group. 1 WER 45 40 35 30 25 20 15 1 2 3 4 Epoch Small Training Set, Re-aligned Small Training Set Small Eval Set, Re-aligned Small Eval Set Large Eval Set, Re-aligned Large Eval Set Results We measured the WER and IWER [8] for our speech recognizer, on both the small and large training set. Our recognizer was evaluated against its training set, as well against a corresponding evaluation set (either small or large) provided by the Kaldi toolkit. Our results for the two different experiments are as follows: 1 Note: This project was done in conjunction with Professor Andrew Ng s research group, specifically with graduate students Andrew Maas and Christopher Lengerich. As the results above indicate, our hypothesis was correct for the smaller training set, but was incorrect for the larger training set, a somewhat surprising result. Analysis Examining the results of our experiments, we see that there is a significant gap between the WER when evaluated on the small training set versus the small evaluation set; clearly, limited value can be gleaned from the experiments of the small training set. The results for the large training set, on the hand, do not

give us a strong indication of whether computing realignments improves the accuracy of our speech recognizer. The WER delta between the re-aligned DNN and the non-re-aligned DNN is within the noise threshold and is thus statistically insignificant we cannot say with any certainty that re-alignment improved our WER. More importantly, the rate of improvement over epochs between the experiments is nearly the same, which means that computing the re-alignments doesn t necessarily accelerate the training improvement of the DNN as we had hoped. plots taken from the three utterances that exhibited the most variation in their alignments, along with the corresponding hypotheses from our speech recognizer for epochs 1 through 4 (note that this was only done for the small training set): One potential explanation for this result could be that the hyper-parameters chosen for our larger neural network, particularly the number of neurons, need to be tuned. Modifying the network topology in this fashion could correct for any over-fitting in the Acoustic Model, which is more likely in the case of the larger training set with a larger network size and, thus, higher variance. Another factor that needs to be examined is the number of epochs; it s unclear based on the data whether we ve reached our optimization objective by the end of our training. The results for the small training set indicate that more epochs are needed, but the larger training set again is less clear in this regard. Alignments To better understand the performance of our DNN, we decided to closely examine the shift in alignments across epochs during the training stage. In the computed alignments, each frame gets assigned to a senone, which maps to a particular center phone. Analyzing the shift in phones between epochs is crucial in determining whether the re-alignments lead to a substantive reduction in WER. In particular, the silence phone (denoted as sil) is quite significant, as these frames effectively become the demarcations between lexical tokens in our hypothesis. Our intuition is the following: if an alignment for a particular utterance undergoes a significant shift in its labeling of the silence phone, then this should lead to a more accurate prediction from the speech recognizer. We generated alignment plots for the various utterances in our corpus to visually track how the alignment of the silence phone was shifting between epochs the following are three example Based on this analysis, our theory is partially correct. The first example above certainly corroborates it as the alignments change, we get a much more accurate prediction of the utterance. But the other two examples indicate this is not always the case; in the second example, the prediction doesn t change at all, despite the drastic shift in alignment, and in the last

example, the prediction becomes worse after a shift in the silence alignment. To explain this behavior, there are a few possible theories to consider. One is that the length of the utterance must be factored into the analysis: utterances of shorter length (i.e. one or two words) tend be dominated by long stretches of silence followed by long stretches of non-silence, whereas longer utterances have smaller but more frequent moments of silence. We may see more dramatic shifts in silence for smaller utterances, but this doesn t necessarily lead to a difference in the Acoustic Model. On the other hand, a slight shift in silence could have a tremendous effect for larger utterances. We also collected more general statistics about the shifts across epochs for all phones, not just the silence phone. The plot below shows the normalized frequency in base phone shifts for any given phone and epoch (note that this was only done for the small training set): which could be attributed to the search for distinct syllables within words by our speech recognition system. One possible experiment worth conducting in the future is to delay re-alignment computation until the second, third, or even fourth epoch. The magnitude of change in the earlier epochs indicates that perhaps these alignments are corrupted by the initial probabilities in the HMM model. Waiting until the second or third epoch, when the state probabilities in the HMM model are more robust due to the multiple epochs of training could lead to more accurate alignment shifts and, thus, better improvement in the WER. This experiment is especially worthwhile, since the re-alignment computation increases the runtime complexity of the training stage of our algorithm. Part-of-Speech Tagging As an additional vector of error analysis, we decided to measure the grammatical accuracy of our speech recognizers how often were the erroneous hypotheses being made by our recognizer grammatically incorrect (e.g. confusing similarsounding words such as money and many that have different grammatical meanings)? Using the Stanford NLP group s Part-Of-Speech Tagger [9], we tagged each word in both the Reference utterance and the Hypothesis utterance, and then calculated the fraction of substitution errors that also differed in the corresponding part-of-speech tag. The results are as follows (note that this was only done for the small training set): Several important things stand out from this data. Firstly, for the vast majority of the phones, the frequency of change decreases over the course of the epochs. Also of significance is that the silence, noise (both spoken and non-spoken), and laughter phones dominate these shifts, especially when compared to vowel and consonant phones. Particularly noteworthy is the abnormally high frequency in the first-second epoch transition (colored in green) for spn, which is spoken noise. Amongst vowel and consonant phones, the en and n phones also shifted frequently, especially in the first epoch. In general, phones that had a combination of vowel and consonant sounds shifted more frequently, Small Training Set Re-alignment Experiments Epoch POS Error Rate 1 84.04% 2 83.26% 3 82.46% 4 81.40% The significance here is the magnitude of the error; the vast majority of the substitution errors made by our speech recognizer are grammar-related. (It is worth noting, of course, that human speech is often grammatically incorrect, which this analysis does not account for.) This analysis demonstrates that, although our experiments focus on addressing the Acoustic Model, perhaps our attention should shift to

improving the Language Model. In particular, this could easily address homonym confusion. This error, made by our recognizer in the small training set experiments, is particularly instructive: Reference: we've been wanting to start camping again this year too uh my oldest Hypothesis: we've been wanting to start camping again this year to uh my oldest In these situations, the Acoustic Model simply cannot distinguish between these two words if we improve the Language Model, however, we could be able to address these errors. Future Work In the future, we d like to further examine the effect of further epochal training for our DNN, as well as finetuning the hyper-parameters of the neural network, particularly the network size. We d also like to analyze the alignments in greater detail, and examine the relationship between various shifts in silence, noise and laughter alignments with utterance length. Also worth exploring are the effects of eschewing early stage re-alignment; perhaps we could see improvement in the WER by delaying re-alignment computation until the second, third, or even fourth epoch. Lastly, we d like to explore alternative language models the work of Mikolov et al. [10] demonstrates that the language model for DNNs can be improved through the use of Recurrent Neural Network (RNN) Language Models, so coupling those improvements with our re-alignments could yield a further increase in accuracy. IEEE Transactions on Audio, Speech, and Language Processing, Special Issue on Deep Learning for Speech and Langauge Processing, 1 13. [4] Povey, D., & Woodland, P. C. (2002). Minimum Phone Error and I-Smoothing for Improved Discriminative Training. ICASSP. [5] Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., & Visweswariah, K. (2008). Boosted MMI for model and feature-space discriminative training. ICASSP (pp. 4057 4060). IEEE. [6] Sequence-discriminative training of deep neural networks, K. Vesely, A. Ghoshal, L. Burget and D. Povey, to appear in: Interspeech 2013 [7] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Veselý, K., Goel, N., et al. (2011). The kaldi speech recognition toolkit. ASRU. [8] Sharon Goldwater, Dan Jurafsky, Christopher D. Manning, Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates, Speech Communication, Volume 52, Issue 3, March 2010, pp. 181-200. [9] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259. [10] Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan Cernocky,Sanjeev Khudanpur: Recurrent neural network based language model, In: Proc. INTERSPEECH 2010 References [1] M.J.F. Gales and S.J. Young (2008). The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends in Signal Processing. [2] Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., et al. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine, 29(November), 82 97. [3] Dahl, G., Yu, D., Deng, L., & Acero, A. (2011). Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition.