Deep Neural Network for Automatic Speech Recognition: from the Industry s View

Similar documents
Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Learning Methods in Multilingual Speech Recognition

Calibration of Confidence Measures in Speech Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A study of speaker adaptation for DNN-based speech synthesis

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Modeling function word errors in DNN-HMM based LVCSR systems

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Modeling function word errors in DNN-HMM based LVCSR systems

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

On the Formation of Phoneme Categories in DNN Acoustic Models

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Speech Recognition at ICSI: Broadcast News and beyond

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

WHEN THERE IS A mismatch between the acoustic

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

arxiv: v1 [cs.lg] 7 Apr 2015

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Improvements to the Pruning Behavior of DNN Acoustic Models

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Probabilistic Latent Semantic Analysis

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Deep Neural Network Language Models

Speech Emotion Recognition Using Support Vector Machine

A Review: Speech Recognition with Deep Learning Methods

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition

Python Machine Learning

Lecture 1: Machine Learning Basics

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Softprop: Softmax Neural Network Backpropagation Learning

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Human Emotion Recognition From Speech

Generative models and adversarial training

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

INPE São José dos Campos

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Rule Learning With Negation: Issues Regarding Effectiveness

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speaker recognition using universal background model on YOHO database

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Evolutive Neural Net Fuzzy Filtering: Basic Description

Edinburgh Research Explorer

Speaker Identification by Comparison of Smart Methods. Abstract

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Automatic Pronunciation Checker

Switchboard Language Model Improvement with Conversational Data from Gigaword

Rule Learning with Negation: Issues Regarding Effectiveness

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Artificial Neural Networks written examination

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

An Online Handwriting Recognition System For Turkish

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Knowledge Transfer in Deep Convolutional Neural Nets

Speech Recognition by Indexing and Sequencing

Using dialogue context to improve parsing performance in dialogue systems

Australian Journal of Basic and Applied Sciences

Body-Conducted Speech Recognition and its Application to Speech Support System

SARDNET: A Self-Organizing Feature Map for Sequences

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Axiom 2013 Team Description Paper

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Model Ensemble for Click Prediction in Bing Search Ads

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Transcription:

Deep Neural Network for Automatic Speech Recognition: from the Industry s View Jinyu Li Microsoft September 13, 2014 at Nanyang Technological University

Speech Modeling in an SR System Training data base Acoustic Model Training Process Acoustic Model Input Speech Feature Extraction HMM Sequential Pattern Recognition (Decoding) Confidenc e Scoring Hello World (0.9) (0.8) Language Model Word Lexicon

Speech Recognition and Acoustic Modeling SR = Finding the most probable sequence of words W=w 1, w 2, w 3, w n, given the speech feature O =o 1, o 2, o 3, o T Max {W} p(w O) = Max {W} p(o W)Pr(W)/p(O) = Max {W} p(o W)Pr(W) where - Pr(W) : probability of W, computed by language model - p(o W) : likelihood of O, computed by an acoustic model p(o W) is produced by a model M, p(o W) p M (O W)

Challenges in Computing P M (O W) Model area (M): Computational model: GMM/DNN Feature area (O): Noise-robustness Computing P M (O W) (runtime) SVD-DNN Optimization and parameter estimation (training) Model recipe Infrastructure and engineering Modeling and adapting to speakers Feature normalization algorithms Discriminative transformation Adaptation to short-term variability Confidence/Score evaluation Adaptation/Normalization Quantization

Acoustic Modeling of a Word /ih/ /L-ih+t/ /t/ /ih-t+r/

DNN for Automatic Speech Recognition DNN Feed-forward artificial neural network More than one layer of hidden units between input and output Apply a nonlinear/linear function in each layer DNN for automatic speech recognition (ASR) Replace the Gaussian mixture model (GMM) in the traditional system with a DNN to evaluate state likelihood IPE Speech Science and Technology

Phoneme State Likelihood Modeling sil-b+ah [2] sil-p+ah [2] p-ah+t [2] ah-t+iy [3] t-iy+sil [3] d-iy+sil [4]

Phoneme State Likelihood Modeling sil-b+ah [2] sil-p+ah [2] p-ah+t [2] ah-t+iy [3] t-iy+sil [3] d-iy+sil [4]

Phoneme State Likelihood Modeling sil-b+ah [2] sil-p+ah [2] p-ah+t [2] ah-t+iy [3] t-iy+sil [3] d-iy+sil [4]

Phoneme State Likelihood Modeling sil-b+ah [2] sil-p+ah [2] p-ah+t [2] ah-t+iy [3] t-iy+sil [3] d-iy+sil [4]

DNN Fundamental Challenges to Industry 1. How to reduce the runtime without accuracy loss? 2. How to do speaker adaptation with low footprints? 3. How to be robust to noise? 4. How to reduce accuracy gap between large and small DNN? 5. How to deal with large variety of data? 6. How to enable languages with limited training data?

Reduce DNN Runtime without Accuracy Loss [Xue13]

Motivation The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it.

Solution The runtime cost of DNN is much larger than that of GMM, which has been fully optimized in product deployment. We need to reduce the runtime cost of DNN in order to ship it. We propose a new DNN structure by taking advantage of the low-rank property of DNN model to compress it

Singular Value Decomposition (SVD) A m n = U m n n n V T n n = u 11 u 1n u m1 u mn ε 11 0 0 0 0 ε kk 0 0 ε nn v 11 v 1n v n1 v nn

SVD Approximation Number of parameters: mn->mk+nk. Runtime cost: O(mn) -> O(mk+nk). E.g., m=2048, n=2048, k=192. 80% runtime cost reduction.

SVD-Based Model Restructuring

SVD-Based Model Restructuring

SVD-Based Model Restructuring

Proposed Method Train standard DNN model with regular methods: pre-training + cross entropy fine-tuning Use SVD to decompose each weight matrix in standard DNN into two smaller matrices Apply new matrices back Fine-tune the new DNN model if needed

A Product Setup Acoustic model WER Number of parameters Original DNN model 25.6% 29M SVD (512) to hidden layer 25.7% 21M All hidden and output layer (192) Before fine-tune 36.7% After fine-tune 25.5% 5.6M

Adapting DNN to Speakers with Low Footprints [Xue 14]

Motivation Speaker personalization with a DNN model creates a storage size issue: It is not practical to store an entire DNN model for each individual speaker during deployment.

Solution Speaker personalization with a DNN model creates a storage size issue: It is not practical to store an entire DNN model for each individual speaker during deployment. We propose low-footprint DNN personalization method based on SVD structure.

SVD Personalization SVD Restructure: A m n U m k W k n SVD Personalization: A m n U m k S k k W k n. Initiate S k k as I k k, and then only adapt/store the speaker-dependent S k k.

SVD Personalization Structure

SVD Personalization Structure

Adapt with 100 Utterances 27.00% 25.00% 23.00% 21.00% 19.00% 17.00% 15.00% Full-rank SI model SVD model Standard adaptation SVD adaptation WER 25.21% 25.12% 20.51% 19.95% Number of parameters (M) 30 7.4 7.4 0.26 35 30 25 20 15 10 5 0

Noise Robustness

DNN Is More Robust to Distortion Multi-conditiontrained DNN on Training Utterances

DNN Is More Robust to Distortion Multi-conditiontrained DNN on Training Utterances

DNN Is More Robust to Distortion Multi-conditiontrained DNN on Training Utterances

DNN Is More Robust to Distortion Multi-conditiontrained DNN on Training Utterances

Noise-Robustness Is Still Most Challenging Clean-trained DNN on Test Utterances

Noise-Robustness Is Still Most Challenging Clean-trained DNN on Test Utterances

Noise-Robustness Is Still Most Challenging Clean-trained DNN on Test Utterances

Noise-Robustness Is Still Most Challenging Multicondition-trained DNN on Test Utterances

Noise-Robustness Is Still Most Challenging Multicondition-trained DNN on Test Utterances

Noise-Robustness Is Still Most Challenging Multicondition-trained DNN on Test Utterances

Some Observations DNN works very well on utterances and environments observed. For the unseen test case, DNN cannot generalize very well. Therefore, noise-robustness technologies are still important. For more technologies on noise-robustness, refer to our recent overview paper [Li14] for more studies

DNN components: Variable Component DNN Weight matrices, outputs of a hidden layer. For any of the DNN components Training: Model it as a set of polynomial functions of a context variable, e.g. SNR, duration, speaking rate. C l J = j=0 C j l v j 0 < l L (J is the order of polynomials) Recognition: compute the component on-the-fly based on the variable and the associated polynomial functions. Developed VP-DNN, VO-DNN.

VPDNN

VODNN

VPDNN Improves Robustness on Noisy Environment Un-seen in the Training The training data has SNR > 10db.

Reduce Accuracy Gap between Large and Small DNN

To Deploy DNN on Server Low rank matrices are used to reduce the number of DNN parameters and CPU cost. Quantization for SSE evaluation is used for single instruction multiple data processing. Frame skipping or prediction is used to remove the evaluation of some frames.

To Deploy DNN on Device The industry has strong interests to have DNN systems on devices due to the increasingly popular mobile scenarios. Even with the technologies mentioned above, the large computational cost is still very challenging due to the limited processing power of devices. A common way to fit CD-DNN-HMM on devices is to reduce the DNN model size by reducing the number of nodes in hidden layers reducing the number of senone targets in the output layer However, these methods significant increase word error rate. In this talk, we explore a better way to reduce the DNN model size with less accuracy loss than the standard training method.

Standard DNN Training Process Generate a set of senones as the DNN training target: splits the decision tree by maximizing the increase of likelihood evaluated on single Gaussians Get transcribed training data Train DNN with cross entropy or sequence training criterion................. Ṭext

Significant Accuracy Loss when DNN Size Is Significantly Reduced Better accuracy is obtained if we use the output of large-size DNN for acoustic likelihood evaluation The output of small-size DNN is away from that of large-size DNN, resulting in worse recognition accuracy The problem is solved if the small-size DNN can generate similar output as the large-size DNN.................... Ṭext.....................

Can We Make the Small-size DNN Generate Similar Output to the Large-size DNN? No -- if we only have transcribed data. Yes -- in industry, we have almost unlimited un-transcribed data and only a small portion is transcribed

Small-Size DNN Training with Output Distribution Learning Use the standard DNN training method to train a large-size teacher DNN using transcribed data Random initialize the small-size student DNN Minimize the KL divergence between the output distribution of the student DNN and teacher DNN with large amount of untranscribed data

Minimize the KL Divergence between the Output Distribution of DNNs N t i=1 P L s i x t log P L s i x t P S s i x t s i : i-th senone x t : the observation at time t N t i=1 P L s i x t logp S s i x t P L s i x t, P S s i x t : posterior output distribution of teacher and student DNN, respectively A general form of the standard DNN training criterion where the target is a one-hot vector. Here the target is generated by the output of teacher DNN

Experiment Setup 375 hours of transcribed US-English data Large-size DNN: 5*2048 Small-size DNN: 5*512 6k senones

EN-US Windows Phone Task Model Training Data Training Criterion WER 5 * 2048 375 hours transcribed data Standard cross entropy 16.32 5 * 512 375 hours transcribed data Standard cross entropy 19.90

EN-US Windows Phone Task Use it as the teacher for output distribution learning Model Training Data Training Criterion WER 5 * 2048 375 hours transcribed data Standard cross entropy 16.32 5 * 512 375 hours transcribed data Standard cross entropy 19.90

EN-US Windows Phone Task Use it as the teacher for output distribution learning Model Training Data Training Criterion WER 5 * 2048 375 hours transcribed data Standard cross entropy 16.32 5 * 512 375 hours transcribed data Standard cross entropy 19.90 5 * 512 375 hours un-transcribed data Output distribution learning 19.55

EN-US Windows Phone Task Use it as the teacher for output distribution learning Model Training Data Training Criterion WER 5 * 2048 375 hours transcribed data Standard cross entropy 16.32 5 * 512 375 hours transcribed data Standard cross entropy 19.90 5 * 512 375 hours un-transcribed data Output distribution learning 19.55 5 * 512 750 hours un-transcribed data Output distribution learning 19.28

EN-US Windows Phone Task Use it as the teacher for output distribution learning Model Training Data Training Criterion WER 5 * 2048 375 hours transcribed data Standard cross entropy 16.32 5 * 512 375 hours transcribed data Standard cross entropy 19.90 5 * 512 375 hours un-transcribed data Output distribution learning 19.55 5 * 512 750 hours un-transcribed data Output distribution learning 19.28 5 * 512 1500 hours un-transcribed data Output distribution learning 18.89

EN-US Windows Phone Task Model Training Data Training Criterion WER 5 * 2048 375 hours transcribed data Standard cross entropy 16.32 5 * 512 375 hours transcribed data Standard cross entropy 19.90 5 * 512 375 hours un-transcribed data Output distribution learning 19.55 5 * 512 750 hours un-transcribed data Output distribution learning 19.28 5 * 512 Decode 750 hours untranscribed data to generate transcription Use it as the teacher for output distribution learning Standard cross entropy 20.48

Can We Use German Data to Learn EN-US DNN? Model Training Data Training Criterion WER 5 * 2048 375 hours EN-US transcribed data 5 * 512 750 hours un-transcribed EN-US data 5 * 512 600 hours un-transcribed German data Use it as the teacher for output distribution learning Standard cross entropy 16.32 Output distribution learning 19.28 Output distribution learning?

Can We Use German Data to Learn EN-US DNN? Model Training Data Training Criterion WER 5 * 2048 375 hours EN-US transcribed data 5 * 512 750 hours un-transcribed EN-US data 5 * 512 600 hours un-transcribed German data Use it as the teacher for output distribution learning Standard cross entropy 16.32 Output distribution learning 19.28 Output distribution learning? Please guess a WER 90? 70? 50? 30? 10?

Can We Use German Data to Learn EN-US DNN? Model Training Data Training Criterion WER 5 * 2048 375 hours EN-US transcribed data 5 * 512 750 hours un-transcribed EN-US data 5 * 512 600 hours un-transcribed German data Use it as the teacher for output distribution learning Standard cross entropy 16.32 Output distribution learning 19.28 Output distribution learning 21.71!

Better Teacher If the teacher DNN is improved by some other techniques, could the improvement be transferred to a better student DNN?

Better Teacher If the teacher DNN is improved by some other techniques, could the improvement be transferred to a better student DNN? Use it as the teacher for output distribution learning Model Training Data Training Criterion WER 5 * 2048 375 hours transcribed data Standard sequence training 13.93 5 * 512 375 hours transcribed data Standard sequence training 17.16

Better Teacher If the teacher DNN is improved by some other techniques, could the improvement be transferred to a better student DNN? Use it as the teacher for output distribution learning Model Training Data Training Criterion WER 5 * 2048 375 hours transcribed data Standard sequence training 13.93 5 * 512 375 hours transcribed data Standard sequence training 17.16 5 * 512 750 hours un-transcribed data Output distribution learning 16.66

Real Application Setup 2 Million parameter for small-size DNN, compared to 30 Million parameters for teacher DNN Accuracy Teacher DNN trained with standard sequence training Student DNN trained with output distribution learning in this talk Small-size DNN trained with standard sequence training

Dealing with Large Variety of Data

Factorization of Speech Signals Output Layer r o... x senones Input Layer v L Many Hidden Layers W L Q 1 Q N.................. Text... Factor 1 (f 1 ) Factor 1 feature extraction...... Factor N (f N ) Factor N feature extraction R(x) = R(y) + N n=1 Q n f n, x Data Training or Testing Samples

Joint Factor Analysis (JFA)-Style Adaptation JFA: M = m + Aa + Bb + Cc, R x R y + Dn + Eh + Fs

Vector Tayler Series (VTS)-Style Adaptation x = y + log 1 exp n y y + log 1 exp n 0 y 0 + A y y 0 + B n n 0 R(x) R y + R (Ay + Bn + const. ) y If we make a rather coarse assumption that R y is constant R x R y + Cy + Dn + const

Fast Adaptation with Factorization Test set B same microphone Test set D microphone mismatch

Factorization of Speech Signals, Another Solution

DNN SR for 8-kHz and 16-kHz Data

Performance on Wideband and Narrowband Test Sets Training Data WER (16-kHz) WER (8- khz) 16-kHz VS-1 (B1) 29.96 71.23 8-kHz VS-1 + 8-kHz VS-2 (B2) - 28.98 16-kHz VS-1 + 8-kHz VS-2 (ZP) 28.27 29.33 16-kHz VS-1 + 16-kHz VS-2 (UB) 27.47 53.51

Distance for the Output Vectors between 8-kHz and 16- khz Input Features 14 12 10 8 6 4 2 16-kHz DNN (UB) Mean Data-mix DNN (ZP) Mean 0 L1 (ED) L4 (ED) L7 (ED) Top (KL)

Enable Languages with Limited Training Data [Huang 13]

Shared Hidden Layer Multi-lingual DNN

Source Languages in Multilingual DNN Benefit Each Other FRA DEU ESP ITA Test Set Size (Words) 40K 37K 18K 31K Monolingual DNN 28.1 24.0 30.6 24.3 SHL-DNN 27.1 22.7 29.4 23.5 Relative WER Reduction 3.6 5.4 3.9 3.3 source languages: FRA: 138 hours, DEU: 195 hours, ESP: 63 hours, and ITA: 93 hours of speech.

Transferring from Western Languages to Mandarin Chinese Is Effective CHN CER (%) 3 hrs 9hrs 36hrs 139hrs Baseline DNN (no transfer) SHL-MDNN Model Transfer 45.1 40.3 31.9 29.0 35.6 33.9 28.4 26.6 Relative CER Reduction 21.1 15.9 10.4 8.3 source languages: FRA: 138 hours, DEU: 195 hours, ESP: 63 hours, and ITA: 93 hours of speech.

Reference [Huang 13] Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong, cross-language knowledge transfer using multilingual deep neural network with shared hidden layers, in ICASSP, 2013 [Li12] Jinyu Li, Dong Yu, Jui-Ting Huang, and Yifan Gong, improving wideband speech recognition using mixedbandwidth training data in CD-DNN-HMM, in IEEE Workshop on Spoken Language Technology, 2012 [Li14] Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach, An Overview of Noise-Robust Automatic Speech Recognition, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745-777, 2014. [Li14b] Jinyu Li, Jui-Ting Huang, and Yifan Gong, Factorized adaptation for deep neural network, in ICASSP, 2014 [Li14c] Jinyu Li, Rui Zhao, Jui-Ting Huang and Yifan Gong, Learning Small-Size DNN with Output-Distribution-Based Criteria, in Interspeech, 2014. [Xue13] Jian Xue, Jinyu Li, and Yifan Gong, Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition, in Interspeech, 2013 [Xue 14] Jian Xue, Jinyu Li, Dong Yu, Mike Seltzer, and Yifan Gong, Singular Value Decomposition Based Lowfootprint Speaker Adaptation and Personalization for Deep Neural Network, in ICASSP, 2014 [Zhao14] Rui Zhao, Jinyu Li and Yifan Gong, Variable-Component Deep Neural Network for Robust Speech Recognition, in Interspeech, 2014. [Zhao14b] Rui Zhao, Jinyu Li and Yifan Gong, Variable-activation and variable-input deep neural network for robust speech recognition, in IEEE SLT, 2014.