Feature-based Robust Techniques For Speech Recognition

Feature-based Robust Techniques For Speech Recognition presented by Nguyen Duc Hoang Ha Supervisors Assoc. Prof. Chng Eng Siong Prof. Li Haizhou 08-Mar-2017

Outline An of Robust ASR The 1st proposed method (Ch5) The major Contribution: Feature Adaptation Using Spectro-Temporal Information The 2nd proposed method (Ch3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments The 3rd proposed method (Ch4): A Particle Filter Compensation Approach to Robust LVCSR s and Future Directions 2

Automatic Speech Recognition (ASR) [Huang2001] AM LM w hello /h e l o/ The aim is to decode the speech signal into text. 3

Applications of the ASR system Siri (http://www.apple.com/ios/siri/) Amazon Echo (https://en.wikipedia.org/wiki/amazon_echo) Google Speech Recognition API (https://cloud.google.com/speech/)... 4

Challenges of the ASR system [Chelba2010, Li2014] Non-native speakers Dialect variations Dis-fluencies Out-of-vocabulary words Language modeling Noise robustness 5

ASR in Noisy Environments [Xiao2009, Li2014] Noisy speech features Clean speech model 6

Feature/Model Compensation [Xiao2009, Li2014] (A) (B) Two major approaches: (A) Feature-based approach (B) Model-based approach 7

Feature/Model Compensation Feature-Based Approach (A) Examples: spectral subtraction [Boll1979], MMSE [Ephraim1984], fmllr [Digalakis1995,Gales1998],... Model-based Approach (B) Examples: MAP model adaptation [Gauvain1994], MLLR/CMLLR model adaptation [Leggetter1995, Gales1998], Vector Taylor series model adaptation [Acero2000, Li2009] 8

Multi-condition training approach [Ng2016] (A) (B) (C) Noisy data collection / simulation 9

Robust ASR (A) Feature-based Approach (B) Model-based Approach Clean Feature Estimation (e.g. SS [Boll1979], MMSE [Ephraim1984],...) MAP Model Adaptation [Gauvain1994] Filtering Approach (e.g. RASTA [Hermansky1994],...) MLLR, CMLLR Model Adaptation [Leggetter1995, Gales1998] Feature Transformation (e.g. fmllr [Digalakis1995,Gales1998]) VTS Model Compensation [Acero2000, Li2009]... (C) Data Collection Simulation Deep learning approaches (e.g. DNN AM [Hinton2012])... 10

Contributions Three Proposed Methods (B2) (A1) ST-Transform (A1) (A3) (2) (A2) (for background noise and reverberation) (A2)NN (B2) VTS (for non-stationary noise) (A3) -LVCSR (for background noise) 11

Contributions Three Proposed Methods 1) Spectra-Temporal Transformation (ST-Transform) D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. Generalization of temporal filter and linear transformation for robust speech recognition. In ICASSP, Italy, 2014. D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. Feature adaptation using linear spectro-temporal transform for robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, PP(99):1 1, 2016. (Contribute to success at the Reverb2014 Challenge for clean condition scheme) 2) Noise Normalization (NN) Vector Taylor Series Model Compensation (VTS) D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. An analysis of vector taylor series model compensation for non-stationary noise in speech recognition. In ISCSLP, Hong Kong, 2012. 3) Particle Filter Compensation () for LVCSR D. H. H. Nguyen, A. Mushtaq, X. Xiao, E. S. Chng, H. Li, and C.-H. Lee. A particle filter compensation approach to robust lvcsr. In APSIPA ASC, Taiwan, 2013. 12

http://reverb2014.dereverberation.com/introduction.html Contributions of 13

Outline An of Robust ASR The 1st proposed method (Ch5): Feature Adaptation Using Spectro-Temporal Information The 2nd proposed method (Ch3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments The 3rd proposed method (Ch4): A Particle Filter Compensation Approach to Robust LVCSR s and Future Directions 14

Feature Adaptation Using Spectro-Temporal Information (A1) ST-Transform 15

Feature Adaptation Using Spectro-Temporal Information Noisy features Transformed features x^ 1: T y1 : T Distribution of transformed features Kullback Leibler divergence Distribution of training features The ST transform W is estimated to minimize a KL divergence between the distribution of the transformed features and the reference distribution of the training features. 16

Changing Notations for Generalization of the Feature Transformation Input features x 1: T Feature Transformation y = f(x) Transformed features y1 : T Distribution of transformed features Kullback Leibler divergence Distribution of training features x denotes the input feature. y denotes the output feature. Transformation y = f(x) is more natural. 17

: Generalized Linear Transform Input: Output: A) e.g. CMN B) e.g. C) e.g. [Atal1974], MVN [Viikki1998] fmllr [Digalakis1 995,Gales 1998] RASTA [Hermansky1994] TSN [Xiao2009] 18

: Generalized Linear Transform Input: Output: output feature vector input feature vectors 19

: Generalized Linear Transform 20

: Generalized Linear Transform 21 Matrix form of W

EM Algorithm for Parameter Estimation From L2-Norm Covariance matrix of Output features Output features From KL-divergence criterion Ref. Model 22

Insufficient Adaptation Data Issue Issues: - Unreliable statistics - Too big the degrees of freedom in ST transform Solutions: + Statistics smoothing approach + Sparse ST transform 23

Statistics Smoothing Approach From training or prior data From test data From training or prior data From test data The idea of statistics smoothing is to interpolate the statistics computed from the adaptation data with the statistics computed from some prior data. 24

Sparse ation Cross Transform A) e.g. CMN, MVN, HEQ B) e.g. fmllr C) e.g. RASTA, ARMA, TSN 25

: Generalized Linear Transform 26 Matrix form of W

Matrix form of W 27

Experimental Settings REVERB Challenge 2014 benchmark task for noisy and reverberant speech recognition: Clean condition training scheme: Training data: 7861 clean utterances from the WSJCAM0 database (about 17.5 hours from 92 speakers) Speech features: 13 MFCCs + 13 + 13 MVN post-processing Acoustic model: 3115 tied-states, 10 mixtures/state The development (dev) and evaluation (eval) data sets: Actual meeting room recording of MC-WSJ-AV corpus Near setting: 100cm distance between the microphone and the speaker Far setting: 250cm distance between the microphone and the speaker 28

An Analysis of Window Length on Dev Set We will fix the window length to be 21 for temporal filter, cross transform, and the full ST transform for experiments on Eval set. 29

Three different adaptation schemes Notes: Full batch mode: 1 transform for each subset (near and far) Speaker mode: 1 transform for each speaker Utterance mode: 1 transform for each utterance 30

Experiments for Cascaded Transforms Input features Cross Transform FMLLR Temporal Filter FMLLR Cross Transform Temporal Filter fmllr Cross Transform fmllr 67 Transform 1 66 Transform 2 Output features % Average WER 65 64 63 62 61 60 59 58 Speaker Mode Full Batch Mode Utterance Mode + Cascading transforms in tandem is an effective way of using spectro-temporal information without significant increase in the number of free parameters + Observing the best result from cascaded transform of cross transform and fmllr 31

Hybrid Cascaded Transforms utt1, utt2,, uttn Transform 1 in full batch mode utt1a, utt2a,, uttna... Transform 2 in utterance mode Transform 2 in utterance mode Transform 2 in utterance mode + Full batch mode (fb): deal with session-wise reverberation and noise distortions + Utterance mode (utt): remove speaker variations and other sentence-wise variations, e.g. due to speaker movement and background noise change + Statistics smoothing (smooth): Ref. statistics provided from the batch mode utt1b, utt2b,, uttnb 32

Cascaded Transforms vs. Hybrid Cascaded Transforms vs. Hybrid Cascaded Transforms + Stats. Smoothing + Full batch mode (fb): deal with sessionwise reverberation and noise distortions + Utterance mode (utt): remove speaker variations and other sentence-wise variations, e.g. due to speaker movement and background noise change + Statistics smoothing (smooth): Ref. statistics provided from the batch mode Observations: + The combination of batch and utterance mode transforms performs the best. + (1) vs (2): 3 % absolute reduction in WER + (3): The best result (1) (2) (3) 33

Outline An of Robust ASR The 1st proposed method: Feature Adaptation Using Spectro-Temporal Information The 2nd proposed method (Chapter 3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments The 3rd proposed method: A Particle Filter Compensation Approach to Robust LVCSR s and Future Directions 34

Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments ( method) (B) (B) VTS Model Compensation: handle the residual noise YNN (A) Noise Normalization: reduce the non-stationary characteristics of additive noise Λy NN (A) 35

Step 1: Noise Normalization Instantaneous noise estimate nt noisy feature yt Noise Estimation x α n=μ n Average noise estimate x + + y^ t NN feature + hyper-parameter α is used to control the degree of removing the instantaneous noise NN feature Observed noisy feature Instantaneous noise estimate DCT matrix Adding average noise estimate 36 to reduce musical noise

Step 2: Back-end Compensation noisy y features t Noise Estimation clean model λ n={μ n, σ n } VTS Model Compensation [Li2009] y=g( x, n) λx λ ^y noisy model Approximations of Noisy Acoustic Models Jacobian matrix (Hyperparameter from noise normalization) 37

Step 2: Back-end Compensation Minimal point Residual Noise Variance We expect that alpha=0.5 is the best setting. 38

Experimental Settings 39

Results Word accuracies evaluated on test sets A and B of AURORA2 database 40

Particle Filter Compensation () Approach to Robust LVCSR (for background noise) (A) 42

Framework Input: Speech Features Decoder 1 Phone Sequence (aligned with input features) Feature Enhancement Enhanced Speech Features Decoder 2 Text 43

for Clean Speech Feature Estimation phone /a/ a) Using Single Pass Retraining (SPR) Technique... b) Using Particle Filter Algorithm. The posterior density of the clean speech features 44

Experiments Experiments are conducted on Aurora 4 data Decoder is from the hidden Markov model toolkit (HTK) A relative error reduction of only 5.3% is obtained (compared to multi-condition training GMM-HMM system). This work has been published in APSIPA, Taiwan, 2013. D. H. H. Nguyen, A. Mushtaq, X. Xiao, E. S. Chng, H. Li, and C.-H. Lee. A particle filter compensation approach to robust LVCSR. In APSIPA ASC, Taiwan, 2013. 45

for LVCSR Outline An of Robust ASR The 1st proposed method (Ch5) The major Contribution: Feature Adaptation Using Spectro-Temporal Information The 2nd proposed method (Ch3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments The 3rd proposed method (Ch4): A Particle Filter Compensation Approach to Robust LVCSR s and Future Directions 46

for LVCSR s Proposed a sparse ST transform: the Cross transform; explore cascaded transforms and interpolation of statistics Proposed to use EM algorithm to estimate the generalized linear transform to minimize the cost function based on KL divergence criterion for feature adaptation Proposed the integration of noise normalization with VTS model compensation Extended the framework to work on LVCSR system 47

for LVCSR Future Directions Discover a sparse transform automatically by using sparse constraints (e.g. apply L1 norm) Introduce nonlinear hidden nodes into the transform, similar to a multilayer perceptron or deep neural network Investigate the proposed methods with existing state of the art DNN acoustic model 48

for LVCSR List of Publications 49

for LVCSR References 50

for LVCSR References 51

for LVCSR References 52

for LVCSR Thank you very much! 53

Supplementary Slides 54