Feature-based Robust Techniques For Speech Recognition

Similar documents
A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

WHEN THERE IS A mismatch between the acoustic

A study of speaker adaptation for DNN-based speech synthesis

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Learning Methods in Multilingual Speech Recognition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v1 [cs.lg] 7 Apr 2015

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Emotion Recognition Using Support Vector Machine

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Speaker recognition using universal background model on YOHO database

Calibration of Confidence Measures in Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Improvements to the Pruning Behavior of DNN Acoustic Models

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Deep Neural Network Language Models

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Human Emotion Recognition From Speech

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Lecture 9: Speech Recognition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Speaker Identification by Comparison of Smart Methods. Abstract

On the Formation of Phoneme Categories in DNN Acoustic Models

Python Machine Learning

Investigation on Mandarin Broadcast News Speech Recognition

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Edinburgh Research Explorer

Lecture 1: Machine Learning Basics

INPE São José dos Campos

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

(Sub)Gradient Descent

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

arxiv: v1 [cs.cl] 27 Apr 2016

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Speech Recognition by Indexing and Sequencing

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Issues in the Mining of Heart Failure Datasets

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Segregation of Unvoiced Speech from Nonspeech Interference

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Word Segmentation of Off-line Handwritten Documents

Author's personal copy

Automatic Pronunciation Checker

Comment-based Multi-View Clustering of Web 2.0 Items

Learning Methods for Fuzzy Systems

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Evolutive Neural Net Fuzzy Filtering: Basic Description

CSL465/603 - Machine Learning

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

CS Machine Learning

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Model Ensemble for Click Prediction in Bing Search Ads

A Review: Speech Recognition with Deep Learning Methods

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Online Updating of Word Representations for Part-of-Speech Tagging

Statistical Parametric Speech Synthesis

Why Did My Detector Do That?!

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Body-Conducted Speech Recognition and its Application to Speech Support System

Time series prediction

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

A Deep Bag-of-Features Model for Music Auto-Tagging

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Transcription:

Feature-based Robust Techniques For Speech Recognition presented by Nguyen Duc Hoang Ha Supervisors Assoc. Prof. Chng Eng Siong Prof. Li Haizhou 08-Mar-2017

Outline An of Robust ASR The 1st proposed method (Ch5) The major Contribution: Feature Adaptation Using Spectro-Temporal Information The 2nd proposed method (Ch3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments The 3rd proposed method (Ch4): A Particle Filter Compensation Approach to Robust LVCSR s and Future Directions 2

Automatic Speech Recognition (ASR) [Huang2001] AM LM w hello /h e l o/ The aim is to decode the speech signal into text. 3

Applications of the ASR system Siri (http://www.apple.com/ios/siri/) Amazon Echo (https://en.wikipedia.org/wiki/amazon_echo) Google Speech Recognition API (https://cloud.google.com/speech/)... 4

Challenges of the ASR system [Chelba2010, Li2014] Non-native speakers Dialect variations Dis-fluencies Out-of-vocabulary words Language modeling Noise robustness 5

ASR in Noisy Environments [Xiao2009, Li2014] Noisy speech features Clean speech model 6

Feature/Model Compensation [Xiao2009, Li2014] (A) (B) Two major approaches: (A) Feature-based approach (B) Model-based approach 7

Feature/Model Compensation Feature-Based Approach (A) Examples: spectral subtraction [Boll1979], MMSE [Ephraim1984], fmllr [Digalakis1995,Gales1998],... Model-based Approach (B) Examples: MAP model adaptation [Gauvain1994], MLLR/CMLLR model adaptation [Leggetter1995, Gales1998], Vector Taylor series model adaptation [Acero2000, Li2009] 8

Multi-condition training approach [Ng2016] (A) (B) (C) Noisy data collection / simulation 9

Robust ASR (A) Feature-based Approach (B) Model-based Approach Clean Feature Estimation (e.g. SS [Boll1979], MMSE [Ephraim1984],...) MAP Model Adaptation [Gauvain1994] Filtering Approach (e.g. RASTA [Hermansky1994],...) MLLR, CMLLR Model Adaptation [Leggetter1995, Gales1998] Feature Transformation (e.g. fmllr [Digalakis1995,Gales1998]) VTS Model Compensation [Acero2000, Li2009]... (C) Data Collection Simulation Deep learning approaches (e.g. DNN AM [Hinton2012])... 10

Contributions Three Proposed Methods (B2) (A1) ST-Transform (A1) (A3) (2) (A2) (for background noise and reverberation) (A2)NN (B2) VTS (for non-stationary noise) (A3) -LVCSR (for background noise) 11

Contributions Three Proposed Methods 1) Spectra-Temporal Transformation (ST-Transform) D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. Generalization of temporal filter and linear transformation for robust speech recognition. In ICASSP, Italy, 2014. D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. Feature adaptation using linear spectro-temporal transform for robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, PP(99):1 1, 2016. (Contribute to success at the Reverb2014 Challenge for clean condition scheme) 2) Noise Normalization (NN) Vector Taylor Series Model Compensation (VTS) D. H. H. Nguyen, X. Xiao, E. S. Chng, and H. Li. An analysis of vector taylor series model compensation for non-stationary noise in speech recognition. In ISCSLP, Hong Kong, 2012. 3) Particle Filter Compensation () for LVCSR D. H. H. Nguyen, A. Mushtaq, X. Xiao, E. S. Chng, H. Li, and C.-H. Lee. A particle filter compensation approach to robust lvcsr. In APSIPA ASC, Taiwan, 2013. 12

http://reverb2014.dereverberation.com/introduction.html Contributions of 13

Outline An of Robust ASR The 1st proposed method (Ch5): Feature Adaptation Using Spectro-Temporal Information The 2nd proposed method (Ch3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments The 3rd proposed method (Ch4): A Particle Filter Compensation Approach to Robust LVCSR s and Future Directions 14

Feature Adaptation Using Spectro-Temporal Information (A1) ST-Transform 15

Feature Adaptation Using Spectro-Temporal Information Noisy features Transformed features x^ 1: T y1 : T Distribution of transformed features Kullback Leibler divergence Distribution of training features The ST transform W is estimated to minimize a KL divergence between the distribution of the transformed features and the reference distribution of the training features. 16

Changing Notations for Generalization of the Feature Transformation Input features x 1: T Feature Transformation y = f(x) Transformed features y1 : T Distribution of transformed features Kullback Leibler divergence Distribution of training features x denotes the input feature. y denotes the output feature. Transformation y = f(x) is more natural. 17

: Generalized Linear Transform Input: Output: A) e.g. CMN B) e.g. C) e.g. [Atal1974], MVN [Viikki1998] fmllr [Digalakis1 995,Gales 1998] RASTA [Hermansky1994] TSN [Xiao2009] 18

: Generalized Linear Transform Input: Output: output feature vector input feature vectors 19

: Generalized Linear Transform 20

: Generalized Linear Transform 21 Matrix form of W

EM Algorithm for Parameter Estimation From L2-Norm Covariance matrix of Output features Output features From KL-divergence criterion Ref. Model 22

Insufficient Adaptation Data Issue Issues: - Unreliable statistics - Too big the degrees of freedom in ST transform Solutions: + Statistics smoothing approach + Sparse ST transform 23

Statistics Smoothing Approach From training or prior data From test data From training or prior data From test data The idea of statistics smoothing is to interpolate the statistics computed from the adaptation data with the statistics computed from some prior data. 24

Sparse ation Cross Transform A) e.g. CMN, MVN, HEQ B) e.g. fmllr C) e.g. RASTA, ARMA, TSN 25

: Generalized Linear Transform 26 Matrix form of W

Matrix form of W 27

Experimental Settings REVERB Challenge 2014 benchmark task for noisy and reverberant speech recognition: Clean condition training scheme: Training data: 7861 clean utterances from the WSJCAM0 database (about 17.5 hours from 92 speakers) Speech features: 13 MFCCs + 13 + 13 MVN post-processing Acoustic model: 3115 tied-states, 10 mixtures/state The development (dev) and evaluation (eval) data sets: Actual meeting room recording of MC-WSJ-AV corpus Near setting: 100cm distance between the microphone and the speaker Far setting: 250cm distance between the microphone and the speaker 28

An Analysis of Window Length on Dev Set We will fix the window length to be 21 for temporal filter, cross transform, and the full ST transform for experiments on Eval set. 29

Three different adaptation schemes Notes: Full batch mode: 1 transform for each subset (near and far) Speaker mode: 1 transform for each speaker Utterance mode: 1 transform for each utterance 30

Experiments for Cascaded Transforms Input features Cross Transform FMLLR Temporal Filter FMLLR Cross Transform Temporal Filter fmllr Cross Transform fmllr 67 Transform 1 66 Transform 2 Output features % Average WER 65 64 63 62 61 60 59 58 Speaker Mode Full Batch Mode Utterance Mode + Cascading transforms in tandem is an effective way of using spectro-temporal information without significant increase in the number of free parameters + Observing the best result from cascaded transform of cross transform and fmllr 31

Hybrid Cascaded Transforms utt1, utt2,, uttn Transform 1 in full batch mode utt1a, utt2a,, uttna... Transform 2 in utterance mode Transform 2 in utterance mode Transform 2 in utterance mode + Full batch mode (fb): deal with session-wise reverberation and noise distortions + Utterance mode (utt): remove speaker variations and other sentence-wise variations, e.g. due to speaker movement and background noise change + Statistics smoothing (smooth): Ref. statistics provided from the batch mode utt1b, utt2b,, uttnb 32

Cascaded Transforms vs. Hybrid Cascaded Transforms vs. Hybrid Cascaded Transforms + Stats. Smoothing + Full batch mode (fb): deal with sessionwise reverberation and noise distortions + Utterance mode (utt): remove speaker variations and other sentence-wise variations, e.g. due to speaker movement and background noise change + Statistics smoothing (smooth): Ref. statistics provided from the batch mode Observations: + The combination of batch and utterance mode transforms performs the best. + (1) vs (2): 3 % absolute reduction in WER + (3): The best result (1) (2) (3) 33

Outline An of Robust ASR The 1st proposed method: Feature Adaptation Using Spectro-Temporal Information The 2nd proposed method (Chapter 3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments The 3rd proposed method: A Particle Filter Compensation Approach to Robust LVCSR s and Future Directions 34

Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments ( method) (B) (B) VTS Model Compensation: handle the residual noise YNN (A) Noise Normalization: reduce the non-stationary characteristics of additive noise Λy NN (A) 35

Step 1: Noise Normalization Instantaneous noise estimate nt noisy feature yt Noise Estimation x α n=μ n Average noise estimate x + + y^ t NN feature + hyper-parameter α is used to control the degree of removing the instantaneous noise NN feature Observed noisy feature Instantaneous noise estimate DCT matrix Adding average noise estimate 36 to reduce musical noise

Step 2: Back-end Compensation noisy y features t Noise Estimation clean model λ n={μ n, σ n } VTS Model Compensation [Li2009] y=g( x, n) λx λ ^y noisy model Approximations of Noisy Acoustic Models Jacobian matrix (Hyperparameter from noise normalization) 37

Step 2: Back-end Compensation Minimal point Residual Noise Variance We expect that alpha=0.5 is the best setting. 38

Experimental Settings 39

Results Word accuracies evaluated on test sets A and B of AURORA2 database 40

Outline An of Robust ASR The 1st proposed method (Ch5): Feature Adaptation Using Spectro-Temporal Information The 2nd proposed method (Ch3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments The 3rd proposed method (Ch4): A Particle Filter Compensation Approach to Robust LVCSR s and Future Directions 41

Particle Filter Compensation () Approach to Robust LVCSR (for background noise) (A) 42

Framework Input: Speech Features Decoder 1 Phone Sequence (aligned with input features) Feature Enhancement Enhanced Speech Features Decoder 2 Text 43

for Clean Speech Feature Estimation phone /a/ a) Using Single Pass Retraining (SPR) Technique... b) Using Particle Filter Algorithm. The posterior density of the clean speech features 44

Experiments Experiments are conducted on Aurora 4 data Decoder is from the hidden Markov model toolkit (HTK) A relative error reduction of only 5.3% is obtained (compared to multi-condition training GMM-HMM system). This work has been published in APSIPA, Taiwan, 2013. D. H. H. Nguyen, A. Mushtaq, X. Xiao, E. S. Chng, H. Li, and C.-H. Lee. A particle filter compensation approach to robust LVCSR. In APSIPA ASC, Taiwan, 2013. 45

for LVCSR Outline An of Robust ASR The 1st proposed method (Ch5) The major Contribution: Feature Adaptation Using Spectro-Temporal Information The 2nd proposed method (Ch3): Combination of Feature Enhancement and VTS Model Compensation for Non-stationary Noisy Environments The 3rd proposed method (Ch4): A Particle Filter Compensation Approach to Robust LVCSR s and Future Directions 46

for LVCSR s Proposed a sparse ST transform: the Cross transform; explore cascaded transforms and interpolation of statistics Proposed to use EM algorithm to estimate the generalized linear transform to minimize the cost function based on KL divergence criterion for feature adaptation Proposed the integration of noise normalization with VTS model compensation Extended the framework to work on LVCSR system 47

for LVCSR Future Directions Discover a sparse transform automatically by using sparse constraints (e.g. apply L1 norm) Introduce nonlinear hidden nodes into the transform, similar to a multilayer perceptron or deep neural network Investigate the proposed methods with existing state of the art DNN acoustic model 48

for LVCSR List of Publications 49

for LVCSR References 50

for LVCSR References 51

for LVCSR References 52

for LVCSR Thank you very much! 53

Supplementary Slides 54