Sequence Discriminative Training;Robust Speech Recognition1

Similar documents
Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A study of speaker adaptation for DNN-based speech synthesis

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Speech Recognition at ICSI: Broadcast News and beyond

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

WHEN THERE IS A mismatch between the acoustic

Learning Methods in Multilingual Speech Recognition

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Calibration of Confidence Measures in Speech Recognition

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Improvements to the Pruning Behavior of DNN Acoustic Models

Human Emotion Recognition From Speech

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Emotion Recognition Using Support Vector Machine

Segregation of Unvoiced Speech from Nonspeech Interference

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Investigation on Mandarin Broadcast News Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

On the Formation of Phoneme Categories in DNN Acoustic Models

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speaker recognition using universal background model on YOHO database

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Lecture 1: Machine Learning Basics

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Deep Neural Network Language Models

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Lecture 9: Speech Recognition

arxiv: v1 [cs.lg] 7 Apr 2015

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Corrective Feedback and Persistent Learning for Information Extraction

Python Machine Learning

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

arxiv: v1 [cs.cl] 27 Apr 2016

Generative models and adversarial training

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Speaker Identification by Comparison of Smart Methods. Abstract

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Artificial Neural Networks written examination

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

INPE São José dos Campos

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Evolutive Neural Net Fuzzy Filtering: Basic Description

Switchboard Language Model Improvement with Conversational Data from Gigaword

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Support Vector Machines for Speaker and Language Recognition

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

CSL465/603 - Machine Learning

Edinburgh Research Explorer

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Lecture 10: Reinforcement Learning

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Automatic Pronunciation Checker

Knowledge Transfer in Deep Convolutional Neural Nets

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Affective Classification of Generic Audio Clips using Regression Models

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Learning Methods for Fuzzy Systems

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Grade 6: Correlated to AGS Basic Math Skills

Body-Conducted Speech Recognition and its Application to Speech Support System

Seminar - Organic Computing

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Probabilistic Latent Semantic Analysis

Transcription:

Sequence Discriminative Training; Robust Speech Recognition Steve Renals Automatic Speech Recognition 16 March 2017 Sequence Discriminative Training;Robust Speech Recognition1

Recall: Maximum likelihood estimation of HMMs Maximum likelihood estimation (MLE) sets the parameters so as to maximize an objective function F MLE : F MLE = U log P λ (X u M(W u )) u=1 for training utterances X 1... X U where W u is the word sequence given by the transcription of the uth utterance, M(W u ) is the corresponding HMM, and λ is the set of HMM parameters Sequence Discriminative Training;Robust Speech Recognition2

Maximum mutual information estimation Maximum mutual information estimation (MMIE) aims to directly maximise the posterior probability (sometimes called conditional maximum likelihood). Using the same notation as before, with P(w) representing the language model probability of word sequence w: F MMIE = = U log P λ (M(W u ) X u ) u=1 U u=1 log P λ(x u M(W u ))P(W u ) w P λ(x u M(w ))P(w ) Sequence Discriminative Training;Robust Speech Recognition3

Maximum mutual information estimation Maximum mutual information estimation (MMIE) aims to directly maximise the posterior probability (sometimes called conditional maximum likelihood). Using the same notation as before, with P(w) representing the language model probability of word sequence w: F MMIE = F MLE = U log P λ (M(W u ) X u ) u=1 U u=1 log P λ(x u M(W u ))P(W u ) w P λ(x u M(w ))P(w ) Sequence Discriminative Training;Robust Speech Recognition3

Maximum mutual information estimation F MMIE = U u=1 log P λ(x u M(W u ))P(W u ) w P λ(x u M(w ))P(w ) Sequence Discriminative Training;Robust Speech Recognition4

Maximum mutual information estimation F MMIE = U u=1 log P λ(x u M(W u ))P(W u ) w P λ(x u M(w ))P(w ) Numerator: likelihood of data given correct word sequence ( clamped to reference alignment) Sequence Discriminative Training;Robust Speech Recognition4

Maximum mutual information estimation F MMIE = U u=1 log P λ(x u M(W u ))P(W u ) w P λ(x u M(w ))P(w ) Numerator: likelihood of data given correct word sequence ( clamped to reference alignment) Denominator: total likelihood of the data given all possible word sequences equivalent to summing over all possible word sequences estimated by the full acoustic and language models in recognition. ( free ) Sequence Discriminative Training;Robust Speech Recognition4

Maximum mutual information estimation F MMIE = U u=1 log P λ(x u M(W u ))P(W u ) w P λ(x u M(w ))P(w ) Numerator: likelihood of data given correct word sequence ( clamped to reference alignment) Denominator: total likelihood of the data given all possible word sequences equivalent to summing over all possible word sequences estimated by the full acoustic and language models in recognition. ( free ) The objective function F MMIE is optimised by making the correct word sequence likely (maximise the numerator), and all other word sequences unlikely (minimise the denominator) Sequence Discriminative Training;Robust Speech Recognition4

Sequence training and lattices Computing the denominator involves summing over all possible word sequences estimate by generating lattices, and summing over all words in the lattice In practice also compute numerator statistics using lattices (useful for summing multiple pronunciations) Generate numerator and denominator lattices for every training utterance Denominator lattice uses recognition setup (with a weaker language model) Each word in the lattice is decoded to give a phone segmentation, and forward-backward is then used to compute the state occupation probabilities Lattices not usually re-computed during training Sequence Discriminative Training;Robust Speech Recognition5

MMIE is sequence discriminative training Sequence: like forward-backward (MLE) training, the overall objective function is at the sequence level maximise the posterior probability of the word sequence given the acoustics P λ (M(W u ) X u ) Discriminative: unlike forward-backward (MLE) training the overall objective function for MMIE is discriminative to maximise MMI: Maximise the numerator by increasing the likelihood of data given the correct word sequence Minimise the denominator by decreasing the total likelihood of the data given all possible word sequences This results in pushing up the correct word sequence, while pulling down the rest Sequence Discriminative Training;Robust Speech Recognition6

MPE: Minimum phone error Basic idea adjust the optimization criterion so it is directly related to word error rate Minimum phone error (MPE) criterion Sequence Discriminative Training;Robust Speech Recognition7

MPE: Minimum phone error Basic idea adjust the optimization criterion so it is directly related to word error rate Minimum phone error (MPE) criterion F MPE = U W log P λ(x u M(W ))P(W )A(W, W u ) W P λ(x u M(W ))P(W ) u=1 A(W, W u ) is the phone transcription accuracy of the sentence W given the reference W u Sequence Discriminative Training;Robust Speech Recognition7

MPE: Minimum phone error Basic idea adjust the optimization criterion so it is directly related to word error rate Minimum phone error (MPE) criterion F MMIE = U W log P λ(x u M(W u ))P(W u )A(W, W u ) W P λ(x u M(W ))P(W ) u=1 A(W, W u ) is the phone transcription accuracy of the sentence W given the reference W u Sequence Discriminative Training;Robust Speech Recognition7

MPE: Minimum phone error Basic idea adjust the optimization criterion so it is directly related to word error rate Minimum phone error (MPE) criterion F MPE = U W log P λ(x u M(W ))P(W )A(W, W u ) W P λ(x u M(W ))P(W ) u=1 A(W, W u ) is the phone transcription accuracy of the sentence W given the reference W u F MPE is a weighted average over all possible sentences w of the raw phone accuracy Although MPE optimizes a phone accuracy level, it does so in the context of a word-level system: it is optimized by finding probable sentences with low phone error rates Sequence Discriminative Training;Robust Speech Recognition7

HMM/DNN systems DNN-based systems are discriminative the cross-entropy (CE) training criterion with softmax output layer pushes up the correct label, and pulls down competing labels CE is a frame-based criterion we would like a sequence level training criterion for DNNs, operating at the word sequence level Can we train DNN systems with an MMI-type objective function? Sequence Discriminative Training;Robust Speech Recognition8

Sequence training of hybrid HMM/DNN systems Can we train DNN systems with an MMI-type objective function? Yes Forward- and back-propagation equations are structurally similar to forward and backward recursions in HMM training Initially train DNN framewise using cross-entropy (CE) error function Use CE-trained model to generate alignments and lattices for sequence training Use CE-trained weights to initialise weights for sequence training Train using back-propagation with sequence training objective function (e.g. MMI) Sequence Discriminative Training;Robust Speech Recognition9

Sequence training results on Switchboard (Kaldi) Results on Switchboard Hub 5 00 test set, trained on 300h training set, comparing maximum likelihood (ML) and discriminative (BMMI) trained GMMs with framewise cross-entropy (CE) and sequence trained (MMI) DNNs. GMM systems use speaker adaptive training (SAT). All systems had 8859 tied triphone states. GMMs 200k Gaussians DNNs 6 hidden layers each with 2048 hidden units SWB CHE Total GMM ML (+SAT) 21.2 36.4 28.8 GMM BMMI (+SAT) 18.6 33.0 25.8 DNN CE 14.2 25.7 20.0 DNN MMI 12.9 24.6 18.8 Veseley et al, 2013. Sequence Discriminative Training;Robust Speech Recognition10

Robust Speech Recognition Sequence Discriminative Training;Robust Speech Recognition11

Additive Noise Multiple acoustic sources are the norm rather than the exception From the point of view of trying to recognize a single stream of speech, this is additive noise Stationary noise: frequency spectrum does not change over time (e.g. air conditioning, car noise at constant speed) Non-stationary noise: time-dependent frequency spectrum (e.g. breaking glass, workshop noise, music, speech) Measure the noise level as SNR (signal-to-noise ratio), measured in db 30dB SNR sounds noise free 0dB SNR has equal signal and noise energy Sequence Discriminative Training;Robust Speech Recognition12

Feature normalization Basic idea: Transform the features to reduce mismatch between training and test Cepstral Mean Normalization (CMN): subtract the mean of the feature vectors from each feature vector, so each feature vector element has a mean of 0 CMN makes features robust to some linear filtering of the signal adds robustness to varying microphones, telephone channels, etc. Cepstral Variance Normalization (CVN): Divide feature vector by standard deviation of feature vectors, so each feature vector element has a variance of 1 Cepstral mean and variance normalisation, CMN/CVN: ˆx i = x i µ(x) σ(x) Sequence Discriminative Training;Robust Speech Recognition13

Feature compensation: Spectral subtraction Basic idea: Estimate the noise spectrum and subtract it from the observed spectra Any feature vector can then be computed from the noise-subtracted spectrum Problems: Need to estimate noise spectrum from a period of non-speech: requires good speech/non-speech detection Errors in the noise estimate (perhaps arising from speech/non-speech separation errors) result in over-/under-compensation of the spectrum Low computational cost, widely used in practice ETSI adavanced front end uses spectral subtraction and CMN Sequence Discriminative Training;Robust Speech Recognition14

Multi-condition Training Basic idea: Don t train on clean speech, but train on speech with a similar noise level (and noise type) Matched condition training in the same noise conditions as testing is rarely possible since the test conditions are nearly always partly unknown Multi-condition training train with speech data in a variety of noise conditions It is possible to artificially mix recorded noise with clean speech at any desired SNR to create a multi-style training set Advantage: training data much better matched to test conditions Disadvantage: acoustic model components become less discriminative and less well matched to the training data Model adaptation can further reduce errors using an adaptation technique such as MLLR Seltzer (2013) Sequence Discriminative Training;Robust Speech Recognition15

Neural Networks raise all boats GMMs and DNNs at varying SNRs WERs on Microsoft voice search data at varying SNRs (Huang 2014) Rs improve [Huang 20 Sequence Discriminative Training;Robust Speech Recognition16

Current approaches to robust speech recognition Decoupled preprocessing: Acoustic processing independent of downstream activity Pro: simple Con: removes variability Example: beamforming for multi-microphone distant speech recognition moves variability success: beamforming [Swietojanski 2013] Preprocessing Slide from Mike Seltzer Sequence Discriminative Training;Robust Speech Recognition17

Current approaches to robust speech recognition Integrated processing: Treat acoustic processing as initial layers of the network optimise parameters with back propagation Pro: should be optimal for the model Con: computationally expensive, Example: direct waveform systems ples: Mask estimation [Narayanan 2014], Mel optimization [Sainath 2013] uld be optimal for the model ensive, hard to move the needle Preprocessing Back-prop Slide from Mike Seltzer Sequence Discriminative Training;Robust Speech Recognition18

Current approaches to robust speech recognition Augmented information: Add additional side information to the network (additional nodes, different objective function,...) Pros: preserves variability, adds knowledge, maintains representation Con: not a physical model Example: noise-aware training, factorised noise codes (ivectors) n ut) Knowledge + Auxiliary information Slide from Mike Seltzer Sequence Discriminative Training;Robust Speech Recognition19

Summary Sequence training: discriminatively optimise GMM or DNN to a sentence (sequence) level criterion rather than a frame level criterion Noise robustness Important for practical applications of speech recognition Achieve robustness through feature invariance Achieve invariance through large training sets and deep networks Much active research in developiong architectures for robust ASR Sequence Discriminative Training;Robust Speech Recognition20

Reading HMM discriminative training: Sec 27.3.1 of: S Young (2008), HMMs and Related Speech Recognition Technologies, in Springer Handbook of Speech Processing, Benesty, Sondhi and Huang (eds), chapter 27, 539 557. http://www.inf.ed.ac.uk/teaching/ courses/asr/2010-11/restrict/young.pdf NN sequence training: K Vesely et al (2013), Sequence-discriminative training of deep neural networks, Interspeech-2013, http://homepages.inf.ed.ac.uk/aghoshal/ pubs/is13-dnn_seq.pdf DNNs for robust ASR: M Seltzer et al (2013), An Investigation of Deep Neural Networks for Noise Robust Speech Recognition, https://www.microsoft.com/en-us/research/publication/ an-investigation-of-deep-neural-networks-for-noise-robust-speech-recognition/ Y Huang et al (2014), A comparative analytic study on the Gaussian mixture and context-dependent deep neural network hidden Markov models, Interspeech-2014. http://www. isca-speech.org/archive/interspeech_2014/i14_1895.html Sequence Discriminative Training;Robust Speech Recognition21