Speaker Adaptation. Steve Renals. Automatic Speech Recognition ASR Lectures 13&14 10, 13 March ASR Lectures 13&14 Speaker Adaptation 1

Similar documents
A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A study of speaker adaptation for DNN-based speech synthesis

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speaker recognition using universal background model on YOHO database

Lecture 1: Machine Learning Basics

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

Speech Emotion Recognition Using Support Vector Machine

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Human Emotion Recognition From Speech

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Calibration of Confidence Measures in Speech Recognition

Python Machine Learning

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Investigation on Mandarin Broadcast News Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Probabilistic Latent Semantic Analysis

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Generative models and adversarial training

Speaker Identification by Comparison of Smart Methods. Abstract

Statistical Parametric Speech Synthesis

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Deep Neural Network Language Models

(Sub)Gradient Descent

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Lecture 9: Speech Recognition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Edinburgh Research Explorer

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Model Ensemble for Click Prediction in Bing Search Ads

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Automatic Pronunciation Checker

Offline Writer Identification Using Convolutional Neural Network Activation Features

Truth Inference in Crowdsourcing: Is the Problem Solved?

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Artificial Neural Networks written examination

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

INPE São José dos Campos

On the Formation of Phoneme Categories in DNN Acoustic Models

CSL465/603 - Machine Learning

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

An Online Handwriting Recognition System For Turkish

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Issues in the Mining of Heart Failure Datasets

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Learning Methods for Fuzzy Systems

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

arxiv: v2 [cs.cv] 30 Mar 2017

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Probability and Statistics Curriculum Pacing Guide

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

arxiv: v1 [cs.lg] 7 Apr 2015

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Corpus Linguistics (L615)

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Rule Learning With Negation: Issues Regarding Effectiveness

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Support Vector Machines for Speaker and Language Recognition

Voice conversion through vector quantization

Reinforcement Learning by Comparing Immediate Reward

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Corrective Feedback and Persistent Learning for Information Extraction

Letter-based speech synthesis

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

CS Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

Mandarin Lexical Tone Recognition: The Gating Paradigm

Transcription:

Speaker Adaptation Steve Renals Automatic Speech Recognition ASR Lectures 13&14 10, 13 March 2014 ASR Lectures 13&14 Speaker Adaptation 1

Overview Speaker Adaptation Introduction: speaker-specific variation, modes of adaptation Model-based adaptation: MAP Model-based adaptation: MLLR Model-based adaptation: Speaker space models Speaker normalization: VTLN Adaptive training Adaptation for hybrid HMM / NN systems ASR Lectures 13&14 Speaker Adaptation 2

Speaker independent / dependent / adaptive Speaker independent (SI) systems have long been the focus for research in transcription, dialogue systems, etc. Speaker dependent (SD) systems can result in word error rates 2 3 times lower than SI systems (given the same amount of training data) A Speaker adaptive (SA) system... we would like Error rates similar to SD systems Building on an SI system Requiring only a small fraction of the speaker-specific training data used by an SD system ASR Lectures 13&14 Speaker Adaptation 3

Speaker-specific variation Acoustic model Speaking styles Accents Speech production anatomy (eg length of the vocal tract) Also non-speaker variation, such as channel conditions (telephone, reverberant room, close talking mic) and application domain Speaker adaptation of acoustic models aims to reduce the mismatch between test data and the models ASR Lectures 13&14 Speaker Adaptation 4

Speaker-specific variation Acoustic model Speaking styles Accents Speech production anatomy (eg length of the vocal tract) Also non-speaker variation, such as channel conditions (telephone, reverberant room, close talking mic) and application domain Speaker adaptation of acoustic models aims to reduce the mismatch between test data and the models Pronunciation model: speaker-specific, consistent change in pronunciation ASR Lectures 13&14 Speaker Adaptation 4

Speaker-specific variation Acoustic model Speaking styles Accents Speech production anatomy (eg length of the vocal tract) Also non-speaker variation, such as channel conditions (telephone, reverberant room, close talking mic) and application domain Speaker adaptation of acoustic models aims to reduce the mismatch between test data and the models Pronunciation model: speaker-specific, consistent change in pronunciation Language model: user-specific documents (exploited in personal dictation systems) ASR Lectures 13&14 Speaker Adaptation 4

Modes of adaptation Supervised or unsupervised Supervised: the word level transcription of the adaptation data is known (and HMMs may be constructed) Unsupervised: the transcription must be estimated (eg using recognition output) ASR Lectures 13&14 Speaker Adaptation 5

Modes of adaptation Supervised or unsupervised Supervised: the word level transcription of the adaptation data is known (and HMMs may be constructed) Unsupervised: the transcription must be estimated (eg using recognition output) Static or dynamic Static: All adaptation data is presented to the system in a block before the final system is estimated (eg as used in enrollment in a dictation system) Dynamic: Adaptation data is incrementally available, and models must be adapted before all adaptation data is available (eg as used in a spoken dialogue system) ASR Lectures 13&14 Speaker Adaptation 5

Approaches to adaptation Model based: Adapt the parameters of the acoustic models to better match the observed data Maximum a posteriori (MAP) adaptation of HMM/GMM parameters Maximum likelihood linear regression (MLLR) of Gaussian parameters Linear input network (LIN) for neural networks ASR Lectures 13&14 Speaker Adaptation 6

Approaches to adaptation Model based: Adapt the parameters of the acoustic models to better match the observed data Maximum a posteriori (MAP) adaptation of HMM/GMM parameters Maximum likelihood linear regression (MLLR) of Gaussian parameters Linear input network (LIN) for neural networks Speaker Normalization: Normalize the acoustic data to reduce mismatch with the acoustic models Vocal Tract Length Normalization (VTLN) Constrained MLLR (cmllr) model-based normalisation! ASR Lectures 13&14 Speaker Adaptation 6

Approaches to adaptation Model based: Adapt the parameters of the acoustic models to better match the observed data Maximum a posteriori (MAP) adaptation of HMM/GMM parameters Maximum likelihood linear regression (MLLR) of Gaussian parameters Linear input network (LIN) for neural networks Speaker Normalization: Normalize the acoustic data to reduce mismatch with the acoustic models Vocal Tract Length Normalization (VTLN) Constrained MLLR (cmllr) model-based normalisation! Speaker space: Estimate multiple sets of acoustic models, characterizing new speakers in terms of these model sets Cluster-adapative training Eigenvoices Speaker codes ASR Lectures 13&14 Speaker Adaptation 6

Adaptation and normalization of acoustic models Feature Space Model Space Training conditions X train Training M train Recognition Test Condition Xtest ASR Lectures 13&14 Speaker Adaptation 7

Adaptation and normalization of acoustic models Feature Space Model Space Training conditions X train Training M train Adaptation Test Condition Xtest Recognition Mtest ASR Lectures 13&14 Speaker Adaptation 7

Adaptation and normalization of acoustic models Feature Space Model Space Training conditions X train Normalization (training) Reference condition ˆX train Training ˆM Normalization (recognition) Test Condition Xtest ASR Lectures 13&14 Speaker Adaptation 7

Adaptation and normalization of acoustic models Feature Space Model Space Training conditions X train Training M train Adaptive training Reference condition ˆM Adaptive recognition Test Condition Xtest Adaptation Mtest ASR Lectures 13&14 Speaker Adaptation 7

Model-based adaptation: The MAP family Basic idea Use the SI models as a prior probability distribution over model parameters when estimating using speaker-specific data ASR Lectures 13&14 Speaker Adaptation 8

Model-based adaptation: The MAP family Basic idea Use the SI models as a prior probability distribution over model parameters when estimating using speaker-specific data Theoretically well-motivated approach to incorporating the knowledge inherent in the SI model parameters ASR Lectures 13&14 Speaker Adaptation 8

Model-based adaptation: The MAP family Basic idea Use the SI models as a prior probability distribution over model parameters when estimating using speaker-specific data Theoretically well-motivated approach to incorporating the knowledge inherent in the SI model parameters Maximum likelihood (ML) training sets the model parameters λ to maximize the likelihood p(x λ) ASR Lectures 13&14 Speaker Adaptation 8

Model-based adaptation: The MAP family Basic idea Use the SI models as a prior probability distribution over model parameters when estimating using speaker-specific data Theoretically well-motivated approach to incorporating the knowledge inherent in the SI model parameters Maximum likelihood (ML) training sets the model parameters λ to maximize the likelihood p(x λ) Maximum a posteriori (MAP) training maximizes the posterior of the parameters given the data: p(λ X) p(x λ)p 0 (λ) p 0 (λ) is the prior distribution of the parameters ASR Lectures 13&14 Speaker Adaptation 8

Model-based adaptation: The MAP family Basic idea Use the SI models as a prior probability distribution over model parameters when estimating using speaker-specific data Theoretically well-motivated approach to incorporating the knowledge inherent in the SI model parameters Maximum likelihood (ML) training sets the model parameters λ to maximize the likelihood p(x λ) Maximum a posteriori (MAP) training maximizes the posterior of the parameters given the data: p(λ X) p(x λ)p 0 (λ) p 0 (λ) is the prior distribution of the parameters The use of a prior distribution, based on the SI models, means that less data is required to estimate the speaker-specific models: we are not starting from complete ignorance ASR Lectures 13&14 Speaker Adaptation 8

Recall: ML estimation of GMM/HMM The mean of the mth Gaussian component of the jth state is estimated using a weighted average µ mj = n γ jm(n)x n n γ jm(n) Where n γ jm(n) is the component occupation probability The covariance of the Gaussian component is given by: Σ mj = n γ jm(n)(x n µ jm )(x n µ jm ) T n γ jm(n) ASR Lectures 13&14 Speaker Adaptation 9

MAP estimation What is p 0 (λ)? ASR Lectures 13&14 Speaker Adaptation 10

MAP estimation What is p 0 (λ)? Conjugate prior: the prior distribution has the same form as the posterior. There is no simple conjugate prior for GMMs, but an intuitively understandable approach may be employed. ASR Lectures 13&14 Speaker Adaptation 10

MAP estimation What is p 0 (λ)? Conjugate prior: the prior distribution has the same form as the posterior. There is no simple conjugate prior for GMMs, but an intuitively understandable approach may be employed. If the prior mean is µ 0, then the MAP estimate for the adapted mean ˆµ of Gaussian is given by: ˆµ = τµ 0 + n γ(n)x n τ + n γ(n) ASR Lectures 13&14 Speaker Adaptation 10

MAP estimation What is p 0 (λ)? Conjugate prior: the prior distribution has the same form as the posterior. There is no simple conjugate prior for GMMs, but an intuitively understandable approach may be employed. If the prior mean is µ 0, then the MAP estimate for the adapted mean ˆµ of Gaussian is given by: ˆµ = τµ 0 + n γ(n)x n τ + n γ(n) τ is a hyperparameter that controls the balance between the ML estimate of the mean, its prior value. Typically τ is in the range 2 20 ASR Lectures 13&14 Speaker Adaptation 10

MAP estimation What is p 0 (λ)? Conjugate prior: the prior distribution has the same form as the posterior. There is no simple conjugate prior for GMMs, but an intuitively understandable approach may be employed. If the prior mean is µ 0, then the MAP estimate for the adapted mean ˆµ of Gaussian is given by: ˆµ = τµ 0 + n γ(n)x n τ + n γ(n) τ is a hyperparameter that controls the balance between the ML estimate of the mean, its prior value. Typically τ is in the range 2 20 x n is the adaptation vector at time n ASR Lectures 13&14 Speaker Adaptation 10

MAP estimation What is p 0 (λ)? Conjugate prior: the prior distribution has the same form as the posterior. There is no simple conjugate prior for GMMs, but an intuitively understandable approach may be employed. If the prior mean is µ 0, then the MAP estimate for the adapted mean ˆµ of Gaussian is given by: ˆµ = τµ 0 + n γ(n)x n τ + n γ(n) τ is a hyperparameter that controls the balance between the ML estimate of the mean, its prior value. Typically τ is in the range 2 20 x n is the adaptation vector at time n γ(n) the probability of this Gaussian at this time ASR Lectures 13&14 Speaker Adaptation 10

MAP estimation What is p 0 (λ)? Conjugate prior: the prior distribution has the same form as the posterior. There is no simple conjugate prior for GMMs, but an intuitively understandable approach may be employed. If the prior mean is µ 0, then the MAP estimate for the adapted mean ˆµ of Gaussian is given by: ˆµ = τµ 0 + n γ(n)x n τ + n γ(n) τ is a hyperparameter that controls the balance between the ML estimate of the mean, its prior value. Typically τ is in the range 2 20 x n is the adaptation vector at time n γ(n) the probability of this Gaussian at this time As the amount of training data increases, so the MAP estimate converges to the ML estimate ASR Lectures 13&14 Speaker Adaptation 10

Local estimation Basic idea The main drawback to MAP adaptation is that it is local ASR Lectures 13&14 Speaker Adaptation 11

Local estimation Basic idea The main drawback to MAP adaptation is that it is local Only the parameters belonging to Gaussians of observed states will be adapted ASR Lectures 13&14 Speaker Adaptation 11

Local estimation Basic idea The main drawback to MAP adaptation is that it is local Only the parameters belonging to Gaussians of observed states will be adapted Large vocabulary speech recognition systems have about 10 5 Gaussians: most will not be adapted ASR Lectures 13&14 Speaker Adaptation 11

Local estimation Basic idea The main drawback to MAP adaptation is that it is local Only the parameters belonging to Gaussians of observed states will be adapted Large vocabulary speech recognition systems have about 10 5 Gaussians: most will not be adapted Structural MAP (SMAP) approaches have been introduced to share Gaussians ASR Lectures 13&14 Speaker Adaptation 11

Local estimation Basic idea The main drawback to MAP adaptation is that it is local Only the parameters belonging to Gaussians of observed states will be adapted Large vocabulary speech recognition systems have about 10 5 Gaussians: most will not be adapted Structural MAP (SMAP) approaches have been introduced to share Gaussians The MLLR family of adaptation approaches addresses this by assuming that transformations for a specific speaker are systematic across Gaussians, states and models ASR Lectures 13&14 Speaker Adaptation 11

Local estimation Basic idea The main drawback to MAP adaptation is that it is local Only the parameters belonging to Gaussians of observed states will be adapted Large vocabulary speech recognition systems have about 10 5 Gaussians: most will not be adapted Structural MAP (SMAP) approaches have been introduced to share Gaussians The MLLR family of adaptation approaches addresses this by assuming that transformations for a specific speaker are systematic across Gaussians, states and models MAP adaptation is very useful for domain adaptation: ASR Lectures 13&14 Speaker Adaptation 11

Local estimation Basic idea The main drawback to MAP adaptation is that it is local Only the parameters belonging to Gaussians of observed states will be adapted Large vocabulary speech recognition systems have about 10 5 Gaussians: most will not be adapted Structural MAP (SMAP) approaches have been introduced to share Gaussians The MLLR family of adaptation approaches addresses this by assuming that transformations for a specific speaker are systematic across Gaussians, states and models MAP adaptation is very useful for domain adaptation: Example: adapting a conversational telephone speech system (100s of hours of data) to multiparty meetings (10s of hours of data) works well with MAP ASR Lectures 13&14 Speaker Adaptation 11

SMAP: Structural MAP Basic idea share Gaussians by organising them in a tree, whose root contains all the Gaussians At each node in the tree compute mean offset and diagonal variance scaling term For each node, its parent is used as a prior distribution This has been shown to speed adaptation compared with standard MAP, while converging to the same solution as standard MAP in the large data limit ASR Lectures 13&14 Speaker Adaptation 12

The Linear Transform family Basic idea Rather than directly adapting the model parameters, estimate a transform which may be applied the Gaussian means and covariances ASR Lectures 13&14 Speaker Adaptation 13

The Linear Transform family Basic idea Rather than directly adapting the model parameters, estimate a transform which may be applied the Gaussian means and covariances Linear transform applied to parameters of a set of Gaussians: adaptation transform parameters are shared across Gaussians ASR Lectures 13&14 Speaker Adaptation 13

The Linear Transform family Basic idea Rather than directly adapting the model parameters, estimate a transform which may be applied the Gaussian means and covariances Linear transform applied to parameters of a set of Gaussians: adaptation transform parameters are shared across Gaussians This addresses the locality problem arising in MAP adaptation, since each adaptation data point can affect many of (or even all) the Gaussians in the system ASR Lectures 13&14 Speaker Adaptation 13

The Linear Transform family Basic idea Rather than directly adapting the model parameters, estimate a transform which may be applied the Gaussian means and covariances Linear transform applied to parameters of a set of Gaussians: adaptation transform parameters are shared across Gaussians This addresses the locality problem arising in MAP adaptation, since each adaptation data point can affect many of (or even all) the Gaussians in the system There are relatively few adaptation parameters, so estimation is robust ASR Lectures 13&14 Speaker Adaptation 13

The Linear Transform family Basic idea Rather than directly adapting the model parameters, estimate a transform which may be applied the Gaussian means and covariances Linear transform applied to parameters of a set of Gaussians: adaptation transform parameters are shared across Gaussians This addresses the locality problem arising in MAP adaptation, since each adaptation data point can affect many of (or even all) the Gaussians in the system There are relatively few adaptation parameters, so estimation is robust Maximum Likelihood Linear Regression (MLLR) is the best known linear transform approach to speaker adaptation ASR Lectures 13&14 Speaker Adaptation 13

MLLR: Maximum Likelihood Linear Regression MLLR is the best known linear transform approach to speaker adaptation Affine transform of mean parameters ˆµ = Aµ + b If the observation vectors are d-dimension, then A is a d d matrix and b is d-dimension vector If we define W = [ba] and η = [1µ T ] T, then we can write: ˆµ = Wη In MLLR, W is estimated so as to maximize the likelihood of the adaptation data A single transform W can be shared across a set of Gaussian components (even all of them!) ASR Lectures 13&14 Speaker Adaptation 14

Regression classes The number of transforms may obtained automatically A set of Gaussian components that share a transform is called a regression class Obtain the regression classes by constructing a regression class tree Each node in the tree represents a regression class sharing a transform For an adaptation set, work down the tree until arriving at the most specific set of nodes for which there is sufficient data Regression class tree constructed in a similar way to state clustering tree In practice the number of regression may be very small: one per context-independent phone class, one per broad class, or even just two (speech/non-speech) ASR Lectures 13&14 Speaker Adaptation 15

Estimating the transforms The linear transformation matrix W is obtained by finding its setting which optimizes the log likelihood Mean adaptation: Log likelihood L = ( ( γ r (n) log K r exp 1 )) 2 (x n Wη r ) T Σ 1 r (x n Wη r ) r n where r ranges over the components belonging to the regression class Differentiating L and setting to 0 results in an equation for W: there is no closed form solution if Σ is full covariance; can be solved if Σ is diagonal (but requires a matrix inversion) Variance adaptation is also possible See Gales and Woodland (1996), Gales (1998) for details ASR Lectures 13&14 Speaker Adaptation 16

MLLR in practice Mean-only MLLR results in 10 15% relative reduction in WER Few regression classes and well-estimated transforms work best in practice Robust adaptation available with about 1 minute of speech; performance similar to SD models available with 30 minutes of adaptation data Such linear transforms can account for any systematic (linear) variation from the speaker independent models, for example those caused by channel effects. ASR Lectures 13&14 Speaker Adaptation 17

Constrained MLLR (cmllr) Basic idea use the same linear transform for both mean and covariance ˆµ = A µ b ˆΣ = A ΣA T ASR Lectures 13&14 Speaker Adaptation 18

Constrained MLLR (cmllr) Basic idea use the same linear transform for both mean and covariance ˆµ = A µ b ˆΣ = A ΣA T No closed form solution but can be solved iteratively ASR Lectures 13&14 Speaker Adaptation 18

Constrained MLLR (cmllr) Basic idea use the same linear transform for both mean and covariance ˆµ = A µ b ˆΣ = A ΣA T No closed form solution but can be solved iteratively Log likelihood for cmllr L = N (Ax n + b; µ, Σ) + log( A ) A = A 1 ; b = Ab Equivalent to applying the linear transform to the data! Also called fmllr (feature space MLLR) ASR Lectures 13&14 Speaker Adaptation 18

Constrained MLLR (cmllr) Basic idea use the same linear transform for both mean and covariance ˆµ = A µ b ˆΣ = A ΣA T No closed form solution but can be solved iteratively Log likelihood for cmllr L = N (Ax n + b; µ, Σ) + log( A ) A = A 1 ; b = Ab Equivalent to applying the linear transform to the data! Also called fmllr (feature space MLLR) Iterative solution amenable to online/dynamic adaptation, by using just one iteration for each increment ASR Lectures 13&14 Speaker Adaptation 18

Constrained MLLR (cmllr) Basic idea use the same linear transform for both mean and covariance ˆµ = A µ b ˆΣ = A ΣA T No closed form solution but can be solved iteratively Log likelihood for cmllr L = N (Ax n + b; µ, Σ) + log( A ) A = A 1 ; b = Ab Equivalent to applying the linear transform to the data! Also called fmllr (feature space MLLR) Iterative solution amenable to online/dynamic adaptation, by using just one iteration for each increment Similar improvement in accuracy to standard MLLR ASR Lectures 13&14 Speaker Adaptation 18

Speaker-adaptive training (SAT) Basic idea Rather than SI seed (canonical) models, construct models designed for adaptation Estimate parameters of canonical models by training MLLR mean transforms for each training speaker Train using the MLLR transform for each speaker; interleave Gaussian parameter estimation and MLLR transform estimation SAT results in much higher training likelihoods, and improved recognition results But: increased training complexity and storage requirements SAT using cmllr, corresponds to a type of speaker normalization at training time ASR Lectures 13&14 Speaker Adaptation 19

Speaker Space Methods Gender-dependent models: sets of HMMs for male and for female speakers ASR Lectures 13&14 Speaker Adaptation 20

Speaker Space Methods Gender-dependent models: sets of HMMs for male and for female speakers Speaker clustering: sets of HMMs for different speaker clusters ASR Lectures 13&14 Speaker Adaptation 20

Speaker Space Methods Gender-dependent models: sets of HMMs for male and for female speakers Speaker clustering: sets of HMMs for different speaker clusters Drawbacks: ASR Lectures 13&14 Speaker Adaptation 20

Speaker Space Methods Gender-dependent models: sets of HMMs for male and for female speakers Speaker clustering: sets of HMMs for different speaker clusters Drawbacks: Hard division of speakers into groups ASR Lectures 13&14 Speaker Adaptation 20

Speaker Space Methods Gender-dependent models: sets of HMMs for male and for female speakers Speaker clustering: sets of HMMs for different speaker clusters Drawbacks: Hard division of speakers into groups Fragments training data ASR Lectures 13&14 Speaker Adaptation 20

Speaker Space Methods Gender-dependent models: sets of HMMs for male and for female speakers Speaker clustering: sets of HMMs for different speaker clusters Drawbacks: Hard division of speakers into groups Fragments training data Weighted speaker cluster approaches which use an interpolated model to represent the current speaker ASR Lectures 13&14 Speaker Adaptation 20

Speaker Space Methods Gender-dependent models: sets of HMMs for male and for female speakers Speaker clustering: sets of HMMs for different speaker clusters Drawbacks: Hard division of speakers into groups Fragments training data Weighted speaker cluster approaches which use an interpolated model to represent the current speaker Cluster-adaptive training ASR Lectures 13&14 Speaker Adaptation 20

Speaker Space Methods Gender-dependent models: sets of HMMs for male and for female speakers Speaker clustering: sets of HMMs for different speaker clusters Drawbacks: Hard division of speakers into groups Fragments training data Weighted speaker cluster approaches which use an interpolated model to represent the current speaker Cluster-adaptive training Eigenvoices ASR Lectures 13&14 Speaker Adaptation 20

Cluster-adaptive training Basic idea Represent a speaker as a weighted sum of speaker cluster models Different cluster models have shared variances and mixture weights, but separate means For a new speaker, mean is defined as µ = c λ c µ c Given the canonical models, only the λ c mixing parameters need estimated for each speaker Given sets of weights for individual speakers, means of the clusters may be updated CAT can reduce WER in large vocabulary tasks by about 4 8% relative For more, see Gales (2000), Cluster adaptive training of hidden Markov models, IEEE Trans Speech and Audio Processing, 8:417 428. ASR Lectures 13&14 Speaker Adaptation 21

Eigenvoices Basic idea Construct a speaker space from a set of SD HMMs Could regard each canonical model as forming a dimension of speaker space Generalize by computing PCA of sets of supervectors (concatenated mean vectors), to form speaker space: each dimension is an eigenvoice Represent a new speaker as a combination of eigenvoices Close relation to CAT Computationally intensive, does not scale well to large vocabulary systems For more, see Kuhn et al (2000), Rapid speaker adaptation in eigenvoice space, IEEE Trans Speech and Audio Processing, 8:695 707. ASR Lectures 13&14 Speaker Adaptation 22

Feature normalization Basic idea: Transform the features to reduce mismatch between training and test Cepstral Mean Normalization (CMN): subtract the avergae feature value from each feature, so each feature has a mean value of 0. makes features robust to some linear filtering of the signal (channel variation) Cepstral Variance Normalization (CVN): Divide feature vector by standard deviation of feature vectors, so each feature vector element has a variance of 1 Cepstral mean and variance normalisation, CMN/CVN: ˆx i = x i µ(x) σ(x) Compute mean and variance statistics over longest available segments with the same speaker/channel Real time normalisation: compute a moving average ASR Lectures 13&14 Speaker Adaptation 23

Vocal Tract Length Normalization (VTLN) Basic idea Normalize the acoustic data to take account of changes in vocal tract length ASR Lectures 13&14 Speaker Adaptation 24

Vocal Tract Length Normalization (VTLN) Basic idea Normalize the acoustic data to take account of changes in vocal tract length Vocal tract length (VTL): First larynx descent in first 2-3 years of life VTL grows according to body size, and is sex-dependent Puberty: second larynx descent for males ASR Lectures 13&14 Speaker Adaptation 24

Vocal Tract Length Normalization (VTLN) Basic idea Normalize the acoustic data to take account of changes in vocal tract length Vocal tract length (VTL): First larynx descent in first 2-3 years of life VTL grows according to body size, and is sex-dependent Puberty: second larynx descent for males VTL has large effect on the spectrum Tube acoustic model: formant positions are inversely proportional to VTL Observation: formant frequencies for women are 20% higher than for men (on average) ASR Lectures 13&14 Speaker Adaptation 24

Vocal Tract Length Normalization (VTLN) Basic idea Normalize the acoustic data to take account of changes in vocal tract length Vocal tract length (VTL): First larynx descent in first 2-3 years of life VTL grows according to body size, and is sex-dependent Puberty: second larynx descent for males VTL has large effect on the spectrum Tube acoustic model: formant positions are inversely proportional to VTL Observation: formant frequencies for women are 20% higher than for men (on average) VTLN: compensate for differences between speakers via a warping of the frequency axis ASR Lectures 13&14 Speaker Adaptation 24

Approaches to VTLN f ˆf = g α (f ) Classify by frequency warping function Piecewise linear Power function Bilinear transform Classify by estimation of warping factor α Signal-based: estimated directly from the acoustic signal, through explicit estimation of formant positions Model-based: maximize the likelihood of the observed data given acoustic models and a transcription. α is another parameter set so as to maximize the likelihood ASR Lectures 13&14 Speaker Adaptation 25

Warping functions: Piecewise linear α=1.2 α=1.0 α=0.8 ˆf = αf ASR Lectures 13&14 Speaker Adaptation 26

Model-based VTLN Basic idea Warp the acoustic features (for a speaker) to better fit the models rather than warping the models to fit the features! ASR Lectures 13&14 Speaker Adaptation 27

Model-based VTLN Basic idea Warp the acoustic features (for a speaker) to better fit the models rather than warping the models to fit the features! Estimate the warping factor α so as to maximise the likelihood of the acoustic models ASR Lectures 13&14 Speaker Adaptation 27

Model-based VTLN Basic idea Warp the acoustic features (for a speaker) to better fit the models rather than warping the models to fit the features! Estimate the warping factor α so as to maximise the likelihood of the acoustic models After estimating the warp factors, normalize the acoustic data and re-estimate the models The process may be iterated ASR Lectures 13&14 Speaker Adaptation 27

Model-based VTLN Basic idea Warp the acoustic features (for a speaker) to better fit the models rather than warping the models to fit the features! Estimate the warping factor α so as to maximise the likelihood of the acoustic models After estimating the warp factors, normalize the acoustic data and re-estimate the models The process may be iterated Model-based VTLN does not directly estimate vocal tract size, rather it estimates an optimal frequency warping, which may be affected by other factors (eg F 0) ASR Lectures 13&14 Speaker Adaptation 27

Model-based VTLN Basic idea Warp the acoustic features (for a speaker) to better fit the models rather than warping the models to fit the features! Estimate the warping factor α so as to maximise the likelihood of the acoustic models After estimating the warp factors, normalize the acoustic data and re-estimate the models The process may be iterated Model-based VTLN does not directly estimate vocal tract size, rather it estimates an optimal frequency warping, which may be affected by other factors (eg F 0) Exhaustive search for the optimal warping factor would be expensive Approximate the log likelihood wrt α as a quadratic, and find the maximum using a line search (Brent s method) ASR Lectures 13&14 Speaker Adaptation 27

Model-based VTLN ASR Lectures 13&14 Speaker Adaptation 28

VTLN: Warp factor estimation, females, non-normalized ASR Lectures 13&14 Speaker Adaptation 29

VTLN: Warp factor estimation, females, pass 1 ASR Lectures 13&14 Speaker Adaptation 30

VTLN: Warp factor estimation, females, pass 2 ASR Lectures 13&14 Speaker Adaptation 31

VTLN: Warp factor estimation, females, pass 3 ASR Lectures 13&14 Speaker Adaptation 32

VTLN: Warp factor estimation, males, non-normalized ASR Lectures 13&14 Speaker Adaptation 33

VTLN: Warp factor estimation, males, pass 1 ASR Lectures 13&14 Speaker Adaptation 34

VTLN: Warp factor estimation, males, pass 2 ASR Lectures 13&14 Speaker Adaptation 35

VTLN: Warp factor estimation, males, pass 3 ASR Lectures 13&14 Speaker Adaptation 36

Speaker adaptation in hybrid HMM/NN systems: CMLLR feature transformation Basic idea: If HMM/GMM system is used to estimate a single constrained MLLR adaptation transform, this can be viewed as a feature space transform Use the HMM/GMM system with the same tied state space to estimate a single CMLLR transform for a given speaker, and use this to transform the input speech to the DNN for the target speaker Can operate unsupervised (since the GMM system estimates the transform) Limited to a single transform (regression class) ASR Lectures 13&14 Speaker Adaptation 37

Speaker adaptation in hybrid HMM/NN systems: LIN Basic idea: single linear input layer trained to map input speaker-dependent speech to speaker-independent network Training: linear input network (LIN) can either be fixed as the identity or (adaptive training) be trained along with the other parameters Testing: freeze the main (speaker-independent) network and propagate gradients for speech from the target speaker to the LIN, which is updated linear transform learned for each speaker Requires supervised training data ASR Lectures 13&14 Speaker Adaptation 38

LIN ~6000 CD phone outputs ~2000 hidden units 3-8 hidden layers ~2000 hidden units 9x39 MFCC inputs ASR Lectures 13&14 Speaker Adaptation 39

LIN ~6000 CD phone outputs ~2000 hidden units 3-8 hidden layers ~2000 hidden units Transformed inputs Linear input network 9x39 MFCC inputs ASR Lectures 13&14 Speaker Adaptation 39

LIN ~6000 CD phone outputs ~2000 hidden units 3-8 hidden layers Fixed ~2000 hidden units Transformed inputs Linear input network Adapted 9x39 MFCC inputs ASR Lectures 13&14 Speaker Adaptation 39

Speaker adaptation in hybrid HMM/NN systems: Speaker codes Basic idea: Learn a short speaker code vector for each talker be freely adjusted acdata. As a result, it is the hybrid NN/HMM amount of adaptation own that it is possible one error rate by using ON odel similar to the one ori probabilities of all The NN inputs are confeature vectors within es. The baseline NNker labels information. ls. The standard back NN weights where the Original Network Composite NN Original Network Transformed Features Features vector Adaptation NN Speaker Code Features vector Fig. 1. Speaker adaptation of the hybrid ASR Lectures 13&14 SpeakerNN-HMM Adaptation model based on 40

Summary Speaker Adaptation One of the most intensive areas of speech recognition research since the early 1990s HMM/GMM Substantial progress, resulting in significant, additive, consistent reductions in word error rate Close mathematical links between different approaches Linear transforms at the heart of many approaches HMM/NN Open research topic GMM-based feature space transforms somewhat effective Direct weight adaptation less effective ASR Lectures 13&14 Speaker Adaptation 41

Reading HMM/GMM Gales and Young (2007), The Application of Hidden Markov Models in Speech Recognition, Foundations and Trends in Signal Processing, 1 (3), 195 304: section 5. Woodland (2001), Speaker adaptation for continuous density HMMs: A review, ISCA ITRW on Adaptation Methods for Speech Recognition. Gales (1998), Maximum likelihood linear transformations for HMM-based speech recognition, Computer Speech and Language, 12:75 98. HMM/DNN Liao (2013), Speaker adaptation of context dependent deep neural networks, Proc IEEE ICASSP Abdel-Hamid and Jiang (2013), Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code, Proc IEEE ICASSP ASR Lectures 13&14 Speaker Adaptation 42