Towards Speaker Adaptive Training of Deep Neural Network Acoustic Models

Similar documents
Distributed Learning of Multilingual DNN Feature Extractors using GPUs

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v1 [cs.cl] 27 Apr 2016

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Improvements to the Pruning Behavior of DNN Acoustic Models

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

A study of speaker adaptation for DNN-based speech synthesis

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Modeling function word errors in DNN-HMM based LVCSR systems

Deep Neural Network Language Models

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v1 [cs.lg] 7 Apr 2015

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Learning Methods in Multilingual Speech Recognition

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Calibration of Confidence Measures in Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Knowledge Transfer in Deep Convolutional Neural Nets

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Recognition at ICSI: Broadcast News and beyond

A Review: Speech Recognition with Deep Learning Methods

Python Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Emotion Recognition Using Support Vector Machine

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Artificial Neural Networks written examination

A Deep Bag-of-Features Model for Music Auto-Tagging

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Dropout improves Recurrent Neural Networks for Handwriting Recognition

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Generative models and adversarial training

Using GIFT to Support an Empirical Study on the Impact of the Self-Reference Effect on Learning

Assignment 1: Predicting Amazon Review Ratings

Human Emotion Recognition From Speech

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

An Online Handwriting Recognition System For Turkish

Probabilistic Latent Semantic Analysis

arxiv: v4 [cs.cl] 28 Mar 2016

arxiv: v1 [cs.cv] 10 May 2017

Investigation on Mandarin Broadcast News Speech Recognition

Using dialogue context to improve parsing performance in dialogue systems

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Learning Methods for Fuzzy Systems

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

WHEN THERE IS A mismatch between the acoustic

Support Vector Machines for Speaker and Language Recognition

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

arxiv: v1 [cs.lg] 15 Jun 2015

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Speaker Identification by Comparison of Smart Methods. Abstract

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Lip Reading in Profile

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

INPE São José dos Campos

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Measurement & Analysis in the Real World

Spoofing and countermeasures for automatic speaker verification

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Model Ensemble for Click Prediction in Bing Search Ads

Attributed Social Network Embedding

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Cultivating DNN Diversity for Large Scale Video Labelling

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Word Segmentation of Off-line Handwritten Documents

arxiv: v2 [cs.ir] 22 Aug 2016

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Australian Journal of Basic and Applied Sciences

THE enormous growth of unstructured data, including

Axiom 2013 Team Description Paper

Georgetown University at TREC 2017 Dynamic Domain Track

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

SUPRA-SEGMENTAL FEATURE BASED SPEAKER TRAIT DETECTION

Transcription:

Towards Speaker Adaptive Training of Deep Neural Network Acoustic Models Yajie Miao Hao Zhang Florian Metze Language Technologies Institute School of Computer Science Carnegie Mellon University 1 / 23

Outline 1 Motivation 2 Speaker normalized feature space with i vectors 3 AdaptNN Bottom adaptation layers ivecnn Linear feature shift Speaker adaptive training of DNNs 4 Experiments 5 Summary & Future Work 2 / 23

Motivation Deep neural networks become the state of the art for acoustic modeling For GMM models, speaker adaptive training has been a standard technique for improving WERs. Various methods [1,2,3,4,5] have been proposed to perform speaker adaptation for DNNs. However, how we can do SAT for DNN is not clear. In this work, we aim to achieve complete speaker adaptive training for DNN acoustic models. 3 / 23

SAT for HMM/GMM model update in fmllr features space fmllr Matrix HMM/ GMM fmllr re estimation Starts with an initial GMM model and estimates fmllr affine transforms Updates model parameters with fmllr applied and then re estimates estimates fmllr transforms. Repeats until convergence. 4 / 23

SAT for HMM/GMM model update in fmllr features space fmllr Matrix HMM/ GMM fmllr re estimation Starts with an initial GMM model and estimates fmllr affine transforms Updates model parameters with fmllr applied and then reestimatesfmllr transforms. Repeats until convergence. We want to do the similar thing for DNNs!! 5 / 23

Basic Idea for SAT DNN original features... initial DNN i vector new features Start with an initial DNN model which is the regular DNN we train for hybrid systems Learn a function which takes advantage of i vectors and projects DNN inputs into a speaker normalized feature space Update the DNN model in the new feature space 6 / 23

Bottom Adaptation Layers Adapt NN... Initial DNN Insert a smaller adaptation network between the initial DNN andthe inputs I vectors are appended to the outputs of each hidden layer By using i vectors, this network AdaptNN transforms the original DNN inputs into a speaker normalized space 7 / 23

Bottom Adaptation Layers Adapt NN... Initial DNN The output layer of AdaptNN has the same dimension as the originalinput input features The output layer adopts the linear activation function, while others use sigmoid id Parameters of AdaptNN can be estimated by the standard error back propagation by fixing initial DNN 8 / 23

Linear Feature Shift ivecnn... initial DNN a = o + f ( i ) t t s ivecnn takes speaker i vectors as inputs and generates a linear feature shift for each speaker This feature shift is added to the original DNN inputs, and the resulting features become more speaker normalized 9 / 23

Linear Feature Shift ivecnn... initial DNN The output layer of ivecnn has the same dimension as DNN inputs and tk takes linear activation function Parameters of ivecnn can be estimated by the standard error back propagation More flexible. It can be applied both to DNNs and also to convolutional neural nets (CNNs) [6] 10 / 23

Procedures of SAT DNN Training... initial DNN Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fmllr) 11 / 23

Procedures of SAT DNN Training original ii features... initial DNN i vectors Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fmllr) Step 2 Learn the feature function (AdaptNN or ivecnn) by keeping the initial DNN fixed. This step requires speaker i vectors as the side information for feature transformation 12 / 23

Procedures of SAT DNN Training original ii initial iti DNN features... i vectors SAT DNN Step 1 Train the initial DNN model. This DNN can be trained on SI (e.g., fbank) or SA features (e.g., fmllr) Step 2 Learn the feature function (AdaptNN or ivecnn) by keeping the initial DNN fixed. This step requires speaker i vectors as the side information for feature transforming Step 3 Re finetune the DNN parameters in the new feature space while keeping the feature function fixed. This finally gives us SAT DNN 13 / 23

Procedures of SAT DNN Decoding original ii features... SAT DNN i vectors Step 1 Given a testing speaker, just extract the i vector for adaptation. I vector extraction is totally unsupervised Step 2 Input the speech features and the i vectors into this architecture for decoding. This projects the input features into the speaker normalized space and adapts the SAT DNN model automatically to this testing speaker 14 / 23

Procedures of SAT DNN Decoding original features... SAT DNN i vectors Since i vector extraction is totally unsupervised, no initial decoding di pass and no fine tuning i on the adaptation ti dt data Only one single pass of decoding, although we are doing unsupervised adaptation Very Efficient Unsupervised Adaptation 15 / 23

Comparison to related work... G. Saon, H. Soltau, D. Nahamoo, and M. Picheny. Speaker adaptation of neural network acoustic models using i vectors. ASRU 2013. Concatenate i vectors with the original features directlyand trainthe the whole network from scratch We failed to get obvious gains from this proposal, most likelydue to normalizationofof i vectors. The i vectors should be normalized very carefully, which is also observed by: A. Senior, I. Lopez Moreno. Improving DNN speaker independence with i vector inputs. ICASSP 2014. When using our SAT DNN, no need to worry about i vector normalization. The feature function will do this job! 16 / 23

Experiments Switchboard A 110 Hour training setup [7] = 100k utterances Kaldi for GMM: mono delta lda+mllt sat Kaldi+PDNN: http://www.cs.cmu.edu/~ymiao/kaldipdnn.html Two types of DNN inputs: SI filterbanks and SA fmllrs Tested on the SWBD part of HUB 00 I Vector Extractor Building Open source ALIZE toolkit [8] A 100 dimensional i vector is extracted for each training and testing speaker 17 / 23

Experiments Switchboard Models Filterbank fmllr Baseline (initial) DNN 21.4 19.9 SAT DNN + AdaptNN 19.8 7.5% 18.7 6.0% SAT DNN + ivecnn 19.9 7.0% 19.0 4.8% Initial DNN + AdaptNN 20.8 (2.8%) 19.2 (3.5%) Initial DNN + ivecnn 21.2 (0.9%) 19.7 (1.0%) 18 / 23

Experiments Switchboard Models Filterbank fmllr Baseline (initial) DNN 21.4 19.9 SAT DNN + AdaptNN 19.8 7.5% 18.7 6.0% SAT DNN + ivecnn 19.9 7.0% 19.0 4.8% Initial DNN + AdaptNN 20.8 (2.8%) 19.2 (3.5%) Initial DNN + ivecnn 21.2 (0.9%) 19.7 (1.0%) Our recent work enlarges the improvement to 11.1% 1% and 68% 6.8% relatively on Filterbank and fmllr respectively 19 / 23

Experiments BABEL More challenging gbabel dataset Conversational telephone speech from low resource languages g 80 hours of training data for each language Tagalog (IARPA babel106 v0.2f ) and Turkish (IARPAbabel105b v0.4) ; only on the SI filterbank features Models Tagalog Turkish Baseline (initial) DNN 49.3 51.3 SAT DNN + AdaptNN 47.1 4.5% 48.6 5.3% SAT DNN + ivecnn 47.3 4.1% 49.3 3.9% 20 / 23

Summary & Future Work Summary We can do SAT for DNNs! To achieve this, we propose two feature learning approaches to get the speaker normalized space We get nice improvement! Our experiments show SAT DNN outperforms DNNs regardless of the feature types of the DNN inputs Our code is open source! You can check out the code and run the experiments http://www.cs.cmu.edu/~ymiao/satdnn.html Future Work Comparison with speaker adaptation ti methods; perform sequence training [9] over the resulting SAT DNN Extend the SAT framework to other architectures, e.g., eg to bottleneck feature extraction [10] and convolutional neural networks [6] 21 / 23

References [1] F. Seide, G. Li, X. Chen, and D. Yu, Feature engineering in context dependent deep neural networks for conversational speech transcription, in Proc. ASRU, pp. 24 29, 2011. [2] B. Li, and K. C. Sim, Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems, in Proc. Interspeech, pp. 526 529, 2010. [3] K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, Adaptation of context dependent deep neural networks for automatic speech recognition, in Proc. IEEE Spoken Language Technology Workshop, pp. 366 369, 369 2012. [4] S. M. Siniscalchi, J. Li, and C. H. Lee, Hermitian based hidden activation functions for adaptation of hybrid HMM/ANN models, in Proc. Interspeech, pp. 526 529, 2012. [5] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, Speaker adaptation of neural network acoustic models using i vectors, in Proc. ASRU, pp. 55 59, 2013. [6] T. N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran, Deep convolutional neural networks for LVCSR, in Proc. ICASSP, pp. 8614 8618, 2013. [7] S. P. Rath, D. Povey, K. Vesely, and J. Cernocky, Improved feature processing for deep neural networks, in Proc. Interspeech, 2013. [8] J. F. Bonastre, N. Scheffer, D. Matrouf, C. Fredouille, A. Larcher, A. Preti, G. Pouchoulin, N. Evans, B. Fauve, and J. Mason, ALIZE/SpkDet: a state of the art open source software for speaker recognition, in Proc. ISCA/IEEE Speaker Odyssey 2008. [9] B. Kingsbury, Lattice based optimization of sequence classification criteria for neural network acoustic modeling, in Proc. ICASSP, pp. 3761 3764, 2009. [10] J. Gehring, Y. Miao, F. Metze, and A. Waibel, Extracting deep bottleneck features using stacked auto encoders, in Proc. ICASSP, 2013. 22 / 23

Thank You Yajie Miao Hao Zhang Florian Metze Language Technologies Institute t School of Computer Science Carnegie Mellon University Acknowledgements. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense U.S. Army Research Laboratory (DoD / ARL) contract number W911NF 12 C 0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government. 23 / 23