Combining Speech and Speaker Recognition - A Joint Modeling Approach

Similar documents
A study of speaker adaptation for DNN-based speech synthesis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Modeling function word errors in DNN-HMM based LVCSR systems

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Learning Methods in Multilingual Speech Recognition

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Modeling function word errors in DNN-HMM based LVCSR systems

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

arxiv: v1 [cs.cl] 27 Apr 2016

arxiv: v1 [cs.lg] 7 Apr 2015

Improvements to the Pruning Behavior of DNN Acoustic Models

Calibration of Confidence Measures in Speech Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Deep Neural Network Language Models

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speech Recognition at ICSI: Broadcast News and beyond

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Probabilistic Latent Semantic Analysis

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Investigation on Mandarin Broadcast News Speech Recognition

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

Spoofing and countermeasures for automatic speaker verification

Support Vector Machines for Speaker and Language Recognition

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Human Emotion Recognition From Speech

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

A Review: Speech Recognition with Deep Learning Methods

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Speaker recognition using universal background model on YOHO database

WHEN THERE IS A mismatch between the acoustic

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Speech Emotion Recognition Using Support Vector Machine

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Generative models and adversarial training

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speaker Recognition For Speech Under Face Cover

Lecture 1: Machine Learning Basics

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

On the Formation of Phoneme Categories in DNN Acoustic Models

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Switchboard Language Model Improvement with Conversational Data from Gigaword

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Python Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Edinburgh Research Explorer

THE world surrounding us involves multiple modalities

An Online Handwriting Recognition System For Turkish

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Segregation of Unvoiced Speech from Nonspeech Interference

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speaker Recognition. Speaker Diarization and Identification

Assignment 1: Predicting Amazon Review Ratings

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Speaker Identification by Comparison of Smart Methods. Abstract

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Lecture 9: Speech Recognition

Laboratorio di Intelligenza Artificiale e Robotica

Cultivating DNN Diversity for Large Scale Video Labelling

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Automatic Pronunciation Checker

Laboratorio di Intelligenza Artificiale e Robotica

CSL465/603 - Machine Learning

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Speech Recognition by Indexing and Sequencing

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Arabic Orthography vs. Arabic OCR

Transcription:

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by: Prof. N. Morgan, Dr. S. Wegmann EECS, University of California, Berkeley, CA USA International Computer Science Institute, Berkeley, CA USA August 16, 2018 Hang Su Dissertation Talk 1 / 71

Table of contents 1 Introduction and Motivation 2 3 4 5 Hang Su Dissertation Talk 2 / 71

Table of contents Introduction Motivation An ideal AI agent for speech 1 Introduction and Motivation 2 3 4 5 Hang Su Dissertation Talk 3 / 71

Introduction Motivation An ideal AI agent for speech Joint modeling of speech and speaker The brief idea Automatic speech recognition (ASR) translate speech to text automatically Speaker recognition or speaker identification identify speakers from characteristics of voice Combining speech and speaker recognition capture speech and speaker characteristics together Hang Su Dissertation Talk 4 / 71

Introduction Motivation An ideal AI agent for speech Why speech / speaker recognition Application of speech & speaker recognition Human-Computer Interface Automatic speech recognition In-car system, smart home, speech search... Speaker recognition Authentication, safety, personalization... Hang Su Dissertation Talk 5 / 71

A problem Introduction Motivation An ideal AI agent for speech They are handled separately Different datasets / evaluations Different models / methods But they are closely related to each other Take speech as input Similar features / models Hang Su Dissertation Talk 6 / 71

A problem Introduction Motivation An ideal AI agent for speech They are handled separately Different datasets / evaluations Different models / methods But they are closely related to each other Take speech as input Similar features / models (Same group of researchers :) Hang Su Dissertation Talk 6 / 71

An ideal AI agent for speech Introduction Motivation An ideal AI agent for speech Hang Su Dissertation Talk 7 / 71

An ideal AI agent for speech Introduction Motivation An ideal AI agent for speech Hang Su Dissertation Talk 8 / 71

Table of contents Automatic Speech Recognition Speaker Recognition 1 Introduction and Motivation 2 3 4 5 Hang Su Dissertation Talk 9 / 71

Table of contents Automatic Speech Recognition Speaker Recognition 1 Introduction and Motivation 2 Automatic Speech Recognition Speaker Recognition 3 4 5 Hang Su Dissertation Talk 10 / 71

Automatic Speech Recognition Speaker Recognition Automatic Speech Recognition (ASR) Transcribe speech into texts Frame-by-frame approach (10 ~30 ms) Components : Feature extraction Acoustic modeling (GMM-HMM) Lexicon Language modeling (LM) Or use end-to-end approach: discard HMM, optionally discard lexicon or language model For a traditional ASR system. Hang Su Dissertation Talk 11 / 71

Traditional ASR pipeline Automatic Speech Recognition Speaker Recognition Hang Su Dissertation Talk 12 / 71

Automatic Speech Recognition Speaker Recognition Gaussian Mixture Model - HMM[9, 3] Hang Su Dissertation Talk 13 / 71

Automatic Speech Recognition Speaker Recognition Deep Neural Network - HMM[1, 11] Hang Su Dissertation Talk 14 / 71

Automatic Speech Recognition Speaker Recognition Long-Short Term Memory - HMM [8] Hang Su Dissertation Talk 15 / 71

Table of contents Automatic Speech Recognition Speaker Recognition 1 Introduction and Motivation 2 Automatic Speech Recognition Speaker Recognition 3 4 5 Hang Su Dissertation Talk 16 / 71

Speaker Recognition Automatic Speech Recognition Speaker Recognition Speaker Recognition: Identify speakers from speech Components: Feature extraction Acoustic modeling Speaker modeling Scoring Make utterance-level predictions Hang Su Dissertation Talk 17 / 71

Automatic Speech Recognition Speaker Recognition Text-independent speaker recognition Hang Su Dissertation Talk 18 / 71

Factor analysis approach [2] Automatic Speech Recognition Speaker Recognition K x t π k N (µ k + A k z i, Σ k ) k K z i N (0, I) π k = 1 k=1 (1) x t is p-dim speech feature for frame t π k is prior for mixture k z i : a q-dim speaker specific latent factor (i.e. i-vector) A k : a p-by-q projection matrix for mixture c µ k and Σ k are Gaussian parameters Hang Su Dissertation Talk 19 / 71

Post-processing of i-vectors Automatic Speech Recognition Speaker Recognition The factor-analysis model is an unsupervised model. Supervised methods could be used to improve i-vectors. Linear Discriminant Analysis [6] Probabilistic Linear Discriminant Analysis [6, 5] Hang Su Dissertation Talk 20 / 71

Table of contents Speaker Recognition using ASR Speaker Adaptation 1 Introduction and Motivation 2 3 4 5 Hang Su Dissertation Talk 21 / 71

Table of contents Speaker Recognition using ASR Speaker Adaptation 1 Introduction and Motivation 2 3 Speaker Recognition using ASR Speaker Adaptation 4 5 Hang Su Dissertation Talk 22 / 71

Speaker recognition using ASR Speaker Recognition using ASR Speaker Adaptation Hang Su Dissertation Talk 23 / 71

Speaker Recognition using ASR Speaker Adaptation Speaker recognition using ASR cont. Substitute UBM with DNN model [7] Substitute UBM with Time-delay DNN [13] Use DNN initialized GMM acoustic model [13] Proposal: Use better DNN models for ASR Trained with raw MFCC feature Trained with LDA transformed feature Trained with LDA + fmllr transformed feature Trained with Minimum Phone Error (MPE) method Factor Analysis Based Speaker Verification Using ASR. Hang Su and Steven Wegmann. Interspeech 2016 Hang Su Dissertation Talk 24 / 71

Data description Speaker Recognition using ASR Speaker Adaptation Speaker recognition evaluation (SRE) data set Training data (SRE 2004-2008) 18,715 recordings from 3,009 speakers 1,000+ hours of data, 360,000,000 frame samples Test data (SRE 2010) 387,112 trials (98% non-target) 11,983 enrollment speakers, 767 test speakers 2 ~3 mins per speaker ASR data set Training data (Switchboard) Testing data (Eval2000) Hang Su Dissertation Talk 25 / 71

Metric DET curve and EER Speaker Recognition using ASR Speaker Adaptation Hang Su Dissertation Talk 26 / 71

Speaker Recognition using ASR Speaker Adaptation Metric Word Error Rate (WER) WER = S + D + I R (2) S : number of substitutions D : number of deletions I : number of insertions R : number of words in references Hang Su Dissertation Talk 27 / 71

Experimental results Speaker Recognition using ASR Speaker Adaptation Eval2000 WER SRE2010 EER UBM 6.31 DNN-MFCC 19.4 6.39 + LDA + MLLT 16.3 4.84 + fmllr 14.9 4.55 + MPE 13.5 4.38 Table 1: EER for speaker recognition systems in different settings ASR decoding needed Hang Su Dissertation Talk 28 / 71

Experimental results Speaker Recognition using ASR Speaker Adaptation Figure 1: DET curve for systems in different settings Hang Su Dissertation Talk 29 / 71

Table of contents Speaker Recognition using ASR Speaker Adaptation 1 Introduction and Motivation 2 3 Speaker Recognition using ASR Speaker Adaptation 4 5 Hang Su Dissertation Talk 30 / 71

Speaker Adaptation Speaker Recognition using ASR Speaker Adaptation How to handle speaker-specific characteristics during recognition? Adapt speaker-independent systems to different speakers (model-space) Normalize speech features to compensate speaker characteristics (feature-space) Hang Su Dissertation Talk 31 / 71

Speaker Recognition using ASR Speaker Adaptation Speaker adaptation for DNN systems Existing methods: Feature-space transformations (fmllr) [4] Model-space transformations [15] Adapting model parameters via regularization [16] Learning hidden unit contributions (LHUC) [14] Hang Su Dissertation Talk 32 / 71

Speaker Recognition using ASR Speaker Adaptation Speaker adaptation using i-vectors[10] h = W a x + W s z Hang Su Dissertation Talk 33 / 71

Speaker Recognition using ASR Speaker Adaptation Speaker adaptation using i-vectors Benefits of using i-vectors Does not require model re-training or ASR decoding Single DNN model for all speakers Potential drawback: Tend to overfit Hang Su Dissertation Talk 34 / 71

Speaker Recognition using ASR Speaker Adaptation Problem of speaker adaptation using i-vector I-vectors are extracted for every recordings Frames 100 million, 4,800 recordings Acoustic feature dim ~440, i-vector dim 100~400 Better objective on training data does not translate into WER improvement Overfitting occurs Hang Su Dissertation Talk 35 / 71

Treatment for overfitting Speaker Recognition using ASR Speaker Adaptation Mitigate overfitting by Reducing i-vector dimension[10] Using utterance-based i-vectors[12] Extract i-vectors using sliding window (in Kaldi) L2 regularization back to baseline DNN[12] Hang Su Dissertation Talk 36 / 71

Speaker Recognition using ASR Speaker Adaptation Regularization on i-vector sub-nnetwork L re = L ce + β w ivec 2 Hang Su Dissertation Talk 37 / 71

Data description Speaker Recognition using ASR Speaker Adaptation Switchboard data set Clean telephone speech, English ~300 hours transcribed data (~108,000,000 samples) ~4,800 recordings Eval2000 hub5 test set Switchboard portion + CallHome (family members) 40 + 40 speakers 2 hours + 1.6 hours Hang Su Dissertation Talk 38 / 71

Speaker Recognition using ASR Speaker Adaptation Metric Word Error Rate (WER) WER = S + D + I R (3) S : number of substitutions D : number of deletions I : number of insertions R : number of words in references Hang Su Dissertation Talk 39 / 71

Experimental results Speaker Recognition using ASR Speaker Adaptation feature MFCC +fmllr data Swbd Callhome Swbd Callhome acoustic feature 16.0 28.5 14.9 25.6 + i-vector 15.2 27.1 14.4 25.7 + regularization 14.6 26.3 14.3 24.9 Table 2: WER on i-vector adaptation using regularization Hang Su Dissertation Talk 40 / 71

Speaker Recognition using ASR Speaker Adaptation A brief summarization: Speech and speaker recognition are two tasks that are closely related Speaker information can be used to improve speech recognition performance Acoustic models trained for ASR can be used to assist speaker recognition Hang Su Dissertation Talk 41 / 71

Table of contents TIK: An Open-source Tool JointDNN for speech and speaker recognition 1 Introduction and Motivation 2 3 4 5 Hang Su Dissertation Talk 42 / 71

Table of contents TIK: An Open-source Tool JointDNN for speech and speaker recognition 1 Introduction and Motivation 2 3 4 TIK: An Open-source Tool JointDNN for speech and speaker recognition 5 Hang Su Dissertation Talk 43 / 71

Existing Tools for Speech TIK: An Open-source Tool JointDNN for speech and speaker recognition Kaldi: Popular speech recognition tool Supports GMM, HMM, DNN, LSTM... State-of-the-art recipes Tensorflow (TF) Flexible deep learning research framework Tensorflow Lite: esay to deploy on embedded devices Tensor Processing Unit (TPU) Hang Su Dissertation Talk 44 / 71

TIK Introduction and Motivation TIK: An Open-source Tool JointDNN for speech and speaker recognition Bridge the gap between Tensorflow and Kaldi It supports acoustic modeling using Tensorflow It integrates with Kaldi decoder through a pipe It covers both speech and speaker recognition tasks Hang Su Dissertation Talk 45 / 71

System Design of TIK TIK: An Open-source Tool JointDNN for speech and speaker recognition Hang Su Dissertation Talk 46 / 71

ASR performance using TIK TIK: An Open-source Tool JointDNN for speech and speaker recognition Swbd CallHome All Kaldi GMM 21.4 34.8 28.2 Kaldi DNN 14.9 25.6 20.3 TIK DNN 14.5 25.5 20.0 TIK BLSTM 13.6 24.3 19.0 Table 3: WER of ASR systems trained with Kaldi and TIK (Eval2000 test set) Hang Su Dissertation Talk 47 / 71

TIK: An Open-source Tool JointDNN for speech and speaker recognition Speaker recognition performance using TIK Cosine LDA PLDA Kaldi UBM 6.91 3.36 2.51 Kaldi DNN 4.00 1.83 1.27 TIK DNN 4.53 2.00 1.27 Table 4: EER of speaker recognition systems trained Kaldi and TIK (SRE2010 test set) Hang Su Dissertation Talk 48 / 71

Table of contents TIK: An Open-source Tool JointDNN for speech and speaker recognition 1 Introduction and Motivation 2 3 4 TIK: An Open-source Tool JointDNN for speech and speaker recognition 5 Hang Su Dissertation Talk 49 / 71

X-vector approach TIK: An Open-source Tool JointDNN for speech and speaker recognition Figure 2: Structure of x-vector approach for speaker recognition Hang Su Dissertation Talk 50 / 71

JointDNN model TIK: An Open-source Tool JointDNN for speech and speaker recognition Figure 3: Structure of JointDNN model Hang Su Dissertation Talk 51 / 71

Loss function TIK: An Open-source Tool JointDNN for speech and speaker recognition S s T S L(θ) = h s,t log P(h s,t o s,t ) β x s log P(x s o s ) (4) s=1 t=1 s=1 Interpolation of two cross-entropy losses β is the interpolation weight h s,t denotes the HMM state for frame t of segment s o s,t is the observed feature vector x s is the correct speaker o s is speech features for segment s Hang Su Dissertation Talk 52 / 71

Data description TIK: An Open-source Tool JointDNN for speech and speaker recognition Training data Switchboard data set ~300 hours transcribed data (~108,000,000 samples) ~520 speakers Testing data Eval2000 hub5 test set for speech recognition SRE2010 test set for speaker recognition Hang Su Dissertation Talk 53 / 71

TIK: An Open-source Tool JointDNN for speech and speaker recognition Performance of speaker recognition EER Baseline i-vector 4.85 Kaldi x-vector 8.94 TIK x-vector 8.81 TIK jd-vector (beta0.01) 4.75 Table 5: EER of JointDNN model for speaker recognition (SRE2010 test set) Hang Su Dissertation Talk 54 / 71

TIK: An Open-source Tool JointDNN for speech and speaker recognition Performance of speaker recognition Figure 4: DET curve of JointDNN model for speaker recognition (SRE2010 test set) Hang Su Dissertation Talk 55 / 71

Performance of speech recognition TIK: An Open-source Tool JointDNN for speech and speaker recognition Swbd Callhome All Baseline DNN 16.1 28.4 22.3 JointDNN (beta 0.01) 16.8 29.0 22.9 Table 6: WER of JointDNN model for speech recognition Hang Su Dissertation Talk 56 / 71

Adjusting Interpolation Weight β TIK: An Open-source Tool JointDNN for speech and speaker recognition Development (%) Evaluation (%) Beta ASR acc Speaker acc SRE EER Swbd WER 0.1 39.07 97.22 5.10 16.7 0.01 39.20 94.10 4.75 16.8 0.001 38.60 85.36 9.19 17.2 0.0001 38.59 41.95 13.25 17.0 Table 7: EER of JointDNN model with different β Hang Su Dissertation Talk 57 / 71

TIK: An Open-source Tool JointDNN for speech and speaker recognition Summary of JointDNN model JointDNN can be used for ASR and SRE simultaneously ASR part helps guide speaker recognition sub-network Effective in using a limited amount of training data Uses less memory compared to i-vector approach (better for embeded device) Hang Su Dissertation Talk 58 / 71

Table of contents 1 Introduction and Motivation 2 3 4 5 Hang Su Dissertation Talk 59 / 71

of the talk Speech and speaker recognition are beneficial to each other A joint model helps exploit both speech and speaker information Effective in using limited amount of training data Hang Su Dissertation Talk 60 / 71

Future work Future work on joint modeling Use a larger data set or data augmentation techniques Introduce recurrent structures into joint model End-to-end approaches for joint modeling Towards an all-around speech AI agent Hang Su Dissertation Talk 61 / 71

Hang Su Dissertation Talk 62 / 71

Reference I Herve A. Bourlard and Nelson Morgan. Connectionist Speech Recognition: A Hybrid Approach. Kluwer Academic Publishers, Norwell, MA, USA, 1993. Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788 798, 2011. Hang Su Dissertation Talk 63 / 71

Reference II Mark Gales and Steve Young. The application of hidden markov models in speech recognition. Foundations and Trends R in Signal Processing, 1(3):195 304, 2008. Mark JF Gales. Maximum likelihood linear transformations for hmm-based speech recognition. Computer speech & language, 12(2):75 98, 1998. Hang Su Dissertation Talk 64 / 71

Reference III Daniel Garcia-Romero and Carol Y Espy-Wilson. Analysis of i-vector length normalization in speaker recognition systems. In Twelfth Annual Conference of the International Speech Communication Association, 2011. Patrick Kenny, Themos Stafylakis, Pierre Ouellet, Md Jahangir Alam, and Pierre Dumouchel. Plda for speaker verification with utterances of arbitrary duration. Hang Su Dissertation Talk 65 / 71

Reference IV In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 7649 7653. IEEE, 2013. Yun Lei, Scheffer Nicolas, Luciana Ferrer, and Mitchell McLaren. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In ICASSP. IEEE, 2014. Hang Su Dissertation Talk 66 / 71

Reference V Abdel-rahman Mohamed, Frank Seide, Dong Yu, Jasha Droppo, Andreas Stoicke, Geoffrey Zweig, and Gerald Penn. Deep bi-directional recurrent networks over spectral windows. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on, pages 78 83. IEEE, 2015. Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257 286, 1989. Hang Su Dissertation Talk 67 / 71

Reference VI George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny. Speaker adaptation of neural network acoustic models using i-vectors. In ASRU, pages 55 59, 2013. Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using context-dependent deep neural networks. In Twelfth Annual Conference of the International Speech Communication Association, 2011. Hang Su Dissertation Talk 68 / 71

Reference VII Andrew Senior and Ignacio Lopez-Moreno. Improving dnn speaker independence with i-vector inputs. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 225 229. IEEE, 2014. David Snyder, Daniel Garcia-Romero, and Daniel Povey. Time delay deep neural network-based universal background models for speaker recognition. In ASRU. IEEE, 2015. Hang Su Dissertation Talk 69 / 71

Reference VIII Pawel Swietojanski and Steve Renals. Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pages 171 176. IEEE, 2014. Kaisheng Yao, Dong Yu, Frank Seide, Hang Su, Li Deng, and Yifan Gong. Adaptation of context-dependent deep neural networks for automatic speech recognition. In Spoken Language Technology Workshop (SLT), 2012 IEEE, pages 366 369. IEEE, 2012. Hang Su Dissertation Talk 70 / 71

Reference IX Dong Yu, Kaisheng Yao, Hang Su, Gang Li, and Frank Seide. Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 7893 7897. IEEE, 2013. Hang Su Dissertation Talk 71 / 71