Natural Speech Technology

Similar documents
A study of speaker adaptation for DNN-based speech synthesis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Deep Neural Network Language Models

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Improvements to the Pruning Behavior of DNN Acoustic Models

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Edinburgh Research Explorer

arxiv: v1 [cs.cl] 27 Apr 2016

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

arxiv: v1 [cs.lg] 7 Apr 2015

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Probabilistic Latent Semantic Analysis

Speech Emotion Recognition Using Support Vector Machine

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Calibration of Confidence Measures in Speech Recognition

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Lecture 1: Machine Learning Basics

Lecture 1: Basic Concepts of Machine Learning

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Python Machine Learning

Investigation on Mandarin Broadcast News Speech Recognition

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Statistical Parametric Speech Synthesis

On the Formation of Phoneme Categories in DNN Acoustic Models

Spoofing and countermeasures for automatic speaker verification

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

(Sub)Gradient Descent

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Rule Learning With Negation: Issues Regarding Effectiveness

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speaker Identification by Comparison of Smart Methods. Abstract

Letter-based speech synthesis

Human Emotion Recognition From Speech

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Knowledge Transfer in Deep Convolutional Neural Nets

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Comment-based Multi-View Clustering of Web 2.0 Items

An Online Handwriting Recognition System For Turkish

WHEN THERE IS A mismatch between the acoustic

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Soft Computing based Learning for Cognitive Radio

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Voice conversion through vector quantization

Word Segmentation of Off-line Handwritten Documents

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Assignment 1: Predicting Amazon Review Ratings

arxiv: v2 [cs.cv] 30 Mar 2017

Cultivating DNN Diversity for Large Scale Video Labelling

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Automatic Pronunciation Checker

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Using dialogue context to improve parsing performance in dialogue systems

CSL465/603 - Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Artificial Neural Networks written examination

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Proceedings of Meetings on Acoustics

Probability and Statistics Curriculum Pacing Guide

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Dropout improves Recurrent Neural Networks for Handwriting Recognition

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Speaker recognition using universal background model on YOHO database

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Transcription:

Natural Speech Technology Steve Renals University of Edinburgh

Natural Speech Technology 5-year UK programme in core speech technology research 2011 2016 Focus on Speech Recognition Speech Synthesis Learning & Adaptation

Motivations Weakly-factored models Factor the underlying causes of observed variability in speech Domain fragility Rapid transfer to new domains, with minimal supervision Synthesis and recognition developed independently Lack of reaction to the environment or context Respond and adapt to changes in the acoustic or linguistic environment Relatively little speech knowledge incorporated Cannot rely on gold standard transcription Work somewhere on the supervised-unsupervised spectrum

Natural Speech Technology Voice reconstruction, donation, & banking homeservice Speech Synthesis Applications Technology Theory Adaptation Learning Technology showcase Speech Recognition Media Archives English Heritage

Natural Speech Technology Exemplar applications Deep generative models Domain transfer Multi-task learning Voice reconstruction, donation, & banking homeservice Speech Synthesis Applications Technology Theory Adaptation Learning Technology showcase Speech Recognition Adaptation and canonical models Media Archives Distant speech recognition English Heritage Multi-genre transcription MGB Challenge

Adaptation multiple speakers acoustic environment different channels

Adapting NN acoustic models Neural network adaptation is challenging models with large numbers of parameters potentially need a lot of adaptation data relatively little structure in the weights unsupervised adaptation is preferable to supervised compact adaptation is preferable joint optimisation of core acoustic model parameters and adaptation parameters Baseline feature-space MLLR using a CD-GMM-HMM system to adapt input features for NN acoustic model

Auxiliary features Append (and optimise) additional speaker-based features to the input i-vectors: low-dimension speaker representation can be estimated from small amounts of data for ASR: Karafiat et al (ASRU-2011); Saon et al (ASRU-2013)

Factorised i-vectors Karanasou et al, Interspeech-2014 Extract two sets of i-vectors speaker information acoustic environment information Estimate the i-vectors as weights for a cluster adaptive training GMM system Orthogonal factor representations allow adaptation to account for wide range of speaker/environment conditions On WSJ with added noise, factorised i-vectors result in 5 10% relative reduction in WER

i-vector priors With limited data (1 utterance) use a prior to improve the robustness of the i-vector estimate Default: Gaussian prior sensitive to amount of data/speaker, and to mismatches between training and test duration Count-smoothing prior interpolates between prior and observed statistics (cf MAP) speaker-independent prior statistics (estimate over all speakers) gender-dependent prior statistics (two clusters) YouTube data, WER improves ~1 3% relative without prior, 3 5% relative with SI prior Karanasou et al, Interspeech-2015

Unsupervised domain discovery Discovery of hidden acoustic domains using LDA Doulaty et al, Interspeech-2015 Experiments on highly diverse data radio television conversational telephone speech meetings read speech lectures

LDA-DNN Doulaty et al, ASRU-2015 8% relative reduction in WER on MGB Challenge, compared with speaker adapted DNN

Model-based adaptation Speaker codes (Bridle & Cox 1990; Abdel-Hamid & Jiang 2013) model-based adaptation using auxiliary features Adaptation of different weight subsets (Liao, ICASSP-2013) 5% relative decrease in WER when all 60M weights adapted Automatically adapt specific parameter subsets output biases (Yao et al, SLT-2012), slope and bias of hidden units (Siniscalchi et al, TASLP-2013) Adaptation cost based on KL divergence between SI and SA output distributions (Yu et al, ICASSP-2013) 3% relative decrease in WER on Switchboard Increase compactness by SVD factorisation of weight matrix (Xue et al, ICASSP-2014)

LHUC a(r) Learning Hidden Unit Contributions ~6000 CD phone outputs a(r) ~2000 hidden units Swietojanski & Renals, SLT-2014 Zhang & Woodland, Interspeech-2015 Key idea: add a learnable speaker-dependent amplitude to each hidden unit 3-8 hidden layers h l m = a(r l m) l (W l> h l 1 m ) a(r) a(r) Speaker dependent parameter r ~2000 hidden units inputs SI Model: set amplitudes to 1 SD Model: learn amplitudes from data, per speaker

LHUC Adaptation data 16 15 14 TED IWSLT, tst2010 DNN SI Baseline DNN+LHUC DNN+SAT LHUC DNN LHUC (Oracle) DNN+SAT LHUC (Oracle) WER (%) 13 12 11 10 10 30 60 120 300 ALL amount of adaptation data [seconds]

LHUC Improvement per speaker Combined results from TED, AMI, Switchboard

Multi-basis adaptive NN C Wu & Gales, Interspeech-2015 2 4% WER relative reduction (YouTube)

Adaptation by speaker selection for dysarthric speech Christensen et al, SLT-2014 Dysarthric speech is highly talker dependent Select SI speaker pool based on WER Pooled SI model + MAP 40% WER UA-Speech: SD 45% WER, SI+MAP 49% WER

Multiple average voice model Personalised speech synthesis for people with speech disorders Lanchantin et al, Interspeech-2014 Combines clusteradaptive training and average voice model Adaptation by interpolating into a speaker eigenspace spanned by mean vectors of speaker-adapted AVMs Improvements in intelligibility and naturalness over tailored synthetic voice

Adaptation in DNN speech synthesis LHUC i-vector y Feature mapping y ' x Gender code Vocoder parameters Vocoder parameters h 4 h 3 h 2 h 1 Linguistic features 259D inputs 60 melcep + + 25 BAP + + F0 + + Voicing 6 tanh hidden layers (1536 units), linear output layer Z Wu et al, Interspeech-2015 SD normalisation of vocoder parameters

Naturalness evaluation 100 80 60 40 20 0 i vector LHUC FT i vector+lhuc i vector+ft LHUC+FT i vector+lhuc+ft MUSHRA test, 30 listeners

Similarity evaluation 100 80 60 40 20 0 i vector LHUC FT i vector+lhuc i vector+ft LHUC+FT i vector+lhuc+ft MUSHRA test, 30 listeners

DNN vs HMM Naturalness:10 Naturalness:100 Similarity:10 Similarity:100 DNN DNN DNN DNN HMM HMM HMM HMM 0 20 40 60 80 100 Preference score (%) Preference test, 30 listeners DNN adapted using i-vector+lhuc+fmllr HMM adapted using CSMAPLR

Multi-task learning

Multi-task DNNs in speech synthesis Main task: vocoder parameters Secondary task: Glimpse-based perceptual measure (STEP) Z Wu et al, ICASSP-2015

Multi-task learning for ASR Bell & Renals, ICASSP-2015 Acoustic features 6000 CD targets OOD inputs OOD CD targets 41 monophone targets 6000 CD targets 3 5% WER relative reduction (TED) In-domain inputs 41 monophone targets

Deep generative modelling

Trajectory RNADE Uria et al, ICASSP-2015

RNADE synthesis

Exemplar Applications

Voice banking and personalised TTS Veaux et al

Multi-domain ASR Saz et al

Browsing Oral Histories Green et al

GlobalVox Bell et al, Interspeech-2015

Concluding remarks Some recent advances from the NST project Other things include distant speech recognition disordered speech recognition end-to-end RNN speech recognition disfluent speech synthesis speaker verification spoofing challenge multilingual / cross-lingual recognition & synthesis software: HTK v3.5, NN LM estimation,

NST people