Learning Latent Representations for Speech Generation and Transformation

Similar documents
Generative models and adversarial training

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Lecture 1: Machine Learning Basics

Probabilistic Latent Semantic Analysis

Speech Recognition at ICSI: Broadcast News and beyond

A study of speaker adaptation for DNN-based speech synthesis

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Learning Methods in Multilingual Speech Recognition

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

On the Formation of Phoneme Categories in DNN Acoustic Models

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Human Emotion Recognition From Speech

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker recognition using universal background model on YOHO database

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Investigation on Mandarin Broadcast News Speech Recognition

Calibration of Confidence Measures in Speech Recognition

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

Edinburgh Research Explorer

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Python Machine Learning

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

THE world surrounding us involves multiple modalities

arxiv: v1 [cs.lg] 15 Jun 2015

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

arxiv: v2 [cs.cv] 30 Mar 2017

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

CSL465/603 - Machine Learning

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Truth Inference in Crowdsourcing: Is the Problem Solved?

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Automatic Pronunciation Checker

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Segregation of Unvoiced Speech from Nonspeech Interference

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Evolutive Neural Net Fuzzy Filtering: Basic Description

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Speaker Recognition. Speaker Diarization and Identification

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Discriminative Learning of Beam-Search Heuristics for Planning

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Phonological Processing for Urdu Text to Speech System

Model Ensemble for Click Prediction in Bing Search Ads

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

BENCHMARK TREND COMPARISON REPORT:

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Word Segmentation of Off-line Handwritten Documents

Knowledge Transfer in Deep Convolutional Neural Nets

Self-Supervised Acquisition of Vowels in American English

Using computational modeling in language acquisition research

Offline Writer Identification Using Convolutional Neural Network Activation Features

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

GDP Falls as MBA Rises?

A Review: Speech Recognition with Deep Learning Methods

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Body-Conducted Speech Recognition and its Application to Speech Support System

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

arxiv: v1 [cs.cv] 10 May 2017

Transcription:

Learning Latent Representations for Speech Generation and Transformation Wei-Ning Hsu, Yu Zhang, James Glass MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA Interspeech 217

What to Expect in This Talk 1. A convolutional variational autoencoder framework to model a generative process of speech q(z x) p(x z) x Encoder μ z z Decode r μ x σ z σ x

What to Expect in This Talk 1. A convolutional variational autoencoder framework to model a generative process of speech 2. A method to associate learned latent representations with physical attributes, such as speaker identity and linguistic content q(z x) p(x z) x Encoder μ z z Decode r μ x σ z σ x

What to Expect in This Talk 1. A convolutional variational autoencoder framework to model a generative process of speech 2. A method to associate learned latent representations with physical attributes, such as speaker identity and linguistic content 3. Simple latent space arithmetic operations to modify speech attributes q(z x) p(x z) x Encoder μ z z Decode r μ x σ z σ x

Outline 1. Motivation 2. Background and Models 3. Latent Attribute Representations and Operations 4. Experiments 5. Conclusion

Motivation We want to learn a generative process of speech 1. What are the factors that affect speech generation? 2. How do these factors play a role in speech generation? 3. How can we infer these factors from observed speech? Welcome!

Motivation We want to learn a generative process of speech 1. What are the factors that affect speech generation? 2. How do these factors play a role in speech generation? 3. How can we infer these factors from observed speech? Why do we want to learn a generative process? Synthesis (1, 2) Recognition and verification (3) Voice conversion and denoising (1, 2, 3) Welcome!

Outline 1. Motivations 2. Background and Models 3. Latent Attribute Representations and Operations 4. Experiments 5. Conclusion

Generative Model Backgrounds Shallow generative models Hidden Markov model-gaussian mixture models (HMM-GMMs) Deep generative models Generative adversarial networks (GANs) model p(x z) and bypass the inference model (generator / discriminator) Auto-regressive models (e.g. WaveNets) model p(x + x,:+., ) and abstain from using latent variables Variational autoencoders (VAEs) learn an inference model and a generative model jointly

Variational Autoencoders (VAEs) Define a probabilistic generative process between observation x and latent variable z p(z), p(x z), and q(z x) are defined to be in some parametric family We define p(x z) (decoder) and q(z x) (encoder) to be diagonal Gaussians Parameters (mean and variance) are described using some NN p(z) is defined to be isotropic Gaussian with unit variance q(z x) p(x z) x Encoder μ z z Decoder μ x σ z σ x

Convolutional Neural Network Architecture Encoder Decoder μ z μ x σ x σ z x x Encoder q(z x) Conv1 Conv2 Conv3 FC1 Gauss Sampl e q(z x) μ z σ z z Decode r p(x z) μ x σ x z FC2 FC3 T-conv1 T-conv2 T-conv3 (+reshape (Gauss) ) p(x z) *T-convstands for transposed convolution

Experiment Setup Dataset: TIMIT (5.4hr) (standard 462 speaker sx/si training set) Speech Segment Dimension: Unsupervised training (i.e., no use of phonetic transcription) T = 2 frames (with shift of 8 frames) F = 8 (FBank) or 2 (Log Magnitude Spectrogram) 2 Training Objective: Variational Lower Bound Optimizer: Adam

Experiment Setup Dataset: TIMIT (5.4hr) (standard 462 speaker sx/si training set) Speech Segment Dimension: Unsupervised training (i.e., no use of phonetic transcription) T = 2 frames (with shift of 8 frames) F = 8 (FBank) or 2 (Log Magnitude Spectrogram) 2 Training Objective: Variational Lower Bound Optimizer: Adam 8

Experiment Setup Dataset: TIMIT (5.4hr) (standard 462 speaker sx/si training set) Speech Segment Dimension: Unsupervised training (i.e., no use of phonetic transcription) T = 2 frames (with shift of 8 frames) F = 8 (FBank) or 2 (Log Magnitude Spectrogram) 2 Training Objective: Variational Lower Bound Optimizer: Adam 8

Outline 1. Motivations 2. Background and Models 3. Latent Attribute Representations and Operations 4. Experiments 5. Audio Demo 6. Conclusion

Speech Reconstruction Illustration The trained VAE is able to reconstruct speech segments Examples from 1 instances of /aa/, /sh/, and /p/ (sampled at center of segment) /aa/ /sh/ /p/

Latent Attribute Representations VAE is encouraged to model independent factors using different dimensions Because the prior is assumed to be a diagonal Gaussian -.7.3 Encoder -.2 1.5.4 Decoder

Latent Attribute Representations VAE is encouraged to model independent factors using different dimensions Because the prior is assumed to be a diagonal Gaussian We want to associate physical attributes with some dimensions Speaker Identity -.7.3 Encoder Phone -.2 1.5.4 Decoder

Latent Attribute Representations VAE is encouraged to model independent factors using different dimensions Because the prior is assumed to be a diagonal Gaussian We want to associate particular dimensions with different physical attributes Latent Phone Representation Speaker Identity Latent Speaker Representation -.7 -.7.3.3 Encoder -.2 Decoder -.2 Phone 1.5 1.5.4.4

Latent Attribute Representations Factors have normal distributions along their associated dimensions

Latent Attribute Representations Factors have normal distributions along their associated dimensions For example, if we want to estimate the latent phone representation for /aa/: Speaker A Speaker B Speaker C -.7 1.1.4.3 -.4.1 -.2 -.2 -.2 /aa/ 1.5 /aa/ 1.5 /aa/ 1.5.4.4.4

Latent Attribute Representations Factors have normal distributions along their associated dimensions For example, if we want to estimate the latent phone representation for /aa/: We can estimate latent attribute by taking the mean latent representations Speaker A Speaker B Speaker C Latent Phone Representation for /aa/ -.7 1.1.4.3 -.2 -.4 -.2.1 -.2 Average -.2 /aa/ 1.5 /aa/ 1.5 /aa/ 1.5 1.5.4.4.4.4

Empirical Study of the Assumptions We compute latent attribute representations of two attributes: Latent Speaker Attribute Latent Phone Attribute -.7 1.1.4.3 -.4.1 -.2.8 -.9 1.5 -.3 -.2.4.2 -.8

Empirical Study of the Assumptions We compute latent attribute representations of two attributes: Compute the absolute cosine similarity between latent attribute representations Latent Speaker Attribute Latent Phone Attribute -.7 1.1.4.3 -.4.1 -.2.8 -.9 1.5 -.3 -.2.4.2 -.8

Empirical Study of the Assumptions We compute latent attribute representations of two attributes: Compute the absolute cosine similarity between latent attribute representations Latent Speaker Attribute Latent Phone Attribute -.7 1.1.4.3 -.4.1 -.2.8 -.9 1.5 -.3 -.2.4.2 -.8

Empirical Study of the Assumptions We compute latent attribute representations of two attributes: Compute the absolute cosine similarity between latent attribute representations Latent Speaker Attribute Latent Phone Attribute -.7 1.1.4.3 -.4.1 -.2.8 -.9 1.5 -.3 -.2.4.2 -.8

Empirical Study of the Assumptions We compute latent attribute representations of two attributes: Compute the absolute cosine similarity between latent attribute representations Latent Speaker Attribute Latent Phone Attribute -.7 1.1.4.3 -.4.1 -.2.8 -.9 1.5 -.3 -.2.4.2 -.8

Empirical Study of the Assumptions We compute latent attribute representations of two attributes: Compute the absolute cosine similarity between latent attribute representations Latent Speaker Attribute Latent Phone Attribute -.7 1.1.4.3 -.4.1 -.2.8 -.9 1.5 -.3 -.2.4.2 -.8

Empirical Study of the Assumptions We compute latent attribute representations of two attributes: Compute the absolute cosine similarity between latent attribute representations Latent Speaker Attribute Latent Phone Attribute -.7 1.1.4.3 -.4.1 -.2.8 -.9 1.5 -.3 -.2.4.2 -.8

Empirical Study of the Assumptions We compute latent attribute representations of two attributes: Compute the absolute cosine similarity between latent attribute representations Latent Speaker Attribute Latent Phone Attribute -.7 1.1.4.3 -.4.1 -.2.8 -.9 1.5 -.3 -.2.4.2 -.8

Arithmetic Operations to Modify Attributes The result suggests that we can modify a specific attribute without altering the others Suppose we want to convert the voice from speaker A (light blue) to speaker B (dark blue) We can do the following operations: -.7.3??? 1.1 -.4???

Arithmetic Operations to Modify Attributes The result suggests that we can modify a specific attribute without altering the others Suppose we want to convert the voice from speaker A (light blue) to speaker B (dark blue) We can do the following operations:

Arithmetic Operations to Modify Attributes The result suggests that we can modify a specific attribute without altering the others Suppose we want to convert the voice from speaker A (light blue) to speaker B (dark blue) We can do the following operations: Speaker Identity -.7.3 Encoder -.2 Phone 1.5.4

Arithmetic Operations to Modify Attributes The result suggests that we can modify a specific attribute without altering the others Suppose we want to convert the voice from speaker A (light blue) to speaker B (dark blue) We can do the following operations: -.7 1.1 Speaker Identity -.7 -.3 + -.4 1.1.3 -.4 Encoder -.2 -.2 -.2 Phone 1.5 1.5 1.5.4.4.4

Arithmetic Operations to Modify Attributes The result suggests that we can modify a specific attribute without altering the others Suppose we want to convert the voice from speaker A (light blue) to speaker B (dark blue) We can do the following operations: -.7 1.1 Speaker Identity -.7 -.3 + -.4 1.1.3 -.4 Encoder -.2 -.2 -.2 Decoder Phone 1.5 1.5 1.5.4.4.4

Outline 1. Motivations 2. Background and Models 3. Latent Attribute Representations and Operations 4. Experiments 5. Conclusion

Magnitude Spectrogram Reconstruction Griffin and Lim algorithm is used for waveform reconstruction Iteratively estimate phase

Modify the Phoneme Modify /aa/ to /ae/, F2 goes up (back vowel -> front vowel) /aa/ /ae/ /aa/ /ae/ /aa/ /ae/

Modify the Phoneme Modify /s/ to /sh/, cutoff goes down (alveolar -> palatal strident) /s/ /sh/ /s/ /sh/ /s/ /sh/

Modify the Speaker Modify a female to a male, pitch decreases

Modify the Speaker Modify a male to a female, pitch increases

Modify the Speaker for An Entire Utterance We choose an utterance from a male speaker (madc) Modify to another male speaker (mabc), and a female speaker (fajw) Each speaker has only 8 utterances in the set ~4s/utterances Estimate the latent speaker representation using only 3s of speech

Modify the Speaker for An Entire Utterance Original Speaker (top) original spectrogram, (bottom) reconstructed spectrogram

Modify the Speaker for An Entire Utterance Convert to Speaker mabc (top) original spectrogram, (bottom) modified spectrogram

Modify the Speaker for An Entire Utterance Convert to Speaker fajw (top) original spectrogram, (bottom) modified spectrogram

Quantitative Evaluation We train discriminators for phone classification and speaker classification Posteriors as the quantitative metric Discriminators mean opinion score on the two attributes Posterior of target attribute increases; posterior of source attribute decreases Posteriors of irrelevant attributes unchanged

Quantitative Evaluation We train discriminators for phone classification and speaker classification Posteriors as the quantitative metric Discriminators mean opinion score on the two attributes Posterior of target attribute increases; posterior of source attribute decreases Posteriors of irrelevant attributes unchanged

Quantitative Evaluation We train discriminators for phone classification and speaker classification Posteriors as the quantitative metric Discriminators mean opinion score on the two attributes Posterior of target attribute increases; posterior of source attribute decreases Posteriors of irrelevant attributes unchanged

Quantitative Evaluation We train discriminators for phone classification and speaker classification Posteriors as the quantitative metric Discriminators mean opinion score on the two attributes Posterior of target attribute increases; posterior of source attribute decreases Posteriors of irrelevant attributes unchanged

Quantitative Evaluation We train discriminators for phone classification and speaker classification Posteriors as the quantitative metric Discriminators mean opinion score on the two attributes Posterior of target attribute increases; posterior of source attribute decreases Posteriors of irrelevant attributes unchanged

Quantitative Evaluation We train discriminators for phone classification and speaker classification Posteriors as the quantitative metric Discriminators mean opinion score on the two attributes Posterior of target attribute increases; posterior of source attribute decreases Posteriors of irrelevant attributes unchanged

Quantitative Evaluation We train discriminators for phone classification and speaker classification Posteriors as the quantitative metric Discriminators mean opinion score on the two attributes Posterior of target attribute increases; posterior of source attribute decreases Posteriors of irrelevant attributes unchanged

Quantitative Evaluation We train discriminators for phone classification and speaker classification Posteriors as the quantitative metric Discriminators mean opinion score on the two attributes Posterior of target attribute increases; posterior of source attribute decreases Posteriors of irrelevant attributes unchanged

Quantitative Evaluation We train discriminators for phone classification and speaker classification Posteriors as the quantitative metric Discriminators mean opinion score on the two attributes Posterior of target attribute increases; posterior of source attribute decreases Posteriors of irrelevant attributes unchanged

Outline 1. Motivations 2. Background and Models 3. Latent Attribute Representations and Operations 4. Experiments 5. Conclusion

Conclusion and Future Work We present a CNN-VAE to model generation process of speech segments

Conclusion and Future Work We present a CNN-VAE to model generation process of speech segments The framework leverages vast quantities of unannotated data to learn a general speech analyzer and a general speech synthesizer.

Conclusion and Future Work We present a CNN-VAE to model generation process of speech segments The framework leverages vast quantities of unannotated data to learn a general speech analyzer and a general speech synthesizer. We demonstrate qualitatively and quantitatively the ability to modify speech attributes.

Conclusion and Future Work We present a CNN-VAE to model generation process of speech segments The framework leverages vast quantities of unannotated data to learn a general speech analyzer and a general speech synthesizer. We demonstrate qualitatively and quantitatively the ability to modify speech attributes. We have applied the modification operation to data augmentation for ASR and achieved significant improvement for domain adaptation. (submitted to ASRU)

Conclusion and Future Work We present a CNN-VAE to model generation process of speech segments The framework leverages vast quantities of unannotated data to learn a general speech analyzer and a general speech synthesizer. We demonstrate qualitatively and quantitatively the ability to modify speech attributes. We have applied the modification operation to data augmentation for ASR and achieved significant improvement for domain adaptation. (submitted to ASRU) For future work, we plan to investigate the use of VAE on voice conversion and speech de-noising under the setting of no parallel training data.

Thanks for Listening. Q&A? Paper, slides, samples and follow-up works can be found on http://people.csail.mit.edu/wnhsu/