Speaker Transformation Algorithm using Segmental Codebooks (STASC) Presented by A. Brian Davis

Similar documents
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Human Emotion Recognition From Speech

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speaker recognition using universal background model on YOHO database

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Voice conversion through vector quantization

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learning Methods in Multilingual Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Lecture 1: Machine Learning Basics

Speaker Recognition. Speaker Diarization and Identification

Segregation of Unvoiced Speech from Nonspeech Interference

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Python Machine Learning

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Generative models and adversarial training

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Investigation on Mandarin Broadcast News Speech Recognition

(Sub)Gradient Descent

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Mandarin Lexical Tone Recognition: The Gating Paradigm

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Artificial Neural Networks written examination

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speaker Identification by Comparison of Smart Methods. Abstract

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Body-Conducted Speech Recognition and its Application to Speech Support System

Probabilistic Latent Semantic Analysis

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Automatic Pronunciation Checker

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

CS Machine Learning

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Expressive speech synthesis: a review

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Calibration of Confidence Measures in Speech Recognition

Self-Supervised Acquisition of Vowels in American English

Lecture 9: Speech Recognition

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Edinburgh Research Explorer

Self-Supervised Acquisition of Vowels in American English

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Rhythm-typology revisited.

Comment-based Multi-View Clustering of Web 2.0 Items

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

THE RECOGNITION OF SPEECH BY MACHINE

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Multi-Lingual Text Leveling

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Truth Inference in Crowdsourcing: Is the Problem Solved?

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Affective Classification of Generic Audio Clips using Regression Models

Assignment 1: Predicting Amazon Review Ratings

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Automatic segmentation of continuous speech using minimum phase group delay functions

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Automatic intonation assessment for computer aided language learning

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Letter-based speech synthesis

Switchboard Language Model Improvement with Conversational Data from Gigaword

Arabic Orthography vs. Arabic OCR

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

The Good Judgment Project: A large scale test of different methods of combining expert predictions

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

Transcription:

Speaker Transformation Algorithm using Segmental Codebooks (STASC) Presented by A. Brian Davis

Speaker Transformation Goal: map acoustic properties of one speaker onto another Uses: Personification of text-speech systems Multimedia Preprocessing step for speech recognition Reduce speaker variability Practical?

Steps Involved Training phase Given speech input from source and target, form spectral transformation Inputs / outputs to transformation: Segment speech small chunks (frames) Formants LPC cepstrum coefficients Others (excitation)? Can we generalize behavior of transform? Codebooks/codewords Vector quantization

Vector quantization Assign vectors to discrete set of values K-Means For STASC, also want average all vectors assigned to a class K-Means gives us this for free Shamelessly stolen from Dr. Gutierrez's pattern recognition slides

LSFs Line spectral frequencies Derived (losslessly) from LPC's Can convert to/from, thus can create speech from LSFs Relate to formant frequencies Used in STASC represent vocal tract of speakers Stable Why use instead of MFCCs?

STASC (first method) Assumes orthographic transcription What's said, in writing From transcription, phonemes retrieved Speech segments assigned phoneme based on transcription MFCCs, dmfccs for each segment (frame) passed into HMM, most likely path using Viterbi algorithm LSFs calculated per frame, labeled with phoneme from HMM Phoneme centroids calculated (average LSF values all vectors labeled particular phoneme One-one mapping

Second method (better) No orthographic transcription Intuitively, we know the HMM states in 1 st method didn't need correspond phonemes Require speakers speak same (hopefully phonetically balanced) sentence Sentences with phones approx. distributed as in normal speech Because fewer restrictions, need to do some extra processing of speaker's speech Normalize root-mean-squared energy Remove silence before/after speech

Second method transformation HMM trained on each sentence Data from source speaker's speech segments LSF vectors Number of states correspond sentence length Segmental k-means, separates speech segments into clusters Baum-Welch algorithm train HMM on cluster averages Covariance matrix uniform For source/target speech segments, Viterbi algorithm assigns segments to states. Transformation moves segments from state in source to state in target Centroids

Excitation characteristic From previous papers, know excitation greatly influences perception of speaker Not trivial to transfer Very different for voiced / unvoiced sounds Use current codebooks to transfer excitation Calculate short-time average magnitude spectrum of excitation signal each speech unit

Codebook weight estimation Assume we have vector w of LSFs labeled with HMM state Also centroids Si of each HMM state Algorithm: Calculate distances di from w to Si Perceptual distance closely spaced LSFs correspond to formant locations given higher weight From distances, calculate weights vi, represent w as linear combination Si's Minimize error?

Gradient Descent Find local optimum weights minimize error reconstructed LSFs vs actual LSFs Algorithm: Find gradient of difference reconstruction, predicted (weighted perceptually) Weight gradient by small value (speed to convergence) Add to old weights Until difference in weights between iterations is sufficiently small Found that only few weights given large value Only use 5 most likely weights 15% additional reduction in Itakura-Saito distance,.4 db error

Use of weights Given reconstruct LSF vector (segment of speech from speaker) from linear combination of sigmoids Use those weights and target's sigmoids, use resulting LSFs to reconstruct speech Other transformations? Excitation spectral characteristics Prosody Can estimate new weights for all, but why? Artist's impression

Excitation and Vocal Tract Use weights construct excitation filter linear combination of sigmoids' ( average target excitation magnitude spectra ) over (source EMS) Use weights construct vocal tract spectrum convert transformed LSF vectors to LPCs V t = 1 P 1 k=1 a k t e jk Expansion of bandwidths; gives unnatural speech

Bandwidth modification Assume average formant bandwidth values of target speaker similar most likely target codeword (LSF centroid) Since LSFs correspond to formant locations / bandwidths, change bandwidths by changing adjacent LSF distances Algorithm: Find LSF entries directly before/after each formant location in most likely Target codeword Calculate average formant bandwidth Same for corresponding speech segment LSF vectors form ratio of average codeword bandwidth over segment bandwidth Apply estimated bandwidth ratio to adjust LSFs of speech segment vectors Enforce reasonable bandwidths (average bandwidth of most likely centroid from target speech over 20

Bandwidth modification result

Prosodic Transformation Pitch, duration, energy modified to mimic target Dynamic segment lengths Constant for unvoiced, 2-3 pitch periods for voiced Pitch: No weights involved Modify f0 linearly, matching variance f0s, matching averages

Duration Uniform duration matching? Different people pronounce different phonemes differently Need finer control duration modification

Duration modification Duration phoneme dependent context (coarticulation) Triphones as speech units Find speech unit centroids (durations), weights per segment, form target duration as linear combination Uses? Human transcription

Energy scale modification Another characteristic of speaker Algorithm (finding energy scaling factor per time frame): Calculate RMS energy for each codeword Derive weights for representing scaling factor as linear combination (target's RMS energy) over (source's RMS energy) After applying other modifications, scale energy

Evaluations Want to test effectiveness of transformation Speaker recognition Speech recognition Objective and subjective Automatic speech recognizer Human subjects Test

Objective Idea: confuse a speaker recognition machine Stacking the deck Confidence measure The machine: st =log P X t P X s 256 mixture Gaussian mixture models 24 dimension feature vector (MFCCs, deltas) Binary split vector quantization One vector for all, split to two in arbitrary directions Train HMM 3 speakers, speaking 1 hour each; 45 minutes for training Different sentences (first method) 15 minutes set aside for testing

Testing Multiple speakers Each transformed to another Context dependent Target Source

Objective (2) Sentence HMM Source / target speak same sentences 15 minutes speech from 2 M, 1F Transform 1 M into M/F Phonetic codebooks also used; compare the two Measure fidelity to: Cepstrum Excitation spectrum RMS energy F0 Duration Results show sentence HMM better; increased training

Objective (2)

Subjective Listening experiments no cheating ABX test 20 stimuli presented A, B listened to; X presented; (2-3 word phrases) Is X perceptually closer to A or to B in terms of speaker identity HMM based transformation 100% M-F, 78% M-M But is it a garbled mess?

Intelligibility 150 short nonsense sentences (prevent inference) Shipping gray paint hands even Phone accuracy of natural, transformed speech compared. Phones retrieved from dictionary 93.8% accuracy transformed, 93.4% accuracy natural Target speaker more intelligible?