An Analysis-by-Synthesis Approach to Vocal Tract Modeling for Robust Speech Recognition

Similar documents
Proceedings of Meetings on Acoustics

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Emotion Recognition Using Support Vector Machine

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

On the Formation of Phoneme Categories in DNN Acoustic Models

Speaker recognition using universal background model on YOHO database

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Recognition at ICSI: Broadcast News and beyond

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

WHEN THERE IS A mismatch between the acoustic

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Speaker Identification by Comparison of Smart Methods. Abstract

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Modeling function word errors in DNN-HMM based LVCSR systems

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speaker Recognition. Speaker Diarization and Identification

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Phonetics. The Sound of Language

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Audible and visible speech

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Learning Methods in Multilingual Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Consonants: articulation and transcription

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Word Segmentation of Off-line Handwritten Documents

Lecture 1: Machine Learning Basics

Speech Recognition by Indexing and Sequencing

Support Vector Machines for Speaker and Language Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Voice conversion through vector quantization

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Multi-View Features in a DNN-CRF Model for Improved Sentence Unit Detection on English Broadcast News

Investigation on Mandarin Broadcast News Speech Recognition

Segregation of Unvoiced Speech from Nonspeech Interference

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Generative models and adversarial training

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Body-Conducted Speech Recognition and its Application to Speech Support System

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Lecture 9: Speech Recognition

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Python Machine Learning

Edinburgh Research Explorer

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Learning Methods for Fuzzy Systems

Beginning primarily with the investigations of Zimmermann (1980a),

This scope and sequence assumes 160 days for instruction, divided among 15 units.

Automatic Pronunciation Checker

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Universal contrastive analysis as a learning principle in CAPT

INPE São José dos Campos

Reinforcement Learning by Comparing Immediate Reward

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Improvements to the Pruning Behavior of DNN Acoustic Models

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Why Did My Detector Do That?!

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Axiom 2013 Team Description Paper

Probabilistic Latent Semantic Analysis

arxiv: v1 [cs.cl] 2 Apr 2017

Calibration of Confidence Measures in Speech Recognition

Spoofing and countermeasures for automatic speaker verification

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Expressive speech synthesis: a review

Transcription:

An Analysis-by-Synthesis Approach to Vocal Tract Modeling for Robust Speech Recognition Ziad Al Bawab (ziada@cs.cmu.edu) Electrical and Computer Engineering Carnegie Mellon University Work in collaboration with: Bhiksha Raj Lorenzo Turicchia (MIT) and Richard M. Stern IBM Research October 9, 2009

Talk Outline I. Introduction II. Deriving vocal tract shapes from EMA data using a physical model III. Analysis-by-synthesis framework IV. Dynamic articulatory model V. Conclusion 2

Frequency Amplitude Conventional Generative Model SPEECH: /S/-/P/-/IY/-/CH/ /S/ /P/ /IY/ /CH/ Maximum Likelihood S1 S2 Sn F1 F2 F13 F1 F2 F13 Acoustic Features F1 F2 F13 S-P-IY-CH 5000 0-5000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 8000 6000 4000 2000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time 3 Wikipedia

Frequency Amplitude The Ultimate Generative Model SPEECH: /S/-/P/-/IY/-/CH/ /S/ /P/ /IY/ /CH/ Articulatory modeling Lips Separation Tongue Tip S11 S21 S12 S22 S13 S23 S14 S24 Articulatory Targets S1n S2n Speech is actually generated by the vocal tract! F1 F2 F13 F1 F2 F13 Acoustic Features F1 F2 F13 S-P-IY-CH Physical Generative Model 5000 0-5000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 8000 6000 4000 2000 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Time Physical model of sound generation 4

The Missing Science Need a framework that can explicitly model the articulatory space (configurations and dynamics) that can help alleviate problems like coarticulation, articulatory target undershoot, asynchrony of articulators, and pronunciation variations Current approaches in articulatory modeling (Livescu, Deng, Erler, and more) attempt to learn and apply constraints based on inferences from surface level acoustic observations or from linguistic sources Need to learn from real articulatory data Need a mapping from articulatory space to the acoustic domain based on the physical generative process that is more natural (i.e. accurate) and can generalize better than learning the mapping statistically (i.e. from parallel articulatory and acoustic data) 5

MOCHA Database MOCHA Apparatus Raw Articulatory Measurements 6

y cm MOCHA EMA Data 1 0.5 UL VL 0 UI TB TD -0.5 TT -1-1.5-2 -2.5-3 LI LL -3.5-2 -1 0 1 2 3 4 5 6 7 x cm 7

Maeda Parameters Upper Palate Lips Glottis P1 P7 7 Maeda Parameters Maeda s Model Area Functions (Acoustic Tubes) Area A1 A36 8 Length L1 L36

Articulatory Speech Synthesis Area Functions (Acoustic Tubes) Area Length Sondhi and Schroeter Model A1 A36 L1 L36 Area to Transfer Function of Each Section VT Transfer Function 9

Deriving Realistic Vocal Tract Shapes from ElectroMagnetic Articulograph Data via Geometric Adaptation and Profile Fitting Problem Overview: Speech synthesis solely from EMA data using: Knowledge of the geometry of the vocal tract Knowledge of the physics of the speech generation process Approach Followed: Compute realistic vocal tract shapes from EMA data 1. Adapting Maeda s geometric vocal tract model to EMA data 2. Search for best fit of the tongue and lips profile contours to EMA data Synthesize speech from vocal tract shapes 3. Articulatory synthesis using the Sondhi and Schroeter model 10

y cm 6 4 0-2 d Upper Incisor 2 29 Lips 1. Vocal Tract Adaptation 21 Tongue Parameters Origin Upper Wall 10 Origin Upper Wall Shift θ d Lips Separation -4-6 Inner Wall 1 2 Larynx Edges -8-6 -4-2 0 x cm 2 4 6 11

y cm Adaptation Result [1] 4 2 0 UL UL d UI 29 UI 26 TT Estimated EMA Upper Wall + Maeda Upper Wall VL VL 15 TB TD -2 LL LL LI LI -4 Inner Wall -6-8 Larynx 2 1-10 -2 0 2 4 6 8 10 x cm [1] Z. Al Bawab, L. Turicchia, R. M. Stern, and B. Raj, Deriving Vocal Tract Shapes From ElectroMagnetic Articulograph Data Via Geometric Adaptation and Matching, in Interspeech, Brighton, UK, September 2009. 12

2. Search Results EMA points in purple for phone II as in Seesaw = /S-II-S-OO/ EMA points in purple for phone @@ as in Working = /W-@@- K-I-NG/ 13

3. Synthesis Results Acoustic tubes model for phone II as in Seesaw = /S-II-S-OO/ Acoustic tubes model for phone @@ as in Working = /W-@@- K-I-NG/ 14

Creating a Realistic Codebook and Adapted Articulatory Transfer Functions Velum Area Codeword: p1 p2 p3 p4 p5 p6 p7 VA 15

y Projecting the 44 Phones Codewords Means using Multi-Dimensional Scaling (MDS) 1.5 1 0.5 0 GK NG II EI I@ Y E A E@ I H AU AI UU @@ L UU@ AA RUH SH ZH @ T CH N D SZ JH DH TH -0.5-1 W OI OU OO O F V PB M -1.5-2.5-2 -1.5-1 -0.5 0 0.5 1 1.5 x 16

Deriving Analysis-by-Synthesis Features[2] Compare signals generated from a codebook of valid vocal tract configurations Energy, Pitch to the incoming signal to produce a distortion feature vector Speech Articulatory Space Articulatory Configurations codeword 1 P1 P7 codeword N P1 Synthesis Synthesis MFCC MFCC MFCC Mel-Cepstral Distortion Distortion Feature Vector d1 dn P7 [2] Z. Al Bawab, B, Raj, and R. M. Stern, Analysis-by-synthesis features for speech recognition, IEEE International Conference on Acoustics, Speech, and Signal Processing, April 2008, Las Vegas, Nevada. 17

Mixture Probability Density Function For a given frame, the output probability of each state in the HMM is a mixture density over a set of M codewords: Weight of each codeword Likelihood of input given the codeword and state 18

HMM Framework 19

Priors From EMA EMA measurements TT TB TD Time cd 1 cd 2 cd 1 cd 3 cd 2 20

Update Equations For each phone, we estimate and for each state as: P( x u cd j ) j exp 2 j d uj 21

y y y y Weights for Phone OU Projected on Codewords-MDS Space Priors from EMA 4 3 2 1 0-1 -2-3 -4-6 -4-2 0 2 4 x Weights Init from EMA 4 3 2 1 0-1 -2-3 -4-6 -4-2 0 2 4 x Weights Flat Init 4 3 2 1 0-1 -2-3 -4-6 -4-2 0 2 4 x Weights Init from EMA + Adaptation 4 3 2 1 0-1 -2-3 -4-6 -4-2 0 2 4 x 22

Experimental Setup Segmented phone recognition on the MOCHA Database (9 speakers, 460 TIMIT British English utterances per speaker, 44 phones) Articulatory codebook composed of 1024 different Maeda configurations derived from MOCHA EMA data LDA dimensionality reduction of the distortion vector to 20 features per frame, phones being the classes of transformation 23

Experimental Setup Cont d Distortion measure used is the Mel-Cepstral distortion: 12 10 MCD( C incoming, C synth) 2 ( Cincoming( k) Csynth( k)) ln10 k 1 2 Classify each phone c according to: cˆ arg max c P( c) P( MFCC c) P( DF c) (1 ) 24

Summary of Phone Error Rates Results [3] Features (dimension) Topology fsew0 msak0 Both Improvement Obser Prob / Init 14,352 14,302 28,654 MFCC + CMN (13) Dist Feat (1024) (Prob. Combination α = 0.2) 3S-128M-HMM Gaussian/VQ 61.6% 55.9% 58.8% 3S-1024M-HMM Exponential/Flat Sparsity = 21% 57.6% 53.7% 55.7% 5.3% Dist Feat(1024) (Prob. Combination α = 0.2) Adapted Dist Feat (1024) (Prob. Combination α = 0.25) Dist Feat + LDA + CMN (20) (Prob. Combination α = 0.6) 3S-1024M-HMM Exponential/EMA Sparsity = 51% 3S-1024M-HMM Exponential/EMA Sparsity = 51% 3S-128M-HMM Gaussian/VQ Sparsity = 0% 58.3% 53.9% 56.1% 4.6% 58.4% 53.1% 55.7% 5.3% 54.9% 49.8% 52.4% 10.9% [3] Z. Al Bawab, B, Raj, and R. M. Stern, A Hybrid Physical and Statistical Dynamic Articulatory Framework Incorporating Analysis-by-Synthesis for Improved Phone Classification, Submitted to ICASSP 2010, Dallas, Texas. 25

Summary of our Contribution Conventional HMM Production Based HMM States Abstract, no physical meaning Real articulatory configurations Output Observation Probability Gaussian probability using acoustic features Exponential probability based on the analysis-bysynthesis distortion features Adaptation VTLN, MLLR, MAP Vocal tract geometric model adaptation Transition Probability Based on acoustic observation Can be leaned from articulatory dynamics 26

Conclusion A model that mimics the actual physics of the vocal tract results in better classification performance Developed a hybrid physical and statistical dynamic articulatory framework that incorporates analysis-bysynthesis for improved phone classification Recent databases open new horizons to better understand the articulatory phenomena Current advancements in computations and machine learning algorithms facilitate the integration of physical models in large scale systems 27

THANK YOU 28