Comparison of Speech Normalization Techniques

Similar documents
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Emotion Recognition Using Support Vector Machine

Speech Recognition at ICSI: Broadcast News and beyond

Speaker recognition using universal background model on YOHO database

A study of speaker adaptation for DNN-based speech synthesis

Learning Methods in Multilingual Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Lecture 9: Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Human Emotion Recognition From Speech

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speaker Identification by Comparison of Smart Methods. Abstract

Voice conversion through vector quantization

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Edinburgh Research Explorer

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Body-Conducted Speech Recognition and its Application to Speech Support System

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Automatic Pronunciation Checker

Segregation of Unvoiced Speech from Nonspeech Interference

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Speech Recognition by Indexing and Sequencing

Lecture 1: Machine Learning Basics

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Support Vector Machines for Speaker and Language Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Calibration of Confidence Measures in Speech Recognition

Proceedings of Meetings on Acoustics

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Improvements to the Pruning Behavior of DNN Acoustic Models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Python Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Using dialogue context to improve parsing performance in dialogue systems

THE RECOGNITION OF SPEECH BY MACHINE

Spoofing and countermeasures for automatic speaker verification

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Investigation on Mandarin Broadcast News Speech Recognition

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

A new Dataset of Telephone-Based Human-Human Call-Center Interaction with Emotional Evaluation

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Perceptual Auditory Aftereffects on Voice Identity Using Brief Vowel Stimuli

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

(Sub)Gradient Descent

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

A Hybrid Text-To-Speech system for Afrikaans

Visit us at:

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Automatic segmentation of continuous speech using minimum phase group delay functions

CEFR Overall Illustrative English Proficiency Scales

On the Formation of Phoneme Categories in DNN Acoustic Models

Statistical Parametric Speech Synthesis

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Why Did My Detector Do That?!

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Introduction to Simulation

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Transcription:

Comparison of Speech Normalization Techniques 1. Goals of the project 2. Reasons for speech normalization 3. Speech normalization techniques 4. Spectral warping 5. Test setup with SPHINX-4 speech recognition system 6. Initial test results

Goals of the project 1. Survey speech normalization techniques. 2. Convey understanding of how these techniques work. 3. Determine importance of normalization to speech recognition improvement. 4. Test speech recognition improvement with speakers of varying dialects, especially ESL speakers. 5. Compare performance of existing techniques. 6. Recommend future work toward two goals: A. Accurate conversational speech recognition. B. Automatic meeting minutes transcription generation.

REFERENCES: [1] Li Lee, Richard C. Rose, AT&T Bell Labs, Murray Hill, NJ, USA, Speaker Normalization Using Efficient Frequency Warping Procedures, Acoustics, Speech, and Signal Processing, 1996, ICASSP-96. Conference Proceedings, Publication Date: 7-10 May 1996 [2] Donald Bailey, Warwick Allen, Serge Demidenko, Spectral Warping Revisited, Proceedings of the Second IEEE International Workshop on Electronic Design, Test and Applications (DELTA 04), 2004 [3] Sadaoki Furui, Steps Toward Natural Human-Machine Communication in the 21st Century, Deparment of Computer Science, Tokyo Institute of Technology,Proc. COST249 Workshop, Voice Operated Telecom Services, 2000 [4] Steven Wegmann, Don McAllaster, Jeremy Ortoff, Barbara Peskin, Speaker Normalization on Conversational Telephone Speech, Dragon Systems, Inc., Acoustics, Speech, and Signal Processing, 1996. ICASSP- Conference Proceedings., 1996 IEEE International Conference [5] Alejandro Acero, Richard M. Stern, Dept. of Electrical Engineering and Computer Engineering and School of Computer Science, Carnegie Mellon University, Pittsburgh, Penn, USA, Robust Speech Recognition by Normalization of the Acoustic Space, Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991, Publication Date: 14-17 Apr 1991 [6] Puming Zhan, Alex Waibel, Vocal Tract Length Normalization for Large Vocabulary Continuous Speech recognition, May, 1997 [7] John McDonough, William Byrne, Xiaoqiang Luo, Center for Language and Speech Processing, The john Hopkins University, Baltimore, MD, USA, Speaker Normalization with All-Pass Transforms, 1998 [8] Charles R. Jankowski Jr., Richard P. Lippmann, MIT Lincoln Laboratory, Lexington, MA, USA, Comparison of Auditory Models for Robust Speech Recognition, 1998

Reasons for speech normalization Automatic Adaptation to Voice Variation 1. Acoustic variation in speech A. Physical and psychological condition of the speaker B. Telephone, microphones, network conditions C. Noise 1. Background noise (additive) 2. Reverberations D. Other speakers E. Speaking styles F. Distortion, echoes, dropouts G. Speaker characteristics 1. Pitch 2. Gender 3. Dialect H. Task/context 1. Dialogue 2. Dictation 3. Interview I. Microphone 1. Distortion 2. Electrical noise 3. Directional characteristics

Reasons for speech normalization Automatic Adaptation to Voice Variation J. Spontaneous speech recognition must deal with variations: 1. extraneous words 2. out-of vocabulary words 3. ungrammatical sentences 4. disfluency 5. partial words 6. repairs 7. hesitations 8. repetitions K. Error rate is 3-4 times greater in SI compared to SD systems

Speech normalization techniques Speaker adaptation or normalization 1. Types A. Supervised B. Unsupervised 1. Unsupervised, instantaneous and incremental speaker adaptation combined with automatic detection of speaker changes is ideal. 2. Must adapt many phonemes using a limited number of utterances. 1. Ergo, must use adequate modeling of speaker-to-speaker variability. 3. Main method to deal with voice variations A. Microphone 1. Close-talking microphone 2. Microphone array B. Analysis and feature extraction 1. Auditory models a. EIH - Ensemble Interval Histogram b. SMC Speech and Multimedia Communication c. PLP Perceptual Linear Predictive Speech Analysis

Speech normalization techniques Speaker adaptation or normalization 3. Main method to deal with voice variations C. Feature-level normalization/adaptation 1. Adaptive filtering 2. Noise subtraction 3. Comb filtering 4. Spectral mapping 5. Cepstral mean normalization 6. cepstra 7. RASTA D. Model-level normalization/adaptation input is reference templates/models 1. Noise addition 2. HMM (de)composition(pmc) 3. Model transformation (MLLR) 4. Bayesian adaptive learning E. Distance/similarity measures F. Frequency weighting measure G. Weighted cepstral distance H. Cepstrum projection measure

Speech normalization techniques Speaker adaptation or normalization 4. Robust matching A. Word spotting B. Utterance verification 5. Linguistic processing A. Language model adaptation Output is Recognition Results

Spectral warping for a HMM speech recognition system per [1] 1. Estimate optimal warping factor (α i ) A. During HMM training. B. Again during HMM recognition. 2. Training A. One α i is determined for each speaker. B. One HMM is built from all warped utterances. 3. Recognition Pr(S i M 0.88 ) Unwarped S i Pr(S i M 1.00 ) MAX α i Pr(S i M 1.12 ) A. Estimate α i based on input utterance. B. Decode utterance using warped feature vectors, e.g. MFCC. 4. Estimation of α i A. Maximum likelihood with respect to a specific HMM. B. Optimal warping factor α i =argmax[pr(x i λ,w i )]. 1. λ : set of HMM models. 2. X i : α i warped cepstrum domain observation vectors for speaker i. 3. W i : transcriptions. C. Practical range of : 0.88 to 1.12, accounting for the 25% range of vocal tract lengths.

Spectral Normalization 1. Vocal tract length varies from person to person A. Formant frequency peaks are inversely proportional to vocal tract length. B. Formant center frequencies vary by as much as 25% between speakers. C. Frequency warping is a technique to re-map speech around the z-domain unit circle, such that all utterances seem to be generated by the same vocal tract. π π 3π/4 3π/4 π/2 π/2 π/4 π/4 DC DC

Spectral warping using SPHINX-4 HMM speech recognition system Frequency warping performed in two stages per [5], resulting in 10% error rate reduction. 1.Bilinear transformation performed with α=0.6 A. LPC-cepstrum mapped into pseudo mel scale 2.Another transform applied with a variable warping parameter, α A. α chosen to minimize VQ error. B. Maintain average of zero between male and female speakers.

Test Configuration Input Wave File Matlab Spectral Warping Program Warped Wave File Front End (Cepstral Mean Normalization etc.) Result (Transcription) Features Matrix SPHINX-4 Recognizer Decoder