AN EXPLORATION ON INFLUENCE FACTORS OF VAD S PERFORMANCE IN SPEAKER RECOGNITION. Cheng Gong, CSLT 2013/04/15

Similar documents
Speech Emotion Recognition Using Support Vector Machine

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

WHEN THERE IS A mismatch between the acoustic

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Human Emotion Recognition From Speech

A study of speaker adaptation for DNN-based speech synthesis

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Segregation of Unvoiced Speech from Nonspeech Interference

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speaker recognition using universal background model on YOHO database

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speaker Recognition. Speaker Diarization and Identification

Author's personal copy

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Support Vector Machines for Speaker and Language Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speaker Identification by Comparison of Smart Methods. Abstract

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Modeling function word errors in DNN-HMM based LVCSR systems

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Body-Conducted Speech Recognition and its Application to Speech Support System

Modeling function word errors in DNN-HMM based LVCSR systems

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Proceedings of Meetings on Acoustics

THE RECOGNITION OF SPEECH BY MACHINE

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Spoofing and countermeasures for automatic speaker verification

On the Formation of Phoneme Categories in DNN Acoustic Models

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Learning Methods in Multilingual Speech Recognition

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Automatic segmentation of continuous speech using minimum phase group delay functions

Automatic intonation assessment for computer aided language learning

Speech Recognition at ICSI: Broadcast News and beyond

Phonetics. The Sound of Language

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Investigation on Mandarin Broadcast News Speech Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Calibration of Confidence Measures in Speech Recognition

Speaker Recognition For Speech Under Face Cover

Affective Classification of Generic Audio Clips using Regression Models

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Lecture 9: Speech Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Speech Recognition by Indexing and Sequencing

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Rhythm-typology revisited.

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Expressive speech synthesis: a review

Generative models and adversarial training

Automatic Pronunciation Checker

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Australian Journal of Basic and Applied Sciences

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Evaluation of Various Methods to Calculate the EGG Contact Quotient

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

Edinburgh Research Explorer

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

SUPRA-SEGMENTAL FEATURE BASED SPEAKER TRAIT DETECTION

Using dialogue context to improve parsing performance in dialogue systems

Word Segmentation of Off-line Handwritten Documents

age, Speech and Hearii

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Python Machine Learning

Reducing Features to Improve Bug Prediction

Independent Assurance, Accreditation, & Proficiency Sample Programs Jason Davis, PE

INPE São José dos Campos

The Good Judgment Project: A large scale test of different methods of combining expert predictions

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

IEEE Proof Print Version

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Transcription:

AN EXPLORATION ON INFLUENCE FACTORS OF VAD S PERFORMANCE IN SPEAKER RECOGNITION Cheng Gong, CSLT 2013/04/15

Outline Introduction Analysis about influence factors of VAD s performance Experimental results and analysis Conclusions

Introduction Voice activity detection (VAD) A method for detecting periods of speech in observed signals VAD technique is particularly important and widely used in both automatic speech recognition and speaker recognition Two parts of VAD process: Acoustic feature extraction Decision mechanism Currently used VAD methods: Short-term signal energy, zero-crossing rate Speech/noise spectral characteristics based methods: MFCCs, LTSE, LSF, MMSE, etc. Periodic feature based methods: ACF, F0, etc.

Introduction Difficulties of VAD: Determine end-points accurately Be robust to noise, especially to non-stationary noise Basic principle of choosing end-points: Speech recognition: integrity of the speech contents Speaker recognition: typicality of the speaker characteristics To get a better result, VAD method in speaker recognition may be different from which in speech recognition

Phonation Types Voiced sound Glottis excitation + Vocal Tract response Quasi-periodic signal All simple/compound vowels and 4 initial consonants (m, n, l, r) in mandarin are voiced sound Unvoiced sound No vocal cord vibration Non-periodic signal Plosive/affricate/fricative, aspirated/unaspirated The other initial consonants in mandarin are unvoiced sound

Phonation Types Influence Research Assumption Phonation types distinction may lead to different contribution to speaker verification results Research Procedure To segment the speech signals based on phonation types (Using HVITE tools) To splice the speech according to the rules of classification by person Silence segments Voiced sound segments Unvoiced sound segments To extract features (MFCC), train models and test on the speaker verification system, compare and analyse the results.

SNR s Influence Research Assumption Noise in the speech doesn t reflect the speaker s characteristics, so the parts which has low SNR may lead to a high EER Research Procedure To estimate noise power spectrum of each speech signal To Calculate SNR of each frame To splice the speech based on the SNR level Clean ~20dB, 20dB~15dB, 15dB~10dB, 10dB~ 5dB, 5dB~ To extract features (MFCC), train models and test on the speaker verification system, compare and analyse the results.

Noise Estimation Algorithm Analysis object Additive noise in the speech (stable/unstable) Destination To obtain a noise power spectrum estimation from noisy speech Implement method Combination of minimum statistics, continuous spectral minimum & minima controlled recursive algorithm

Noise Estimation Algorithm Compute smooth speech power spectrum P(λ,k) Calculate speech presence probability p(λ,k) using first-order recursion Find the local minimum of noisy speech P min (λ,k) Compute time-frequency dependent smoothing factors α s (λ,k) Compute ratio S r (λ,k) of smoothed speech power spectrum to its local minimum Update noise estimate D(λ,k) using time-frequency dependent smoothing factors α s (λ,k)

Database CCB database Recorded in clean environments using telephone channel Sampling rate: 8kHz Training utterance length: 39s~75s Testing utterance length: 11s~44s Channel Training True Speaker Impostor M F M F M F Telephone 50 50 150 150 1000 1000 100 300 2000

Experimental Conditions Feature: Mel-frequency cepstral coefficients (MFCC) 16-orders with energy, without delta 32 Mel filter banks Model: GMM-UBM 1024 mixtures

Results and Analysis Gender EER(%) Voiced Sound Unvoiced Sound Silence Baseline M 7.65 42.49 48.74 8.17 F 8.13 42.17 49.22 8.53 M+F 5.89 41.87 49.12 7.44 Phonation types influence

Results and Analysis 50 45 40 35 30 25 20 15 10 5 0 EER(%) clean-20db 20dB-15dB 15dB-10dB 10dB-5dB 5dB- Silence SNR s influence If the segments (SNR<5dB) are removed, the EER = 5.09% (the baseline EER = 7.16%)

Results and Analysis Add white noise with different SNRs to the speech, the table below shows the EERs when removing the segments whose SNR<5dB: SNR Improved EER(%) Baseline clean 5.09 7.16 20dB 6.46 8.76 15dB 8.31 10.89 10dB 11.85 15.24 5dB 16.65 21.46

Conclusion Unvoiced sounds don t contribute much to speaker verification results. The speech with voiced sounds only can get better results. The EER is related to SNR of the speech directly, if we remove some segments whose SNR is very low, the results will get much better. This method has remarkable effects on noisy speech.

References 1. Ishizuka,K., Nakatani, T., Fujimoto,M., Miyazaki,N., 2010. Noise robust voice activity detection based on periodic to aperiodic component ratio. Speech Commun. 52,41-60 2. Le Bouquin-Jeannes,R., Faucon,G., 1995. Study of voice activity detector and its influence on a noise reduction system. Speech Commun. 16,245 254 3. Karray,L., Martin,A., 2003. Towards improving speech detection robustness for speech recognition in adverse conditions. Speech Commun.40,261 276 4. Rabiner,L.R., Sambur,M.R., 1975. An algorithm for determining the end points of isolated utterances. BellSyst.Tech.J.54,297 315 5. Kristjansson,T., Deligne,S., Olsen,P., 2005. Voicing features for robust speech detection. In:Proc.Interspeech,pp.369 372 6. Ramirez,J., Segura,J.C., Benitez,C., De la Torre,A., Rubio,A., 2004. Efficient voice activity detection algorithms using long-term speech information. SpeechCommun. 42,271 287 7. ITU-T Recommendation G.729 Annex B, 1996. A silence compression scheme for G.729 optimized for terminals conforming to Recommendation V.70 8. Ephraim,Y., Malah,D., 1984. Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. Acoustic Speech Signal Process. ASSP-32,1109 1121 9. Rabiner,L.R., Sambur,M.R., 1975. An algorithm for determining the end points of isolated utterances. BellSyst.Tech.J.54,297 315 10. Ahmadi,S., Spanias,A.S., 1999. Cepstrum-based pitch detection using a new statistical V/UV classification algorithm. IEEETrans. on Speech and Audio Process.7,333-338

References 11. Rangachari,S., Loizou,P.C., 2006. A noise-estimation algorithm for highly non-stationary environments. Speech Communication 48 220-231 12. Martin,R., 2001. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9 (5), 504 512 13. Cohen,I., 2002. Noise estimation by minima controlled recursive averaging for robust speech enhancement. IEEE Signal Process. Lett. 9 (1), 12 15 14. Doblinger,G., 1995. Computationally efficient speech enhancement by spectral minima tracking in subbands. Proc. Eurospeech 2, 1513 1516