Robust Spectral Representation Using Group Delay Function and Stabilized Weighted Linear Prediction for Additive Noise Degradations

Similar documents
Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Speech Emotion Recognition Using Support Vector Machine

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

WHEN THERE IS A mismatch between the acoustic

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Automatic segmentation of continuous speech using minimum phase group delay functions

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Segregation of Unvoiced Speech from Nonspeech Interference

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speaker Identification by Comparison of Smart Methods. Abstract

Author's personal copy

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speaker Recognition For Speech Under Face Cover

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speaker recognition using universal background model on YOHO database

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Human Emotion Recognition From Speech

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A study of speaker adaptation for DNN-based speech synthesis

Body-Conducted Speech Recognition and its Application to Speech Support System

THE RECOGNITION OF SPEECH BY MACHINE

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speaker Recognition. Speaker Diarization and Identification

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Voice conversion through vector quantization

SPEAKER IDENTIFICATION FROM SHOUTED SPEECH: ANALYSIS AND COMPENSATION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Evaluation of Various Methods to Calculate the EGG Contact Quotient

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

age, Speech and Hearii

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Spoofing and countermeasures for automatic speaker verification

Affective Classification of Generic Audio Clips using Regression Models

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Investigation on Mandarin Broadcast News Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Operational Knowledge Management: a way to manage competence

Modeling function word errors in DNN-HMM based LVCSR systems

Probabilistic Latent Semantic Analysis

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Phonetics. The Sound of Language

APB Step 3 Test, Evaluation, and Analysis Process

Calibration of Confidence Measures in Speech Recognition

Support Vector Machines for Speaker and Language Recognition

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Word Segmentation of Off-line Handwritten Documents

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Educating for innovationdriven

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Automatic intonation assessment for computer aided language learning

Edinburgh Research Explorer

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Proceedings of Meetings on Acoustics

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Ansys Tutorial Random Vibration

Perceptual Auditory Aftereffects on Voice Identity Using Brief Vowel Stimuli

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Audible and visible speech

Course Law Enforcement II. Unit I Careers in Law Enforcement

OPAC and User Perception in Law University Libraries in the Karnataka: A Study

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Rhythm-typology revisited.

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

Klaus Zuberbühler c) School of Psychology, University of St. Andrews, St. Andrews, Fife KY16 9JU, Scotland, United Kingdom

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Using LibQUAL+ at Brown University and at the University of Connecticut Libraries

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

On-Line Data Analytics

Consonants: articulation and transcription

Provisional. Using ambulatory voice monitoring to investigate common voice disorders: Research update

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Data Fusion Models in WSNs: Comparison and Analysis

Transcription:

Robust Spectral Representation Using Group Delay Function and Stabilized Weighted Linear Prediction for Additive Noise Degradations Dhananjaya Gowda, Jouni Pohjalainen, Paavo Alku and Mikko Kurimo Dept. of Signal Processing and Acoustics School of Electrical Eng., Aalto University, Finland Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 1

Outline Weighted linear prediction (WLP) Stabilized weighted linear prediction (SWLP) Group delay (GD) of an all-pole model SWLP-GD spectrum Robustness of SWLP-GD spectrum Speaker recognition experiments Conclusions Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 2

Weighted linear prediction Idea: give more importance/weight to reduce prediction errors in the close phase region of the glottal cycle Provides better estimates of the vocal tract Noise robust as the focus is now on high SNR region Short time energy (STE) is one such weight function [Ma et al., Speech Comm. 1993] Stabilized WLP (SWLP) ensures stability of the estimated poles [Magi et al., Speech Comm. 2008] More weight to high SNR closed phase region Electroglottograph (measures the air flow through the vocal folds as we speak) Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 3

Group delay of an all-pole system Group delay (GD) function negative derivative of phase spectrum GD function is additive in nature (w.r.t. individual resonances) as against multiplicative magnitude spectrum Formant peaks are better resolved Formant peaks are better highlighted even under degradations Can be computed from the inverse filter impulse response Avoids phase unwrapping Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 4

SWLP-GD spectrum Computed as the group delay function of the SWLP spectrum SWLP tends to smooth the spectrum due to weighting SWLP-GD brings back the formant resolution Weak formants better highlighted in SWLP-GD Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 5

Robustness of SWLP-GD Objective measure #1 average log spectral distortion (LSD) LSD between normalized spectra from clean and degraded speech Spectra normalized to unit energy Data VTR database (192 utterances, 24 speakers, 8 female & 16 male) Degradations from NOISEX database Observations STRAIGHT marginally better than LP SWLP better than STRAIGHT SWLP-GD improves upon SWLP and performs the best Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 6

Robustness of SWLP-GD (contd..) Objective measure #2 Frequency weighted segmental SNR Gives more weight to spectral peaks as against valleys Correlates well with the industry standard PESQ (a measure of speech quality) SWLP-GD performs better than other representations Most affected: white noise followed by factory noise Frequency weighted segmental SNR Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 7

Robustness of SWLP-GD (contd..) SWLP-GD spectra for different noise degradations at 0 db SNR Good performance in most strongly voiced regions Most affected region (esp. white & factory) Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 8

Speaker recognition experiments Small-scale closed-set speaker recognition experiments C clean case; B10, F10, V10 and W10 noisy speech at 10 db SNR (babble, factory, vehicle and white noise respectively) Matched and mismatched conditions Data - VTR database 24 speakers; 8 female, 16 male Train: 6 utts ; Test: 2 utts Degradations: NOISEX database Models and features 32 mixture GMMs 12 MFCCs (c1-c12) Results: Overall 48.8% (DFT), 62.7% (SWLP-GD) Mismatched 36.5% (DFT), 54.2% (SWLP-GD) Matched conditions Mismatched conditions with large improvements Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 9

Conclusions SWLP-GD key features Provides robust spectral representation for feature extraction Temporal weighting provides robustness in time domain Group delay function provides robustness in frequency domain SWLP-GD vs traditional spectral representations lower log-spectral distortion and higher frequency weighted SNR compared to the traditional DFT, LP or STRAIGHT spectra. performs better than the traditional MFCCs in a small-scale closed-set speaker recognition experiments for mismatched conditions of degradation Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 10

References [1] C. Magi, J. Pohjalainen, T. Bäckström, and P. Alku, Stabilized weighted linear prediction, Speech Communication, vol. 51, no. 5, pp. 401 411, 2009. [2] B. Yegnanarayana, Formant extraction from linear prediction phase spectra, J. Acoust. Soc. Am., vol. 63, no. 5, pp. 1638 1640, May 1978. [3] H. Murthy and B. Yegnanarayana, Group delay functions and its applications in speech technology, Sadhana, vol. 36, pp. 745 782, 2011. [4] C. Magi, T. Bäckström, and P. Alku, Objective and subjective evaluation of seven selected all-pole modeling methods in processing of noise corrupted speech, in Proc. 7th Nordic Signal Processing Symposium (NORSIG 2006), Reykjavik, Iceland, June 2006. [5] C. Ma, Y. Kamp, and L. F. Willems, Robust signal selection for linear prediction analysis of voiced speech, Speech Communication, vol. 12, no. 1, pp. 69 81, 1993. [6] L. Deng, X. Cui, R. Pruvenok, J. Huang, and S. Momen, A database of vocal tract resonance trajectories for research in speech processing, in Proc. Int. Conf. Acoustics Speech and Signal Processing, Toulouse, France, 2006, pp. I 369 I 372. [7] D. Gowda, J. Pohjalainen, M. Kurimo, and P. Alku, Robust formant detection using group delay function and stabilized weighted linear prediction, in Proc. Interspeech, Lyon, France, August 2013. Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 11

Questions? Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 12

Thank You! Gowda et al., SWLP-GD for Robust Speaker Recogn., SpeD-2013: Cluj-Napoca, Romania, Oct 17, 2013 13