Preference for ms window duration in speech analysis

Similar documents
AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Mandarin Lexical Tone Recognition: The Gating Paradigm

WHEN THERE IS A mismatch between the acoustic

Human Emotion Recognition From Speech

Speech Recognition at ICSI: Broadcast News and beyond

Learning Methods in Multilingual Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A study of speaker adaptation for DNN-based speech synthesis

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Modeling function word errors in DNN-HMM based LVCSR systems

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Segregation of Unvoiced Speech from Nonspeech Interference

Speech Emotion Recognition Using Support Vector Machine

Automatic segmentation of continuous speech using minimum phase group delay functions

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speaker Recognition. Speaker Diarization and Identification

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

On the Formation of Phoneme Categories in DNN Acoustic Models

Software Maintenance

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Speaker recognition using universal background model on YOHO database

Rhythm-typology revisited.

Proceedings of Meetings on Acoustics

Automatic Pronunciation Checker

Body-Conducted Speech Recognition and its Application to Speech Support System

Speech Recognition by Indexing and Sequencing

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Investigation on Mandarin Broadcast News Speech Recognition

Author's personal copy

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Speaker Identification by Comparison of Smart Methods. Abstract

Lecture 9: Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Voice conversion through vector quantization

On the Combined Behavior of Autonomous Resource Management Agents

Tuesday 13 May 2014 Afternoon

An Online Handwriting Recognition System For Turkish

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Edinburgh Research Explorer

Lecture 1: Machine Learning Basics

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

The Good Judgment Project: A large scale test of different methods of combining expert predictions

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

English Language Arts Summative Assessment

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

NCEO Technical Report 27

Level 1 Mathematics and Statistics, 2015

Phonological and Phonetic Representations: The Case of Neutralization

Automatic intonation assessment for computer aided language learning

REVIEW OF CONNECTED SPEECH

Rote rehearsal and spacing effects in the free recall of pure and mixed lists. By: Peter P.J.L. Verkoeijen and Peter F. Delaney

Word Segmentation of Off-line Handwritten Documents

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

SARDNET: A Self-Organizing Feature Map for Sequences

Hardhatting in a Geo-World

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Reducing Features to Improve Bug Prediction

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Support Vector Machines for Speaker and Language Recognition

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Course Law Enforcement II. Unit I Careers in Law Enforcement

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Knowledge Transfer in Deep Convolutional Neural Nets

Mathematics subject curriculum

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Transcription:

Griffith Research Online https://research-repository.griffith.edu.au Preference for 0-0 ms window duration in speech analysis Author Paliwal, Kuldip, Lyons, James, Wojcicki, Kamil Published 00 Conference Title th International Conference on Signal Processing and Communication Systems, ICSPCS'00. Proceedings DOI https://doi.org/0.09/icspcs.00.597 Copyright Statement Copyright 00 IEEE. Personal use of this material is permitted. However, permission to reprint/ republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Downloaded from http://hdl.handle.net/007/758

Preference for 0-0 ms window duration in speech analysis Kuldip K. Paliwal, James G. Lyons and Kamil K. Wójcicki Signal Processing Laboratory Griffith University, Nathan, QLD, Australia {k.paliwal, j.lyons, k.wojcicki}@griffith.edu.au ABSTRACT In speech processing the short-time magnitude spectrum is believed to contain most of the information about speech intelligibility and it is normally computed using the short-time Fourier transform over 0-0 ms window duration. In this paper, we investigate the effect of the analysis window duration on speech intelligibility in a systematic way. For this purpose, both subjective and objective experiments are conducted. The subjective experiment is in a form of a consonant recognition task by human listeners, whereas the objective experiment is in a form of an automatic speech recognition (ASR) task. In our experiments various analysis window durations are investigated. For the subjective experiment we construct speech stimuli based purely on the short-time magnitude information. The results of the subjective experiment show that the analysis window duration of 5 5 ms is the optimum choice when speech is reconstructed from the short-time magnitude spectrum. Similar conclusions were made based on the results of the objective (ASR) experiment. The ASR results were found to have statistically significant correlation with the subjective intelligibility results. Index Terms Analysis window duration, magnitude spectrum, automatic speech recognition, speech intelligibility. INTRODUCTION Although speech is non-stationary, it can be assumed quasistationary and, therefore, can be processed through the short-time Fourier analysis. The short-time Fourier transform (STFT) of a speech signal s(t) is given by S(t, f) = Z s(τ)w(t τ)e jπfτ dτ, () where w(t) is an analysis window function of duration T w. In speech processing, the Hamming window function is typically used and its width is normally 0 0 ms. The short-time Fourier spectrum, S(t, f), is a complex quantity and can be expressed in polar form as S(t, f) = S(t, f) e jψ(t,f), () where S(t, f) is the short-time magnitude spectrum and ψ(t, f)= S(t, f) is the short-time phase spectrum. The signal s(t) is completely characterized by its magnitude and phase spectra. The rationale for making the window duration 0 0 ms comes from the following qualitative arguments. When making the quasistationarity assumption, we want the speech analysis segment to be stationary. As a result we cannot make the speech analysis window In our discussions, when referring to the magnitude or phase spectra the short-time modifier is implied unless otherwise stated. too large, otherwise the signal within the window will become nonstationary. From this consideration the window duration should be as small as possible. However, making the window duration small also has its disadvantages. One disadvantage is that if we make the analysis duration smaller, then the frame shift decreases and thus the frame rate increases. This means we will be processing a lot more information than necessary, thus increasing the computational complexity. The second disadvantage of making the window duration small, is that the spectral estimates will tend to become less reliable due to the stochastic nature of the speech signal. The third reason why we cannot make the analysis window too small, is that in speech processing the typical range of pitch frequency is between and 0 Hz. This means that a typical pitch pulse occurs every to ms. If the duration of the analysis window is smaller than the pitch period, then the pitch pulse will sometimes be present, and at other times absent. When the speech signal is voiced in nature, the location of pitch pulses will change from frame to frame under pitch-asynchronous analysis. To make this analysis independent of the location of pitch pulses within the analysis segment, we need a segment length of at least two to three times the pitch period. The above arguments are normally used to justify the analysis window duration of around 0 0 ms. However, they are all qualitative arguments, which do not tell us exactly what the analysis segment duration should be. In this paper we propose to investigate a systematic way of arriving at an optimal duration of an analysis window. We want to do so in the context of typical speech processing applications. The majority of these applications utilize only the short-time magnitude spectrum information. For example, speech and speaker recognition tasks use cepstral coefficients as features which are based solely on the shorttime magnitude spectrum. Similarly, typical speech enhancement algorithms modify only the magnitude spectrum and leave the noisy phase spectrum unchanged. For this reason, in our investigations we employ the analysis-modification-synthesis (AMS) framework where, during the modification stage, only the short-time magnitude spectrum is kept, while the short-time phase spectrum is discarded by randomizing its values. In our experiments we investigate the effect of the duration of an analysis segment used in the short-time Fourier analysis to find out what window duration gives the best speech intelligibility under this framework. For this purpose, both subjective and objective experiments are conducted. For the subjective evaluation we conduct listening tests using human listeners in a consonant recognition task. For the objective evaluation, we carry out an automatic speech recognition (ASR) experiment on the TIMIT speech corpus. The remainder of this paper is organized as follows. Section provides details of the subjective listening tests. Section outlines the objective experiment. The results and discussion are presented in Section. 978---77-8/0/$6.00 c 00 IEEE

. SUBJECTIVE EXPERIMENT This section describes subjective measurement of speech intelligibility as a function of analysis window duration. For this purpose human listening tests are conducted, in which consonant recognition performance is measured... Analysis-modification-synthesis The aim of the present study is to determine the effect that the duration of an analysis segment has on speech intelligibility, using a systematic, quantitative approach. Since the majority of speech processing applications utilize only the short-time magnitude spectrum we construct stimuli that retain only the magnitude information. For this purpose, the analysis-modification-synthesis (AMS) procedure, shown in Fig., is used. In the AMS framework the speech signal is divided into overlapped frames. The frames are then windowed using an analysis window, w(t), followed by the Fourier analysis, and spectral modification. The spectral modification stage is where only the magnitude information is retained. The phase spectrum information is removed by randomizing the phase spectrum values. The resulting modified STFT is given by Ŝ(t, f) = S(t, f) e jφ(t,f), () where φ(t, f) is a random variable uniformly distributed between 0 and π. Note that when constructing the random phase spectrum, the antisymmetry property of phase spectrum should be preserved. The stimulus, ŝ(t), is then constructed by taking the inverse STFT of Ŝ(t, f), followed by synthesis windowing and overlap-add (OLA) reconstruction [,,, ]. We refer to the resulting stimulus as magnitude-only stimulus, since it is reconstructed by using only the short-time magnitude spectrum... Recordings Six stop consonants, [ b, d, g, p, t, k ], were selected for the human consonant recognition task. Each consonant was placed in a vowelconsonant-vowel (VCV) context within the Hear aca now carrier sentence. The recordings were carried out in a silent room using a SONY ECM-MS7 microphone. Four speakers were used, two males and two females. Six recordings per speaker were made, giving a total of recordings. Each recording lasted approximately three seconds, including leading and trailing silence portions. All recordings were sampled at F s =6 khz with 6-bit precision... Stimuli The recordings were processed using the AMS procedure detailed in Section.. The Hamming window was employed as the analysis window function. Ten analysis window durations were investigated (T w =,,, 8, 6,, 6, 8, 56 and 5 ms). The frame shift was set to T w/8 ms and the FFT analysis length was set to N,where N (=T wf s) is the number of samples in each frame. These settings were chosen to minimize aliasing effects. For a detailed look at how the choice of the above parameters affects subjective intelligibility, we refer the reader to [6, 7]. The modified Hanning window [] Although we remove the information about the short-time phase spectrum by randomizing its values and keep the magnitude spectrum, the phase spectrum component in the reconstructed speech cannot be removed to a 00% perfection [5]. For example, for the consonant [ g ], the utterance is Hear aga now. Fourier spectrum Magnitude spectrum Modified spectrum Input speech Overlapped frames Analysis window Fourier transform S(t, f) = S(t, f) e j S(t, f) S(t, f) Random phase spectrum Inverse Fourier transform Synthesis window Overlap add Processed speech Ŝ(t, f) = S(t, f) e jφ(t,f) φ(t, f) Fig.. Procedure used for stimulus construction. was used as the synthesis window. The original recordings (reconstructed without spectral modification) were also included. Overall, different treatments were applied to the recordings, resulting in the total of 6 stimuli files. Example spectrograms of original as well as processed stimuli are shown in Fig.... Subjects For listeners, we used twelve English speaking volunteers, with normal hearing. None of the listeners participated in the recording of the stimuli..5. Procedure The listening tests were conducted in isolation, over a single session, in a quiet room. The task was to identify each carrier utterance as one of the six stop consonants. The listeners were presented with seven labeled options on a digital computer, with the first six corresponding to the six stop consonants and the seventh being the null response. The subjects were instructed to choose the null response only if they had no idea as to what the embedded consonant might have been. The stimuli audio files were played in a randomized order and presented over closed circumaural headphones (SONY MDR- V0) at a comfortable listening level. Prior to the actual test, the listeners were familiarized with the task in a short practice session. The entire sitting lasted approximately half an hour. The responses were collected via a keyboard. No feedback was given.

. OBJECTIVE EXPERIMENT This section provides the details of our investigation of the importance of the analysis window duration in the context of a popular speech processing application, namely automatic speech recognition (ASR). For this purpose, an ASR experiment was conducted on the TIMIT speech corpus [8]. The TIMIT corpus is sampled at 6 khz and consists of 0 utterances spoken by speakers. The corpus is separated into training and testing sets. For our experiments the sa* utterances, which are the same across all speakers, were removed from both the training and testing sets to prevent biasing the results. The full train set, consisting of 696 utterances from 6 speakers, was used for training, while the core test set, consisting of 9 utterances from speakers, was used for testing. Both training and testing was performed on clean speech. For the our experiment we employed the hidden Markov model toolkit (HTK) [9]. A HTKbased triphone recognizer, with states per HMM and 8 Gaussian mixtures per state, was used. The features used were mel-frequency cepstral coefficients [0] with energy as well as the first and second derivatives (9 coefficients total). Various analysis window durations were investigated. Cepstral mean subtraction was applied. A bigram language model was used. The training phoneme set, which consisted of 8 phonemes, was reduced to 9 for testing purposes (as in []). Phoneme recognition results are quoted in terms of correctness percentage [9].. RESULTS AND DISCUSSION In the subjective experiment, described in Section, we have measured consonant recognition performance through human listening tests. We refer to the results of these measurements as subjective intelligibility scores. The subjective intelligibility scores (along with their standard error bars) are shown in Fig. (a) as a function of analysis window duration. The following observations can be made based on these results. For short analysis window durations the subjective intelligibility scores are low. The scores increase with an increase in analysis window length, but at long window durations the subjective intelligibility scores start to decrease. It is important to note that Fig. (a) shows a peak for analysis window durations between 5 and 5 ms. The results of the ASR experiment, detailed in Section, are shown in Fig. (b). We refer to these results as objective scores. The objective results show a trend similar to that of the subjective results. Although, in the objective case, the peak is wider and it can be seen to lie between 5 and ms. The objective scores as a function of subjective intelligibility scores, as well as least-squares line of best fit and correlation coefficient, are shown in Fig.. The objective scores were found to have a statistically significant correlation with subjective intelligibility scores at a 0.000 level of significance using correlation analysis []. This indicates that ASR can be used to predict subjective intelligibility. Based on subjective as well as objective results, it can be seen that the optimum window duration for speech analysis is around 5 5 ms. For speech applications based solely on the short-time magnitude spectrum this window duration is expected to be the right choice. This duration has been recommended in the past on the basis of qualitative arguments. However, in the present work the similar optimal segment length was obtained through a systematic study of subjective and objective intelligibility of speech stimuli, reconstructed using only the short-time magnitude spectrum. Consonant recognition accuracy (%) ASR correctness (%) 00 0 0 00 (a) 0 0 0 0 Analysis window duration (ms) 0 0 (b) 0 0 0 0 Analysis window duration (ms) Fig.. Experimental results for: (a) subjective intelligibility tests in terms of consonant recognition accuracy (%); and (b) automatic speech recognition in terms of correctness (%). 5. CONCLUSION In this paper, the effect of the analysis window duration on speech intelligibility was investigated in a systematic way. Two evaluation methods were employed, subjective and objective. The subjective evaluation was based on human listening tests that comprised of a consonant recognition task, while for the objective evaluation an ASR experiment was conducted. The experimental results show that the analysis window duration of 5 5 ms is the optimum choice when a speech signal is reconstructed from its short-time magnitude spectrum only. 6. REFERENCES [] J.B. Allen and L.R. Rabiner, A unified approach to short-time Fourier analysis and synthesis, Proc. IEEE, vol. 65, no., pp. 558 56, 977. [] R.E. Crochiere, A weighted overlap-add method of short-time Fourier analysis / synthesis, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-8, no., pp. 99 0, 9. [] M.R. Portnoff, Short-time Fourier analysis of sampled speech, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-9, no., pp. 6 7, 98.

ASR correctness (%) 00 0 0 (a) (b) 0 0 00 Consonant recognition accuracy (%) (c) Fig.. Automatic speech recognition results in terms of ASR correctness (%) versus subjective intelligibility scores in terms of consonant recognition accuracy (%). Least-squares line of best fit is also shown. Correlation coefficient: r =0.9. [] D.W. Griffin and J.S. Lim, Signal estimation from modified short-time Fourier transform, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-, no., pp. 6, 98. [5] O. Ghitza, On the upper cutoff frequency of the auditory critical-band envelope detectors in the context of speech perception, J. Acoust. Soc. Am., vol. 0, no., pp. 68, 00. [6] K.K. Paliwal and L.D. Alsteris, On the usefulness of STFT phase spectrum in human listening tests, Speech Communication, vol. 5, no., pp. 5, 005. [7] L.D. Alsteris and K.K. Paliwal, Short-time phase spectrum in speech processing: A review and some experimental results, Digital Signal Processing, vol. 7, pp. 578 66, may 007. [8] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc -., NASA STI/Recon Technical Report N, vol. 9, pp. +, Feb. 99. [9] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK book, Cambridge University Engineering Department,. edition, 006. [0] S. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust. Speech Signal Process., vol. 8, no., pp. 57 66, 9. [] K.-F. Lee and H.-W. Hon, Speaker-independent phone recognition using hidden Markov models, IEEE Trans. Acoust. Speech Signal Process., vol. 7, no., pp. 6 68, Nov 989. [] E. Kreyszig, Advanced Engineering Mathematics, Wiley, 9th edition, 006. (d) (e) (f) 0 0.5.5.5 Time (s) Fig.. Spectrograms of an utterance Hear aga now, by a male speaker: (a) original speech (passed through the AMS procedure with no spectral modification); (b f) processed speech magnitudeonly stimuli for different analysis window durations: (b) ms; (c) 8 ms; (d) ms; (e) 8 ms; and (f) 5 ms.