Effect of Analysis Window Duration on Speech Intelligibility

Similar documents
WHEN THERE IS A mismatch between the acoustic

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Mandarin Lexical Tone Recognition: The Gating Paradigm

Human Emotion Recognition From Speech

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Speech Recognition at ICSI: Broadcast News and beyond

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

On the Combined Behavior of Autonomous Resource Management Agents

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Emotion Recognition Using Support Vector Machine

Software Maintenance

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis

Rhythm-typology revisited.

Automatic segmentation of continuous speech using minimum phase group delay functions

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Proceedings of Meetings on Acoustics

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speaker Recognition. Speaker Diarization and Identification

Voice conversion through vector quantization

Segregation of Unvoiced Speech from Nonspeech Interference

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

On the Formation of Phoneme Categories in DNN Acoustic Models

Lecture 1: Machine Learning Basics

Speaker Identification by Comparison of Smart Methods. Abstract

Modeling function word errors in DNN-HMM based LVCSR systems

Proc. Natl. Acad. Sci. USA, in press. Classification: Biological Sciences, Neurobiology

Ryerson University Sociology SOC 483: Advanced Research and Statistics

Level 1 Mathematics and Statistics, 2015

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Modeling function word errors in DNN-HMM based LVCSR systems

Word Segmentation of Off-line Handwritten Documents

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Author's personal copy

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Tuesday 13 May 2014 Afternoon

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

TCH_LRN 531 Frameworks for Research in Mathematics and Science Education (3 Credits)

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Rote rehearsal and spacing effects in the free recall of pure and mixed lists. By: Peter P.J.L. Verkoeijen and Peter F. Delaney

Speaker recognition using universal background model on YOHO database

Susan K. Woodruff. instructional coaching scale: measuring the impact of coaching interactions

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

NCEO Technical Report 27

English Language Arts Summative Assessment

Probability and Statistics Curriculum Pacing Guide

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Learning Methods in Multilingual Speech Recognition

Python Machine Learning

Extending Place Value with Whole Numbers to 1,000,000

THE UNIVERSITY OF SYDNEY Semester 2, Information Sheet for MATH2068/2988 Number Theory and Cryptography

Graduate Program in Education

Dublin City Schools Mathematics Graded Course of Study GRADE 4

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Reducing Features to Improve Bug Prediction

Automatic intonation assessment for computer aided language learning

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

INPE São José dos Campos

arxiv: v1 [math.at] 10 Jan 2016

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Foundations of Knowledge Representation in Cyc

STA 225: Introductory Statistics (CT)

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Automatic Pronunciation Checker

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Phonological and Phonetic Representations: The Case of Neutralization

REVIEW OF CONNECTED SPEECH

Using EEG to Improve Massive Open Online Courses Feedback Interaction

THE RECOGNITION OF SPEECH BY MACHINE

SARDNET: A Self-Organizing Feature Map for Sequences

Probabilistic Latent Semantic Analysis

COMM370, Social Media Advertising Fall 2017

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Body-Conducted Speech Recognition and its Application to Speech Support System

Visit us at:

On-Line Data Analytics

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Speech Recognition by Indexing and Sequencing

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Transcription:

Effect of Analysis Window Duration on Speech Intelligibility Author Paliwal, Kuldip, Wojcicki, Kamil Published 2008 Journal Title IEEE Signal Processing Letters DOI https://doi.org/10.1109/lsp.2008.2005755 Copyright Statement 2008 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Downloaded from http://hdl.handle.net/10072/23589 Griffith Research Online https://research-repository.griffith.edu.au

IEEE SIGNAL PROCESSING LETTERS, VOL. 15, 2008 785 Effect of Analysis Window Duration on Speech Intelligibility Kuldip Paliwal, Member, IEEE, and Kamil Wójcicki Abstract In this letter, we investigate the effect of the analysis window duration on speech intelligibility in a systematic way. In speech processing, the short-time magnitude spectrum is believed to contain the majority of the intelligible information. Consequently, in our experiments, we construct speech stimuli based purely on the short-time magnitude spectrum. We conduct subjective listening tests in the form of a consonant recognition task to assess intelligibility as a function of analysis window duration. In our investigations, we also employ three objective speech intelligibility measures based on the speech transmission index (STI). The experimental results show that the analysis window duration of 15 35 ms is the optimum choice when speech is reconstructed from the short-time magnitude spectrum. Index Terms Analysis window duration, magnitude spectrum, speech intelligibility, speech transmission index (STI). I. INTRODUCTION ALTHOUGH speech is nonstationary, it can be assumed quasi-stationary and, therefore, can be processed through the short-time Fourier analysis. The short-time Fourier transform (STFT) of a speech signal is given by where is an analysis window function of duration.in speech processing, the Hamming window function is typically used and its width is normally 20 40 ms. The short-time Fourier spectrum,, is a complex quantity and can be expressed in polar form as where is the short-time magnitude spectrum and is the short-time phase spectrum. The signal is completely characterized by its magnitude and phase spectra. 1 The rationale for making the window duration 20 40 ms comes from the following qualitative arguments. When making the quasi-stationarity assumption, we want the speech analysis segment to be stationary. As a result, we cannot make the speech analysis window too large; otherwise, the signal within the Manuscript received May 03, 2008; revised August 12, 2008. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Brian Kan-Wing Mak. The authors are with the Signal Processing Laboratory, Griffith School of Engineering, Griffith University, Nathan QLD 4111, Australia (e-mail: k.paliwal@griffith.edu.au; k.wojcicki@griffith.edu.au). Digital Object Identifier 10.1109/LSP.2008.2005755 1 In our discussions, when referring to the magnitude or phase spectra, the short-time modifier is implied unless otherwise stated. (1) (2) window will become nonstationary. From this consideration, the window duration should be as small as possible. However, making the window duration small also has its disadvantages. One disadvantage is that if we make the analysis duration smaller, then the frame shift decreases and thus the frame rate increases. This means we will be processing a lot more information than necessary, thus increasing the computational complexity. The second disadvantage of making the window duration small is that the spectral estimates will tend to become less reliable due to the stochastic nature of the speech signal. The third reason why we cannot make the analysis window too small is that in speech processing, the typical range of pitch frequency is between 80 and 500 Hz. This means that a typical pitch pulse occurs every 2 to 12 ms. If the duration of the analysis window is smaller than the pitch period, then the pitch pulse will sometimes be present and at other times absent. When the speech signal is voiced in nature, the location of pitch pulses will change from frame to frame under pitch-asynchronous analysis. To make this analysis independent of the location of pitch pulses within the analysis segment, we need a segment length of at least two to three times the pitch period. The above arguments are normally used to justify the analysis window duration of around 20 40 ms. However, they are all qualitative arguments and do not tell us exactly what the analysis segment duration should be. In this letter, we propose to investigate a systematic way of arriving at an optimal duration of an analysis window. We want to do so in the context of typical speech processing applications. The majority of these applications utilize only the shorttime magnitude spectrum information. For example, speech and speaker recognition tasks use cepstral coefficients as features which are based solely on the short-time magnitude spectrum. Similarly, typical speech enhancement algorithms modify only the magnitude spectrum and leave the noisy phase spectrum unchanged. For this reason, in our investigations, we employ the analysis-modification-synthesis (AMS) framework where, during the modification stage, only the short-time magnitude spectrum is kept, while the short-time phase spectrum is discarded by randomizing its values. In our experiments, we investigate the effect of the duration of an analysis segment used in the short-time Fourier analysis to find out what window duration gives the best speech intelligibility under this framework. For this purpose, both subjective and objective speech intelligibility measures are employed. For subjective evaluation, we conduct listening tests using human listeners in a consonant recognition task. For objective evaluation, we employ three speech-based derivatives of a popular objective speech intelligibility measure, namely, the speech transmission index (STI). The remainder of this letter is organized as follows. Section II describes the AMS procedure used to construct stimuli files for 1070-9908/$20.00 2008 IEEE

786 IEEE SIGNAL PROCESSING LETTERS, VOL. 15, 2008 purpose, human listening tests are conducted, in which consonant recognition performance is measured. A. Recordings Six stop consonants,, were selected for the human consonant recognition task. Each consonant was placed in a vowel-consonant-vowel (VCV) context within the Hear aca now carrier sentence. 3 The recordings were carried out in a silent room using a SONY ECM-MS907 microphone. Four speakers were used: two males and two females. Six recordings per speaker were made, giving a total of 24 recordings. Each recording lasted approximately 3 s, including leading and trailing silence portions. All recordings were sampled at khz with 16-bit precision. Fig. 1. Procedure used for stimulus construction. our experiments. Section III provides details of the subjective listening tests. Section IV outlines the objective evaluation procedure. Results and discussion are presented in Section V. II. ANALYSIS-MODIFICATION-SYNTHESIS The aim of the present study is to determine the effect that the duration of an analysis segment has on speech intelligibility, using a systematic, quantitative approach. Since the majority of speech processing applications utilize only the shorttime magnitude spectrum, we construct stimuli that retain only the magnitude information. For this purpose, the AMS procedure, shown in Fig. 1, is used. In the AMS framework, the speech signal is divided into overlapped frames. The frames are then windowed using an analysis window,, followed by the Fourier analysis, and spectral modification. The spectral modification stage is where only the magnitude information is retained. The phase spectrum information is removed by randomizing the phase spectrum values. The resulting modified STFT is given by where is a random variable uniformly distributed between 0 and. Note that when constructing the random phase spectrum, the antisymmetry property of phase spectrum should be preserved. The stimulus,, is then constructed by taking the inverse STFT of, followed by synthesis windowing and overlap-add (OLA) reconstruction [1] [4]. We refer to the resulting stimulus as magnitude-only stimulus, since it is reconstructed by using only the short-time magnitude spectrum. 2 III. SUBJECTIVE EXPERIMENT This section describes subjective measurement of speech intelligibility as a function of analysis window duration. For this 2 Although we remove the information about the short-time phase spectrum by randomizing its values and keep the magnitude spectrum, the phase spectrum component in the reconstructed speech cannot be removed to a 100% perfection [5]. (3) B. Stimuli The recordings were processed using the AMS procedure detailed in Section II. The Hamming window was employed as the analysis window function. Twelve analysis window durations were investigated (, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, and 2048 ms). The frame shift was set to ms and the FFT analysis length was set to, where is the number of samples in each frame. These settings were chosen to minimize aliasing effects. For a detailed look at how the choice of the above parameters affects subjective intelligibility, we refer the reader to [6] and [7]. The modified Hanning window [4] was used as the synthesis window. The original recordings (reconstructed without spectral modification) were also included. Overall, 13 different treatments were applied to the 24 recordings, resulting in the total of 312 stimuli files. Example spectrograms of original as well as processed stimuli are shown in Fig. 4. C. Subjects For listeners, we used twelve English-speaking volunteers, with normal hearing. None of the listeners participated in the recording of the stimuli. D. Procedure The listening tests were conducted in isolation, over a single session, in a quiet room. The task was to identify each carrier utterance as one of the six stop consonants. The listeners were presented with seven labeled options on a digital computer, with the first six corresponding to the six stop consonants and the seventh being the null response. The subjects were instructed to choose the null response only if they had no idea as to what the embedded consonant might have been. The stimuli audio files were played in a randomized order and presented over closed circumaural headphones (SONY MDR-V500) at a comfortable listening level. Prior to the actual test, the listeners were familiarized with the task in a short practice session. The entire sitting lasted approximately half an hour. The responses were collected via a keyboard. No feedback was given. IV. OBJECTIVE EVALUATION In this section, our aim is to investigate the effect of the analysis window duration on speech intelligibility using objective measures. For this purpose, we employ the STI as the 3 For example, for the consonant [g], the utterance is Hear aga now.

PALIWAL AND WÓJCICKI: EFFECT OF ANALYSIS WINDOW DURATION ON SPEECH INTELLIGIBILITY 787 performance metric [8]. STI measures the extent to which slow temporal intensity envelope modulations are preserved in degraded listening environments [9]. It is these slow intensity variations that are important for speech intelligibility. In the present work, we employ the speech-based STI computation procedure where speech signal is used as a probe. Under this framework, the original and processed speech signals are passed separately through a bank of seven octave band filters. Each filtered signal is squared and low pass filtered (with cutoff frequency of 32 khz) to derive the temporal intensity envelope. The power spectrum of the temporal intensity envelope is subjected to one-third octave band analysis. The components over each of the 14 one-third octave band intervals (with centers ranging from 0.63 to 12.7 Hz) are summed, producing 98 modulation indices. The resulting modulation spectrum of the original speech, along with the modulation spectrum of the processed speech, can then be used to compute the modulation transfer function (MTF), which in turn is used to compute STI. In this work, three different approaches are employed for the computation of the MTF. The first approach is by Houtgast and Steeneken [10], the second is by Drullman et al. [11], and the third is by Payton et al. [12]. The details of MTF and STI computations are given in [13]. The objective evaluation is performed on the stimuli files used in the subjective experiment (see Section III-B). V. RESULTS AND DISCUSSION In the subjective experiment, described in Section III, we have measured consonant recognition performance through human listening tests. We refer to the results of these measurements as subjective intelligibility scores. The subjective intelligibility scores (along with their standard error bars) are shown in Fig. 2(a) as a function of analysis window duration. The following observations can be made based on these results. For short analysis window durations, the subjective intelligibility scores are low. The scores increase with an increase in analysis window length, but at long window durations, the subjective intelligibility scores start to decrease. It is important to note that Fig. 2(a) shows a peak for analysis window durations between 15 and 35 ms. Section IV outlines an objective evaluation of speech intelligibility. We refer to the results of this evaluation as objective intelligibility scores. The objective intelligibility scores as a function of analysis window length are shown in Fig. 2(b). The objective results show a trend similar to that of the subjective results. Although, in the objective case, the peak is not as pronounced, it can be seen to lie between 8 and 40 ms. Note that all three speech-based STI measures display a similar trend. Mean speech-based STI scores as a function of subjective intelligibility scores, as well as least-squares lines of best fit and correlation coefficients, are shown in Fig. 3. All three STI derivatives were found to have a statistically significant correlation with subjective intelligibility scores at a 0.0001 level of significance using correlation analysis [14]. This indicates that the three STI measures can be used to predict subjective intelligibility. Based on subjective as well as objective intelligibility scores, it can be seen that the optimum window duration for speech analysis is around 15 35 ms. For speech applications based Fig. 2. Experimental results. (a) Subjective intelligibility scores in terms of consonant recognition accuracy (%). (b) Objective intelligibility scores in terms of mean speech-based STI. Objective scores are shown for the following methods: Houtgast and Steeneken method [10] broken line, Drullman et al. method [11] dotted line, and Payton et al. method [12] solid line. Fig. 3. Objective intelligibility scores in terms of mean speech-based STI versus subjective intelligibility scores in terms of consonant recognition accuracy (%). Correlation coefficients, r, as well as least-squares lines of best fit are also shown for each of the STI-based methods. solely on the short-time magnitude spectrum, this window duration is expected to be the right choice. This duration has been recommended in the past on the basis of qualitative arguments.

788 IEEE SIGNAL PROCESSING LETTERS, VOL. 15, 2008 length was obtained through a systematic study of subjective and objective intelligibility of speech stimuli, reconstructed using only the short-time magnitude spectrum. To the best of our knowledge, this is the first attempt to quantify the window duration on the basis of subjective intelligibility scores. VI. CONCLUSION In this letter, the effect of the analysis window duration on speech intelligibility was investigated in a systematic way. Subjective evaluation in the form of human listening tests comprising of a consonant recognition task were conducted. In addition to the subjective evaluation, three speech-based variants of the STI objective speech intelligibility measure were also employed. The experimental results show that the analysis window duration of 15 35 ms is the optimum choice when a speech signal is reconstructed from its short-time magnitude spectrum only. REFERENCES Fig. 4. Spectrograms of an utterance Hear aga now, by a male speaker. (a) Original speech (passed through the AMS procedure with no spectral modification). (b g) Processed speech magnitude-only stimuli for different analysis window durations: (b) 2 ms, (c) 8 ms, (d) 32 ms, (e) 128 ms, (f) 512 ms, and (g) 2048 ms. However, in the present work, the similar optimal segment [1] J. Allen and L. Rabiner, A unified approach to short-time Fourier analysis and synthesis, Proc. IEEE, vol. PROC-65, no. 11, pp. 1558 1564, Nov. 1977. [2] R. Crochiere, A weighted overlap-add method of short-time Fourier analysis/synthesis, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-28, no. 1, pp. 99 102, Feb. 1980. [3] M. Portnoff, Short-time Fourier analysis of sampled speech, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-29, no. 3, pp. 364 373, Jun. 1981. [4] D. Griffin and J. Lim, Signal estimation from modified short-time Fourier transform, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 2, pp. 236 243, Apr. 1984. [5] O. Ghitza, On the upper cutoff frequency of the auditory critical-band envelope detectors in the context of speech perception, J. Acoust. Soc. Amer., vol. 110, no. 3, pp. 1628 1640, Sep. 2001. [6] K. Paliwal and L. Alsteris, On the usefulness of STFT phase spectrum in human listening tests, Speech Commun., vol. 45, no. 2, pp. 153 170, Feb. 2005. [7] L. Alsteris and K. Paliwal, Short-time phase spectrum in speech processing: A review and some experimental results, Digit. Signal Process., vol. 17, pp. 578 616, May 2007. [8] H. Steeneken and T. Houtgast, A physical method for measuring speech-transmission quality, J. Acoust. Soc. Amer., vol. 67, no. 1, pp. 318 326, Jan. 1980. [9] K. Payton and L. Braida, A method to determine the speech transmission index from speech waveforms, J. Acoust. Soc. Amer., vol. 106, pp. 3637 3648, Dec. 1999. [10] T. Houtgast and H. Steeneken, A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria, J. Acoust. Soc. Amer., vol. 77, no. 3, pp. 1069 1077, Mar. 1985. [11] R. Drullman, J. Fresten, and R. Plomp, Effect of reducing slow temporal modulations on speech reception, J. Acoust. Soc. Amer., vol. 95, pp. 2670 2680, May 1994. [12] K. L. Payton, L. D. Braida, S. Chen, P. Rosengard, and R. Goldsworthy, Computing the STI using speech as a probe stimulus, in Past, Present and Future of the Speech Transmission Index. Soesterberg, The Netherlands: TNO Human Factors, 2002, pp. 125 138. [13] R. Goldsworthy and J. Greenberg, Analysis of speech-based speech transmission index methods with implications for nonlinear operations, J. Acoust. Soc. Amer., vol. 116, no. 6, pp. 3679 3689, Dec. 2004. [14] E. Kreyszig, Advanced Engineering Mathematics, 9th ed. New York: Wiley, 2006.