Zaki B. Nossair and Stephen A. Zahorian Department of Electrical and Computer Engineering Old Dominion University Norfolk, VA, 23529

Similar documents
Human Emotion Recognition From Speech

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speaker Identification by Comparison of Smart Methods. Abstract

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

WHEN THERE IS A mismatch between the acoustic

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Emotion Recognition Using Support Vector Machine

On the Formation of Phoneme Categories in DNN Acoustic Models

Modeling function word errors in DNN-HMM based LVCSR systems

SARDNET: A Self-Organizing Feature Map for Sequences

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

A study of speaker adaptation for DNN-based speech synthesis

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Neural Network GUI Tested on Text-To-Phoneme Mapping

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speaker recognition using universal background model on YOHO database

Learning Methods in Multilingual Speech Recognition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speaker Recognition. Speaker Diarization and Identification

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Mandarin Lexical Tone Recognition: The Gating Paradigm

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Segregation of Unvoiced Speech from Nonspeech Interference

INPE São José dos Campos

Word Segmentation of Off-line Handwritten Documents

Automatic Pronunciation Checker

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Knowledge Transfer in Deep Convolutional Neural Nets

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Python Machine Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Artificial Neural Networks written examination

Speech Recognition by Indexing and Sequencing

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Proceedings of Meetings on Acoustics

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Automatic segmentation of continuous speech using minimum phase group delay functions

Lecture 1: Machine Learning Basics

Calibration of Confidence Measures in Speech Recognition

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Voice conversion through vector quantization

Learning Methods for Fuzzy Systems

Evolutive Neural Net Fuzzy Filtering: Basic Description

Lecture 9: Speech Recognition

THE RECOGNITION OF SPEECH BY MACHINE

A Case Study: News Classification Based on Term Frequency

CS Machine Learning

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Improvements to the Pruning Behavior of DNN Acoustic Models

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Rule Learning With Negation: Issues Regarding Effectiveness

Softprop: Softmax Neural Network Backpropagation Learning

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Body-Conducted Speech Recognition and its Application to Speech Support System

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Issues in the Mining of Heart Failure Datasets

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Rhythm-typology revisited.

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

A Reinforcement Learning Variant for Control Scheduling

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Generative models and adversarial training

Rule Learning with Negation: Issues Regarding Effectiveness

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Probabilistic Latent Semantic Analysis

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Corpus Linguistics (L615)

Arabic Orthography vs. Arabic OCR

Transcription:

SMOOTHED TIME/FREQUENCY FEATURES FOR VOWEL CLASSIFICATION Zaki B. Nossair and Stephen A. Zahorian Department of Electrical and Computer Engineering Old Dominion University Norfolk, VA, 23529 ABSTRACT A novel signal modeling technique is described to compute smoothed time-frequency features for encoding speech information. These time-frequency features compactly and accurately model phonetic information, while accounting for the main effects of contextual variations. These segment-level features are computed such that more emphasis is given to the center of the segment and less to the end regions. For phonetic classification, the features are relatively insensitive to both time and frequency resolution, as least insofar as changes in window length and frame spacing are concerned. A 60-dimensional feature space based on this modeling technique resulted in 70.9 % accuracy for classification of 16 vowels extracted from the TIMIT data base in speaker-independent experiments. These results are higher than any other results reported in the literature for the same task. Introduction One of the fundamental issues in feature selection for automatic speech recognition is the representation of spectral/temporal information which best captures the phonetic content of the speech signal. Since time and frequency resolution are inversely related, in practice a tradeoff must be made between these two resolutions. Ideally these resolutions should also depend on frequency. Another important consideration, at least for any statistically-based recognizer, is the desirability of using as few features as possible. In this paper signal processing strategies are described to compute smoothed time-frequency features for encoding speech information of speech segments in a compact form. There are at least two primary techniques which appear in the literature for modeling phonemes extracted from continuous speech. In one of these, each phoneme is represented by a sequence of feature vectors that are extracted from equally spaced frames of speech. Typically an HMM is used for modeling this sequence of feature vectors [1]. In the second method each phoneme is represented by three feature vectors extracted at the beginning, the middle, and the end of the labeled acoustic signal [2,3,4]. These feature vectors are then concatenated to form one longer fixed-length vector and then used as the input to a classifier such as a neural network. Neither of these methods is particularly effective at capturing the temporal history of the underlying features. Although the basic features are often augmented with some type of delta terms, the temporal modeling deficiency is only partially overcome. In the present technique each acoustic segment is initially represented with multiple feature vectors which are extracted from equally spaced frames of speech. Then each feature trajectory across these multiple frames is represented by a low-order time-warped cosine basis vector expansion. The coefficients of these cosine expansions are then used as a representation for the underlying phonetic information. The use of this low-order cosine expansion over time enables modeling the temporal or dynamic information as well as the contextual information in a compact and integrated form. Using a weighted basis vector expansion over a sufficiently long acoustic segment, we are able to emphasize the center of the segment, which contains most of the information about the underlying phone, but also include contextual information from the neighboring phonemes. The use of only a few low-order terms in the expansion over time also smoothes out noise due to signal processing artifacts such as window length and frame spacing. In the remaining sections of this paper we provide a detailed explanation for the feature computation procedures, the classification technique, experimental procedures and experimental results. COMPUTATION OF TIME/FREQUENCY FEATURES

The features are computed in a multi-stage process as follows. The first step is to high-frequency preemphasize the speech signal using a second order FIR filter given by y[n] = 0.3426 x[n] + 0.4945 x[n-1] - 0.64 x[n-2], where the coefficients are for a 16 khz sampling rate. This pre-emphasis, which has a broad peak near 3 khz, and which approximates the inverse of the equal-loudness contour, results in slightly better performance (about 1%) than does a first order pre-emphasis, (y[n] = x[n] -.95 x[n-1] ). The next step of processing is to compute a 1024 point FFT from each Kaiser-windowed (coefficient of 5.33) frame of speech data. The magnitude spectrum is then determined, logarithmically amplitude scaled, and frequency warped with a bilinear transformation with a warping coefficient of.45. The scaled FFT spectrum is then reduced (or smoothed) using a cosine transform, computed over a frequency range of 75 Hz to 6000 Hz. These coefficients which we call discrete cosine transform coefficients (DCTCs), are essentially cepstral coefficients. Each DCTC trajectory is then represented by the coefficients in a modified cosine expansion over the segment interval as follows: k=m-1 DCTC i (n) = Σ DCS ik BV k (n) k=0 where DCTC i (n) is the ith cosine coefficient of the magnitude spectrum of the nth frame, BV k (n) is the nth value of the kth basis vector, M is the number of basis vectors used, and DCS ik is kth coefficient of the modified cosine expansion of DCTC i. The basis vectors are computed as "time-warped" cosine basis vectors, using a Kaiser-window weighting function, such that the data are more accurately represented in the center of the interval than near the endpoints, using BV k (n) = KW(n) cos(wk/l). where KW(n) is the Kaiser-window weighting function, L is the length of each trajectory (number of frames), j=n-1 W = (0.5 π /L) + Σ DW(j), j=1 m=l-1 and DW(j) = (KW(j)+KW(j+1))π(L-1)/L(Σ(KW(m)+KW(m+1))). m=1 Figure 1 depicts the first three basis vectors, using a coefficient of 5 for the Kaiser warping function. The methodology described above allows considerable flexibility for examining tradeoffs between time and frequency resolution. For example to increase frequency resolution, the frame length should be increased and the number of DCTC terms should be increased, whereas the number of DCS terms can be reduced. To increase time resolution, the frame length and frame spacing and number of DCTC terms should be reduced, but the number of DCS terms should be increased. The tradeoff between the resolution of the representation at the center of the segment relative to the endpoints can be examined by varying the coefficient in the time-warping function. Also very importantly, the procedure results in considerable data reduction relative to the original features. For example, 15 features sampled at 30 frames (450 total features) can be reduced to 75 features if 5 basis vectors are used for each expansion. We conducted several experiments designed to evaluate the usefulness of these features for automatic vowel classification and to investigate the tradeoffs mentioned above.

Figure 1 First three basis vectors used to encode trajectories of spectral features CLASSIFIER The pattern classification approach used in this study is called a binary paired partitioning (BPP) neural network [5,6]. This classification approach partitions an N-way classification task into N*(N-1)/2 two-way classification tasks. Each two-way classification task is performed using a neural network classifier which is trained to discriminate one pair of categories. The two-way classification decisions are then combined to form the N-way decisions. For all experiments reported in this paper each pair-wise network was a memoryless, feed-forward, multi-layer perceptron and was configured to have one hidden layer of 5 nodes, unless otherwise stated, and one output node. Back-propagation was used for training these networks with 160,000 network updates using an initial learning rate of.45 and momentum term of.6. The learning rate was reduced by a factor of.96 every 5000 network updates. Experiments The experiments were performed with vowel data from the DARPA/TIMIT data base (October 1990 version). The vowels used were /iy,ih,eh,ey,ae,aa,aw,ay,ah,ao,oy,ow,uh,ux,er,uw/. In this paper we give only test results based on the SX sentences using 499 speakers (356 male and 143 females) for training and 50 speakers for testing (33 male 17 females). The vowels and speakers used in these tests were the same as those used in previously reported tests with the TIMIT vowels ([2], [3]), in order to facilitate comparisons of our experimental results with previously published results. Experiment 1--Control This experiment was conducted to determine the extent to which temporal information for vowels can be captured using only 1, 3, or 5 frames extracted at uniformly spaced time points with respect to the labeled section of each vowel. This method is very similar to that used in previous studies of vowel classification using TIMIT data. Ten DCTCs were computed for each frame (since, as described in the following experiment, this number was found to be sufficient). The vowel classification results for these three cases were 53.9%, 63.8%, and 65.4%. These results are similar to those obtained in previous studies for similar conditions. Experiment 2--Time/Frequency Resolution. This experiment was designed to examine tradeoffs in time/frequency resolution. That is, we wanted to examine classification performance as a function of the number of DCTCs used and the number of DCSs used to

encode each DCTC. For this experiment we used a fixed time interval of 300 msec centered at the labeled midpoint of each vowel and a time warping of 10. Figure 2 shows the evaluation results of this experiment in bargraph form. The results show that to some extent time and frequency resolution can be traded off in the sense that as more DCTCs are used (better frequency resolution), fewer DCSs are needed (poorer time resolution). For optimum results, 10 or more DCTCs are required. The overall best result of 70.9% s was obtained with 12 DCTCs and 5 DCSs (60 features), corresponding to a relatively smooth resolution in both time and frequency. Figure 2 Vowel classification rate as a function of the number of DCTCs used and the number of DCSs used to encode each DCTC. Experiment 3 -- A More Detailed Examination of Temporal Resolution In this experiment we examined the effects of segment length and temporal resolution within the segment (midpoint versus endpoints) for vowel classification. We varied the segment duration from 50 msec up to 300 msec and the amount of time warping from 0 to 12. Except for the 50 msec segment, we also used a fixed number of features (10 DCTCs * 5 DCSs, or 50 features). For the case of the 50 msec time window we used 30 features (10 DCTCs * 3 DCSs), since there is very little temporal variation over this short interval. Table 1 shows performance as a function of amount of time warping for different window lengths. Note that for each duration, as the warping factor increases, the basis vectors increasingly emphasize the center of the interval, thus reducing the effective time duration of the window. Table 1. Classification rates for 16 vowels as a function of the amount of time warping for different time window lengths. warping Time Duration(ms) Factor 50 100 200 300 0 63.7 67.1 69.6 66.2 2 64.4 65.8 70.3 69.0 4 62.9 66.9 69.3 69.3 6 61.5 65.5 68.7 69.8 8 61.0 64.4 67.9 70.5 10 61.1 63.6 68.5 70.4 12 61.4 63.8 67.8 69.3 The results in Table 1 clearly show that performance improves as the segment interval increases from 50 ms to

100 ms to 200 ms. There is only a slight improvement resulting from increasing the interval to 300 ms from 200 ms (70.5% versus 70.3%). However, as expected, as the segment becomes longer, the best results are obtained with a larger warping factor. The absolute best results were obtained using a warping factor of 8 and segment length of 300ms. Summary A spectral/temporal feature set has been described for speech analysis and has been evaluated with vowel classification experiments. The spectral/temporal features result in substantially higher classification rates for vowels than can be obtained by simply concatenating multiple frames of static features. This new feature set has been used to obtain vowel classification results of 70.9% for 16 vowels of the DARPA/TIMIT data base, higher than any other previously reported results ([1], [4], [5]). References [1] K.-F. Lee and H.-W Hon (1989), "Speaker-Independent Phone Recognition Using Hidden Markov Models," IEEE Trans. Acoust. Speech, Sig. Process. 37, 1641-1648. [2] H. Leung and V. Zue (1988), "Some Phonetic Recognition Experiments Using Artificial Neural Nets," ICASSP-88, pp. I: 422-425. [3] H. Leung and V. Zue (1990), "Phonetic Classification Using Multi-Layer Perceptrons," ICASSP-90, pp. I: 525-528. [4] Z. B. Nossair and S. A. Zahorian (1991). "Dynamic Spectral Shape Features as Acoustic correlates for Initial Stop Consonants," J. Acoust. Soc. Am.- 89-6, pp. 2978-2991. [5] L. Rudasi and S. A. Zahorian (1991), "Text-Independent Talker Identification with Neural Networks," ICASSP-91, pp. 389-392. [6] L. Rudasi and S. A. Zahorian (1992), "Text-Independent Speaker Identification using Binary-pair Partitioned Neural Networks," IJCNN-92, pp. IV: 679-684. Figure 1. First three basis vectors used to encode trajectories of spectral features. Figure 2. Vowel classification rate as a function of the number of DCTCs used and the number of DCSs used to encode each DCTC.