THE formulation of the hidden Markov model (HMM) has

Similar documents
Learning Methods in Multilingual Speech Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

On the Formation of Phoneme Categories in DNN Acoustic Models

Human Emotion Recognition From Speech

WHEN THERE IS A mismatch between the acoustic

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Modeling function word errors in DNN-HMM based LVCSR systems

A study of speaker adaptation for DNN-based speech synthesis

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Lecture 1: Machine Learning Basics

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Emotion Recognition Using Support Vector Machine

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Recognition at ICSI: Broadcast News and beyond

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speaker recognition using universal background model on YOHO database

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Proceedings of Meetings on Acoustics

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

The Strong Minimalist Thesis and Bounded Optimality

Speaker Identification by Comparison of Smart Methods. Abstract

SARDNET: A Self-Organizing Feature Map for Sequences

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Python Machine Learning

An Online Handwriting Recognition System For Turkish

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Honors Mathematics. Introduction and Definition of Honors Mathematics

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Support Vector Machines for Speaker and Language Recognition

Calibration of Confidence Measures in Speech Recognition

Word Segmentation of Off-line Handwritten Documents

Probabilistic Latent Semantic Analysis

Mandarin Lexical Tone Recognition: The Gating Paradigm

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Statewide Framework Document for:

Extending Place Value with Whole Numbers to 1,000,000

The Good Judgment Project: A large scale test of different methods of combining expert predictions

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Introduction to Simulation

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Automatic Pronunciation Checker

Speech Recognition by Indexing and Sequencing

INPE São José dos Campos

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Lecture 9: Speech Recognition

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Improvements to the Pruning Behavior of DNN Acoustic Models

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Grade 6: Correlated to AGS Basic Math Skills

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Investigation on Mandarin Broadcast News Speech Recognition

CEFR Overall Illustrative English Proficiency Scales

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Problems of the Arabic OCR: New Attitudes

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Cal s Dinner Card Deals

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

CS Machine Learning

School of Innovative Technologies and Engineering

Generative models and adversarial training

Physics 270: Experimental Physics

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Segregation of Unvoiced Speech from Nonspeech Interference

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

College Pricing and Income Inequality

Probability and Statistics Curriculum Pacing Guide

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Software Maintenance

Detailed course syllabus

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

Learning Methods for Fuzzy Systems

Reinforcement Learning by Comparing Immediate Reward

Rule Learning With Negation: Issues Regarding Effectiveness

Transcription:

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 4, JULY 1997 319 Speaker-Independent Phonetic Classification Using Hidden Markov Models with Mixtures of Trend Functions Li Deng, Senior Member, IEEE, and Michael Aksmanovic Abstract In this study, we make a major extension of the nonstationary-state or trended hidden Markov model (HMM) from the previous single-trend formulation [2], [3] to the current mixture-trended one. This extension is motivated by the observation of wide variations in the trajectories of the acoustic data in fluent, speaker-independent speech associated with a fixed underlying linguistic unit. It is also motivated by potential use of mixtures of trend functions to characterize heterogeneous time-varying data generated from distinctive sources such as the speech signals collected from different microphones or from different telephone channels. We show how HMM s with mixtures of trend functions can be implemented simply in the already well-established single-trend HMM framework via the device of expanding each state into a set of parallel states. Details of a maximum-likelihood-based (ML-based) algorithm are given for estimating state-dependent mixture trajectory parameters in the model. Experimental results on the task of classifying speaker-independent vowels excised from the TIMIT data base demonstrate consistent performance improvement using phonemic mixture-trended HMM s over their single-trend counterpart. I. INTRODUCTION THE formulation of the hidden Markov model (HMM) has been successfully used in automatic speech recognition for about two decades [16]. In the standard formulation, the individual states in the HMM are each associated with a stationary stochastic process [1], [12]. This makes the standard HMM inadequate for representing the nonstationary (or smoothly time-varying) property of the many types of vocalic segments of speech, including vowels in consonantal contexts as well as diphthongs, glides, and liquids, that are intended to be described by the HMM-state statistics. A generalized or nonstationary-state HMM has been developed recently to overcome this inadequacy by introducing state-dependent polynomial regression functions over time (trend functions) that serve as a parametric-form expression of the time-varying means in the HMM s Gaussian output distributions [2], [3]. The trended HMM as described in [2] and [3] has been limited to only a single-trend function associated with each HMM state. Just as extension of the unimodal Gaussian HMM [12] to the mixture HMM [9] is a significant step toward Manuscript received January 29, 1994; revised November 7, 1996. This work was supported by the Natural Sciences and Engineering Research Council of Canada. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. John H. L. Hansen. The authors are with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ont., Canada N2L 3G1 (e-mail: deng@crg5.waterloo.edu). Publisher Item Identifier S 1063-6676(97)04853-0. superior modeling of speech acoustics, 1 we expect that the same superiority can be achieved in our nonstationary-state HMM framework by extending the single-trend HMM to the mixture-trended HMM. The rationale behind this expectation is straightforward: Both contextual and speaker variations necessarily induce changes in the trajectories of the (preprocessed) speech data for a fixed underlying phonemic-like linguistic unit, the vocalic unit in particular. Such changes are not merely a vertical shift in the trajectory, 2 but can more likely be an alteration on the overall shape of the trajectory. Given this physical reality, if only single-trend function is forced in the model formulation, a wide range of the acoustic trajectory variations would be artificially averaged out, giving rise to an averaged trend function (i.e., a trajectory) in the model that would be of little resemblance to real speech data (after preprocessing in the spectral domain). The same problem would occur when the speech signals to be modeled are coming from separate recording conditions. In the absence of parametric techniques capable of capturing systematic variations of the acoustic trajectories of speech, one must find an expedient way to accommodate the trajectory variations caused by environment, contextual, and speaker factors. We have adopted the mixture (nonparametric) technique for such purpose. In this paper, we explore the use of the mixturetrended HMM as a new stochastic generative model of speech acoustics aiming at speech recognition. II. THE MIXTURE-TRENDED HMM The mixture-trended HMM developed in this study has the same underlying left-to-right Markov chain as the conventional HMM [16] and single-trend HMM [2]. Simply put, the parameters that characterize a mixture trended HMM are: i), the state-transition matrix of the Markov chain (a total of states); and ii) state-dependent parameters in a set (i.e., mixture) of multivariate Gaussian processes for the output vector-valued sequences with timevarying means and time-invariant covariance matrices. To be specific, in the current implemented model, the timevarying means are expressed explicitly as polynomials of the state-occupation time. Viewing each state-dependent output Gaussian process as a data-generation device, we can write 1 Experimental evidence for such superiority has been reported in all major speech recognition laboratories, e.g. [5], [11], [14], [15]. 2 This type of trajectory variation can be trivially represented by a single trend function containing one free shift parameter within the framework of [2]. 1063 6676/97$10.00 1997 IEEE

320 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 4, JULY 1997 imposed on the data trajectory that it has to remain within the same mixture throughout its occupancy of an HMM state. No such constraint is imposed on the conventional mixture HMM. Since existence of speech data trajectories is well known, use of this constraint in our model, even in the degenerated case, can be easily justified. Fig. 1. Example of algorithmic equivalence: two super-states, each with five mixtures. the output sequence,or (1) where the first term is the state( )-dependent polynomial regression function (order ) indexed by mixture component, with registering the time when state in the HMM is just entered before regression on time takes place. 3 The second term in (1) is the residual noise assumed to be the output of an independent, identically distributed (i.i.d.) zero-mean Gaussian source with state-dependent, time-invariant covariance matrix 4. Note that in (1) only the polynomial coefficients (for state and mixture component ) are considered as true model parameters; is merely an auxiliary parameter for the purpose of obtaining maximal accuracy in estimating (over all possible values). The single-trend HMM described in [3] is a special case of the above mixture-trended model when the size of the mixture is set to one. In that case, (1) becomes The mixture-trended model described in this paper is a somewhat simplified version of the general mixture-trended HMM in that each of the regression functions,,or, in (1) is assumed to be equally likely a priori; i.e., mixture weights are assumed equal. This simplification is reasonable because the likelihood associated with matching the entire speech data sequence with the trajectory model from each mixture component has a much greater dynamic range than that of the mixture weight. (In all the experiments we have conducted, we found little difference in the experimental results between use of this assumption and use of general, nonequal mixture weights.) We note that a degenerated case of the model described above becomes a stationary-state HMM. However, this is somewhat different from the conventional stationarystate mixture HMM of [5] and [9] because of the constraint 3 Therefore, (t 0 i ) in (1) represents the occupation time in state i. 4 Throughout our experience, covariance matrices play a much smaller role than the mean vectors in the mixture Gaussian distributions. Tying and untying covariances (across mixture components) make little difference in the evaluation results, and for the sake of implementation simplicity and for saving the parameter size in the model, we choose to report only the case of tying covariances across mixture components; hence, 6 i is not indexed by m. as (2) III. ESTIMATION OF POLYNOMIAL COEFFICIENTS IN THE MIXTURE MODEL A. Algorithmic Equivalence Between Mixture and Single-Trend HMM s One major contribution of this study is that it takes a novel view on the mixture-trended HMM, making it become algorithmically equivalent to the already well-established singletrend HMM. 5 By algorithmic equivalence, we mean that the two models have identical generative properties for the model output sequences and that the same algorithms can be used for scoring an arbitrary observation sequence and for estimating optimal state sequences and model parameters. One practical advantage of this new view is that in implementing training and recognition modules in the speech recognizer using the mixture trended HMM, only minor modifications are needed from the already available software implementing the same modules associated with the single-trend HMM. Fig. 1 serves to illustrate the algorithmic equivalence between single-trend HMM and mixture-trended HMM. This is a two-state five-mixture trended HMM; each state is identified by a dashed circle and is called a super state (to distinguish it from the state of the conventional single-trend HMM). The algorithmically equivalent single-trend HMM has ten states (as denoted by the ten solid circles), with no allowance for state transitions within each super-state. This restricted HMM topology is essential to achieve the equivalence as it ensures temporal continuity of each single trend function associated with the corresponding one of the ten states. Once the algorithmic equivalence between the mixture and single-trend HMM s is established, the likelihood-based estimation method for the mixture-trended HMM parameters becomes essentially the same as that for the conventional single-trend HMM, with only relatively minor technical differences that we describe below. B. Parameter Estimation Segmentation Step The segmentation step of the parameter estimation algorithm developed in this study is an application of the dynamic programming principle to two optimization variables: indices for states in the HMM and the state-occupation time for each HMM state. 6 To describe the segmentation step, we first denote as a state sequence and as a sequence of -frame training data. Note that here each item in the sequence is the state associated with one single trend function; i.e. is not a superstate associated with a mixture of trend functions. Also, denote 5 A similar view has been expressed in [5] for the stationary-state HMM. 6 The Viterbi algorithm developed for stationary-state HMM s is an application of the dynamic programming principle to only one optimization variable indices of HMM states.

DENG AND AKSMANOVIC: SPEAKER-INDEPENDENT PHONETIC CLASSIFICATION 321 (5) (6) as a duration sequence where is the stateoccupation time within state. Further, define the following probability density function where is the dimensionality of the input data vector, superscript denotes matrix transpose, and subscript indicates that state index in the algorithmically equivalent single-trend HMM is uniquely determined by the super-state index of the mixture-trended HMM and of the mixturecomponent index. Finally, define P as the likelihood for the optimal state sequence evaluated at time, with state-occupation time within state ( denotes the parameter set of the mixture trended HMM). Given the above notations and definitions, the following four operations are a complete description of the segmentation step, where is efficiently computed via recursion, and is used to store the most likely state information (state identity and state duration) at time, given that and. 1) Initialization: otherwise with being the initial probability distribution of Markov states. 2) Recursion: See (5) and (6), shown at the top of the page, for and. 3) Termination: 4) Backtracking: (3) (4) (7) (8) (9) We point out one technical difference between the above segmentation step for the mixture-trended HMM and that for the conventional single-trend HMM here: In (5), the maximization over the state index for the single-trend HMM, which is algorithmically equivalent to the mixture-trended HMM of concern, is constrained to be outside the superstate where state resides; this has been indicated by the maximization range in (5) (the set denotes the complementary of the super-state encompassing state ). C. Parameter Estimation Maximization Step After the above segmentation step, estimation of the model parameters becomes the problem of polynomial regression. For the mixture-trended HMM, this in general would be a complex multilevel regression problem. However, taking our view of the mixture-trended HMM as its algorithmically equivalent version of the single-trend HMM, we effectively reduce the problem to the standard (single-level) regression problem. The solution of a set of standard regression equations, which can be found in any rudimentary statistics textbook, gives estimates of the polynomial coefficients for each HMM state and for each mixture component. IV. EXPERIMENTAL EVALUATION The speech data employed to evaluate the mixture trended HMM in our experiments are ten vowels/diphthongs (/aa/, /ae/, /ah/, /ao/, /eh/, /ey/, /ih/, /iy/, /ay/, /aw/) extracted from the speaker-independent TIMIT corpus. Although the model described in this paper is directly applicable to continuous speech recognition, the scope of this study is limited to context-independent vowel classification, a simple task yet involving speech data that contain prominent variations in the observed trajectories for each speech class. As we mentioned in the introduction section the vocalic segments (including diphthongs) are smoothly time varying and, hence, call for the strongest need for use of the trajectory model to describe them. 7 All tokens of the eight vowels from 120 speakers (a total of 5110 vowel/diphthong tokens) in our data base were used for training and those from disjoint 40 speakers (a total of 1767 vowel/diphthong tokens) for classifier evaluation. A conventional speech preprocessor was used to produce melfrequency cepstral coefficients. Briefly, a Hamming window of duration 25.6 ms was applied every 10 ms (the frame length) to the raw speech data in the form of digitally sampled signal. Within each window, mel-frequency cepstral coefficients (MFCC s) up to the 12th order were computed (using the 7 Most consonantal segments (e.g. stops, nasal murmurs, etc.) are short in duration and their acoustic properties (including the transition to their adjacent segments) are better handled by the Markov chain s state transition rather than by the state-conditioned trajectory model.

322 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 4, JULY 1997 TABLE I SPEAKER-INDEPENDENT VOWEL CLASSIFICATION RATE AS A FUNCTION OF POLYNOMIAL ORDER (R) AND OF THE NUMBER OF MIXTURES (M ) IN EACH STATE. ONLY STATIC MFCC s (C1 0 C12 PLUS NORMALIZED C0) ARE USED AS PREPROCESSED DATA FOR THE HMM s TABLE II SPEAKER-INDEPENDENT VOWEL CLASSIFICATION RATE AS A FUNCTION OF POLYNOMIAL ORDER (R) AND OF THE NUMBER OF MIXTURES (M ) IN EACH STATE. BOTH STATIC MFCC s AND DELTA MFCC s ARE USED AS PREPROCESSED DATA FOR THE HMM s HTK toolkit). Beside our main interest of this work comparing mixture-trended HMM s with single-trend HMM s and with the stationary-state HHM s, in our evaluation experiments we also compare the performance of these recognizers with and without use of delta MFCC s. Although the trended HMM as a trajectory model already captures the dynamics of the speech data sequence, it is of interest to examine to what extent the trajectory modeling approach and signal processing approach (i.e., use of delta parameters), as well as a combination of them, contribute to the recognition performance. The vowel classification results, organized by the classification rate as a function of the order of the polynomial trend function in (1) and of the number of mixture components in each of the trended HMM states, are summaried in Tables I and II. In Table I are the results with use of only static MFCC s ( plus normalized ), and in Table II are those with adjoint MFCC s and delta MFCC s. Fixed left-toright three-state HMM s are used. 8 Note that the results for two rather orthogonal benchmark HMM classifiers are included as special cases in Tables I and II: The rows associated with polynomial order are the same (except for the additional trajectory-path constraint) as the stationarystate mixture HMM [5], [9], and the columns associated with mixture number correspond to the single-trend HMM [2], [3]. The results of Tables I and II demonstrate superiority of the mixture-trended HMM over both of the benchmark HMM s. In particular, as the number of mixtures and the polynomial order increase (the latter increases to two) 9, the classification rate continues to improve, except the rates become comparable for linear and quadratic trends after the mixture number reaches ten. In general, we observe that moving from order zero to order one in the HMM trend function gives greater overall performance improvements than moving from order one to order two. The better performance of single-trend HMM s over the unimodal Gaussian stationary-state HMM 8 Like the conventional stationary-state HMM, the choice of the number of states in the HMM is made empirically also for the current model. Use of three states in our experiments gives satisfactory performance which is either comparable or superior to the use of other state numbers. 9 Our experiences showed that the trended functions higher than the second order do not result in superior performance. The preprocessed speech data are reasonably smooth and hence use of low-order trended functions appears to suffice. Some occasional fast jumps in the preprocessed speech data are naturally handled by Markov chain s state transitions. (column one of Tables I and II) confirms our earlier results using a different evaluation task [3]. The better performance of mixture-trend HMM s over the single-trend HMM s (columns two to five of Tables I and II) justifies the motivation of this study introduced in Section I of this paper. By comparing the results of Table I and those of Table II, we note that use of delta MFCC s improves all types of recognizers, but for unimodal (nonmixture) HMM s, use of delta MFCC s improves the stationary-state HMM (row one, column one in Tables I and II) to a much greater degree than the trended HMM s (rows two and three, column one in Tables I and II). Quantitatively, for the unimodal stationary-state HMM, the improvement from 54.3% to 60.4% corresponds to an error rate reduction of 15.4%; while, for the single-trend HMM s, improvements from 58.3% to 61.7% (linear trend) and from 59.0% to 61.8% (quadratic trend) correspond to significantly smaller error rate reductions of 8.9% and 7.3%, respectively. This observation, together with the general observation that trended HMM s perform better than stationary-state HMM s with and without use of delta parameters, suggests that the trajectory modeling captures at least some dynamic properties of speech data, which the delta parameters themselves are unable to capture. In interpreting the classification results shown in Tables I and II, we also note a complication arising from the varying total number of model parameters associated with different polynomial order and different mixture size. Nevertheless, the case with and can be compared with the case with and, and the case with and can be compared with the case with and, etc., since these model pairs do contain identical number of model parameters. In analyzing the classification results, we have also observed rather nonuniform distributions of the classification errors over different vowel/diphthong categories. To illustrate, we show in Table III the confusion matrix of the classification result associated with the entry of rate 68.8% in Table II. We observe in general that the tense or long vowels/diphthongs have significantly greater classification accuracy than short ones. /iy/, /ey/, /ae/, /ao/ and /ay/ are the long vowels/diphthongs, whose classification accuracy goes all over 70%. In contrast, the remaining five relatively short vowel classes, including diphthong /aw/, achieve the classification accuracy only on the order of 60% and below. Such clear

DENG AND AKSMANOVIC: SPEAKER-INDEPENDENT PHONETIC CLASSIFICATION 323 TABLE III CONFUSION MATRIX SHOWING CLASSIFICATION ERROR DISTRIBUTION disparity in classification accuracy may be attributed to two factors. First, the polynomial trend functions used in the HMM is more suited to describe smooth data trajectories that are exhibited in the long vowel/diphthong sounds. Second, long vowels/diphthongs are less subject to the context-dependent reduction effects in the fluent TIMIT utterances than the short vowels, and hence tend to cause fewer confusions in our context-independent classifier. V. SUMMARY AND DISCUSSION We propose, implement, and evaluate a new version of the nonstationary-state HMM with each state characterized by a mixture of trend functions (time-varying Gaussian means) embedded in stationary white noise. This new version of the model can be viewed as a generalization from either the single-trend nonstationary-state HMM [2], or from the stationary-state HMM with mixture characterization of the states [5], [9] (with the exception that in our model there is an additional constraint that each constant-line trajectory does not jump across different mixture components within each state). The generalization from the the single-trend model can be viewed as providing discrete-mode distributions on the segment-bound polynomial parameters. 10 Development of this new model is motivated mainly by the observation that contextual and speaker variations bring about widely varying trajectory shapes of the acoustic data in fluent, speakerindependent speech examined in the TIMIT data base. The speech recognition evaluation results we have obtained so far show consistent performance improvement in the recognizer based on the new model. Although the experiments reported in this paper are limited to only the vowel classification task, 10 We note that the discrete-mode distributions have also be provided to other types of stochastic segment models [10], [8], [7], and that continuousmode distributions on parameters as a special case of our arbitrary-order polynomial model have appeared in [17] and [6]. the model is, in theory, well suited for use in continuous speech recognition tasks. The main difficulty in extending the experiments to continuous speech recognition lies in the computation complexity. We discuss here several aspects of the computation complexity associated with the implementation of the mixturetrended HMM developed in this study. The major computation for the model training lies in the segmentation algorithm described in Section III-B, with the maximization step occupying only a very small fraction of the total computation. (In fact, the decoding process requires the computation, which is exactly the same as that of the segmentation algorithm.) First, the computation complexity grows linearly with the size of mixture, much like the conventional stationary-state mixture HMM. Second, increases in the polynomial order from one to more than one (all nonstationary-state HMM s) has very little effect on the total computation. Only a small overhead is incurred on computing more terms of the polynomial as Gaussian means and on regression (the maximization step in the EM algorithm). Finally, the segmentation algorithm has the computation complexity quadratically related to the observation length for nonstationary-state models (polynomial order one or greater), significantly greater than that for the stationary-state HMM (polynomial order equals zero) which grows only linearly with. In practice, as we have implemented in our vowel classification experiments, state duration constraints can be effectively utilized to reduce the computation with only minimal effects on the segmentation accuracy. The state duration constraints would be significantly more difficult to provide for continuous speech recognition, which has limited our current evaluation of the trended HMM only to discrete utterance classification. In the mixture-trended HMM, the duration distribution for any HMM state is still exponential; that is, no changes from the conventional HMM in the durational aspect have

324 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 4, JULY 1997 been made. However, due to the use of frame-dependent output distributions, the segmentation algorithm has a similar complexity to that of the semi-hmm described in [13]. We emphasize that the similarly high computation complexity in model parameter estimation in both the current trended HMM and in the semi-hmm of [13] results from completely different reasons. For the former, the computation overhead is due to use of the frame-dependent output distributions within each HMM state; in the latter, the overhead is due to use of the nonexponential state durational distribution. One major focus of our recent work on speech recognition has been to develop a parsimonious phonological representation for fluent speech based on the concepts borrowed from articulatory phonology [4]. Central to the phonological representation of this type is the process of temporal overlap of multidimensional articulatory features (or gestures). The transitional Markov states constructed via overlapping one or more of primary articulatory features (lips, tongue blade, or tongue body) are the ideal site where the mixtures of nonstationary trend functions should be in use. Since the contextual factors have been largely removed within this new gesture-based phonological framework, the mixtures in the trended HMM can be used more effectively to capture the acoustic trajectory variations due to speaker-related factors only. Finally, we note that the mixture model described in this paper can be effectively used to characterize the speech signals mixed in a fixed number of distinct generating sources. This situation arises if a speech recognizer is used when training data are collected from different telephone channels. The reason that the mixture-trended HMM is particularly suited to characterize such heterogeneous speech data sources is the inherent constraint [see (5)] that ensures each separate data sequence follows a distinct model trajectory (rather than jumping across a set of trajectories within an HMM state). Therefore, our new model is effective not only for handling speaker and phonetic variabilities in speech, but also for environmental (microphone or telephone channel) variability. ACKNOWLEDGMENT The authors thank anonymous reviewers who provided constructive comments that improved the quality of the paper. REFERENCES [1] L. Baum, An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes, Inequalities, vol. 3, pp. 1 8, 1972. [2] L. Deng, A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal, Signal Processing, vol. 27, pp. 65 78, Apr. 1992. [3] L. Deng, M. Aksmanovic, D. Sun, and C. F. J. Wu, Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states, IEEE Trans. Speech Audio Processing, vol. 2, pp. 507 520, Oct. 1994. [4] L. Deng and D. Sun, A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features, J. Acoust. Soc. Amer., vol. 95, pp. 2702 2719, May 1994. [5] L. Deng et al., Phonemic hidden Markov models with continuous mixture output densities for large vocabulary word recognition, IEEE Trans. Acoust., Speech, Signal Processing, vol. 39, pp. 1677 1681, July 1991. [6] M. Gales and S. Young, The theory of segmental hidden Markov models, Tech. Rep. CUED/F-INFENG/TR.133, Dept. Eng., Cambridge Univ., Cambridge, U.K., 1993. [7] W. Goldenthal and J. Glass, Modeling spectral dynamics for vowel classification, in Proc. Eurospeech, 1993, pp. 289 292. [8] Y. Gong and J. P. Haton, Stochastic trajectory modeling for speech recognition, in Proc. ICASSP, 1994, vol. 1, pp. 57 60. [9] B.-H. Juang, S. Levinson, and M. Sondhi, Maximum likelihood estimation for multivariate mixture observations of Markov chain, IEEE Trans. Inform. Theory, vol. IT-32, pp. 307 309, 1986. [10] A. Kannan and M. Ostendorf, A comparison of trajectory and mixture modeling in segment-based word recognition, in Proc. ICASSP, 1993, vol. 2, pp. 327 330. [11] C. Lee, L. Rabiner, R. Pieraccini, and J. Wilpon, Acoustic modeling for large vocabulary speech recognition, Comput. Speech Language, vol. 4, pp. 127 165, 1990. [12] L. Liporace, Maximum likelihood estimation for multivariate observations of Markov sources, IEEE Trans. Inform. Theory, vol. 28, pp. 729 734, 1982. [13] S. Levinson, Continuously variable duration hidden Markov models for automatic speech recognition, Comput. Speech Language, vol. 1, pp. 29 45, 1986. [14] A. Nadas and D. Nahamoo, Automatic speech recognition via pseudoindependent marginal mixtures, in Proc. ICASSP, 1987, pp. 1285 1288. [15] H. Ney and A. Noll, Phoneme modeling using continuous mixture densities, in Proc. ICASSP, 1988, pp. 437 440. [16] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, vol. 77, pp. 257 285 Feb. 1989. [17] M. Russel, A segmental HMM for speech pattern matching, in Proc. ICASSP, vol. 2, pp. 499 502, 1993. Li Deng (S 83 M 86 SM 91) received the B.S. degree in biophysics from the University of Science and Technology of China in 1982, and the M.S. and Ph.D. degrees in electrical engineering from the University of Wisconsin, Madison, in 1984 and 1986, respectively. He worked on large vocabulary automatic speech recognition at INRS-Telecommunications, Montreal, P.Q., Canada, from 1986 to 1989. Since 1989, he has been with Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ont., Canada, where he is currently Full Professor. From 1992 to 1993, he conducted sabbatical research at the Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA, working on statistical models of speech production and the related speech recognition algorithms. His research interests include acoustic-phonetic modeling of speech, speech recognition, synthesis, and enhancement, speech production and perception, statistical methods for signal analysis and modeling, nonlinear signal processing, neural network algorithms, computational phonetics and phonology for the world s languages, and auditory speech processing. Michael Aksmanovic received the B.A.Sc. in computer engineering and the M.A.Sc. in electrical engineering in 1991 and 1993, respectively, both from the University of Waterloo, Waterloo, Ont., Canada. He is currently working toward the Ph.D. at the University of Victoria, Victoria, BC, Canada. His research interests include digital signal processing, speech recognition, and parallel programming.