On Compensating the Mel-Frequency Cepstral Coefficients for Noisy Speech Recognition

On Compensating the el-frequency Cepstral Coefficients for Noisy Speech Recognition Eric H. C. Choi Interfaces, achines and Graphic Environments (IAGEN) National ICT Australia Locked Bag 9013, Alexandria, NSW 1435, Sydney, Australia Eric.Choi@nicta.com.au Abstract This paper describes a novel noise-robust automatic speech recognition (ASR) front-end that employs a combination of el-filterbank output compensation and cumulative distribution mapping of cepstral coefficients with truncated Gaussian distribution. Recognition experiments on the Aurora II connected digits database reveal that the proposed front-end achieves an average digit recognition accuracy of 84.92% for a model set trained from clean speech data. Compared with the ETSI standard el-cepstral front-end, the proposed front-end is found to obtain a relative error rate reduction of around 61%. oreover, the proposed front-end can provide comparable recognition accuracy with the ETSI advanced front-end, at less than half the computation load. Keywords:. Speech recognition, noise robustness, frontend processing, el-frequency cepstral coefficient. 1 Introduction The proliferation of handheld computing devices has been the driving force behind the growing needs of more usable and natural user interfaces for ubiquitous computing. Traditional user interfaces based on the use of keyboard and mouse will not fulfill the needs of these mobile users. Automatic speech recognition (ASR) plays a critical role in providing more user-friendly user interfaces for these handheld devices. However since a handheld device can be used anywhere and in different environments, the design of a speech recognition system must take the potential noisy acoustic environments into consideration. Automatic speech recognition basically consists of two stages (Rabiner and Juang 1993). The first stage, known as front-end processing or feature extraction, is aimed at extracting a time sequence of feature vectors which represents the temporal evolution of the spectral characteristics of a speech signal. The second stage is a Copyright 2006, Australian Computer Society, Inc. This paper appeared at the Twenty-Ninth Australasian Computer Science Conference (ACSC2006), Hobart, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 48. Vladimir Estivill-Castro and Gill Dobbie, Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included. pattern matching process where actual search is carried out to decode the spoken words by matching the sequence of feature vectors against the acoustic and language models stored in the recogniser. In state-of-the-art ASR systems, the features extracted in front-end processing are typically el-frequency cepstral coefficients (FCC) and the pattern matching is mostly based on hidden arkov modelling (H) which requires relevant speech samples to train the acoustic models beforehand. State-of-the-art ASR systems work pretty well if the training and usage conditions are similar and reasonably benign. However, under the influence of noise, these systems begin to degrade and their accuracies may become unacceptably low in severe environments (Deng and Huang 2004). To remedy this noise robustness issue in ASR due to the static nature of the H parameters once trained, various adaptive techniques have been proposed. A common theme of these techniques is the utilisation of some form of compensation to account for the effects of noise on the speech characteristics. In general, a compensation technique can be applied in the signal, feature or model space to reduce mismatch between training and usage conditions (Huang et al. 2001). Signal-space methods, e.g. (Ephraim 1992), typically try to enhance a noisy speech signal by improving its signalto-noise ratio (SNR). However, increase in SNR does not always contribute to improvement in recognition accuracy. Feature-space methods, e.g. (Hermansky 1990), try to derive some kind of feature representation that is potentially invariant to the change in environmental noise conditions. This is often achieved by incorporating some aspects of human auditory modelling. Alternatively, some other feature-space methods (Sankar and Lee 1996; Choi 2004) try to understand and compensate the effects of noise on a speech representation and correspondingly reduce the mismatch. odel-space methods, e.g. (Yao et al. 2001; Zhang and Furui 2004), try to adust the recognition model parameters to incorporate the effects of noise on the acoustic models. A few standards for ASR feature extraction are available from the European Telecommunications Standards Institute (ETSI). The standard WI007 el-cepstral frontend (ETSI 2000) covers the processing of speech signal into FCCs. As this basic front-end is not that robust for noisy speech recognition, another standard that is more appropriate for noisy speech recognition has been released. The WI008 advanced front-end (ETSI 2002)

utilises a two-stage el-warped Wiener filtering to improve the signal-to-noise ratio of a speech utterance, and then applies an SNR-dependent waveform processing to the noise-reduced signal. The resultant speech signal is further processed into FCCs and after which a blind equalisation is applied to the cepstral coefficients. While the advanced front-end WI008 represents the state-of-art in terms of recognition accuracy for noisy speech, its main drawback is high computation load due to the use of double Wiener filtering. There is a need for an alternative front-end that is as robust as the WI008 advanced frontend but has comparable computation to that of the WI007 standard front-end. In this work, the main focus is on feature-space compensation for a cepstral based front-end. It is demonstrated that a general framework of el-filterbank output compensation can be used together with cumulative distribution mapping to compensate the effects of noises. Here we extend our previous work (Choi 2004) by adding log el-filterbank output weighting and frame skipping to the proposed front-end processing. We also benchmark the proposed front-end against the ETSI front-ends for evaluation purpose. The organisation of this paper is as follows. It will describe the details of the proposed front-end in Section 2. Following this in Section 3 will be some recognition experiments on the Aurora II digits database and the corresponding discussion. Finally, a summary of the conclusions will be presented in Section 4. 2 Front-end Processing Typical ASR front-ends lack the ability to compensate the effects of noise on feature extraction and if a speech signal is noisy, they tend to extract more information about the noise, instead of the speech itself. Therefore a noise robust front-end needs to have knowledge about the noise and accordingly adust the processing to extract only relevant information about the speech. To this end, we have experimented with some novel noise compensation techniques that not only account for the effects of noise but also emphasise speech information that is less susceptible to noise corruption. A high level block diagram of the proposed front-end is shown in Figure 1. The development of the proposed front-end processing is based on the ETSI standard el-frequency cepstral coefficient front-end (ETSI 2000). Typically, the FCCs (C i ) of a frame of speech data are given by: C i iπ = m Cos[ ( 0.5)]; = 1 m = log ( Y ); i = 0,1,2,..., N; N < where Y is the output magnitude of the -th elfilterbank and is the total number of el-filters in the filterbank analysis. In processing an utterance, each frame of speech data is 25ms wide and there is a 10ms time shift e (1) between current frame and the next frame of speech data (i.e. 15ms overlap between two consecutive frames). Speech Signal y(t) Output Features Pre-processing and FFT el-frequency Filtering Y el-filterbank Output Compensation m Discrete Cosine Transform (DCT) 13FCCs (C 0~C 12) Cumulative Distribution apping & Frame Skipping Figure 1: Block diagram of the proposed front-end (two novel processing blocks related to noise compensation are highlighted) In this work, two more processing blocks related to noise compensation have been added to the ETSI standard FCC front-end. The el-filterbank output compensation block incorporates noise spectral subtraction, spectral flooring and log el-filterbank output weighting into a single framework. oreover, noise robustness is further enhanced by applying cumulative distribution mapping (CD) with frame skipping to the resultant cepstral coefficients. A detailed description of this novel noise compensation framework is presented as in the following sub-sections. 2.1 el-filterbank Output Compensation The noise robustness of the proposed front-end is enhanced by compensating the el-filterbank outputs according to the noise spectral characteristics. In this work, an enhanced log el-filterbank output is given by: m = α log {1 + β AX [( Y Nˆ ), γ Y ]} (2) e where α, β, γ all (0,1) are parameters to adust the noise compensation, Nˆ is the noise magnitude estimate of the -th el-filterbank output and AX[.] is a function which returns the maximum value of its arguments. Note that γ is used to control the degree of noise spectral subtraction (Vaseghi 2000) and β is used to adust the degree of spectral flooring (Choi 2004). Here, both γ and β are assumed to be independent of the el-filterbank index as we are more interested in the log el-filterbank

output weighting and this assumption can simplify the formulation. Also these two parameters are applied globally in that they have the same values for all the speech utterances. The motivation to incorporate log el-filterbank output weighting is to emphasise those filterbank outputs which are found to be more reliable and less affected by the actual noise spectral characteristics. One possible way to measure the reliability of a filterbank output is the signalto-noise ratio (SNR). From the viewpoint of psychoacoustics (Stevens 1957), these weighing factors (α ) are related to the spectral compression process that converts sound intensity into perceived loudness by human. So far in the literature, each of the weighting factors has been assumed to be dependent on its individual output SNR only. However, in our case, a weighting factor is also dependent on the SNRs of other filterbank outputs and it is given by: Y loge (1 + ) Nˆ α = ; α = 1 (3) Yk = 1 loge(1 + ) Nˆ k= 1 k The constant 1 is added to the log function to prevent it from having negative values since there may be errors in the noise estimates. In essence, α is basically calculated as the ratio of the SNR of a particular filterbank output to the sum of the SNRs of all the filterbank outputs. oreover, in this case, all the weighing factors are calculated frame-by-frame dynamically based on the noise estimates from the first 10 frames of each speech utterance. While equation (2) provides a general framework to perform the noise compensation, it is anticipated that some kind of normalisation to the dynamic ranges of the compensated cepstral coefficients would be beneficial. For this purpose, we choose to apply cumulative distribution mapping to the cepstral coefficients after noise compensation. 2.2 Cumulative Distribution apping The cumulative distribution mapping (CD) method described here is based on the use of histogram equalisation (HE) originally developed for image processing (Russ 1995). The use of the HE method for noise compensation in front-end processing of speech can also be found in (Dharanipragada and Padmanabhan 2000). The main idea of this method is to map the distribution of a time sequence of noisy speech features into a target distribution with a pre-defined probability density function (PDF). In our case, it is assumed that for a given feature value v o, the mapping relationship would be: v o z o f ( v) dv = h( z) dz ; or F v (v o ) = F z (z o ) (4) v= z= where F v (v) is the corresponding cumulative distribution function (CDF) of a given set of noisy speech features and F z (z) is the target CDF, f(v) and h(z) are the respective PDFs. From equation (4), we have z o = F z -1 [F v (v o )] (5) Therefore the required mapping from a given speech feature v o into the corresponding target feature z o is represented by equation (5). Typically h(z) is assumed to be a Gaussian as in the literature of histogram equalisation. On the other hand, there is no particular strong reason, other than easier implementation, that one has to assume h(z) to be Gaussian. In fact, we have observed that the left tail region of the distribution of a normalised feature may not be that useful as it represents mainly the range of more noisy features. Based on this observation, we have developed the novel use of a truncated Gaussian as target distribution. athematically, the additional constraint is given by: z o F = 1 z [ F ( v v o SKIP, )], if Fv ( vo ) θth; otherwise 0 θ th < 1 where θ th is a constant that determines the fraction of features to be discarded, and SKIP denotes a function that skips the current frame of speech data and does not output any feature value. In the current implementation, we perform the skipping of a whole feature vector based only on C 0 (zero th -order cepstral coefficient) as it indicates the energy level of a frame of speech data. oreover, the h(z) is assumed to be a Gaussian with zero mean and unity variance. In the experiments, CD is applied only to the static feature vector which consists of 13 FCCs (C 0 ~ C 12 ) and each cepstral coefficient is normalised individually. 3 Experimental Results The proposed front-end has been evaluated on the Aurora II database (Hirsch and Pearce 2000). This database contains noisy connected digits, which were created by adding various types of noises at different SNRs to the original clean utterances (i.e. utterances with high SNRs). There are three test sets in the database and they contain 8 types of additive noises. Each of the test sets A and B contains about 28K utterances and the test set C is about half that size. The test set C includes channel distortion as well. The SNRs of the test data range from -5 db to more than 20 db. The training data consist of another 8440 clean utterances. 3.1 Experimental Setup All the pre-processing and el filtering of a speech signal in the proposed front-end followed the ETSI standard FCC front-end. The static feature vector of our front-end consisted of 13 FCCs (C 0 ~ C 12 ). This static (6)

feature vector was appended with their corresponding 1 st - order and 2 nd -order time derivatives to form a resultant vector with 39 coefficients for speech recognition at the backend, as per the Aurora evaluation framework. Hidden arkov modelling (H) (Rabiner and Juang 1993) was used for the speech recognition experiments. Each model was represented by a continuous density H with left-to-right configuration. Digit models had 16 states with 3 Gaussians per state, while the silence model had 3 states with 6 Gaussians per state. An inter-digit silence model with 1 state was also used, and it was tied with the middle state of the silence model. 3.2 Comparison of Accuracy and Robustness We followed the official Aurora evaluation framework in that average recognition accuracy for each test set is calculated from the recognition results for those test data with SNRs from 0 db to 20dB only. In all the experiments reported here, the spectral flooring parameter β and the spectral subtraction parameter γ for the proposed front-end were set to 0.001 and 0.4 respectively, as determined empirically in some preliminary experiments. Note that the 1 st -order and the 2 nd -order time derivatives of a static feature vector were generated after the static features had been compensated and normalised. The first set of experiments investigated the effect of the frame skipping threshold (θ th ) on the recognition accuracy of the proposed front-end. The experimental results obtained with various values of the threshold are summarised as shown in Table 1. Table 1: Average digit accuracies (%) for Aurora test sets, proposed front-end with various thresholds (θ th ) for skipping frames θ th Test A Test B Test C Avg. 0.00 # 83.65 84.00 82.74 83.46 0.03 84.47 84.90 83.58 84.32 0.05 84.57 84.93 83.76 84.42 0.06 84.71 85.22 83.91 84.61 0.07 84.98 85.41 84.08 84.82 0.08 85.06 85.49 84.21 84.92 0.09 85.10 85.50 84.10 84.90 0.10 85.08 85.61 83.95 84.88 # No frame skipping in this case As observed from Table 1, the incorporation of frame skipping in CD does improve the accuracy of the proposed front-end and the optimal threshold for achieving the best average accuracy is found to be 0.08. Since the frame skipping is applied to feature vectors with smaller value of C 0, this is equivalent to removing speech segments which have lower frame energy. Obviously, these segments are less reliable in discriminating between different speech sounds, as they can potentially contain more information about the noise than the speech signal itself. Furthermore, it may be observed from the table that the optimal frame skipping threshold is different for different test sets. It seems that some kind adaptive threshold according to noise condition and characteristic would be beneficial. Nevertheless, for the Aurora digit strings, skipping about 10% of the feature vectors in an utterance is seemed to be reasonable. Table 2: Average digit accuracies (%) for Aurora test sets, comparing proposed front-end (θ th =0.08) with ETSI FCC front-ends Front-end Test A Test B Test C Avg. % Improv* ETSI std. 61.34 55.75 66.14 61.08 0.0 ETSI adv. 86.20 85.24 84.72 85.39 62.5 Proposed 85.06 85.49 84.21 84.92 61.3 * % Improvement is measured in terms of relative error rate reduction reference to the ETSI standard front-end The performances of the proposed front-end with θ th =0.08 were compared with those of the ETSI FCC front-ends and the results are shown in Table 2. From the table, it can be observed that the proposed front-end performs much better than the ETSI standard FCC front-end in terms of average recognition accuracy, while it achieves comparable recognition accuracy with the ETSI advanced front-end. Although for the test set B, the proposed front-end seems to perform marginally better than the advanced front-end (85.49% vs. 85.24%), the difference in accuracy is found to be not statistically significant. The fact that both the proposed front-end and the advanced front-end have similar accuracy is noteworthy since the proposed front-end requires only about half the computation load of the advanced frontend, as it will be shown later in Section 3.3. Digit Accurcay (%) 100 90 80 70 60 50 40 30 20 10 ETSI_std ETSI_adv Proposed 0 Clean 20 15 10 5 0-5 SNR (db) Figure 2: Average recognition results for Aurora test sets, proposed front-end (θ th =0.08) compared with ETSI FCC front-ends by SNR To get an insight on how the proposed front-end is performing at different noise levels, a break-down of the

recognition results according to individual SNRs and averaged across all three test sets is shown in Figure 2. Also shown in the figure are those corresponding results for the ETSI FCC front-ends. As observed from Figure 2, both the proposed front-end and the advanced front-end perform similarly at different SNRs. On the other hand, both the proposed front-end and the advanced front-end perform much better than the ETSI standard FCC front-end, particularly in the noisier conditions. In some cases, more than double of the recognition accuracy can be achieved by using the proposed front-end (e.g. at 5dB SNR). To illustrate the performances of the front-ends for different noise types, the average recognition accuracy over 0 to 20 db SNRs obtained by each front-end for each type of noisy speech data in test set A is plotted in Figure 3. From the figure, it can be observed that the proposed front-end achieves higher digit accuracy than the ETSI advanced front-end for the babble-type noisy speech (other people talking at background causing the noises). The difference in accuracy (84.74% vs. 82.21%) is found to be statistically significant (z=6.195, p<0.001, two tailed). Overall, the advanced front-end is found to achieve marginally better accuracy than the proposed front-end for the other types of noisy speech. Digit Accuracy (%) 90 80 70 60 50 40 30 ETSI_std Proposed ETSI_adv Subway Babble Car Exhibition Noise Type Figure 3: Recognition results for Aurora test set A, proposed front-end (θ th =0.08) compared with ETSI FCC front-ends by noise type Similarly, the recognition results by noise type for test sets B and C are also shown in Figure 4. Note that the (C) following the name of a noise type in the figure denotes speech data from test set C which also contains additional channel distortion. Again it can be observed from Figure 4 that the proposed front-end performs as good as the advanced front-end for all the noise types and the proposed front-end achieves a better accuracy for the restaurant-type noisy speech. This better accuracy (82.52% vs. 81.11%) is found to be statistically significant (z=3.324, p<0.001, two tailed). It seems that the proposed front-end is particularly effective in handling background noises due to other people talking at the same time. Overall the previous two figures demonstrate that the proposed front-end is much more consistent and robust than the ETSI standard FCC front-end in recognising different types of noisy speech, and it is as noise robust as the ETSI advanced front-end in most of the cases. Digit Accuracy (%) 90 80 70 60 50 40 30 Restaurant ETSI_std Proposed ETSI_adv Street Airport Train Station Noise Type Subway(C) Street(C) Figure 4: Recognition results for Aurora test sets B, and C, proposed front-end (θ th =0.08) compared with ETSI FCC front-ends by noise type 3.3 Comparison of Computation Load In order to estimate the computational complexity of the proposed front-end processing, the ETSI standard, the ETSI advanced and the proposed front-end were run on the Aurora II multi-condition training data, and the duration was recorded as shown in Table 3. No other processes were running on the processor at the time. The multi-condition training set contains utterances with 4 different noise types (subway, babble, car and exhibition) and 5 SNRs (5 to 20dB and clean ). In total, there are 8440 utterances in the training set (422 utterances per condition). Table 3: Comparison of running times on a 2.66 GHz processor with 2 GB RA for front-end processing of Aurora multi-condition training set (8440 utterances) Front-end ETSI std. Proposed ETSI adv. Time (s) 132 158 325 On average, the computation load of the proposed frontend was found to be about 20% more than that of the ETSI standard FCC front-end, but only about 49% that of the ETSI advanced front-end. It took an average of about 19ms for the proposed front-end to process an utterance. The higher computational load of the ETSI advanced front-end is expected, as the advanced front-end

applies Wiener filtering twice to a speech signal based on time-domain convolution. Compared with the ETSI advanced front-end, the much lighter computation requirement of the proposed frontend can be a distinguished advantage for applications running on handheld devices. oreover, the proposed front-end is easier to be implemented on fixed-point processors used by most handheld devices. 4 Conclusions A new and noise robust front-end based on the combined application of el-filterbank output compensation and cumulative distribution mapping with frame skipping has been proposed. Experimental results on the Aurora II speech database have revealed the effectiveness of the novel combination of these noise compensation methods. The proposed front-end achieves an average digit accuracy of 84.92% for the three test sets with clean H training. Compared with the ETSI standard elcepstral front-end, the proposed front-end has been able to provide a relative error rate reduction of more than 61%. oreover, the proposed front-end can provide comparable recognition accuracy with the ETSI advanced front-end, at less than half the computation load. Possible future extension work includes the use of dynamic noise estimates to handle non-stationary noises, the replacement of the simple spectral flooring with a more advanced temporal masking algorithm and the use of adaptive threshold for frame skipping. 5 References Choi, E. (2004): Noise Robust Front-end for ASR using Spectral Subtraction, Spectral Flooring and Cumulative Distribution apping. Proc. 10th Australian Int. Conf. on Speech Science and Technology, pp. 451-456. Deng, Li. and Huang, X. (2004): Challenges in Adopting Speech Recognition. Communications of the AC, Vol. 47, No.1, pp. 69-75. Dharanipragada, S. and Padmanabhan,. (2000): A Nonlinear Unsupervised Adaptation Technique for Speech Recognition. Proc. Int. Conf. on Spoken Language Processing, Vol. 4, pp. 556-559. Ephraim, Y. (1992): A Bayesian Estimation Approach for Speech Enhancement Using Hidden arkov odels. IEEE Trans. Signal Processing, Vol. 40, No. 4, pp. 725-735. ETSI (2000): Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithms. ETSI standard document ES 201 108. ETSI (2002): Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Advanced Front-end Feature Extraction Algorithm; Compression Algorithm. ETSI standard document ES 202 050. Hermansky, H. (1990): Perceptual Linear Predictive (PLP) Analysis of Speech. Journal Acoustical Society of America (JASA), Vol. 87 (4), pp. 1738-1752. Hirsch, H.G. and Pearce, D. (2000): The AURORA Experimental Framework for the Performance Evaluation of Speech Recognition Systems Under Noise Conditions. Proc. ISCA ITRW ASR2000, pp. 181-188. Huang, C., Wang, H. and Lee, C. (2001): An SNR- Incremental Stochastic atching Algorithm for Noisy Speech Recognition. IEEE Trans. Speech and Audio Processing, Vol. 9, No. 8, pp. 866-873. Rabiner, L.R. and Juang, B.H. (1993): Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs, New Jersey. Russ, J.C. (1995); The Image Processing Handbook. 2 nd ed., CRC Press. Sankar, A. and Lee, C.H. (1996): A aximum Likelihood Approach to Stochastic atching for Robust Speech Recognition. IEEE Trans. Speech and Audio Processing, Vol. 4, pp. 190 202. Stevens, S.S. (1957): On the Psychological Law. Psychological Review, Vol. 64, pp. 153-181. Vaseghi, S.V. (2000): Advanced Digital Signal Processing and Noise Reduction. Wiley Press. Yao, K., Paliwal, K.K. and Nakamura, S. (2001): Sequential Noise Compensation by a Sequential Kullback Proximal Algorithm. Proc. European Conf. on Speech Communication and Technology, pp. 1139-1142. Zhang, Z. and Furui, S. (2004): Piecewise-linear Transformation-based H Adaptation for Noisy Speech. Speech Communication, Vol. 42, Issue 1, pp. 43-58.