Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

Size: px

Start display at page:

Download "Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition"

Moris Jessie Reeves
6 years ago
Views:

1 146 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 Robust Enpoint Detection an Energy Normalization for Real-Time Speech an Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine Tsai, an Qiru Zhou, Member, IEEE Abstract When automatic speech recognition (ASR) an speaker verification (SV) are applie in averse acoustic environments, enpoint etection an energy normalization can be crucial to the functioning of both systems. In low signal-to-noise ratio (SNR) an nonstationary environments, conventional approaches to enpoint etection an energy normalization often fail an ASR performances usually egrae ramatically. The purpose of this paper is to aress the enpoint problem. For ASR, we propose a real-time approach. It uses an optimal filter plus a three-state transition iagram for enpoint etection. The filter is esigne utilizing several criteria to ensure accuracy an robustness. It has almost invariant response at various backgroun noise levels. The etecte enpoints are then applie to energy normalization sequentially. Evaluation results show that the propose algorithm significantly reuces the string error rates in low SNR situations. The reuction rates even excee 50% in several evaluate atabases. For SV, we propose a batch-moe approach. It uses the optimal filter plus a two-mixture energy moel for enpoint etection. The experiments show that the batch-moe algorithm can etect enpoints as accurately as using HMM force alignment while the propose one has much less computational complexity. Inex Terms Change-point etection, ege etection, enpoint etection, optimal filter, robust speech recognition, speaker verification, speech activity etection, speech etection. I. INTRODUCTION IN SPEECH an speaker recognition, we nee to process the signal in utterances consisting of speech, silence, an other backgroun noise. The etection of the presence of speech embee in various types of nonspeech events an backgroun noise is calle enpoint etection, speech etection, or speech activity etection. In this paper, we aress enpoint etection by sequential an batch-moe processes to support real-time recognition (in which the recognition response is the same as or faster than recoring an utterance). The sequential process is often use in automatic speech recognition (ASR) [1] while the batch-moe process is often allowe in speaker recognition [2], name ialing [3], comman control an embee systems, where utterances are usually as short as a few secons an the elay in response is usually small. Enpoint etection has been stuie for several ecaes. The first application was in a telephone transmission an switching Manuscript receive June 7, 2001; revise February 13, The associate eitor coorinating the review of this manuscript an approving it for publication was Dr. Juergen Schroeter. The authors are with the Multimeia Communications Research Lab, Bell Labs, Lucent Technologies, Murray Hill, NJ USA ( qli@research.bell-labs.com). Publisher Item Ientifier S (02)03972-X. system evelope in Bell Labs, for time assignment of communication channels [4]. The principle was to use the free channel time to interpolate aitional speakers by speech activity etection. Since then, various speech etection algorithms have been evelope for ASR, speaker verification, echo cancellation, speech coing an other applications. In general, ifferent applications nee ifferent algorithms to meet their specific requirements in terms of computational accuracy, complexity, robustness, sensitivity, response time, etc. The approaches inclue those base on energy threshol (e.g., [5]), pitch etection (e.g., [6]), spectrum analysis, cepstral analysis [7], zero-crossing rate [8], [9], perioicity measure, hybri etection [10], fusion [11] an many other methos. Furthermore, similar issues have also been stuie in other research areas, such as ege etection in image processing [12], [13] an change-point etection in theoretical statistics [14] [18]. As is well-known, enpoint etection is crucial to both ASR an speaker recognition because it often affects a system s performance in terms of accuracy an spee for several reasons. First, cepstral mean subtraction (CMS) [19] [21], a popular algorithm for robust speaker an speech recognition, nees accurate enpoints to compute the mean of speech frames precisely in orer to improve recognition accuracy. Secon, if silence frames can be remove prior to recognition, the accumulate utterance likelihoo scores will focus more on the speech portion of an utterance instea of on both noise an speech. Therefore, it has the potential to increase recognition accuracy. Thir, it is har to moel noise an silence accurately in changing environments. This effect can be limite by removing backgroun noise frames in avance. Fourth, removing nonspeech frames when the number of nonspeech frames is large can significantly reuce the computation time. Finally, for open speech recognition systems, such as open-microphone esktop applications an auio transcription of broacast news, it is necessary to segment utterances from continuous auio input. In applications of speech an speaker recognition, nonspeech events an backgroun noise complicate the enpoint etection problem consierably. For example, the enpoints of speech are often obscure by speaker-generate artifacts such as clicks, pops, heavy breathing, or by ial tones. Long-istance telephone transmission channels also introuce similar types of artifacts an backgroun noise. In recent years, as wireless, hans-free an Internet Protocol (IP) phones get more an more popular, the enpoint etection problem becomes even more ifficult since the signal-to-noise ratios (SNR) of these kins of communication evices are usually lower an the noise is nonstationary than those in traitional telephone lines an hansets. The noise may come from the backgroun, such as car noise, room reflec /02$ IEEE

2 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 147 tion, street noise, backgroun talking, etc., or from communication systems, such as coing, transmission, packet loss, etc. In these cases, the ASR or speaker recognition performance often egraes ramatically ue to unreliable enpoint etection. Another problem relate to enpoint etection is real-time energy normalization. In both ASR an speaker recognition, we usually normalize the energy feature such that the largest energy level in a given utterance is close to or slightly below a constant of zero or one. This is not a problem in batch-moe processing, but it can be a crucial problem in real-time processing since it is ifficult to estimate the maximal energy in an utterance with just a short-time ata buffer while the acoustic environment is changing. It becomes especially har in averse acoustic environments. A look-ahea approach to energy normalization can be foun in [6]. Actually, as we will point out later in this stuy, real-time energy normalization an enpoint etection are two relate problems. The more accurately we can etect enpoints, the better we can o on real-time energy normalization. In this paper, we propose two enpoint-etection algorithms for real-time ASR an speaker recognition. Generally speaking, both algorithms must meet the following requirements: accurate location of etecte enpoints; robust etection at various noise levels; low computational complexity; fast response time; an simple implementation. The real-time energy normalization problem is aresse together with enpoint etection. The rest of the paper is organize as follows. In Section II, we will introuce a filter for enpoint etection. In Section III, we will propose a sequential algorithm of combine enpoint etection an energy normalization for ASR in averse environments an provie experimental results in large atabase evaluations. In Section IV, we will propose an accurate enpoint-etection algorithm for batch-moe applications an compare the etecte enpoints with manually-etecte as well as HMM force-alignment etecte enpoints. Finally, we will summarize our finings in Section V. II. A FILTER FOR ENDPOINT DETECTION To ensure the low-complexity requirement, we borrow the one-imensional (1-D) short-term energy in the cepstral feature to be the feature for enpoint etection where ata sample; frame number; frame energy in ecibels; winow length; number of the first ata sample in the winow. Thus, the etecte enpoints can be aligne to the ASR feature vector automatically an the computation can be reuce from the speech-sampling rate to the frame rate. For accurate an robust enpoint etection, we nee a etector that can etect all possible enpoints from the energy feature. Since the output of the etector contains false acceptances, (1) a ecision moule is then neee to make final ecisions base on the etector s output. Here, we assume that one utterance may have several speech segments separate by possible pauses. Each of the segments can be etermine by etecting a pair of enpoints name segment beginning an ening points. On the energy contours of utterances, there is always a raising ege following a beginning point an a escening ege preceing an ening point. We call them beginning an ening eges, respectively, as shown in Fig. 4(a). Since enpoints always come with the eges, our approach is first to etect the eges an then to fin the corresponing enpoints. The founation of the theory of the optimal ege etector was first establishe by Canny [12]. He erive an optimal step-ege etector. Spacek [22], on the other han forme a performance measure combining all three quantities erive by Canny an provie the solution of the optimal filter for step ege. Petrou an Kittler then extene the work to ramp-ege etection [13]. Since the eges corresponing to enpoints in the energy feature are closer to the ramp ege than the ieal step ege, Li an Tsai applie Petrou an Kittler s filter to the enpoint etection for speaker verification in [2]. In summary, we nee a etector that meets the following general requirements: 1) invariant outputs at various backgroun energy levels; 2) capability of etecting both beginning an ening points; 3) short time elay or look-ahea; 4) limite response level; 5) maximum output signal-to-noise ratio (SNR) at enpoints; 6) accurate location of etecte enpoints; 7) maximum suppression of false etection. We then nee to convert the above criteria to a mathematic representation. As we have iscusse, it is reasonable to assume that the beginning ege in the energy contour is a ramp ege that can be moele by the following function: for for where represents the frame number of the feature an is some positive constant which can be ajuste for ifferent kins of eges, such as beginning or ening eges an for ifferent sampling rates. The etector is a 1-D filter which can be operate as a moving-average filter in the energy feature. From the above requirements, the filter shoul have the following properties which are similar to those in [13]. P1) It must be antisymmetrical, i.e., an thus. This follows from the fact that we want it to etect an antisymmetrical features [12], i.e., sensitive to both beginning an ening eges accoring to the request in 2); an to have near-zero response to backgroun noise at any level, i.e., invariant to backgroun noise accoring to the request in 1). P2) Accoring to the requirement in 3), it must be of finite extent going smoothly to zero at its ens:, (2)

3 148 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 an for, where is the half with of the filter. P3) Accoring to the requirement in 4), it must have a given maximum amplitue : where is efine by an is in the interval (, 0). If we further represent requirements 5), 6), an 7), as, an, respectively, the combine objective function is Fig. 1. Shape of the esigne optimal filter. Subject to properties P1), P2), an P3) (3) It aims at fining the filter function,, such that the value of the objective function is maximal subject to properties P1) P3). Fortunately, the object function is very similar to optimal ege etection in image processing an the etails of the object function have been erive by Petrou an Kittler [13] following Canny [12], as well as in Appenix A. After applying the metho of Lagrange multipliers, the solution for the filter function is [13] where an are filter parameters. Since is only half of the filter, when, the actual filter coefficients are where is an integer. The filter can then be operate as a movingaverage filter in where is the energy feature an is the current frame number. An example of the esigne optimal filter is shown in Fig. 1. Intuitively, the shape of the filter inicates that the filter must have positive response to a beginning ege, negative response to an ening ege an a near zero response to silence. Its response is basically invariant to ifferent backgroun noise levels since they all have near zero responses. III. REAL-TIME ENDPOINT DETECTION AND ENERGY NORMALIZATION FOR ASR The approach of using enpoint etection for real-time ASR is illustrate in Fig. 2 [23]. We use an optimal filter, as iscusse in the last section, to etect all possible enpoints, following by a three-state logic as a ecision moule to ecie real enpoints. The information of etecte enpoints is also utilize for real-time energy normalization. Finally, all silence frames are remove an only the speech frames incluing cepstrum an the normalize energy are sent to the recognizer. (4) (5) (6) Fig. 2. Enpoint etection an energy normalization for real-time ASR. A. Filter for Both Beginning- an Ening-Ege Detection After evaluating the shapes of both beginning an ening eges, we choose the filter size to be to meet requirements 2) an 3). For an, the filter parameters have been provie in [13] as, For in our application, we just nee to rescale an while s are as shown previously. The shape of the esigne filter is shown in Fig. 1 with a simple normalization,. For real-time etection, let ; then the filter has 25 points in total with a 24-frame look-ahea since both an are zeros. The filter operates as a moving-average filter where is the energy feature an is the current frame number. The output is then evaluate in a three-state transition iagram for final enpoint ecisions. B. State Transition Diagram Enpoint ecision nees to be mae by comparing the value of with some pre-etermine threshols. Due to the sequential nature of the etector an the complexity of the ecision proceure, we use a three-state transition iagram to make final ecisions. As shown in Fig. 3, the three states are: silence, in-speech, an leaving-speech. Either the silence or the in-speech state can be a starting state an any state can be a final state. In the following iscussion, we assume that the silence state is the starting state. The input is an the output is the etecte frame numbers of beginning an ening points. The transition conitions are labele on the eges between states an the actions are liste in parentheses. Count is a frame counter, an are two threshols with an Gap is an integer inicating the require number of frames from a etecte enpoint to the actual en of speech. We use Fig. 4 as an example to illustrate the state transition. The energy for a spoken igit 4 is plotte in Fig. 4(a) an the (7)

4 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 149 is stable to the noise levels, the etecte enpoints are more reliable. Those constants, Gap,, an, can be etermine empirically by plotting several utterances an corresponing filter outputs. As we will show in the atabase evaluation, the algorithm is not very sensitive to the values of an since the same values were use in ifferent atabases. Also, in some applications, two separate filters can be esigne for beginning an ening point etection. The size of the beginning filter can be smaller than 25 points while the ening filter can be larger than 25 points. This approach may further improve accuracy; however, it will have a longer elay an use more computation. The 25-point filter use in this section was esigne for both beginning an ening point etection in an 8 KHz sampling rate. Also, in the case that an utterance is starte from an unvoice phoneme, it is practical to step back about ten frames from the etecte beginning points. Fig. 3. State transition iagram for enpoint ecision. C. Real-Time Energy Normalization Suppose that the maximal energy value in an utterance is. The purpose of energy normalization is to normalize the utterance energy, such that the largest value of energy is close to zero by performing. In a real-time moe, we have to estimate the maximal energy sequentially while the ata are being collecte. Here, the estimate maximum energy becomes a variable an is enote as. Nevertheless, we can use the etecte enpoints to obtain a better estimate. We first initialize the maximal energy to a constant, which is selecte empirically an use it for normalization until we etect the first beginning point at as in Fig. 4, i.e.,. If the average energy where is a pre-selecte threshol to ensure that new is not from a single click, we then estimate the maximal energy as (8) Fig. 4. Example: (a) energy contour of igit 4 an (b) filter outputs an state transitions. filter output is shown in Fig. 4(b). The state iagram stays in the silence state until reaches point in Fig. 4(b), where means that a beginning point is etecte. The actions are to output a beginning point [corresponing to the left vertical soli line in Fig. 4(a)] an to move to the in-speech state. It stays in the in-speech state until reaching point in Fig. 4(b), where. The iagram then moves to the leavingspeech state an sets Count. The counter resets several times until reaching point. At point, Counter Gap. An actual enpoint is etecte as the left vertical ashe line in Fig. 4(b). The iagram then moves back to the silence state. During the stay in the leaving-speech state, if, this means that a beginning ege is coming an we shoul move back to the in-speech state. The 30-frame gap correspons to the perio of escening energy before reaching a real ening point. We note that the threshols, such as an, are set in the filter outputs instea of absolute energy. Since the filter output where is the length of the filter an the length of the look-ahea winow. At point, the look-ahea winow is from to as shown in Fig. 4. From now on, we upate as (9) (10) Parameter may nee to be ajuste for ifferent system. For example, the value of coul be ifferent between telephone an esktop systems. Parameter is relatively easy to etermine. For the example in Fig. 5, the energy features of two utterances with 20 B SNR (bottom) an 5 B SNR (top) are plotte in Fig. 5(a). The 5-B utterance is generate by artificially aing car noise to the 20 B one. The filter outputs are shown in Fig. 5(b) for 20 B (soli line) an 5 B (ashe line) SNRs, respectively. The etecte enpoints an normalize energy for 20 an 5 B SNRs are plotte in Fig. 5(c) an 5(), respectively. We note that the filter outputs for 20 an 5 B cases

5 150 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 Fig. 5. (a) Energy contours of Z214 from original utterance (bottom, 20 B SNR) an after aing car noise (top, 5 B SNR). (b) Filter outputs for 5 B (ashe line) an 20 B (soli line) SNR cases. (c) Detecte enpoints an normalize energy for the 20 B SNR case an () for the 5 B SNR case. Fig. 6. Comparisons on real-time connecte igit recognition with various SNRs. From 5- to 20-B SNRs, the propose algorithm provie wor error rate reuctions of 90.2%, 93.4%, 57.1%, an 57.1%, respectively. are almost invariant aroun an, although their backgroun energy levels have a ifference of 15 B. This ensures the robustness in enpoint etection. We also note that the normalize energy profiles are almost the same as the original one, although the normalization is one in a real-time moe. D. Database Evaluation The propose algorithm was compare with a baseline enpoint etection algorithm on one noisy atabase an several telephone atabases. 1) Baseline Enpoint Detection: The baseline system is a real-time, energy contour base aaptive etector evelope base on the algorithm introuce in [1], [5]. It has been use for years in research an commercial speech recognizers. In the baseline system, a six-state transition iagram is use to etect enpoints. Those states are name as initializing, silence, rising, energy, fell-rising, an fell states. In total, eight counters an 24 har-limit threshols are use for the ecisions of state transition. Two aaptive threshol values were use in most of the threshols. We note that all the threshols are compare with raw energy values irectly. Energy normalization in the baseline system is one separately by estimating the maximal an minimal energy values, then comparing their ifference to a fixe threshol for ecision. Since the energy values change with acoustic environments, the baseline approach leas to unreliable enpoint etection an energy normalization, especially in low SNR an nonstationary environments. 2) Noisy Database Evaluation: In this experiment, a atabase was first recore from a esktop computer at 16 KHz sampling rate, then own-sample to 8 KHz sampling rate. Later, car another backgrounnoises wereartificially aetotheoriginal atabase at the SNR levels of 5, 10, 15, an 20 B. The original atabase has 39 utterances an 1738 igits in total. Each utterance has 3, 7, or 11 igits. LPC feature an the short-term energy were use an the hien Markov moel (HMM) in a hea-boy-tail Fig. 7. (a) Energy contour of the 523th utterance in DB5: 1 Z 4 O (b) Enpoints an normalize energy from the baseline system. The utterance was recognize as 1 Z 4 O 5 8. (c) Enpoints an normalize energy from the propose system. The utterance was recognize correctly as 1 Z 4 O () The filter output. (HBT) structure was employe to moel each of the igits [24], [25]. The HBT structure assumes that context epenent igit moels can be built by concatenating a left-context epenent unit (hea) with a context inepenent unit (boy) followe by a right-context epenent unit (tail). We use three HMM states to represent each hea an tail an four state to represent each boy. Sixteen mixtures were use for each boy state an four mixtures were use for each hea or tail state. The real-time recognition performances on various SNRs are shownin Fig. 6. Compare to the baseline algorithm, the propose onesignificantly reuceworerrorrates. Thebaselinealgorithm faile to work in low SNR cases because it uses raw energy values irectly to etect enpoints an to perform energy normalization. The propose algorithm makes ecision on the filter output instea of raw energy values; therefore, it provie more robust results. An example of error analysis is shown in Fig. 7.

6 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 151 TABLE I DATABASE EVALUATION RESULTS (%) Fig. 8. Shape of the optimal filter for beginning ege etection, plotte as h (t), with W =7an s =1. 3) Telephone Database Evaluation: The propose algorithm was further evaluate in 11 atabases collecte from the telephone networks with 8 khz sampling rates in various acoustic environments. LPC parameters an short-term energy were use. The acoustic moel consists of one silence moel, 41 mono-phone moels an 275 hea-boy-tail units for igit recognition. It has a total of 79 phoneme symbols, 33 of which are for igit units. Eleven atabases, DB1 to DB11, were use for the evaluation. DB1 to DB5 contain igits, alphabet an wor strings. Finite-state grammars were use to specify the vali forms of recognize strings. DB6 to DB11 contain pure igit strings. In all the evaluations, both enpoint etection an energy normalization were performe in real-time moe an only the etecte speech portions of an utterance were sent to the recognition back-en. In the propose system, we set the parameters as,,, an Gap. These parameters were unchange throughout the evaluation in all 11 atabases to show the robustness of the algorithm, although the parameters can be ajuste accoring to signal conitions in ifferent applications. The evaluation results are liste in Table I. It shows that the propose algorithm works very well in regular telephone ata as well. It provie wor error reuction in most of the atabases. The wor error reuctions even excee 30% in DB2, DB6, an DB9. To analyze the improvement, the original energy feature of an utterance, 1 Z 4 O 5 8 2, in DB6 is plotte in Fig. 7(a). The etecte enpoints an normalize energy using the conventional approach are shown in Fig. 7(b) while the results of the propose algorithm are shown in Fig. 7(c). The filter output is plotte in Fig. 7(). From Fig. 7(b), we can observe that the normalize maximal energy of the conventional approach is about 10 B below zero, which causes a wrong recognition result: 1 Z 4 O 5 8. On the other han, the propose algorithm normalize the maximal energy to zero approximately an the utterance was recognize correctly as 1 Z 4 O IV. ACCURATE BATCH-MODE ENDPOINT DETECTION FOR SPEAKER VERIFICATION So far, we have focuse on real-time enpoint etection, which is mainly for ASR applications where silence or garbage moels are usually use to further etermine accurate enpoints Fig. 9. Shape of the optimal filter for ening ege etection, plotte as h (t), with W =35an s =0:2. in ecoing. In another category of applications, real-time processing is not so crucial. Speech ata can be processe in a batch-moe, i.e., after ata recoring is finishe. The applications inclue speaker verification, name ialing, speech control, etc., where the utterances are usually short (e.g., less than 2 s) an the verification or recognition can be one within 1 s. Since many of these kins of applications are offere in embee systems, such as wireless phones or portable evices; or in multi-user systems, such as a speaker verification server for millions of users [3], they normally require low computational complexity for low cost or for a fast response. For these cases, one solution is to use an accurate en-point etector to remove all silence; therefore, we not only can reuce the number of ecoing frames significantly, but also eliminate the silence moel in ecoing, which usually takes a lot of space an computation. Batch-moe processing enables this class of operations. A. Batch-Moe Algorithm To obtain accurate enpoints, we esigne two filters, one for beginning-ege an another for ening-ege etection, using the algorithm in Section II. The first filter is shown in Fig. 8 with seven points, while the secon one is shown in Fig. 9 with 35 points. This is because the ening ege is usually longer than the beginning ege. We note that the ening filter gives positive response at a etecte ening ege. To help in accurately etermining energy threshols, we use a Gaussian mixture moel to moel the energy istribution. The final enpoints are etecte by combining the information from the filter outputs an the estimate threshols. 1) Energy Distribution Moel: We assume that a Gaussian mixture moel can approximately represent the istribution of energy in an utterance with two mixtures representing speech an backgroun energy, respectively (11) where is a weighting parameter, is a normal istribution given by (12)

7 152 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 an an are the mean an stanar eviation, respectively. The means for speech an backgroun noise are an with the corresponing stanar eviations, an. The threshols for speech an backgroun noise are an, respectively. When the energy value is above, we consier it as speech; when the energy value is below, we consier it as backgroun noise. To obtain fast an explicit parameter estimation, we applie a moment algorithm instea of the popular EM algorithm which nees iterations. The fast estimation algorithm is liste in Appenix B. 2) Summary of the Algorithm: We now summarize the propose algorithm for batch-moe enpoint etection [2]. The parameters in the following algorithm are for the energy feature compute from 30 ms energy winows shifting every 10 ms. The ata sampling rate is 8 KHz. 1) Compute log energy of the given utterance, an normalize it by subtracting to get. We assume that the speech is surroune by silence an various kins of noise. 2) Remove the ial tone from. The ial tone can be etecte when an These two parameters are etermine base on the minimal length an minimal energy level of ial tones. 3) Estimate,, an using (22) to (29), then etermine two threshols an for speech an backgroun energy, respectively. Speech energy shoul be above the value of threshol an silence/backgroun noise energy shoul be below. This is base on the assumption that noise an speech can be represente as two separate Gaussian mixtures. 4) Compute the output of the beginning-ege filter (13) then search for the locations of all peaks, from the filter output. A peak associate with a beginning point shoul meet the following properties: an The actual beginning point is where is the th beginning ege. The shift is ue to the offset between the center of a beginning ege an the actual beginning point. Fig. 10. Normalize log energy of Call office with heavy breath in the en. Lines A, B, C, an D inicate the estimate values of,, an, respectively. 5) From a etecte beginning point, search for the corresponing ening point, which shoul satisfy the following conitions: i) an ; ii) ; iii) when, 60% frames of, shoul have the values above ; an iv). Here, ii) an iii) are to ensure that the segmentation is speech instea of a click or breath noise. The segment that cannot meet the above conitions is not a speech segment. 6) Determine the actual last ening point. Compute the response of the ening-ege filter in the last segment,,by (14) Search for the last peak of, where an. Then, shift the peak point locate at the center of the ening ege to the last ening point. The offset shoul be about half the filter size. We choose 16 frames. Thus, if at frame, the energy level is still higher than threshol, the ening point is at frame ; otherwise, the ening point is the last point before the energy crosses the threshol if an. (15) 3) Illustrative Examples: We use the example in Fig. 10 to illustrate the concept of the propose algorithm. The utterance call office is first converte to log energy an normalize to have the largest value be zero. For this example, the speech signal is about 2 s concatenate by another 2 s of heavy breath. We estimate the means an stanar eviations of the speech an the backgroun energy using the equations in Appenix B. The results are shown in Fig. 10, where lines A, B, C, an D inicate the estimate values of,, an, respectively. Then we computer using (13). We note that the operation iffers slightly from the real-time one. The result is shown in Fig. 11 as a soli line. After evaluating the values of the peaks that are above threshol, for this case, the locations of beginning points are first locate at the centers of the highest peaks.

8 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 153 Fig. 11. Output of the beginning-ege filter (soli line) an ening-ege filter (ashe line). Fig. 13. Last ening point was ajuste from Line H to I by applying the ening-ege filter. Fig. 12. Lines E, F, G, an H inicate the locations of two pairs of beginning an ening points. Fig. 14. Normalize log energy of call Canice at her home with breath in the beginning. The etecte enpoints are the vertical soli line an ashe line. Actual beginning points can then be locate by shifting the corresponing locations of the peaks to the left by half the filter size. From the first beginning point, we search for the location of the corresponing ening point, where the energy level is lower than. For this example, we get two pairs of enpoints corresponing to two speech segments, as shown in Fig. 12, from line E to F an from line G to H, respectively. The clicks in the beginning of the utterances were not etecte as speech because the filter responses at these locations were lower than the threshol value. As we can see from Fig. 12, the last segment between lines G an H inclues the heavy breath. The energy ata in the segments are then fe into the ening-ege filter an compute (14). The filter output is shown in Fig. 11 as the ashe line. The ening point of the last segment is locate by shifting the frame inex of the largest peak to the right by about half the size of the ening ege filter. If the energy value is lower than at the shifte location, the ening point shoul be the last point where the energy level is greater than as escribe in (15). The final speech segments are from line E to line F an from line G to line I, as shown in Fig. 13. More examples are shown in Figs. 14 an 15. Fig. 14 is the energy contour of utterance Call Canice at her home phone with breath in the beginning. The horizontal soli an ashe lines represent the means an threshols for speech an noise. The etecte beginning an ening points are shown as the vertical soli line an ashe lines, respectively. The breath signal is exclue from the speech segmentation successfully. Fig. 15 is an example of an utterance with a ial tone in the en. The Fig. 15. Normalize log energy of I plege allegiance to the flag, with a ial tone at the en. The vertical lines inicate the beginning point an the ening points. first an the last vertical lines are the beginning an the ening points, respectively. Other vertical lines inicate the etecte silence between wors. The ial tone in the en of the utterance was etecte an exclue from the speech segment. B. Comparisons With HMM Force-Alignment Approach The atabase use for the comparison was collecte for speaker verification with a common phrase I plege allegiance to the flag. It has 100 speakers an 4741 utterances in total. The utterances were collecte over long istance telephone networks. The speakers were instructe to make the phone calls at ifferent locations an using ifferent telephone hansets. The collecte utterances are with various kins of noise. A pair of beginning an ening points was etecte manually for every utterance. We use the manually etecte enpoints

9 154 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 TABLE II STATISTICS OF THE DIFFERENCES ON DETECTED BEGINNING POINTS Fig. 16. Dashe line is the histogram of the ifferences between manuallyan HMM-etecte beginning points. Soli line is between manually an batch-moe etecte beginning points. as references to compare with the enpoints etecte by the propose approach an by HMM approach. Here, the HMM approach means enpoint etection by force alignment assuming both the moels an lexicons are available. The HMM approach uses 41 speaker-inepenent phoneme moels. Each phone moel has 3 states an each state has 32 Gaussian mixtures. The feature vector is compose of 12 cepstral an 12 elta cepstral coefficients. The cepstrum is erive from a tenth-orer LPC analysis over a 30 ms winow an the feature vectors are upate at 10 ms intervals. The histogram of the ifferences between the manually etecte beginning points an HMM etecte beginning points is shown in Fig. 16 as a ashe line. The histogram of the ifferences between the manually-etecte beginning points an the beginning points etecte by the propose approach is shown in Fig. 16 as a soli line. The statistics are liste in Table II. The accuracy of the propose approach is very close to the HMM approach. The shift between those two histograms can be resolve by ajusting the threshols in etermining beginning points; however, it is not necessary since the overall ifference between the two approaches is about the same. The histogram of the ifferences between the manually etecte ening points an HMM etecte ening points is shown in Fig. 17 as a ashe line. The histogram of the ifferences between manually etecte ening points an the ening points etecte by the propose approach is shown in Fig. 17 as a soli line. These two histograms are very close. We note that both of the histograms shift from the manually etecte ening-points. This is ue to the ifferent interpretations on ening-points between human an algorithms. The histograms an table inicate that the enpoints etecte by the propose algorithm have the same accuracy as the HMM etecte enpoints. Comparing with HMM approach, the propose one oes not nee any language-epenent moels an lexicon information; therefore, it can support language-inepenent applications. Also, the propose algorithm is much faster. It only nees about 130 Kflops (floating Fig. 17. Dashe line is the histogram of the ifferences between manually- an HMM-etecte ening points. Soli line is between manually- an batch-moe etecte ening points. point operations) for enpoint etection, while the HMM approach nees over 200 Mflops for force alignment using a set of speaker-inepenent phoneme moels. Furthermore, the propose algorithm can etect the silence between wors easily while it nees to involve much more computation when using the HMM approach. C. Application to Language-Inepenent Speaker Verification Since the propose algorithm can etect enpoints at accuracy similar to the HMM approach, we apply the propose algorithm to the front-en of a speaker verification system [2]. After LPC cepstral extraction, the propose algorithm etects enpoints on the energy. Silence, breath, ial tone an other nonspeech signals are then remove from the feature set. Given the original feature observation of, after silence removal, the feature set becomes which is a subset of, i.e.,. Cepstral mean subtraction (CMS) is then performe on. This approach was evaluate on a atabase consisting of 38 speakers 18 male an 20 female for speaker verification (see [26] for the atabase escriptions). The common pass-phrase for all speakers is call Janice at her office phone. Each true speaker was teste with the same pass-phrase from all impostors. In the language-inepenent configuration, the equal error rates (EERs) are 3.6% an 4.4% for male an female groups, respectively. In the language-epenent configuration where the backgroun moel is applie, the EERs are 2% an 3.5% for male an female groups, respectively. The average iniviual EER is 2.8%. The accuracy is in the same level as the speaker

10 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 155 verification system where HMMs were applie to enpoint etection [27], [28]. The propose algorithm has also been implemente in a real speech controller with embee speaker verification. Reaers are referre to [3] for etail. V. CONCLUSIONS In this paper, we propose two algorithms for real-time an batch-moe enpoint etection. Both algorithms apply filters to etect possible enpoints an then make final ecisions base on the filter outputs. Since the filter is esigne to be invariant to various levels of backgroun noise, the propose algorithms are reliable an robust, even in very low SNR situations. In the real-time algorithm, a filter with a 24-frame look-ahea etects all possible enpoints. A three-state transition iagram then evaluates the output from the filter for final ecisions. The etecte enpoints are then applie to real-time energy normalization. Since the entire algorithm only uses a 1-D energy feature, it has low complexity an is very fast in computation. The evaluation in a noisy atabase has showe significant string error reuction, over 50% on all 5- to 20-B SNR situations. The evaluations in telephone atabases have showe over 30% reuctions in four out of 12 atabases. The propose algorithm has been implemente in real-time ASR systems. The contributions are not only to improve the recognition accuracy but also the robustness of entire system in low SNR environments. In the batch-moe algorithm, the peaks of the filter output are use to etect enpoints with threshols estimate from a two-mixture energy istribution moel, where the moel parameters can be solve through close-form equations. Using manually etecte enpoints as references, we have compare the propose algorithm with the force-alignment approach using HMM. The experiments showe that the propose algorithm has similar accuracy to the HMM approach while it nees much less computations. The algorithm has also been implemente in a real recognition system for language-inepenent speech control incluing embee speaker verification [3]. APPENDIX A OBJECTIVE FUNCTION FOR THE OPTIMAL FILTER DESIGN Assume that the beginning or ening ege in log energy is a ramp ege as efine in (2). An, assume that the eges are emerge with white Gaussian noise. Following Canny s criteria, Petrou an Kittler [13] erive the SNR for this filter as being proportional to Finally, the measure for the suppression of false eges is proportional to the mean istance between the neighboring maxima of the response of the filter to white Gaussian noise Therefore, the combine objective function of the filter is APPENDIX B ENERGY MODEL ESTIMATION (18) (19) Instea of the popular EM algorithm, we applie the moment algorithm [29] for a faster parameter estimation for the moel in (11). Let represent sample values. By equating the observe moments given by where (20) is the sample mean to the theoretical moments given by (21) where, we can obtain five nonlinear simultaneous equations an the solution has been summarize in [29]. To estimate the five parameters, we first fin the real negative root of where (22) (16) where is a half with of the actual filter. They consier a goo locality measure to be inversely proportional to the stanar eviation of the istribution of enpoint where the ege is suppose to be. It was efine as (17) an where an are the fourth an fifth sample cumulates, respectively. Let be a real negative root of (22), then parameters an are obtaine as roots of (23)

11 156 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 where (24) Now, the estimates of the five moel parameters may be erive as the following explicit forms: (25) (26) (27) (28) (29) We note that in the application. In the case that the solution of (22) oes not exist, a histogram can be constructe to estimate the mixture moel parameters approximately. ACKNOWLEDGMENT The authors wish to thank Dr. D. W. Tufts for a very helpful lecture an Drs. W.-G. Kim, C.-H. Lee, O. Siohan, F. K. Soong, an F. Korkmazskiy for useful iscussions. REFERENCES [1] L. Rabiner an B.-H. Juang, Funamentals of Speech Recognition. Englewoo Cliffs, NJ: Prentice-Hall, [2] Q. Li an A. Tsai, A matche filter approach to enpoint etection for robust speaker verification, in Proc. IEEE Workshop on Automatic Ientification, Summit, NJ, Oct [3], A language-inepenent personal voice controller with embee speaker verification, in Proc. Eurospeech 99, Buapest, Hungary, Sept [4] K. Bullington an J. M. Fraser, Engineering aspects of TASI, Bell Syst. Tech. J., pp , Mar [5] J. G. Wilpon, L. R. Rabiner, an T. Martin, An improve wor-etection algorithm for telephone-quality speech incorporating both syntactic an semantic constraints, AT&T Bell Labs. Tech. J., vol. 63, pp , Mar [6] R. Chengalvarayan, Robust energy normalization using speech/nonspeech iscriminator for German connecte igit recognition, in Proc. Eurospeech 99, Buapest, Hungary, Sept. 1999, pp [7] J. A. Haigh an J. S. Mason, Robust voice activity etection using cepstral features, in Proc. IEEE TENCON, 1993, pp [8] L. R. Rabiner an M. R. Sambur, An algorithm for etermining the enpoints of isolate utterances, Bell Syst. Tech. J., vol. 54, pp , Feb [9] J. C. Junqua, B. Reaves, an B. Mak, A stuy of enpoint etection algorithms in averse conitions: Incience on a DTW an HMM recognize, in Proc. Eurospeech, 1991, pp [10] L. F. Lamel, L. R. Rabiner, A. E. Rosenberg, an J. G. Wilpon, An improve enpoint etector for isolate wor recognition, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-29, pp , Aug [11] S. G. Tanyer an H. Özer, Voice activity etection in nonstationary noise, IEEE Trans. Speech Auio Processing, vol. 8, pp , July [12] J. Canny, A computational approach to ege etection, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-8, pp , Nov [13] M. Petrou an J. Kittler, Optimal ege etectors for ramp eges, IEEE Trans. Pattern Anal. Machine Intell., vol. 13, pp , May [14] E. Carlstein, M. Muller, an D. Siegmun, Change-Point Problems. Haywar, CA: Inst. Math. Statist., [15] R. K. Bansal an P. Papantoni-Kazakos, An algorithm for etecting a change in stochastic process, IEEE Trans. Inform. Theory, vol. IT-32, pp , Mar [16] A. Wal, Sequential Analysis. Lonon, U.K: Chapman & Hall, [17] Q. Li, A etection approach to search-space reuction for HMM state alignment in speaker verification, IEEE Trans. Speech Auio Processing, vol. 9, pp , July [18] B. Brosky an B. S. Darkhovsky, Nonparametric Methos in Change- Point Problems. Norwell, MA: Kluwer, [19] B. S. Atal, Effectiveness of linear preiction characteristics of the speech wave for automatic speaker ientification an verification, J. Acoust. Soc. Amer., vol. 55, pp , [20], Automatic recognition of speakers from their voices, Proc. IEEE, vol. 64, pp , [21] S. Furui, Cepstral analysis techniques for automatic speaker verification, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, pp , Apr [22] L. A. Spacek, Ege etection an motion etection, Image Vision Comput., vol. 4, pp , [23] Q. Li, J. Zheng, Q. Zhou, an C.-H. Lee, A robust, real-time enpoint etector with energy normalization for ASR in averse environments, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Salt Lake City, UT, May [24] W. Chou, C.-H. Lee, an B.-H. Juang, Minimum error rate training of inter-wor context epenent acoustic moel units in speech recognition, in Proc. Int. Conf. on Spoken Language Processing, 1994, pp [25] C.-H. Lee, E. Giachin, L. R. Rabiner, R. Pieraccini, an A. E. Rosenberg, Improve acoustic moeling for large vocabulary speech recognition, Comput. Speech Lang., vol. 6, pp , [26] R. A. Sukkar an C.-H. Lee, Vocabulary inepenent iscriminative utterance verification for nonkeywor rejection in subwor base speech recognition, IEEE Trans. Speech Auio Processing, vol. 4, pp , Nov [27] S. Parthasarathy an A. E. Rosenberg, General phrase speaker verification using sub-wor backgroun moels an likelihoo-ratio scoring, in Proc. ICSLP-96, Philaelphia, PA, Oct [28] Q. Li, S. Parthasarathy, an A. E. Rosenberg, A fast algorithm for stochastic matching with application to robust speaker verification, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Munich, Germany, Apr. 1997, pp [29] B. S. Everitt an D. J. Han, Finite Mixture Distributions. Lonon, U.K.: Chapman & Hall, Qi (Peter) Li (S 87 M 88 SM 01) receive the Ph.D. egree in electrical engineering from the University of Rhoe Islan, Kingston, in In 1995, he joine Bell Laboratories, Murray Hill, NJ, where he is currently a Member of Technical Staff in the Dialogue Systems Research Department. From 1988 to 1994, he was with F.M. Engineering an Research, Norwoo, MA, where he worke in research on patent recognition algorithms an in real-time systems. His research interests inclue robust speaker an speech recognition, robust feature extraction, fast search algorithms, stochastic moeling, fast iscriminative learning, an neural networks. His research results have been implemente in Lucent proucts. He has publishe extensively, file an aware many patents in his research areas. Dr. Li has been active as a reviewer for several journals, incluing IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, as a Local Chair for the IEEE 1999 Workshop on Automatic Ientification, an as a committee member for several IEEE workshops. He has receive two awars an is liste in Who s Who in America (Millennium an 2001 Eitions). Jinsong Zheng receive the B.S. an M.S. egrees in computer science from Fuan University, Shanghai, China, an Utah State University, Logan, UT, respectively. Between 1994 an 1998, he was a Software Engineer with WebSci Technologies, South Brunswick, NJ. Since 1998, he has been a Consultant in the Dialogue Systems Research Department of Bell Labs, Lucent Technologies, Murray Hill, NJ, where he has been involve in various research projects in speech recognition. He is also a member of the Lucent Automatic Speech Recognition (LASR) software evelopment team.

12 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 157 Augustine Tsai receive the M.S. egree in systems engineering from Case Western Reserve University, Clevelan, OH, in 1989, a secon M.S. egree an the Ph.D egree in electrical engineering from Rutgers University, New Brunswick, NJ, in 1991 an 1996, respectively. He was a Lea Engineer with the U.S. Army Face Recognition Project while he was with CAIP, Rutgers University, in He was with SpeakEZ eveloping speaker verification proucts. In 1997, he worke on the ATM LAN Emulation for MPEG II vieo broacast in AT&T Since 1998, he has been with the Multimeia Communication Research Laboratory, Bell Labs, Murray Hill, NJ. He has been involve with various activities in speaker verification, ialogue systems, an language moeling. He has contribute in esign of the ialogue session manager for the VoiceXML platform. He is currently working on QoS policy management in the multiprotocol label switching (MPLS) base meia networks. He has publications in machine vision, face recognition, speech/3-d auio processing, an vieo networks. Qiru Zhou (S 86 M 92) receive the B.S. an M.S. egrees in electrical an computer engineering from Northern Jiao-Tong University, China, an Beijing University of Posts an Telecommunications, China, respectively. He joine Bell Labs, AT&T, in Currently he is a Member of Technical Staff at Bell Labs, Lucent Technologies, Murray Hill, NJ, with the Dialogue Systems Research Department. His research interests inclue speech an speaker recognition algorithms, multimoal ialogue infrastructure, real-time istribute object-oriente software methoology for multimeia communication systems, an stanar for speech an multimeia applications. Since 1992, He has been involve an lea in various projects in AT&T an Lucent to apply speech an ialogue technologies into proucts. He is now a Technical Leaer in Lucent speech software prouct evelopment.

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,