Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition

Size: px
Start display at page:

Download "Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition"

Transcription

1 146 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 Robust Enpoint Detection an Energy Normalization for Real-Time Speech an Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine Tsai, an Qiru Zhou, Member, IEEE Abstract When automatic speech recognition (ASR) an speaker verification (SV) are applie in averse acoustic environments, enpoint etection an energy normalization can be crucial to the functioning of both systems. In low signal-to-noise ratio (SNR) an nonstationary environments, conventional approaches to enpoint etection an energy normalization often fail an ASR performances usually egrae ramatically. The purpose of this paper is to aress the enpoint problem. For ASR, we propose a real-time approach. It uses an optimal filter plus a three-state transition iagram for enpoint etection. The filter is esigne utilizing several criteria to ensure accuracy an robustness. It has almost invariant response at various backgroun noise levels. The etecte enpoints are then applie to energy normalization sequentially. Evaluation results show that the propose algorithm significantly reuces the string error rates in low SNR situations. The reuction rates even excee 50% in several evaluate atabases. For SV, we propose a batch-moe approach. It uses the optimal filter plus a two-mixture energy moel for enpoint etection. The experiments show that the batch-moe algorithm can etect enpoints as accurately as using HMM force alignment while the propose one has much less computational complexity. Inex Terms Change-point etection, ege etection, enpoint etection, optimal filter, robust speech recognition, speaker verification, speech activity etection, speech etection. I. INTRODUCTION IN SPEECH an speaker recognition, we nee to process the signal in utterances consisting of speech, silence, an other backgroun noise. The etection of the presence of speech embee in various types of nonspeech events an backgroun noise is calle enpoint etection, speech etection, or speech activity etection. In this paper, we aress enpoint etection by sequential an batch-moe processes to support real-time recognition (in which the recognition response is the same as or faster than recoring an utterance). The sequential process is often use in automatic speech recognition (ASR) [1] while the batch-moe process is often allowe in speaker recognition [2], name ialing [3], comman control an embee systems, where utterances are usually as short as a few secons an the elay in response is usually small. Enpoint etection has been stuie for several ecaes. The first application was in a telephone transmission an switching Manuscript receive June 7, 2001; revise February 13, The associate eitor coorinating the review of this manuscript an approving it for publication was Dr. Juergen Schroeter. The authors are with the Multimeia Communications Research Lab, Bell Labs, Lucent Technologies, Murray Hill, NJ USA ( qli@research.bell-labs.com). Publisher Item Ientifier S (02)03972-X. system evelope in Bell Labs, for time assignment of communication channels [4]. The principle was to use the free channel time to interpolate aitional speakers by speech activity etection. Since then, various speech etection algorithms have been evelope for ASR, speaker verification, echo cancellation, speech coing an other applications. In general, ifferent applications nee ifferent algorithms to meet their specific requirements in terms of computational accuracy, complexity, robustness, sensitivity, response time, etc. The approaches inclue those base on energy threshol (e.g., [5]), pitch etection (e.g., [6]), spectrum analysis, cepstral analysis [7], zero-crossing rate [8], [9], perioicity measure, hybri etection [10], fusion [11] an many other methos. Furthermore, similar issues have also been stuie in other research areas, such as ege etection in image processing [12], [13] an change-point etection in theoretical statistics [14] [18]. As is well-known, enpoint etection is crucial to both ASR an speaker recognition because it often affects a system s performance in terms of accuracy an spee for several reasons. First, cepstral mean subtraction (CMS) [19] [21], a popular algorithm for robust speaker an speech recognition, nees accurate enpoints to compute the mean of speech frames precisely in orer to improve recognition accuracy. Secon, if silence frames can be remove prior to recognition, the accumulate utterance likelihoo scores will focus more on the speech portion of an utterance instea of on both noise an speech. Therefore, it has the potential to increase recognition accuracy. Thir, it is har to moel noise an silence accurately in changing environments. This effect can be limite by removing backgroun noise frames in avance. Fourth, removing nonspeech frames when the number of nonspeech frames is large can significantly reuce the computation time. Finally, for open speech recognition systems, such as open-microphone esktop applications an auio transcription of broacast news, it is necessary to segment utterances from continuous auio input. In applications of speech an speaker recognition, nonspeech events an backgroun noise complicate the enpoint etection problem consierably. For example, the enpoints of speech are often obscure by speaker-generate artifacts such as clicks, pops, heavy breathing, or by ial tones. Long-istance telephone transmission channels also introuce similar types of artifacts an backgroun noise. In recent years, as wireless, hans-free an Internet Protocol (IP) phones get more an more popular, the enpoint etection problem becomes even more ifficult since the signal-to-noise ratios (SNR) of these kins of communication evices are usually lower an the noise is nonstationary than those in traitional telephone lines an hansets. The noise may come from the backgroun, such as car noise, room reflec /02$ IEEE

2 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 147 tion, street noise, backgroun talking, etc., or from communication systems, such as coing, transmission, packet loss, etc. In these cases, the ASR or speaker recognition performance often egraes ramatically ue to unreliable enpoint etection. Another problem relate to enpoint etection is real-time energy normalization. In both ASR an speaker recognition, we usually normalize the energy feature such that the largest energy level in a given utterance is close to or slightly below a constant of zero or one. This is not a problem in batch-moe processing, but it can be a crucial problem in real-time processing since it is ifficult to estimate the maximal energy in an utterance with just a short-time ata buffer while the acoustic environment is changing. It becomes especially har in averse acoustic environments. A look-ahea approach to energy normalization can be foun in [6]. Actually, as we will point out later in this stuy, real-time energy normalization an enpoint etection are two relate problems. The more accurately we can etect enpoints, the better we can o on real-time energy normalization. In this paper, we propose two enpoint-etection algorithms for real-time ASR an speaker recognition. Generally speaking, both algorithms must meet the following requirements: accurate location of etecte enpoints; robust etection at various noise levels; low computational complexity; fast response time; an simple implementation. The real-time energy normalization problem is aresse together with enpoint etection. The rest of the paper is organize as follows. In Section II, we will introuce a filter for enpoint etection. In Section III, we will propose a sequential algorithm of combine enpoint etection an energy normalization for ASR in averse environments an provie experimental results in large atabase evaluations. In Section IV, we will propose an accurate enpoint-etection algorithm for batch-moe applications an compare the etecte enpoints with manually-etecte as well as HMM force-alignment etecte enpoints. Finally, we will summarize our finings in Section V. II. A FILTER FOR ENDPOINT DETECTION To ensure the low-complexity requirement, we borrow the one-imensional (1-D) short-term energy in the cepstral feature to be the feature for enpoint etection where ata sample; frame number; frame energy in ecibels; winow length; number of the first ata sample in the winow. Thus, the etecte enpoints can be aligne to the ASR feature vector automatically an the computation can be reuce from the speech-sampling rate to the frame rate. For accurate an robust enpoint etection, we nee a etector that can etect all possible enpoints from the energy feature. Since the output of the etector contains false acceptances, (1) a ecision moule is then neee to make final ecisions base on the etector s output. Here, we assume that one utterance may have several speech segments separate by possible pauses. Each of the segments can be etermine by etecting a pair of enpoints name segment beginning an ening points. On the energy contours of utterances, there is always a raising ege following a beginning point an a escening ege preceing an ening point. We call them beginning an ening eges, respectively, as shown in Fig. 4(a). Since enpoints always come with the eges, our approach is first to etect the eges an then to fin the corresponing enpoints. The founation of the theory of the optimal ege etector was first establishe by Canny [12]. He erive an optimal step-ege etector. Spacek [22], on the other han forme a performance measure combining all three quantities erive by Canny an provie the solution of the optimal filter for step ege. Petrou an Kittler then extene the work to ramp-ege etection [13]. Since the eges corresponing to enpoints in the energy feature are closer to the ramp ege than the ieal step ege, Li an Tsai applie Petrou an Kittler s filter to the enpoint etection for speaker verification in [2]. In summary, we nee a etector that meets the following general requirements: 1) invariant outputs at various backgroun energy levels; 2) capability of etecting both beginning an ening points; 3) short time elay or look-ahea; 4) limite response level; 5) maximum output signal-to-noise ratio (SNR) at enpoints; 6) accurate location of etecte enpoints; 7) maximum suppression of false etection. We then nee to convert the above criteria to a mathematic representation. As we have iscusse, it is reasonable to assume that the beginning ege in the energy contour is a ramp ege that can be moele by the following function: for for where represents the frame number of the feature an is some positive constant which can be ajuste for ifferent kins of eges, such as beginning or ening eges an for ifferent sampling rates. The etector is a 1-D filter which can be operate as a moving-average filter in the energy feature. From the above requirements, the filter shoul have the following properties which are similar to those in [13]. P1) It must be antisymmetrical, i.e., an thus. This follows from the fact that we want it to etect an antisymmetrical features [12], i.e., sensitive to both beginning an ening eges accoring to the request in 2); an to have near-zero response to backgroun noise at any level, i.e., invariant to backgroun noise accoring to the request in 1). P2) Accoring to the requirement in 3), it must be of finite extent going smoothly to zero at its ens:, (2)

3 148 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 an for, where is the half with of the filter. P3) Accoring to the requirement in 4), it must have a given maximum amplitue : where is efine by an is in the interval (, 0). If we further represent requirements 5), 6), an 7), as, an, respectively, the combine objective function is Fig. 1. Shape of the esigne optimal filter. Subject to properties P1), P2), an P3) (3) It aims at fining the filter function,, such that the value of the objective function is maximal subject to properties P1) P3). Fortunately, the object function is very similar to optimal ege etection in image processing an the etails of the object function have been erive by Petrou an Kittler [13] following Canny [12], as well as in Appenix A. After applying the metho of Lagrange multipliers, the solution for the filter function is [13] where an are filter parameters. Since is only half of the filter, when, the actual filter coefficients are where is an integer. The filter can then be operate as a movingaverage filter in where is the energy feature an is the current frame number. An example of the esigne optimal filter is shown in Fig. 1. Intuitively, the shape of the filter inicates that the filter must have positive response to a beginning ege, negative response to an ening ege an a near zero response to silence. Its response is basically invariant to ifferent backgroun noise levels since they all have near zero responses. III. REAL-TIME ENDPOINT DETECTION AND ENERGY NORMALIZATION FOR ASR The approach of using enpoint etection for real-time ASR is illustrate in Fig. 2 [23]. We use an optimal filter, as iscusse in the last section, to etect all possible enpoints, following by a three-state logic as a ecision moule to ecie real enpoints. The information of etecte enpoints is also utilize for real-time energy normalization. Finally, all silence frames are remove an only the speech frames incluing cepstrum an the normalize energy are sent to the recognizer. (4) (5) (6) Fig. 2. Enpoint etection an energy normalization for real-time ASR. A. Filter for Both Beginning- an Ening-Ege Detection After evaluating the shapes of both beginning an ening eges, we choose the filter size to be to meet requirements 2) an 3). For an, the filter parameters have been provie in [13] as, For in our application, we just nee to rescale an while s are as shown previously. The shape of the esigne filter is shown in Fig. 1 with a simple normalization,. For real-time etection, let ; then the filter has 25 points in total with a 24-frame look-ahea since both an are zeros. The filter operates as a moving-average filter where is the energy feature an is the current frame number. The output is then evaluate in a three-state transition iagram for final enpoint ecisions. B. State Transition Diagram Enpoint ecision nees to be mae by comparing the value of with some pre-etermine threshols. Due to the sequential nature of the etector an the complexity of the ecision proceure, we use a three-state transition iagram to make final ecisions. As shown in Fig. 3, the three states are: silence, in-speech, an leaving-speech. Either the silence or the in-speech state can be a starting state an any state can be a final state. In the following iscussion, we assume that the silence state is the starting state. The input is an the output is the etecte frame numbers of beginning an ening points. The transition conitions are labele on the eges between states an the actions are liste in parentheses. Count is a frame counter, an are two threshols with an Gap is an integer inicating the require number of frames from a etecte enpoint to the actual en of speech. We use Fig. 4 as an example to illustrate the state transition. The energy for a spoken igit 4 is plotte in Fig. 4(a) an the (7)

4 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 149 is stable to the noise levels, the etecte enpoints are more reliable. Those constants, Gap,, an, can be etermine empirically by plotting several utterances an corresponing filter outputs. As we will show in the atabase evaluation, the algorithm is not very sensitive to the values of an since the same values were use in ifferent atabases. Also, in some applications, two separate filters can be esigne for beginning an ening point etection. The size of the beginning filter can be smaller than 25 points while the ening filter can be larger than 25 points. This approach may further improve accuracy; however, it will have a longer elay an use more computation. The 25-point filter use in this section was esigne for both beginning an ening point etection in an 8 KHz sampling rate. Also, in the case that an utterance is starte from an unvoice phoneme, it is practical to step back about ten frames from the etecte beginning points. Fig. 3. State transition iagram for enpoint ecision. C. Real-Time Energy Normalization Suppose that the maximal energy value in an utterance is. The purpose of energy normalization is to normalize the utterance energy, such that the largest value of energy is close to zero by performing. In a real-time moe, we have to estimate the maximal energy sequentially while the ata are being collecte. Here, the estimate maximum energy becomes a variable an is enote as. Nevertheless, we can use the etecte enpoints to obtain a better estimate. We first initialize the maximal energy to a constant, which is selecte empirically an use it for normalization until we etect the first beginning point at as in Fig. 4, i.e.,. If the average energy where is a pre-selecte threshol to ensure that new is not from a single click, we then estimate the maximal energy as (8) Fig. 4. Example: (a) energy contour of igit 4 an (b) filter outputs an state transitions. filter output is shown in Fig. 4(b). The state iagram stays in the silence state until reaches point in Fig. 4(b), where means that a beginning point is etecte. The actions are to output a beginning point [corresponing to the left vertical soli line in Fig. 4(a)] an to move to the in-speech state. It stays in the in-speech state until reaching point in Fig. 4(b), where. The iagram then moves to the leavingspeech state an sets Count. The counter resets several times until reaching point. At point, Counter Gap. An actual enpoint is etecte as the left vertical ashe line in Fig. 4(b). The iagram then moves back to the silence state. During the stay in the leaving-speech state, if, this means that a beginning ege is coming an we shoul move back to the in-speech state. The 30-frame gap correspons to the perio of escening energy before reaching a real ening point. We note that the threshols, such as an, are set in the filter outputs instea of absolute energy. Since the filter output where is the length of the filter an the length of the look-ahea winow. At point, the look-ahea winow is from to as shown in Fig. 4. From now on, we upate as (9) (10) Parameter may nee to be ajuste for ifferent system. For example, the value of coul be ifferent between telephone an esktop systems. Parameter is relatively easy to etermine. For the example in Fig. 5, the energy features of two utterances with 20 B SNR (bottom) an 5 B SNR (top) are plotte in Fig. 5(a). The 5-B utterance is generate by artificially aing car noise to the 20 B one. The filter outputs are shown in Fig. 5(b) for 20 B (soli line) an 5 B (ashe line) SNRs, respectively. The etecte enpoints an normalize energy for 20 an 5 B SNRs are plotte in Fig. 5(c) an 5(), respectively. We note that the filter outputs for 20 an 5 B cases

5 150 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 Fig. 5. (a) Energy contours of Z214 from original utterance (bottom, 20 B SNR) an after aing car noise (top, 5 B SNR). (b) Filter outputs for 5 B (ashe line) an 20 B (soli line) SNR cases. (c) Detecte enpoints an normalize energy for the 20 B SNR case an () for the 5 B SNR case. Fig. 6. Comparisons on real-time connecte igit recognition with various SNRs. From 5- to 20-B SNRs, the propose algorithm provie wor error rate reuctions of 90.2%, 93.4%, 57.1%, an 57.1%, respectively. are almost invariant aroun an, although their backgroun energy levels have a ifference of 15 B. This ensures the robustness in enpoint etection. We also note that the normalize energy profiles are almost the same as the original one, although the normalization is one in a real-time moe. D. Database Evaluation The propose algorithm was compare with a baseline enpoint etection algorithm on one noisy atabase an several telephone atabases. 1) Baseline Enpoint Detection: The baseline system is a real-time, energy contour base aaptive etector evelope base on the algorithm introuce in [1], [5]. It has been use for years in research an commercial speech recognizers. In the baseline system, a six-state transition iagram is use to etect enpoints. Those states are name as initializing, silence, rising, energy, fell-rising, an fell states. In total, eight counters an 24 har-limit threshols are use for the ecisions of state transition. Two aaptive threshol values were use in most of the threshols. We note that all the threshols are compare with raw energy values irectly. Energy normalization in the baseline system is one separately by estimating the maximal an minimal energy values, then comparing their ifference to a fixe threshol for ecision. Since the energy values change with acoustic environments, the baseline approach leas to unreliable enpoint etection an energy normalization, especially in low SNR an nonstationary environments. 2) Noisy Database Evaluation: In this experiment, a atabase was first recore from a esktop computer at 16 KHz sampling rate, then own-sample to 8 KHz sampling rate. Later, car another backgrounnoises wereartificially aetotheoriginal atabase at the SNR levels of 5, 10, 15, an 20 B. The original atabase has 39 utterances an 1738 igits in total. Each utterance has 3, 7, or 11 igits. LPC feature an the short-term energy were use an the hien Markov moel (HMM) in a hea-boy-tail Fig. 7. (a) Energy contour of the 523th utterance in DB5: 1 Z 4 O (b) Enpoints an normalize energy from the baseline system. The utterance was recognize as 1 Z 4 O 5 8. (c) Enpoints an normalize energy from the propose system. The utterance was recognize correctly as 1 Z 4 O () The filter output. (HBT) structure was employe to moel each of the igits [24], [25]. The HBT structure assumes that context epenent igit moels can be built by concatenating a left-context epenent unit (hea) with a context inepenent unit (boy) followe by a right-context epenent unit (tail). We use three HMM states to represent each hea an tail an four state to represent each boy. Sixteen mixtures were use for each boy state an four mixtures were use for each hea or tail state. The real-time recognition performances on various SNRs are shownin Fig. 6. Compare to the baseline algorithm, the propose onesignificantly reuceworerrorrates. Thebaselinealgorithm faile to work in low SNR cases because it uses raw energy values irectly to etect enpoints an to perform energy normalization. The propose algorithm makes ecision on the filter output instea of raw energy values; therefore, it provie more robust results. An example of error analysis is shown in Fig. 7.

6 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 151 TABLE I DATABASE EVALUATION RESULTS (%) Fig. 8. Shape of the optimal filter for beginning ege etection, plotte as h (t), with W =7an s =1. 3) Telephone Database Evaluation: The propose algorithm was further evaluate in 11 atabases collecte from the telephone networks with 8 khz sampling rates in various acoustic environments. LPC parameters an short-term energy were use. The acoustic moel consists of one silence moel, 41 mono-phone moels an 275 hea-boy-tail units for igit recognition. It has a total of 79 phoneme symbols, 33 of which are for igit units. Eleven atabases, DB1 to DB11, were use for the evaluation. DB1 to DB5 contain igits, alphabet an wor strings. Finite-state grammars were use to specify the vali forms of recognize strings. DB6 to DB11 contain pure igit strings. In all the evaluations, both enpoint etection an energy normalization were performe in real-time moe an only the etecte speech portions of an utterance were sent to the recognition back-en. In the propose system, we set the parameters as,,, an Gap. These parameters were unchange throughout the evaluation in all 11 atabases to show the robustness of the algorithm, although the parameters can be ajuste accoring to signal conitions in ifferent applications. The evaluation results are liste in Table I. It shows that the propose algorithm works very well in regular telephone ata as well. It provie wor error reuction in most of the atabases. The wor error reuctions even excee 30% in DB2, DB6, an DB9. To analyze the improvement, the original energy feature of an utterance, 1 Z 4 O 5 8 2, in DB6 is plotte in Fig. 7(a). The etecte enpoints an normalize energy using the conventional approach are shown in Fig. 7(b) while the results of the propose algorithm are shown in Fig. 7(c). The filter output is plotte in Fig. 7(). From Fig. 7(b), we can observe that the normalize maximal energy of the conventional approach is about 10 B below zero, which causes a wrong recognition result: 1 Z 4 O 5 8. On the other han, the propose algorithm normalize the maximal energy to zero approximately an the utterance was recognize correctly as 1 Z 4 O IV. ACCURATE BATCH-MODE ENDPOINT DETECTION FOR SPEAKER VERIFICATION So far, we have focuse on real-time enpoint etection, which is mainly for ASR applications where silence or garbage moels are usually use to further etermine accurate enpoints Fig. 9. Shape of the optimal filter for ening ege etection, plotte as h (t), with W =35an s =0:2. in ecoing. In another category of applications, real-time processing is not so crucial. Speech ata can be processe in a batch-moe, i.e., after ata recoring is finishe. The applications inclue speaker verification, name ialing, speech control, etc., where the utterances are usually short (e.g., less than 2 s) an the verification or recognition can be one within 1 s. Since many of these kins of applications are offere in embee systems, such as wireless phones or portable evices; or in multi-user systems, such as a speaker verification server for millions of users [3], they normally require low computational complexity for low cost or for a fast response. For these cases, one solution is to use an accurate en-point etector to remove all silence; therefore, we not only can reuce the number of ecoing frames significantly, but also eliminate the silence moel in ecoing, which usually takes a lot of space an computation. Batch-moe processing enables this class of operations. A. Batch-Moe Algorithm To obtain accurate enpoints, we esigne two filters, one for beginning-ege an another for ening-ege etection, using the algorithm in Section II. The first filter is shown in Fig. 8 with seven points, while the secon one is shown in Fig. 9 with 35 points. This is because the ening ege is usually longer than the beginning ege. We note that the ening filter gives positive response at a etecte ening ege. To help in accurately etermining energy threshols, we use a Gaussian mixture moel to moel the energy istribution. The final enpoints are etecte by combining the information from the filter outputs an the estimate threshols. 1) Energy Distribution Moel: We assume that a Gaussian mixture moel can approximately represent the istribution of energy in an utterance with two mixtures representing speech an backgroun energy, respectively (11) where is a weighting parameter, is a normal istribution given by (12)

7 152 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 an an are the mean an stanar eviation, respectively. The means for speech an backgroun noise are an with the corresponing stanar eviations, an. The threshols for speech an backgroun noise are an, respectively. When the energy value is above, we consier it as speech; when the energy value is below, we consier it as backgroun noise. To obtain fast an explicit parameter estimation, we applie a moment algorithm instea of the popular EM algorithm which nees iterations. The fast estimation algorithm is liste in Appenix B. 2) Summary of the Algorithm: We now summarize the propose algorithm for batch-moe enpoint etection [2]. The parameters in the following algorithm are for the energy feature compute from 30 ms energy winows shifting every 10 ms. The ata sampling rate is 8 KHz. 1) Compute log energy of the given utterance, an normalize it by subtracting to get. We assume that the speech is surroune by silence an various kins of noise. 2) Remove the ial tone from. The ial tone can be etecte when an These two parameters are etermine base on the minimal length an minimal energy level of ial tones. 3) Estimate,, an using (22) to (29), then etermine two threshols an for speech an backgroun energy, respectively. Speech energy shoul be above the value of threshol an silence/backgroun noise energy shoul be below. This is base on the assumption that noise an speech can be represente as two separate Gaussian mixtures. 4) Compute the output of the beginning-ege filter (13) then search for the locations of all peaks, from the filter output. A peak associate with a beginning point shoul meet the following properties: an The actual beginning point is where is the th beginning ege. The shift is ue to the offset between the center of a beginning ege an the actual beginning point. Fig. 10. Normalize log energy of Call office with heavy breath in the en. Lines A, B, C, an D inicate the estimate values of,, an, respectively. 5) From a etecte beginning point, search for the corresponing ening point, which shoul satisfy the following conitions: i) an ; ii) ; iii) when, 60% frames of, shoul have the values above ; an iv). Here, ii) an iii) are to ensure that the segmentation is speech instea of a click or breath noise. The segment that cannot meet the above conitions is not a speech segment. 6) Determine the actual last ening point. Compute the response of the ening-ege filter in the last segment,,by (14) Search for the last peak of, where an. Then, shift the peak point locate at the center of the ening ege to the last ening point. The offset shoul be about half the filter size. We choose 16 frames. Thus, if at frame, the energy level is still higher than threshol, the ening point is at frame ; otherwise, the ening point is the last point before the energy crosses the threshol if an. (15) 3) Illustrative Examples: We use the example in Fig. 10 to illustrate the concept of the propose algorithm. The utterance call office is first converte to log energy an normalize to have the largest value be zero. For this example, the speech signal is about 2 s concatenate by another 2 s of heavy breath. We estimate the means an stanar eviations of the speech an the backgroun energy using the equations in Appenix B. The results are shown in Fig. 10, where lines A, B, C, an D inicate the estimate values of,, an, respectively. Then we computer using (13). We note that the operation iffers slightly from the real-time one. The result is shown in Fig. 11 as a soli line. After evaluating the values of the peaks that are above threshol, for this case, the locations of beginning points are first locate at the centers of the highest peaks.

8 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 153 Fig. 11. Output of the beginning-ege filter (soli line) an ening-ege filter (ashe line). Fig. 13. Last ening point was ajuste from Line H to I by applying the ening-ege filter. Fig. 12. Lines E, F, G, an H inicate the locations of two pairs of beginning an ening points. Fig. 14. Normalize log energy of call Canice at her home with breath in the beginning. The etecte enpoints are the vertical soli line an ashe line. Actual beginning points can then be locate by shifting the corresponing locations of the peaks to the left by half the filter size. From the first beginning point, we search for the location of the corresponing ening point, where the energy level is lower than. For this example, we get two pairs of enpoints corresponing to two speech segments, as shown in Fig. 12, from line E to F an from line G to H, respectively. The clicks in the beginning of the utterances were not etecte as speech because the filter responses at these locations were lower than the threshol value. As we can see from Fig. 12, the last segment between lines G an H inclues the heavy breath. The energy ata in the segments are then fe into the ening-ege filter an compute (14). The filter output is shown in Fig. 11 as the ashe line. The ening point of the last segment is locate by shifting the frame inex of the largest peak to the right by about half the size of the ening ege filter. If the energy value is lower than at the shifte location, the ening point shoul be the last point where the energy level is greater than as escribe in (15). The final speech segments are from line E to line F an from line G to line I, as shown in Fig. 13. More examples are shown in Figs. 14 an 15. Fig. 14 is the energy contour of utterance Call Canice at her home phone with breath in the beginning. The horizontal soli an ashe lines represent the means an threshols for speech an noise. The etecte beginning an ening points are shown as the vertical soli line an ashe lines, respectively. The breath signal is exclue from the speech segmentation successfully. Fig. 15 is an example of an utterance with a ial tone in the en. The Fig. 15. Normalize log energy of I plege allegiance to the flag, with a ial tone at the en. The vertical lines inicate the beginning point an the ening points. first an the last vertical lines are the beginning an the ening points, respectively. Other vertical lines inicate the etecte silence between wors. The ial tone in the en of the utterance was etecte an exclue from the speech segment. B. Comparisons With HMM Force-Alignment Approach The atabase use for the comparison was collecte for speaker verification with a common phrase I plege allegiance to the flag. It has 100 speakers an 4741 utterances in total. The utterances were collecte over long istance telephone networks. The speakers were instructe to make the phone calls at ifferent locations an using ifferent telephone hansets. The collecte utterances are with various kins of noise. A pair of beginning an ening points was etecte manually for every utterance. We use the manually etecte enpoints

9 154 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 TABLE II STATISTICS OF THE DIFFERENCES ON DETECTED BEGINNING POINTS Fig. 16. Dashe line is the histogram of the ifferences between manuallyan HMM-etecte beginning points. Soli line is between manually an batch-moe etecte beginning points. as references to compare with the enpoints etecte by the propose approach an by HMM approach. Here, the HMM approach means enpoint etection by force alignment assuming both the moels an lexicons are available. The HMM approach uses 41 speaker-inepenent phoneme moels. Each phone moel has 3 states an each state has 32 Gaussian mixtures. The feature vector is compose of 12 cepstral an 12 elta cepstral coefficients. The cepstrum is erive from a tenth-orer LPC analysis over a 30 ms winow an the feature vectors are upate at 10 ms intervals. The histogram of the ifferences between the manually etecte beginning points an HMM etecte beginning points is shown in Fig. 16 as a ashe line. The histogram of the ifferences between the manually-etecte beginning points an the beginning points etecte by the propose approach is shown in Fig. 16 as a soli line. The statistics are liste in Table II. The accuracy of the propose approach is very close to the HMM approach. The shift between those two histograms can be resolve by ajusting the threshols in etermining beginning points; however, it is not necessary since the overall ifference between the two approaches is about the same. The histogram of the ifferences between the manually etecte ening points an HMM etecte ening points is shown in Fig. 17 as a ashe line. The histogram of the ifferences between manually etecte ening points an the ening points etecte by the propose approach is shown in Fig. 17 as a soli line. These two histograms are very close. We note that both of the histograms shift from the manually etecte ening-points. This is ue to the ifferent interpretations on ening-points between human an algorithms. The histograms an table inicate that the enpoints etecte by the propose algorithm have the same accuracy as the HMM etecte enpoints. Comparing with HMM approach, the propose one oes not nee any language-epenent moels an lexicon information; therefore, it can support language-inepenent applications. Also, the propose algorithm is much faster. It only nees about 130 Kflops (floating Fig. 17. Dashe line is the histogram of the ifferences between manually- an HMM-etecte ening points. Soli line is between manually- an batch-moe etecte ening points. point operations) for enpoint etection, while the HMM approach nees over 200 Mflops for force alignment using a set of speaker-inepenent phoneme moels. Furthermore, the propose algorithm can etect the silence between wors easily while it nees to involve much more computation when using the HMM approach. C. Application to Language-Inepenent Speaker Verification Since the propose algorithm can etect enpoints at accuracy similar to the HMM approach, we apply the propose algorithm to the front-en of a speaker verification system [2]. After LPC cepstral extraction, the propose algorithm etects enpoints on the energy. Silence, breath, ial tone an other nonspeech signals are then remove from the feature set. Given the original feature observation of, after silence removal, the feature set becomes which is a subset of, i.e.,. Cepstral mean subtraction (CMS) is then performe on. This approach was evaluate on a atabase consisting of 38 speakers 18 male an 20 female for speaker verification (see [26] for the atabase escriptions). The common pass-phrase for all speakers is call Janice at her office phone. Each true speaker was teste with the same pass-phrase from all impostors. In the language-inepenent configuration, the equal error rates (EERs) are 3.6% an 4.4% for male an female groups, respectively. In the language-epenent configuration where the backgroun moel is applie, the EERs are 2% an 3.5% for male an female groups, respectively. The average iniviual EER is 2.8%. The accuracy is in the same level as the speaker

10 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 155 verification system where HMMs were applie to enpoint etection [27], [28]. The propose algorithm has also been implemente in a real speech controller with embee speaker verification. Reaers are referre to [3] for etail. V. CONCLUSIONS In this paper, we propose two algorithms for real-time an batch-moe enpoint etection. Both algorithms apply filters to etect possible enpoints an then make final ecisions base on the filter outputs. Since the filter is esigne to be invariant to various levels of backgroun noise, the propose algorithms are reliable an robust, even in very low SNR situations. In the real-time algorithm, a filter with a 24-frame look-ahea etects all possible enpoints. A three-state transition iagram then evaluates the output from the filter for final ecisions. The etecte enpoints are then applie to real-time energy normalization. Since the entire algorithm only uses a 1-D energy feature, it has low complexity an is very fast in computation. The evaluation in a noisy atabase has showe significant string error reuction, over 50% on all 5- to 20-B SNR situations. The evaluations in telephone atabases have showe over 30% reuctions in four out of 12 atabases. The propose algorithm has been implemente in real-time ASR systems. The contributions are not only to improve the recognition accuracy but also the robustness of entire system in low SNR environments. In the batch-moe algorithm, the peaks of the filter output are use to etect enpoints with threshols estimate from a two-mixture energy istribution moel, where the moel parameters can be solve through close-form equations. Using manually etecte enpoints as references, we have compare the propose algorithm with the force-alignment approach using HMM. The experiments showe that the propose algorithm has similar accuracy to the HMM approach while it nees much less computations. The algorithm has also been implemente in a real recognition system for language-inepenent speech control incluing embee speaker verification [3]. APPENDIX A OBJECTIVE FUNCTION FOR THE OPTIMAL FILTER DESIGN Assume that the beginning or ening ege in log energy is a ramp ege as efine in (2). An, assume that the eges are emerge with white Gaussian noise. Following Canny s criteria, Petrou an Kittler [13] erive the SNR for this filter as being proportional to Finally, the measure for the suppression of false eges is proportional to the mean istance between the neighboring maxima of the response of the filter to white Gaussian noise Therefore, the combine objective function of the filter is APPENDIX B ENERGY MODEL ESTIMATION (18) (19) Instea of the popular EM algorithm, we applie the moment algorithm [29] for a faster parameter estimation for the moel in (11). Let represent sample values. By equating the observe moments given by where (20) is the sample mean to the theoretical moments given by (21) where, we can obtain five nonlinear simultaneous equations an the solution has been summarize in [29]. To estimate the five parameters, we first fin the real negative root of where (22) (16) where is a half with of the actual filter. They consier a goo locality measure to be inversely proportional to the stanar eviation of the istribution of enpoint where the ege is suppose to be. It was efine as (17) an where an are the fourth an fifth sample cumulates, respectively. Let be a real negative root of (22), then parameters an are obtaine as roots of (23)

11 156 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 3, MARCH 2002 where (24) Now, the estimates of the five moel parameters may be erive as the following explicit forms: (25) (26) (27) (28) (29) We note that in the application. In the case that the solution of (22) oes not exist, a histogram can be constructe to estimate the mixture moel parameters approximately. ACKNOWLEDGMENT The authors wish to thank Dr. D. W. Tufts for a very helpful lecture an Drs. W.-G. Kim, C.-H. Lee, O. Siohan, F. K. Soong, an F. Korkmazskiy for useful iscussions. REFERENCES [1] L. Rabiner an B.-H. Juang, Funamentals of Speech Recognition. Englewoo Cliffs, NJ: Prentice-Hall, [2] Q. Li an A. Tsai, A matche filter approach to enpoint etection for robust speaker verification, in Proc. IEEE Workshop on Automatic Ientification, Summit, NJ, Oct [3], A language-inepenent personal voice controller with embee speaker verification, in Proc. Eurospeech 99, Buapest, Hungary, Sept [4] K. Bullington an J. M. Fraser, Engineering aspects of TASI, Bell Syst. Tech. J., pp , Mar [5] J. G. Wilpon, L. R. Rabiner, an T. Martin, An improve wor-etection algorithm for telephone-quality speech incorporating both syntactic an semantic constraints, AT&T Bell Labs. Tech. J., vol. 63, pp , Mar [6] R. Chengalvarayan, Robust energy normalization using speech/nonspeech iscriminator for German connecte igit recognition, in Proc. Eurospeech 99, Buapest, Hungary, Sept. 1999, pp [7] J. A. Haigh an J. S. Mason, Robust voice activity etection using cepstral features, in Proc. IEEE TENCON, 1993, pp [8] L. R. Rabiner an M. R. Sambur, An algorithm for etermining the enpoints of isolate utterances, Bell Syst. Tech. J., vol. 54, pp , Feb [9] J. C. Junqua, B. Reaves, an B. Mak, A stuy of enpoint etection algorithms in averse conitions: Incience on a DTW an HMM recognize, in Proc. Eurospeech, 1991, pp [10] L. F. Lamel, L. R. Rabiner, A. E. Rosenberg, an J. G. Wilpon, An improve enpoint etector for isolate wor recognition, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-29, pp , Aug [11] S. G. Tanyer an H. Özer, Voice activity etection in nonstationary noise, IEEE Trans. Speech Auio Processing, vol. 8, pp , July [12] J. Canny, A computational approach to ege etection, IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-8, pp , Nov [13] M. Petrou an J. Kittler, Optimal ege etectors for ramp eges, IEEE Trans. Pattern Anal. Machine Intell., vol. 13, pp , May [14] E. Carlstein, M. Muller, an D. Siegmun, Change-Point Problems. Haywar, CA: Inst. Math. Statist., [15] R. K. Bansal an P. Papantoni-Kazakos, An algorithm for etecting a change in stochastic process, IEEE Trans. Inform. Theory, vol. IT-32, pp , Mar [16] A. Wal, Sequential Analysis. Lonon, U.K: Chapman & Hall, [17] Q. Li, A etection approach to search-space reuction for HMM state alignment in speaker verification, IEEE Trans. Speech Auio Processing, vol. 9, pp , July [18] B. Brosky an B. S. Darkhovsky, Nonparametric Methos in Change- Point Problems. Norwell, MA: Kluwer, [19] B. S. Atal, Effectiveness of linear preiction characteristics of the speech wave for automatic speaker ientification an verification, J. Acoust. Soc. Amer., vol. 55, pp , [20], Automatic recognition of speakers from their voices, Proc. IEEE, vol. 64, pp , [21] S. Furui, Cepstral analysis techniques for automatic speaker verification, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, pp , Apr [22] L. A. Spacek, Ege etection an motion etection, Image Vision Comput., vol. 4, pp , [23] Q. Li, J. Zheng, Q. Zhou, an C.-H. Lee, A robust, real-time enpoint etector with energy normalization for ASR in averse environments, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Salt Lake City, UT, May [24] W. Chou, C.-H. Lee, an B.-H. Juang, Minimum error rate training of inter-wor context epenent acoustic moel units in speech recognition, in Proc. Int. Conf. on Spoken Language Processing, 1994, pp [25] C.-H. Lee, E. Giachin, L. R. Rabiner, R. Pieraccini, an A. E. Rosenberg, Improve acoustic moeling for large vocabulary speech recognition, Comput. Speech Lang., vol. 6, pp , [26] R. A. Sukkar an C.-H. Lee, Vocabulary inepenent iscriminative utterance verification for nonkeywor rejection in subwor base speech recognition, IEEE Trans. Speech Auio Processing, vol. 4, pp , Nov [27] S. Parthasarathy an A. E. Rosenberg, General phrase speaker verification using sub-wor backgroun moels an likelihoo-ratio scoring, in Proc. ICSLP-96, Philaelphia, PA, Oct [28] Q. Li, S. Parthasarathy, an A. E. Rosenberg, A fast algorithm for stochastic matching with application to robust speaker verification, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Munich, Germany, Apr. 1997, pp [29] B. S. Everitt an D. J. Han, Finite Mixture Distributions. Lonon, U.K.: Chapman & Hall, Qi (Peter) Li (S 87 M 88 SM 01) receive the Ph.D. egree in electrical engineering from the University of Rhoe Islan, Kingston, in In 1995, he joine Bell Laboratories, Murray Hill, NJ, where he is currently a Member of Technical Staff in the Dialogue Systems Research Department. From 1988 to 1994, he was with F.M. Engineering an Research, Norwoo, MA, where he worke in research on patent recognition algorithms an in real-time systems. His research interests inclue robust speaker an speech recognition, robust feature extraction, fast search algorithms, stochastic moeling, fast iscriminative learning, an neural networks. His research results have been implemente in Lucent proucts. He has publishe extensively, file an aware many patents in his research areas. Dr. Li has been active as a reviewer for several journals, incluing IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, as a Local Chair for the IEEE 1999 Workshop on Automatic Ientification, an as a committee member for several IEEE workshops. He has receive two awars an is liste in Who s Who in America (Millennium an 2001 Eitions). Jinsong Zheng receive the B.S. an M.S. egrees in computer science from Fuan University, Shanghai, China, an Utah State University, Logan, UT, respectively. Between 1994 an 1998, he was a Software Engineer with WebSci Technologies, South Brunswick, NJ. Since 1998, he has been a Consultant in the Dialogue Systems Research Department of Bell Labs, Lucent Technologies, Murray Hill, NJ, where he has been involve in various research projects in speech recognition. He is also a member of the Lucent Automatic Speech Recognition (LASR) software evelopment team.

12 LI et al.: ROBUST ENDPOINT DETECTION AND ENERGY NORMALIZATION 157 Augustine Tsai receive the M.S. egree in systems engineering from Case Western Reserve University, Clevelan, OH, in 1989, a secon M.S. egree an the Ph.D egree in electrical engineering from Rutgers University, New Brunswick, NJ, in 1991 an 1996, respectively. He was a Lea Engineer with the U.S. Army Face Recognition Project while he was with CAIP, Rutgers University, in He was with SpeakEZ eveloping speaker verification proucts. In 1997, he worke on the ATM LAN Emulation for MPEG II vieo broacast in AT&T Since 1998, he has been with the Multimeia Communication Research Laboratory, Bell Labs, Murray Hill, NJ. He has been involve with various activities in speaker verification, ialogue systems, an language moeling. He has contribute in esign of the ialogue session manager for the VoiceXML platform. He is currently working on QoS policy management in the multiprotocol label switching (MPLS) base meia networks. He has publications in machine vision, face recognition, speech/3-d auio processing, an vieo networks. Qiru Zhou (S 86 M 92) receive the B.S. an M.S. egrees in electrical an computer engineering from Northern Jiao-Tong University, China, an Beijing University of Posts an Telecommunications, China, respectively. He joine Bell Labs, AT&T, in Currently he is a Member of Technical Staff at Bell Labs, Lucent Technologies, Murray Hill, NJ, with the Dialogue Systems Research Department. His research interests inclue speech an speaker recognition algorithms, multimoal ialogue infrastructure, real-time istribute object-oriente software methoology for multimeia communication systems, an stanar for speech an multimeia applications. Since 1992, He has been involve an lea in various projects in AT&T an Lucent to apply speech an ialogue technologies into proucts. He is now a Technical Leaer in Lucent speech software prouct evelopment.

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

SANTIAGO CANYON COLLEGE Reading & English Placement Testing Information

SANTIAGO CANYON COLLEGE Reading & English Placement Testing Information SANTIAGO CANYON COLLEGE Reaing & English Placement Testing Information DO YOUR BEST on the Reaing & English Placement Test The Reaing & English placement test is esigne to assess stuents skills in reaing

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language Z.HACHKAR 1,3, A. FARCHI 2, B.MOUNIR 1, J. EL ABBADI 3 1 Ecole Supérieure de Technologie, Safi, Morocco. zhachkar2000@yahoo.fr.

More information

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm Prof. Ch.Srinivasa Kumar Prof. and Head of department. Electronics and communication Nalanda Institute

More information

Sweden, The Baltic States and Poland November 2000

Sweden, The Baltic States and Poland November 2000 Folkbilning co-operation between Sween, The Baltic States an Polan 1990 2000 November 2000 TABLE OF CONTENTS FOREWORD...3 SUMMARY...4 I. CONCLUSIONS FROM THE COUNTRIES...6 I.1 Estonia...8 I.2 Latvia...12

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012 Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of

More information

SPECIAL ARTICLES Pharmacy Education in Vietnam

SPECIAL ARTICLES Pharmacy Education in Vietnam American Journal of Pharmaceutical Eucation 2013; 77 (6) Article 114. SPECIAL ARTICLES Pharmacy Eucation in Vietnam Thi-Ha Vo, MSc, a,b Pierrick Beouch, PharmD, PhD, b,c Thi-Hoai Nguyen, PhD, a Thi-Lien-Huong

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-issn: 2278-2834,p- ISSN: 2278-8735.Volume 10, Issue 2, Ver.1 (Mar - Apr.2015), PP 55-61 www.iosrjournals.org Analysis of Emotion

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 Ranniery Maia 1,2, Jinfu Ni 1,2, Shinsuke Sakai 1,2, Tomoki Toda 1,3, Keiichi Tokuda 1,4 Tohru Shimizu 1,2, Satoshi Nakamura 1,2 1 National

More information

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad

More information

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Sanket S. Kalamkar and Adrish Banerjee Department of Electrical Engineering

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

Grade 6: Correlated to AGS Basic Math Skills

Grade 6: Correlated to AGS Basic Math Skills Grade 6: Correlated to AGS Basic Math Skills Grade 6: Standard 1 Number Sense Students compare and order positive and negative integers, decimals, fractions, and mixed numbers. They find multiples and

More information

Segregation of Unvoiced Speech from Nonspeech Interference

Segregation of Unvoiced Speech from Nonspeech Interference Technical Report OSU-CISRC-8/7-TR63 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/27

More information

Mandarin Lexical Tone Recognition: The Gating Paradigm

Mandarin Lexical Tone Recognition: The Gating Paradigm Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Statewide Framework Document for:

Statewide Framework Document for: Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance

More information

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,

More information

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com

More information

Using dialogue context to improve parsing performance in dialogue systems

Using dialogue context to improve parsing performance in dialogue systems Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT

More information

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions 26 24th European Signal Processing Conference (EUSIPCO) Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions Emma Jokinen Department

More information

An Online Handwriting Recognition System For Turkish

An Online Handwriting Recognition System For Turkish An Online Handwriting Recognition System For Turkish Esra Vural, Hakan Erdogan, Kemal Oflazer, Berrin Yanikoglu Sabanci University, Tuzla, Istanbul, Turkey 34956 ABSTRACT Despite recent developments in

More information

Using Proportions to Solve Percentage Problems I

Using Proportions to Solve Percentage Problems I RP7-1 Using Proportions to Solve Percentage Problems I Pages 46 48 Standards: 7.RP.A. Goals: Students will write equivalent statements for proportions by keeping track of the part and the whole, and by

More information

Circuit Simulators: A Revolutionary E-Learning Platform

Circuit Simulators: A Revolutionary E-Learning Platform Circuit Simulators: A Revolutionary E-Learning Platform Mahi Itagi Padre Conceicao College of Engineering, Verna, Goa, India. itagimahi@gmail.com Akhil Deshpande Gogte Institute of Technology, Udyambag,

More information

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

Speaker recognition using universal background model on YOHO database

Speaker recognition using universal background model on YOHO database Aalborg University Master Thesis project Speaker recognition using universal background model on YOHO database Author: Alexandre Majetniak Supervisor: Zheng-Hua Tan May 31, 2011 The Faculties of Engineering,

More information

On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

More information

Software Maintenance

Software Maintenance 1 What is Software Maintenance? Software Maintenance is a very broad activity that includes error corrections, enhancements of capabilities, deletion of obsolete capabilities, and optimization. 2 Categories

More information

Automatic Pronunciation Checker

Automatic Pronunciation Checker Institut für Technische Informatik und Kommunikationsnetze Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich Ecole polytechnique fédérale de Zurich Politecnico federale

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence INTERSPEECH September,, San Francisco, USA Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence Bidisha Sharma and S. R. Mahadeva Prasanna Department of Electronics

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor

Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction Sensor International Journal of Control, Automation, and Systems Vol. 1, No. 3, September 2003 395 Quantitative Evaluation of an Intuitive Teaching Method for Industrial Robot Using a Force / Moment Direction

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

Voice conversion through vector quantization

Voice conversion through vector quantization J. Acoust. Soc. Jpn.(E)11, 2 (1990) Voice conversion through vector quantization Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara A TR Interpreting Telephony Research Laboratories,

More information

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,

More information

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-

More information

Extending Place Value with Whole Numbers to 1,000,000

Extending Place Value with Whole Numbers to 1,000,000 Grade 4 Mathematics, Quarter 1, Unit 1.1 Extending Place Value with Whole Numbers to 1,000,000 Overview Number of Instructional Days: 10 (1 day = 45 minutes) Content to Be Learned Recognize that a digit

More information

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working

More information

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Transfer Learning Action Models by Measuring the Similarity of Different Domains Transfer Learning Action Models by Measuring the Similarity of Different Domains Hankui Zhuo 1, Qiang Yang 2, and Lei Li 1 1 Software Research Institute, Sun Yat-sen University, Guangzhou, China. zhuohank@gmail.com,lnslilei@mail.sysu.edu.cn

More information

GACE Computer Science Assessment Test at a Glance

GACE Computer Science Assessment Test at a Glance GACE Computer Science Assessment Test at a Glance Updated May 2017 See the GACE Computer Science Assessment Study Companion for practice questions and preparation resources. Assessment Name Computer Science

More information

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE Edexcel GCSE Statistics 1389 Paper 1H June 2007 Mark Scheme Edexcel GCSE Statistics 1389 NOTES ON MARKING PRINCIPLES 1 Types of mark M marks: method marks A marks: accuracy marks B marks: unconditional

More information

Detailed course syllabus

Detailed course syllabus Detailed course syllabus 1. Linear regression model. Ordinary least squares method. This introductory class covers basic definitions of econometrics, econometric model, and economic data. Classification

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,

More information

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING Yong Sun, a * Colin Fidge b and Lin Ma a a CRC for Integrated Engineering Asset Management, School of Engineering Systems, Queensland

More information

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Wilny Wilson.P M.Tech Computer Science Student Thejus Engineering College Thrissur, India. Sindhu.S Computer

More information

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers October 31, 2003 Amit Juneja Department of Electrical and Computer Engineering University of Maryland, College Park,

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining Dave Donnellan, School of Computer Applications Dublin City University Dublin 9 Ireland daviddonnellan@eircom.net Claus Pahl

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,

More information

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Tyler Perrachione LING 451-0 Proseminar in Sound Structure Prof. A. Bradlow 17 March 2006 Intra-talker Variation: Audience Design Factors Affecting Lexical Selections Abstract Although the acoustic and

More information

Bluetooth mlearning Applications for the Classroom of the Future

Bluetooth mlearning Applications for the Classroom of the Future Bluetooth mlearning Applications for the Classroom of the Future Tracey J. Mehigan, Daniel C. Doolan, Sabin Tabirca Department of Computer Science, University College Cork, College Road, Cork, Ireland

More information

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria FUZZY EXPERT SYSTEMS 16-18 18 February 2002 University of Damascus-Syria Dr. Kasim M. Al-Aubidy Computer Eng. Dept. Philadelphia University What is Expert Systems? ES are computer programs that emulate

More information

Automatic segmentation of continuous speech using minimum phase group delay functions

Automatic segmentation of continuous speech using minimum phase group delay functions Speech Communication 42 (24) 429 446 www.elsevier.com/locate/specom Automatic segmentation of continuous speech using minimum phase group delay functions V. Kamakshi Prasad, T. Nagarajan *, Hema A. Murthy

More information

Evolutive Neural Net Fuzzy Filtering: Basic Description

Evolutive Neural Net Fuzzy Filtering: Basic Description Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:

More information

New Features & Functionality in Q Release Version 3.2 June 2016

New Features & Functionality in Q Release Version 3.2 June 2016 in Q Release Version 3.2 June 2016 Contents New Features & Functionality 3 Multiple Applications 3 Class, Student and Staff Banner Applications 3 Attendance 4 Class Attendance 4 Mass Attendance 4 Truancy

More information

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160

More information

Radius STEM Readiness TM

Radius STEM Readiness TM Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and

More information

Session H1B Teaching Introductory Electrical Engineering: Project-Based Learning Experience

Session H1B Teaching Introductory Electrical Engineering: Project-Based Learning Experience Teaching Introductory Electrical Engineering: Project-Based Learning Experience Chi-Un Lei, Hayden Kwok-Hay So, Edmund Y. Lam, Kenneth Kin-Yip Wong, Ricky Yu-Kwong Kwok Department of Electrical and Electronic

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS Heiga Zen, Haşim Sak Google fheigazen,hasimg@google.com ABSTRACT Long short-term

More information

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016 AGENDA Advanced Learning Theories Alejandra J. Magana, Ph.D. admagana@purdue.edu Introduction to Learning Theories Role of Learning Theories and Frameworks Learning Design Research Design Dual Coding Theory

More information

A Reinforcement Learning Variant for Control Scheduling

A Reinforcement Learning Variant for Control Scheduling A Reinforcement Learning Variant for Control Scheduling Aloke Guha Honeywell Sensor and System Development Center 3660 Technology Drive Minneapolis MN 55417 Abstract We present an algorithm based on reinforcement

More information

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation Taufiq Hasan Gang Liu Seyed Omid Sadjadi Navid Shokouhi The CRSS SRE Team John H.L. Hansen Keith W. Godin Abhinav Misra Ali Ziaei Hynek Bořil

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

More information

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ; EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10 Instructor: Kang G. Shin, 4605 CSE, 763-0391; kgshin@umich.edu Number of credit hours: 4 Class meeting time and room: Regular classes: MW 10:30am noon

More information

Support Vector Machines for Speaker and Language Recognition

Support Vector Machines for Speaker and Language Recognition Support Vector Machines for Speaker and Language Recognition W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Singer, P. A. Torres-Carrasquillo MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information