BENEFIT OF MUMBLE MODEL TO THE CZECH TELEPHONE DIALOGUE SYSTEM Luděk Müller, Luboš Šmídl, Filip Jurčíček, and Josef V. Psutka University of West Bohemia, Department of Cybernetics, Univerzitní 22, 306 14 Pilsen, Czech Republic, muller@kky.zcu.cz, smidl5@kky.zcu.cz, filip@kky.zcu.cz, psutka_j@kky.zcu.cz Abstract This paper discusses a usage of a mumble model in a Czech telephone dialogue system designed and constructed at the Department of Cybernetics, University of West Bohemia, and describes benefits of the mumble model to speech recognition, namely to a rejection method. Firstly, the overview of the Czech telephone dialogue system and its recognition engine is given. The recognition is based on a statistical approach. The triphones are used and modeled by tree state left to right HMMs with an output probability density function expressed as a multivariate Gaussian mixture. The stochastic regular grammars are used as a language model to reduce a task perplexity. Secondly, the mumble model is introduced as a recursive network of Czech phones HMM models connected in parallel, and an implementation of a rejection and a key word spotting method, both based on the mumble model, is explained. Finally, the experimental results providing the 19.4 % equal error rate (EER) of the rejection and 16.7 % EER of the key word spotting method are discussed. 1. Introduction Current speech recognition systems usually work with a vocabulary of a limited size. Also a finite state grammar is often used as a language model which efficiently restricts a number of acceptable utterances. Thus a problem arises when the incoming utterance does not respect the recognition grammar rules. The simplest example is if the speaker says an out of vocabulary word. It is required that in this case the speech recognition engine must not select any suitable sentence from the sentence set defined by the grammar and should inform an application that no sentence matches the input utterance and recognition result is rejected. In this paper a new rejection method based on a time local distance between the mumble score and the word score is presented. In the next part of the article an implementation of a key word spotting method using a mumble model and a finite state grammar is described. The mumble model is here used to capture and absorb non key word parts of an utterance. 2. Dialogue System Our dialogue system consists of three main parts a speech engine, a dialogue manager, and a dialogue application. The dialogue application is a task oriented module keeping knowledge on a lexicon, dialogue structure etc., and the dialogue manager controls a communication between a user and the system. The speech engine contains only a speech recognition module today; a speech synthesis module will be added to the engine in the near future. Figure 1 illustrates the dialogue system architecture.
Figure 1. Telephone dialogue system 3. Speech Recognition Engine The core of the speech engine is implemented in C++ and designed as platform independent. The platform specific implementation layer was built for an implementation on MS Windows NT/95/98/2000. Our goal was to design a fast recognition module without decreasing recognition accuracy. Several instances of the speech engine can operate on one PC in real time. Furthermore, each engine module can be implemented as a set of several tasks, each of them generally running as an individual process. Figure 2 shows how the tasks of the speech recognition engine cooperate. 3.1. Recognition module The recognition module incorporates a front end, an acoustic model, a language model (represented by a stochastic regular grammar) and a decoding block providing searching for the best word sequence matching the incoming acoustic signal with respect to the grammar. As it was mentioned above, the recognition module is split up into three tasks: the front end, the labeler and the decoder. The front end is responsible for converting a continuous acoustic speech signal into a sequence of feature vectors. The digitization of an input analog telephone signal and/or a generation of a synthesized speech signal are provided by a telephone interface board. The front end is now equipped by a DIALOGIC D/21D board that supports two telephone lines. This enables us to run two speech recognition engines at the same time on one computer. The speech signal is first digitized (by DIALOGIC board) at a sampling rate of 8kHz. A 25 ms Hamming window shifted with 10 ms steps and a pre emphasis factor of 0.97 are then used to calculate 13 mel frequency cepstral coefficients (MFCCs) (including c(0) coefficient) and their first order and second order derivatives. To make the telephone speech recognition more robust we use a RASTA like band pass filter [7] that suppresses slowly varying channel distortions. The RASTA filter is applied to the first 13 MFCCs and can be described by equation 4 yn ( ) = G ( i 2) xn ( + i ) + µ yn ( 1), i= 0 where x(n) and y(n) are an input and an output speech signal respectively, n = 0, 1,..., T 1, and T is the number of frames. Parameters μ and G are set to μ = 0.94 and G = 0.1 respectively. The front end also contains a silence speech detector. If an observation feature vector is Figure 2. Speech recognition engine
marked as silence then the labeler task and the decoder task are waiting for speech data. The labeler is responsible for computing a large number of log likelihood scores (LLSs) of observation feature vector. There are 2510 divers tied states. Each state is represented in 39 dimension space (13 MFCCs + 13 first order derivatives + 13 second order derivatives). To implement a real time recognizer it is important to reduce the number of LLSs calculations. This leads to an approximate computation of LLSs. We solve this problem by applying a new method, which establishes relatively exactly (in the original space of dimension 39) the first 50 or 150 best (most probable) LLSs. The proposed method uses efficiently relevant statistical properties of the Gaussian mixture densities combining them with a priory hit technique and the knn method. This approach allows more than 90% reduction of a computation cost without substantial decrease of recognition accuracy. The decoder is responsible for finding the best word sequence that matches incoming acoustic signal. The decoder uses a crossword context dependent HMM state network. The whole network consists of one or more (generally in run time) connected grammars. A considerable part of the net is generated before the decoder starts, but every part of the net can be generated on demand in run time. The decoder uses a Viterbi search technique with an efficient beam pruning. 3.2. The Mumble Model The mumble model is constructed as a set of HMM models connected in a parallel fashion. Each HMM model is 3 state left to right model which represents one context independent phone. The structure of the mumble model is depicted in Figure 3. Actually the probabilities of emission of an observation vector in a given state are evaluated as the maximal emission probability of all corresponding states of context dependent triphones. Thus neither additional HMM models nor additional training is required. The value of the backward loop probability BPr causes a various length of the phone sequence recognized by the network in Figure 3. While the higher value of BPr produces more insertions, the smaller value induces more deletions in the resulting phone sequence. Figure 3: Mumble model 3.3. The Rejection Method At each time frame, the log likelihood score of the mumble model is evaluated as the maximum log likelihood score from all mumble model states. Similarly, the log likelihood score of the recognition network is taken as the maximum log likelihood score from all network model states. Then the difference between these two maximal values is computed and saved into a buffer keeping score differences of the last N time frames. Let us denote M as a suitable chosen value, 0 < M < N. In each time of recognition the difference between the last buffer element B[t] and the buffer
element B[t M] is evaluated and compared to some predefined threshold. If the threshold is exceeded, then the recognition result is rejected. 3.4. The Key Word Spotting Method The decoder can also run in a key word spotting mode. The recognition network contains both a set of key words in a parallel connection and a mumble model also connected to the key word network in parallel. An example of a key word recognition network with a mumble model is depicted in Figure 4. finite state grammar was used as the language model in all experiments. The rejection method was tested on a telephone yellow page database. The speech corpus comprised 357 speakers and 357 utterances (each utterance was spoken by a different speaker). We used the vocabulary of size 577 words and the grammar that accepted 716 different two words sentences (person s names). In order to test the case when an out of vocabulary phrase is spoken a test was also performed with a grammar which does not accept any utterance from the test speech corpus. The both grammars were tested with the test speech corpus, and false acceptance and false rejection error rates were obtained for different values of the rejection thresholds. The results are shown in Figure 5. The word error rate is 5.4%. The intersection point between the false acceptance curve and the false rejection curve denotes the ERR (equal error rate). 40 35 false rejection false acceptance Figure 4: Key word recognition network During recognition a Viterbi search finds out the best path through the recognition network. For any non key word part of an utterance the mumble model should have a better acoustic match score than any key word model; thus a mumble word will be assigned to the non key word part of the utterance. In this way the mumble model catches the non key word parts. Finally, the mumble words are omitted from the resulting recognized word sequence and only the key words remind in the output. 4. Experimental Results The proposed methods have been tested using two Czech telephone databases [5]. A error rate 30 25 20 15 10 55000 55500 56000 56500 57000 57500 58000 58500 threshold Figure 5: Rejection result (716 sentences) This rather high value is caused by a relatively small width of beam pruning during recognition. The ERR is 19.4%. Furthermore we performed tests with the same test corpus on a person s names grammar with a larger vocabulary size and found that the ERR increases with the number of sentences accepted by the recognizer and with the vocabulary size. The ERR 25.1 % was
achieved for a vocabulary containing 2864 sentences. The key word spotting method was tested using the Czech telephone database from an economic area. 50 words were chosen as 50 key words and 97 utterances from different speakers were utilized as a test speech corpus. error rate 35 30 25 20 15 10 5 key-phrase deletion errorrate key-phrase insertion errorrate 250000 300000 350000 400000 450000 500000 550000 600000 backw ard loop tran sition cost(-log likelihood ofbpr) Figure 6: Key words spotting result The average utterance length is 14 words and all utterances contain 450 different words. Results for the backward loop probability are shown in Figure 6 (ERR = 16.7). 5. Conclusions This paper describes a mumble model method incorporated into the Czech telephone dialogue system and discusses its benefits to speech recognition. The various test results for both the rejection method of out of grammar utterances and the word spotting method are given. In our experiments with the rejection technique the ERRs for a small vocabulary (716 sentences) and for a large vocabulary (2 864 sentences) were 19.4% and 25.1%, respectively. The results for the key-word spotting method showed the ERR of 16.7%. 6. Acknowledgments This work was supported by the Ministry of Education of Czech Republic, projects no. LN00B096 and MSM235200004. 7. References [1] Lin, Q., Das, S., Lubensky, D., Picheny, M.: A new confidence measure based on rank ordering subphone scores, In: ICSLP 1998, Sydney. [2] Neti, C., Roukos, S., Eide, E.: Confidence measure as a guide for stack search in speech recognition, In: ICASSP96, pp.883 887, Germany. [3] Lin, Q., Lubensky, D., Roukos, S.: Use of recursive mumble models for confidence measuring, In: Eurospeech99, pp.53 56; Budapest. [4] Young, S.J., Russell, N.H., Thornton, J. H.S.: Token Passing: a Simple Conceptual Model for Connected Speech Recognition Systems, Cambridge University Engineering Department, July 31, 1989. [5] Radová, V., Psutka, J., Šmídl, L., Vopálka, P., Jurčíček, F.: Czech Speech Corpus for Development of Speech Recognition Systems, In: Proceedings of the Workshop on Developing language resources for minority languages, Athens 2000. [6] Müller, L., Psutka, J., Šmídl, L.: Design of Speech Recognition Engine, In: Text, Speech and Dialogue 2000 (TSD2000), 3 rd International Workshop, Brno, Czech Republic, 2000. [7] Han, J., Han, M., Park, G.B., Park, J., Gao, W.: Relative Mel Frequency Cepstral Coefficients Compensation for Robust Telephone Speech Recognition In: Eurospeech97, pp. 1531-1534.