Data-Driven Approach to Designing Compound Words for Continuous Speech Recognition

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001 327 Data-Driven Approach to Designing Compound Words for Continuous Speech Recognition George Saon and Mukund Padmanabhan, Senior Member, IEEE Abstract In this paper, we present a new approach to deriving compound words from a training corpus. The motivation for making compound words is because under some assumptions, speech recognition errors occur less frequently in longer words. Furthermore, they also enable more accurate modeling of pronunciation variability at the boundary between adjacent words in a continuously spoken utterance. We introduce a measure based on the product between the direct and the reverse bigram probability of a pair of words for finding candidate pairs in order to create compound words. Our experimental results show that by augmenting both the acoustic vocabulary and the language model with these new tokens, the word recognition accuracy can be improved by absolute 2.8% (7% relative) on a voicemail continuous speech recognition task. We also compare the proposed measure for selecting compound words with other measures that have been described in the literature. I. INTRODUCTION ONE of the observations that can be made in speech recognition systems is that short words are more frequently misrecognized. This is indicated in Fig. 1, which represents the number of errors made in all words of a specified length (length as defined by the average number of phones in the baseforms of the words). The results for this figure were obtained by decoding the training data of the voicemail corpus (representing 40 h of spontaneous telephone speech) in the following way. Two language models were trained, one from the transcriptions of the first 20 h (LMa) and the second from the transcriptions of the last 20 h (LMb). The first 20 h of the training data were then decoded using LMb and the last 20 h with LMa. These results are intuitively understandable in a longer phone sequence, it is necessary to make more errors in order to get the word wrong. If we consider different words in the vocabulary as sequences of phones and under the following assumptions: 1) no phone sequence in the vocabulary is a subset of any other phone sequence in the vocabulary; 2) probability of error for all phones is the same, ; 3) majority of the phones in a baseform need to be erroneously decoded for the word to be wrong, then the probability of making an error in a word with baseform of length is given by. For values of around 0.3 (which is consistent with what we observed in the training data), can be seen to decrease Manuscript received October 19, 1999; revised November 29, 2000. This work was supported in part by DARPA under Grant MDA972-97-C-0012. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Jerome R. Bellegarda. The authors are with the IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 USA (e-mail: saon@watson.ibm.com). Publisher Item Identifier S 1063-6676(01)02736-5. Fig. 1. word). Word error rate versus word length (expressed as number of phones in as increases, implying that longer words are less frequently misrecognized (with the exception of phone lengths between six and nine where the tendency seems to be reversed). The second observation is that the pronunciation variability of words is greater in spontaneous, conversational speech compared to the case of carefully read speech where the uttered words are closer to their canonical representations (baseforms). One can argue that, by increasing the vocabulary of alternate pronunciations of words (acoustic vocabulary), most of the speech variability can be captured in the spontaneous case. However, an increase in the number of alternate pronunciations is usually followed by an increase in the confusability between words since different words can end up having close or even identical pronunciation variants. Most coarticulation effects arise at the boundary between adjacent words and result in alterations of the last phones of the first word and the first few phones of the second word. One method to model these changes is the use of crossword phonological rewriting rules as proposed in [5]; this provides a systematic way of taking into account coarticulation phenomena such as geminate or plosive deletion (e.g., WENT TO WEH N T UW), palatization (e.g., GOT YOU G AO CH AX), etc. An alternative way of dealing with coarticulation effects at word boundaries is to merge specific pairs of words into single compound words (also called multi-words [3], phrases [6], [8], [10], [11] or sticky pairs [2]) and to provide special coarticulated pronunciation variants for these new tokens. For instance, frequently occurring pairs such as KIND OF, LET ME, LET YOU can be viewed as single words (KIND-OF, LET-ME, LET-YOU) which are often pronounced K AY N D AX, L EH M IY, or L EH CH AX, respectively. 1063 6676/01$10.00 2001 IEEE

328 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001 In this paper, we present a new approach to deriving compound words from a training corpus. Compound words have a fortiori longer phone sequences than their constituents, consequently, one would expect them to be misrecognized less frequently. Furthermore, they also enable more accurate modeling of pronunciation variability at the boundary between adjacent words in a continuously spoken utterance. We suggest and experiment with a number of acoustic and linguistic measures to select these compound words, and present results that indicate that up to a 7% relative improvement can be obtained by adding a small number of compound words to the vocabulary. The rest of the paper is organized as follows. In Section II, we investigate the effect of adding compound words to the language model and describe the various measures that we used for deriving compound words. In Section III, we discuss the experiments and results. Concluding remarks will be presented at the end of the paper. II. MEASURES FOR DERIVING COMPOUND WORDS Though the motivation for adding compound words to the vocabulary is clear, as mentioned previously, adding more tokens or pronunciation variants to the acoustic vocabulary and/or the language model could increase the confusability between words. Hence, the candidate pairs for compound words have to be chosen carefully in order to avoid this increase. Intuitively, such a pair has to meet several requirements [9]. 1) The pair of words has to occur frequently in the training corpus. There is no gain in adding a pair with a low count to the vocabulary since the chances of encountering that pair during the decoding of unseen data will be low. Besides, the compound word issued from this pair will contribute to the acoustic confusability with other words which are more likely according to the language model. 2) The words within the pair have to occur frequently together and more rarely in the pair context of other words. This requirement is necessary since one very frequent word, say, can be part of several different frequent pairs, say,. If all these pairs were to be added to the vocabulary then the confusability between and the pair or would be increased, especially if word has a short phone sequence. This will result in insertions or deletions of the word when incorrectly decoding the word or the sequence (or ). A concrete example is given by the function word THE which can occur in numerous different contexts (such as IN-THE, OF-THE, ON- THE, AT-THE, etc) all of which being frequent. 3) The words should ideally present coarticulation effects at the juncture, i.e., their continuous pronunciation should be different than when they are uttered in isolation. Unfortunately, this requirement is not always compatible with the previous ones, in other words, the word pairs which have strong coarticulation effects do not necessarily occur very often, nor do the individual words occur only together. Consider, for instance, the sequence BYE-BYE often pronounced B AX B AY which is relatively rare in our database whereas the individual word BYE appears in most voicemail messages. The use of compound words has been suggested by several researchers and has been shown to improve speech recognition performance for various tasks [1] [3], [6], [8], [10] [12]. We will make further references to the different approaches throughout this paper as we will examine some possible metrics for selecting compound words. These measures can be broadly classified into language model oriented or acoustic oriented measures, depending on whether the information that is being used is entirely textual or includes acoustic confusability such as phone recognition rate or coarticulated versus non coarticulated baseform (or word pronunciation variant or lexeme) recognition rate. A. Effect of Compound Words on the Language Model Before describing the methods related to selecting the compound words, it is instructive to see what effect the addition of these words has on the language model. Let us assume that the lexicon has been constructed, with the compound words selected according to some measure, and examine the effect on the language model. Language models are generally characterized by the log likelihood of the training data and the perplexity. The log likelihood of a sequence of words (representing the training data) can be obtained simply as The usual n-gram assumption limits the number of terms in the conditioning in (1) to. Hence, the log likelihood of the training data assuming a unigram or bigram model would be, respectively, We can also define an average log likelihood per word as, This may also be written as (3) Hence, the average log likelihood per word is related to the conditional entropy of given. The perplexity of the language model is defined in terms of the inverse of the average log likelihood per word [7]. It is an indication of the average number of words that can follow a given word (a measure of the predictive power of the language model). Hence (1) (2) Perplexity (4) 1) Unigram Model Difference in Log Likelihood: Consider the probability of a sequence of two words and. Further assume that and. The probability of this word sequence assuming a unigram language model is given by (5)

SAON AND PADMANABHAN: DATA-DRIVEN APPROACH TO DESIGNING COMPOUND WORDS 329 Now consider replacing the pair of words and in the original lexicon with the compound word. The likelihood of the word sequence becomes (6) Comparing (5) and (6), the difference in log probability is given by i.e., the compound word has the effect of incorporating a trigram dependency in a bigram language model. The denominator in (12) is the product of the forward and reverse bigram probability of and, and the numerator is the product of the forward and reverse trigram probability of and. B. Language Model Measures The first measure that we consider is the mutual information between two consecutive words [3], [6], [11], [12] which is defined as This can be seen to represent the mutual information between the words and, and forms the basis of the first linguistic measure. A similar discussion of the link between the likelihood and the average mutual information between adjacent classes is provided by Brown et al. in [2]. Bigram model-difference in log likelihood: An analogous reasoning can be applied in the case of a bigram language model by considering the probability of a sequence of three words and conditioned on when and. The probability of this word sequence assuming a bigram language model is (8) As before, replacing the pair of words and in the original lexicon with the compound word changes the likelihood of the word sequence as follows: (7) (13) From (7), this choice of compound words may be seen to be motivated by the desire to maximize the difference in log likelihood of the training data for the two lexicons when a unigram model is used. A weighted variant of the mutual information was proposed in [2] as a criterion for finding sticky pairs. Most authors however, use in its unweighted form, that is, they choose the pairs such as to maximize the mutual information between the words regardless of the frequency of the pairs (see, for example, [6] and [12]). In [11], is used only to select candidate pairs, the final decision of turning pairs into compound words is made based on bigram perplexity reduction. The second measure that we propose is based on defining a direct bigram probability between the words and as and a reverse bigram probability between the words as. The reverse bigram probability as a standalone measure has been mentioned in [10] (called backward bigram) and in [1] (called left probability). Both the direct and the reverse bigrams can be simply estimated from the training corpus as follows: Comparing (8) and (9), the difference in log likelihood is (9) (14) The measure that we used is the geometrical average of the direct and the reverse bigram Substituting (10) This measure has also been independently introduced in [1] (called mutual probability) and is similar to the correlation coefficient proposed recently by Kuo [8] which can be written as we get (11) (12) The similarity between the two arises from the fact that they divide the joint probability by the mean of the marginals,,, with the main difference lying in the choice of an arithmetic versus a geometric mean of the marginals. Note that for every pair of words. A high value for means that both the

330 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001 direct and the reverse bigrams are high for, or in other words, the probabilities that is followed by and is preceded by are high which makes the pair a good candidate for a compound word according to our second requirement. In our implementation we selected all pairs of words for which this measure is greater than a fixed threshold and for which the raw count of the word pair exceeds another predefined threshold. It may be seen that the mutual information measure has much in common with the bigram product measure. Intuitively, a high mutual information between two words means that they occur often together in the training corpus (the pair count is comparable with the individual counts), and in this sense it is similar to the bigram product measure. However, the bigram product measure imposes an additional constraint in that it not only requires and to occur together, but also prevents them from occurring in conjunction with other words. Further, from (12), it is not apparent that the log likelihood improves with the use of compound words chosen by because this measure maximizes the denominator term of the likelihood difference. The log likelihood is generally directly related to the perplexity, however, perplexities cannot be compared for language models with different vocabularies. Some authors suggest the use of a normalized perplexity (where the average log likelihood of the training data is computed with respect to the original number of words [1], [11]) and even design the compound words such as to directly optimize this quantity [8], [10], [11]. This turns out to be equivalent to increasing the total likelihood of the training corpus. C. Acoustic Measures Neither the bigram product measure nor the mutual information take into account coarticulation effects at word boundaries, since they are language model oriented measures. These coarticulation effects have to be added explicitly for the pairs which become compound words according to these metrics, either by using phonological rewriting rules or by manually designing coarticulated baseforms where appropriate. The second part of our study is centered around the use of explicit acoustic information when designing compound words. The first measure deals explicitly with coarticulation phenomena and can be summarized as follows. For the pairs of words in the training corpus which present such phenomena according to the applicability of at least one phonological rewriting rule [5], one can compare the number of times that a coarticulated baseform for the pair is preferred over a concatenation of non-coarticulated individual baseforms of the words forming that pair in the training corpus. This can be estimated by doing a Viterbi alignment of all instances of the word pair in the training data, with the coarticulated pair baseform and with the concatenation of individual baseforms, and selecting the baseform which has a higher acoustic score. If baseform denotes the number of times that the coarticulated baseform is preferred, and baseform denotes the number of times that the concatenated baseform is preferred, the measure is defined as the ratio between these two counts baseform baseform If this ratio is bigger than a threshold (which is set in practice to 1) then the pair is turned into a compound word. The second measure is more related to the acoustic confusability of a word. Let us assume that word has a low probability of correct classification. One would expect that, by tying to a word which has a higher phone classification accuracy, the compound word (or ) would have a higher classification accuracy. The second measure that we define computes a quantity related to the probability of correct classification for the different pronunciation variants of the compound word. We first define a probability of correct classification for word,, as follows: phone where phone denotes the probability of correct classification for phone (this is computed by decoding the training data and just counting the number of times that phone was correctly recognized), and baseform denotes the number of phones in the baseform. The second acoustic measure for selecting compound words is defined as follows: The word pairs that maximize this measure are then selected as compound words. III. EXPERIMENTS AND RESULTS All the experiments were performed on a telephony voicemail database comprising about 40 h of speech [9]. The language model is a conventional linearly interpolated trigram model [7] and was trained on approximately 400 K words of text. The effect of adding compound words was to increase the span of the LM beyond trigrams. We have not attempted, however, to compare a weaker LM (say a bigram LM) augmented with compound words with the corresponding trigram or -gram LM as was suggested by one reviewer. The size of the acoustic vocabulary for the application is 14 K words. The results are reported on a set of 43 voicemail messages (roughly 2000 words). The experimental setup is as follows. We started with a vocabulary that had no compound words, and applied every measure iteratively to increase the number of compound words in the vocabulary. After one iteration, the word pairs that scored more than a threshold were transformed into compound words and all instances of the pairs in the training corpus were replaced by these new words. Both the acoustic vocabulary and the language model vocabulary were augmented by these words after each step. In the following tables, underlined compound words are meant to indicate that the compound words also had coarticulated baseforms which were added to the acoustic vocabulary. Also, indicates the number of compound words that were added to the vocabulary during the current iteration. We will first describe the results with as this gave us the best performance.

SAON AND PADMANABHAN: DATA-DRIVEN APPROACH TO DESIGNING COMPOUND WORDS 331 TABLE I RECOGNITION SCORES AND PERPLEXITIES FOR MEASURE LM TABLE II RECOGNITION SCORES AND PERPLEXITIES FOR MEASURES LM, AC AND AC For the second language model measure, which was based on the product between the direct and the reverse bigram, the threshold was chosen to be 0.2, i.e., if, then would be made a compound word. This threshold was chosen so as to get approximately the same number of compound words finally as in the case where they were designed by hand. Table I summarizes the number of new compound words obtained after each iteration, examples of such words, and the word error rate as well as the perplexity of the test set (the normalized perplexity is denoted by a *). The last line of Table I also indicates the beneficial effect of adding coarticulated baseforms to the vocabulary, even when the compound words are chosen strictly based on a linguistic measure. The only difference between Iteration 3 and 3b in Table I is that in the former case, baseforms were added to the vocabulary to account for the coarticulation in the selected compound words, whereas in the latter case (3b), the baseforms were simply a concatenation of the baseforms of the individual components. This seems to indicate that though a significant gain can be obtained by selecting compound words based only on a linguistic measure, the gain can be further enhanced by allowing for a coarticulated pronunciation of these selected compound words. For the remaining measures (, and ), the thresholds were set such as to obtain the same number of words (or pairs) after each iteration as for the case. We believe that this facilitates a fair comparison between the performances of the different measures. The threshold on the pair count was set to 100 for and (or ) and to 300 for. The performances of these measures are illustrated in Table I. It may be seen that there is virtually no improvement by using any of these other measures. The bigram product measure, outperforms the mutual information metric, because the measure seems to pick words which co-occur frequently (i.e., the first condition in Section II) without paying heed to whether the same constituent words also co-occur frequently with other words (the second condition in Section II). Another observation from Tables I and Table II is that for the same number of pairs after the first iteration (42), the difference in perplexity is significant between the language models based on and. Surprisingly, the better performance is obtained for the language model with a higher perplexity. 1 The poor performance of the acoustic measures can be explained by the fact that neither nor take into account word pair frequency information. Besides, there is no measure of the degree of stickiness of a pair as in the case of the lan- 1 As was pointed out by the reviewers, perplexity cannot really be compared across different vocabularies. The normalized perplexity (also shown in the tables) is supposedly a better indicator of task complexity in this case, but our results did not seem to indicate any great correlation between the word error rate and normalized perplexity either.

332 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001 TABLE III PERPLEXITY AND RECOGNITION PERFORMANCE USING MANUALLY DESIGNED COMPOUND WORDS frequency of pairs and the degree of closeness of a pair (how often do the words of a pair occur together). Once the pairs have been found, the modeling of coarticulation effects at word boundaries within the pairs (where applicable) may further improve the overall performance. guage model oriented measures (by stickiness, we mean frequency of co-occurrence of the word pair, i.e., word tends to stick to word ). This tends to increase the acoustic confusability between words in the vocabulary since a frequent word can be part of many pairs now. Finally, Table III shows the performance of a set of 58 manually designed compound words suited for the voicemail recognition task. It is generally the case that tuning the speech recognition system to a particular task (for instance by manually selecting the compound words) is a process that does tend to improve performance on the task, however, this represents a tedious and time consuming process. Consequently, it is encouraging to see that the statistically derived measure (which can be implemented relatively easily on a new task) is able to approach the same performance, even though it uses a few more compound words. IV. DISCUSSION In this paper, we experimented with a number of methods to design compound words to augment the vocabulary of a speech recognition system. The motivation for combining pairs of words to form compound words is twofold: 1) experimental observations indicate that it is less likely that longer phone sequences are misrecognized and 2) compound words enable cross word coarticulation effects to be easily modeled. We experimented with both linguistic and acoustic measures in selecting these compound words. The linguistic measures were related to the mutual information between word pairs, and a new measure, the product of the forward and reverse bigram probability of the word pair. The acoustic measures were based on whether the word pair had a significant amount of cross word coarticulation. Our experimental results indicated that the second linguistic measure was particularly useful in selecting compound words. Even though we found that selecting compound words on the basis of acoustic measures was not useful, we found that in the case where the compound words were selected based on the linguistic measure, it was beneficial to add coarticulated baseforms when necessary for the selected compound words. Experimental results show an overall improvement in word error rate of 7% (relative) and achieves comparable performance to human design of compound words. The main conclusion that can be drawn is that effective metrics for designing compound words should depend upon some language model information such as the REFERENCES [1] C. Beaujard and M. Jardino, Language modeling based on automatic word concatenations, in Proc. Eurospeech 99, Budapest, Hungary, 1999. [2] P. F. Brown, V. J. Della Pietra, P. V. DeSouza, J. C. Lai, and R. L. Mercer, Class-based n-gram models of natural language, Comput. Linguist., vol. 18, no. 4, pp. 467 477, 1992. [3] M. Finke and A. Waibel, Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition, in Proc. Eurospeech 97, Rhodes, Greece, 1997. [4] M. Finke, Flexible transcription alignment, in 1997 IEEE Workshop Speech Recognition Understanding, Santa Barbara, CA, 1997. [5] E. P. Giachin, A. E. Rosenberg, and C. H. Lee, Word juncture modeling using phonological rules for HMM-based continuous speech recognition, Comput., Speech, Lang., vol. 5, pp. 155 168, 1991. [6] E. P. Giachin, Phrase bigrams for continuous speech recognition, in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Detroit, MI, 1995, pp. 225 228. [7] F. Jelinek, Statistical methods for speech recognition, in Language, Speech and Communication Series. Cambridge, MA: MIT Press, 1999. [8] H. K. J. Kuo and W. Reichl, Phrase-based language models for speech recognition, in Proc. Eurospeech 99, Budapest, Hungary, 1999. [9] M. Padmanabhan, G. Saon, S. Basu, J. Huang, and G. Zweig, Recent improvements in voicemail transcription, in Proc. Eurospeech 99, Budapest, Hungary, 1999. [10] K. Ries, F. D. Buo, and A. Waibel, Class phrase models for language modeling, in Proc. Int. Conf. Speech Language Processing 96, Philadelphia, PA, 1996. [11] B. Suhm and A. Waibel, Toward better language models for spontaneous speech, in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing 94, Yokohama, Japan, 1994. [12] I. Zitouni, J. F. Mari, K. Smaili, and J. P. Haton, Variable-length sequence language models for large vocabulary continuous dictation machine, in Proc. Eurospeech 99, Budapest, Hungary, 1999. George Saon received the M.Sc. and Ph.D. degrees in computer science from the University Henri Poincare, Nancy, France, in 1994 and 1997. From 1994 to 1998, he worked on stochastic modeling for off-line handwriting recognition at the Laboratorie Lorrain de Recherche en Informatique et ses Applications (LORIA). He is currently with the IBM T. J. Watson Research Center, Yorktown Heights, NY, conducting research on large vocabulary conversational telephone speech recognition. His research interests are in pattern recognition and stochastic modeling. Mukund Padmanabhan (S 89 M 89 SM 99) received the M.S. and Ph.D. degrees from the University of California, Los Angeles, in 1989 and 1992, respectively. Since 1992, he has been with the Speech Recognition Group, IBM T. J. Watson Research Center, Yorktown Heights, NY, where he currently manages a group conducting research on aspects of telephone speech recognition. His research interests are in speech recognition and language processing algorithms, signal processing algorithms, and analog integrated circuits. He is coauthor of the book Feedback-based Orthogonal Digital Filters: Theory, Applications, and Implementation.