PERFORMANCE IMPROVEMENT OF AUTOMATIC SPEECH RECOGNITION SYSTEMS VIA MULTIPLE LANGUAGE MODELS PRODUCED BY SENTENCE-BASED CLUSTERING

PERFORMANCE IMPROVEMENT OF AUTOMATIC SPEECH RECOGNITION SYSTEMS VIA MULTIPLE LANGUAGE MODELS PRODUCED BY SENTENCE-BASED CLUSTERING Sushil Kumar Podder, Khaled Shaban, Jiping Sun, Fakhri Karray, Otman Basir, and Mohamed Kamel [spodder, kshaban, jsun, karray, obasir, mkamel]@uwaterloo.ca PAMI Lab, University of Waterloo, Waterloo, CANADA ABSTRACT Grammar-based speech recognition systems exhibit performance degradation as their vocabulary sizes increase. Data clustering is deemed to reduce the proportionality of this problem. We introduce an approach to data clustering for automatic speech recognition systems using Kohonen Self-Organized Map. Clustering results are used further to build a language model for each of the clusters using CMU- Cambridge toolkit. The approach was implemented as a prototype for a large vocabulary and continuous speech recognition system and about 8% performance improvement was achieved in comparison with the performance achieved using the language model and dictionary provided by Sphinx3. In this paper we present the experimental results along with discussions, analysis and potential future directions. Keywords: Automatic Speech Recognition, Self Organized Map, Language model 1. INTRODUCTION With the continuing advances made in speech technology, more information and services will become readily available to users. A simple cell phone will be enough to hook into the information age. Automatic Speech Recognition (ASR) systems are already being deployed to help people finding out flight information, trading stocks, accessing emails, and finding out weather conditions. Over the past few years a number of significant (commercial and research oriented) engines and toolkits have been developed to offer building ASR systems, e.g., Nuance [6], Microsoft Speech Engine [7], and CMU-Cambridge Toolkit [2]. Despite the success of grammar-based engines in control and command applications, they suffer acute performance degradation once their vocabulary sizes exceed some sizes. Therefore, it seems very difficult to have this approach working to implement free and domain independent ASR applications, e.g., dictation systems. A practical solution to this problem is to break down the large language model (LM) and construct multiple LMs with acceptable vocabulary sizes. These models should group words and phrases that are used coherently together in the natural use of the language. In this paper we present an approach to improve speech recognition systems and get one step closer to have spontaneous and domain independent dictation systems.

The breakdown of the paper is as follows: Section one, the introduction, presents some basic background material, demarcates the subject, and discusses some existing problems and how this work can tackle them. Section two gives a literature survey and state-of-the-art review regarding the application of data clustering for the creation of language models. The next four sections constitute the main essay of the work: Section three presents a description of the proposed solution including data modeling, clustering technique, and language model building. Section four offers a thorough explanation of the experimental setup by means of its data set, and algorithms parameters tuning and settings. Section five contains results collected from an implemented prototype of a dictation system that makes use of the proposed clustering technique. Section six includes analysis and discussion about the results and highlights the key issues covered. And section seven recommends further developments concerning the presented study. 2. OVERVIEW OF RELATED WORK Here is an overview including state-of-the-art of three related areas of studies; language modeling, data clustering, and the use of clustering to improve speech recognition systems 2.1 Language Modeling Probabilistic n-gram language model is the most widely used in large vocabulary ASR systems. The job of the n-gram language model is to define which words follow at each point in the model and the transition probability from one word to the next [10] [13] [14]. For large vocabulary recognizers, word lattice is not constructed beforehand, it is built as the search progresses and the language model is used to determine the overall probability of the search paths under consideration. However, as the size of language model increases, search path also increases proportionally and consequently model perplexity increases. So, the reduction of the size of language model is important. In the ASR system, the size of the active vocabulary of a speaker in a context is rather small. Usually, speakers like to confine themselves in some specific domain. So, the generation of domain specific language (sub-language) model is the most challenging and demanding task in ASR. 2.2 Data Clustering Clustering Algorithms partition a set of objects into groups or clusters [11] [15]. Data representation modeling is an essential part of the clustering process. It is the means by which objects are described using a set of features and values. Multiple objects may or may not have the same representation in this model, and the goal is to place similar objects to different groups. There are many different clustering algorithms, and they can be classified into a few basic types. There are two types of structures produced by clustering algorithms, hierarchical clustering and flat or nonhierarchical clustering. Flat clustering simply consists of a certain number of clusters and relation between clusters is often undetermined. Most algorithms that produce flat clustering are iterative. They start with a set of initial clusters and improve them by iterating a reallocation operation that reassigns objects. A hierarchical clustering is a hierarchy with the usual interpretation that each mode stands for a subclass of its mother s node. The leaves of the tree are the single objects of the clustered set. Each node represents the cluster that contains all the objects of its descendant. Another important distinction between clustering algorithms is whether they perform a soft clustering or hard clustering. In a hard assignment, each object is assigned to one and only one cluster. Soft assignments allow degrees of membership and membership in multiple clusters.

2.3 Clustering for better Recognition Brown et al. [8] have taken the approach of clustering believing that can play an important role in improving language modeling. Recently, the speech community has begun to address the use of clustering in building better language model, and thus improve recognition accuracy. Florian and Yarowsky [3] utilized hierarchical and dynamic topic-based clustering to build language models. Although this work, and other topic-based clustering approaches [4] could bring in perplexity improvements and word error rates reductions, they would have difficulties handling the issue of topic-insensitive words. As Mangu [5] observed that closed-class function words, such as the, of, and with, have minimal probability variation across different topics while most openclass content words exhibit substantial topic variation. Moreover, it would require the recognizer an additional overhead of identifying the topic/domain of the users first utterances in order to use the appropriate language model. 3. THE PROPOSED SOLUTION For the purpose of language modeling, clustering can be done based on documents or segment of documents. This can be obtained either manually (by topic-tagging the documents) or automatically, by using an unsupervised algorithm to group similar documents in topic-like clusters. We have utilized the latter approach, for its generality and extensibility, and because there is no reason to believe that the manually assigned topics are optimal for language modeling. Here are some descriptions of the main steps taken towards solving the problem. Figure 1 depicts all steps and presents it in a flow diagram. In brief, a large corpus is used by a sentence-based clustering technique to produce smaller collection of related sentences. The clusters are used further to build multiple language models. User s utterance would be passed to all language models and initial recognition results are collected. These results are aggregated to create a final language model that convert-spoken text into written form. Processed BNC Clustering (SOM) CMU decoder Test speech Clusters Rec ognition result Language model & dictionary generator Figure 1: System Flow diagram S ub language models & Dictionary Final LM +dictionary CMU decoder Final Result 3.1 Data Modeling Clustering sentences of text corpora are believed to yield the closest and the most similar words and common phrases together. Extracting sentences and representing them was the first step to deal with this issue. Most data clustering algorithms expect the data set to be clustered in the form of a set of vectors X = x, x,..., x }, where the vector, i = 1,,n { 1 2 n corresponds to a single object in the data set and is called the feature vector. Extracting the proper features to represent objects through the feature vector is a crucial factor on the running time of the algorithm and hence its scalability. When clustering sentences we overcame the problem of high dimensionality as follows: each sentence is represented by a vector x, such that x = f, f,..., f }, where, i = 1,,m, is the { 1 2 m word i indexed in a global dictionary of all existing words. The vector size m is the average number of words found in a sentence. 3.2 Data Clustering There exists a multitude of clustering techniques in the literature, each adopting a certain strategy for detecting the grouping in the data. The method that f i x i

was chosen here is the Kohonen Self-Organizing Map (SOM) [1]. SOM can be visualized as a sheetlike neural-network array, the cells (or nodes) of which become specifically tuned to various input signal patterns or classes of patterns in an orderly fashion. The learning process is competitive and unsupervised, meaning that no teacher is needed to define the correct output (or actually the cell into which the input is mapped) for an input. In the basic version, only one map node (winner) at a time is activated corresponding to each input. The locations of the responses in the array tend to become ordered in the learning process as if some meaningful nonlinear coordinate system for the different input features were being created over the network [1]. All sentence vectors were fed to the network, as one epoch, and for a number of iterations the network would be trained to map them to a specific number of clusters. 3.3 Language Model Building Using unsupervised learning methodology, SOM; the whole training data (sentence sequences) was clustered. The sentences belonging to each of the clusters were collected together to generate individual sub language model. Language models were generated using statistical language modeling toolkit provide by CMU [2]. Each of the language models generates 1-gram, 2-gram and 3-gram word sequences as well as their probabilities. Using these probabilities we computed the probabilities of appearing of each of the testing sentences from each of the sub models as well as the CMU language model. 4. EXPERIMENT SETUP For our experiments, we chose BNC (British National Corpus). It comprises of approximately 100 million words and about 6 millions sentences. After some pre-processing, viz., filtering out all of the nonwords, punctuations, abbreviations and numerals, and removing stop and unimportant words, we chose 5,845,344 sentences containing 49,233 unique words for the clustering purpose. Each of the sentences was represented as a vector of fixed dimension as 60, and each of the components of the vector was represented by the index (normalized) of the wordlist. If the sentence size was less than 60, the rest of the components were filled with 0 s, and for the sentences larger than 60 words were just truncated to 60. Note that dimension was selected as 60 through analyzing a vast amount of text data of the BNC. We used SOM_PAK [12] for sentence clustering. Here are some parameters values of the SOM algorithm: 5 is the initial radius of the training area in SOM algorithm decreases linearly to one during training. Number of clusters that SOM would produce was set to 5, 10, 20, 20, and 40 clusters. Hexa was the topology type used in the map. Possible choices were hexagonal lattice (hexa) and rectangular lattice (rect). The neighborhood function type bubble was used. Running length (number of iterations) in training was chosen to be 30000 after a number of trials. For testing, we randomly selected 100 VOA (voice of America) broadcast utterances and ran the CMU engine incorporating clustering, and without clustering. The results are presented in the flowing section. 5. RESULTS We computed the language models perplexity using CMU toolkit for the reference sentences. For the best sub language model, average perplexity is 8.6 bits and for the big language model is 8.8 bits. Figure 2, shows the perplexities of generating each of the 16 14 12 pe rpl10 exi ty 8 6 4 2 sub language model big language model 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 10 0 sentence number Figure 2. Language model perplexity (in bits )

sentences from the sub language model and the big language model. The recognition results without clustering and with clustering for different sizes of clusters are presented in Table 1. No. of Clusters Recognition (Without clustering) Recognition (With clustering) Table 1: Recognition Performance 6. DISCUSSION Performance improvement 5 70.8% 75.3% 4.5% 10 70.8% 75.4% 4.6% 20 70.8% 78.0% 7.2% 30 70.8% 78.7% 7.9% 40 70.8% 78.0% 7.2% To improve the performance of ASR and to be able to have free and unlimited dictation systems, we proposed an approach that incorporated clustering technique, SOM. From a set of experiments, we have achieved consistently better recognition performance for most of the test utterances. In some cases, around 8% performance improvement has been obtained compared with the one CMU language model. The improvement in recognition performance should be due to the SOM s ability to organize sentences according to textual similarities and these textual similarities reduce the perplexity of the model and consequently enhance the recognition results. The results show the gradual improvement in performance with the increased number of clusters, but once it exceeds 20 the performance does not change significantly, rather it starts degrading. Since language generator relies on the uni-gram, bi-gram, and tri-gram counts, due to the reduced size of cluster, model experiences inadequate tokens of bigram and tri-gram from the training data, and consequently the decoder is biased to pickup most of the uni-gram. Hence, there is always a trade-off between the cluster size and performance. Although, sub-language model shows a significant improvement in recognition performance as well as in perplexity, problem of automatic identification of suitable sub language model has not yet been solved. Following section shows our ongoing effort. 7. FUTURE WORK Although sub-language model generated from the clustered data improved the recognition efficiency, there is still a room for further improvement in language modeling through modifying the encoding scheme using semantics. We have investigated that conventional representation (encoding scheme) of the sentences used for clustering is not sufficient enough to grab the inherent semantics of the sentence. As the conventional sentence representation treats the sentence as a bag of words without having any conceptual similarity of terms such as defined in terminological resources like WordNet [8], clustering solutions only relate the sentences that use identical terminology. In our ongoing effort, we are trying to find the appropriate language through semantic clustering Using big language model and dictionary, CMU decoder usually provides one single best hypothesis for each of the utterance. From our present experiment, we have found that ASR engine with CMU language model is able to predict about 70% correct words. Since SOM algorithm is able to cluster the data based on the inherent redundancy or similarity in the data, we postulate that cluster(s) closest to the best hypothesis might contain more relevant words. Based on the assumption, we followed the following steps (depicted in Figure 3): Pass the utterance to the CMU decoder using CMU language model Represent the best hypothesis as a feature vector Select the cluster(s) most similar to it using some sort of distance measure (usually Euclidian)

Accumulate all the sentences in the specific cluster(s) and generate final language model Pass the utterance once again to the CMU decoder using final language model C lustering with SOM Encoding Scheme Processed BNC CMU Decoder Test speech References Cluste r s s WordNet Best Hypothesis Language Model and Dictionary Generator CMU LM Dictionary Decision - Making Figure 3: Two stage ASR System Language & Dictionary dl Final LM Dictionary CMU decoder Final Result [1] Kohonen, T., Self-Organizing Maps, Springer, Berlin, Heidelberg, 1995. [2] P.R. Clarkson and R. Rosenfeld. Statistical Language Modeling Using the CMU-Cambridge Toolkit From Proceedings ESCA Eurospeech 1997. [3] Radu Florian, and David Yarowsky. Dynamic Nonlocal Language Modeling via Hierarchical Topic-Based Adaptation, 37th Annual Meeting of the Association for Computational Linguistics, 1999. [4] S. Lowe. An attempt at improving recognition accuracy on switchboard by using topic identification, Johns Hopkins Speech Workshop, Language Modeling Group, Final Report, 1995. [5] Lidia Mangu. Hierarchical topic-sensitive language models for automatic speech recognition, Techincal report, Computer Scinece Department, Johns Hopkins University, 1999. [6] http://www.nuance.com. [7] http://www.microsoft.com/speech. [8] Brown, Peter F., Vincent J. Della Pietra, Peter V. desouza, Jenifer C. Lai, and Robert L. Mercer. Class-based n-gram models of natural language. Computational Linguistics 18:467-479, 1992. [9] Miller, G.A. 1995. WordNet: A Lexical Database for English. Communications of the ACM 11. [10] Chritopher D. Manning, and Hinrich Schutze. Foundations of Statistical Natural Language Processing, The MIT Press, 1999. [11] Jain, Anil K., and Richard C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice Hall. [12] Kohonen, T., Hynninen, J., Kangas, J., and Laaksonen, J. (1996). SOM_PAK: The self organizing map program package. Technical Report A21, Helsinki University of Technology, Laboratory of Computer and Information Science, Espoo, Finland. [13] Rosenfeld, R. Adaptive Statistical Language Modeling. A maximum entropy approach. PhD Thesis, Computer Science Department, Carnegie Mellon University, 1994. [14] Stolcke, A. et. al., Structure and Performance of a Dependency Language Model. Proceeding of the EuroSpeech, 1997. [15] Kaufmann, Leonard, and Peter J. Rousseeuw. Finding groups in data. New York: Wiley, 1990.