Switchboard Language Model Improvement with Conversational Data from Gigaword

Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword Internship at ESAT-PSI Speech Group Under supervision of: Dr. Ir. Dong Hoon Van Uytsel Dr. Ir. Jacques Duchateau Promotor: Prof. Dr. Ir. Hugo Van hamme Yanfen Hao June, 2004

Table of Contents ABSTRACT... 2 INTRODUCTION... 2 1 THE STATE OF THE ART OF TEXT CLASSIFICATION... 3 2 BUILDING A UNIGRAM CLASSIFIER... 5 2.1 INTRODUCTION OF THE DATA SETS... 5 2.2 TEXT NORMALIZATION... 6 2.3 ORGANIZING THE CORPUS INTO TRAINING, DEVELOPMENT AND TEST PARTS. 6 2.4 CONSTRUCTION OF A SWITCHBOARD UNIGRAM LANGUAGE MODEL... 6 2.5 CLASSIFICATION WITH CROSS ENTROPY... 8 2.5.1 A brief review of entropy, cross-entropy and perplexity... 8 2.5.2 The role of cross-entropy in text classification... 8 2.5.3 Computing the cross-entropy of development data... 9 2.5.4 Finding the decision boundary for classification... 12 2.6 EVALUATING THE UNIGRAM CLASSIFIER WITH TEST DATA... 12 3 IMPROVING THE SWITCHBOARD LANGUAGE MODEL... 13 3.1 INSPECTING NEWSWIRE ARTICLES WITH LOWER CROSS-ENTROPY... 13 3.2 PROCEDURE OF IMPROVING THE LANGUAGE MODEL... 13 3.3 ANALYSIS... 15 3.4 EVALUATION OF THE LANGUAGE MODEL IMPROVEMENT... 16 4 CONCLUSION AND FUTURE WORK... 18 ACKNOWLEDGEMENT... 19 REFERENCES... 20 APPENDIX... 21 1

Abstract This paper aims to report my project at the ESAT-PSI speech group under supervision of Dr. Dong Hoon Van Uytsel and Dr. Jacques Duchateau. The goal of the project is to extract the conversational data from a newswire corpus (Gigaword) and to improve the conversational speech (Switchboard) language model. The project consists of two stages. At the beginning of the first part, an overview of different approaches for text classification was given. Then a unigram classifier using cross-entropy for text categorization was built. What we achieved at the end of the first phrase was a conversational unigram classifier that showed high accuracy in classifying newswire text and transcriptions of conversations over the telephone. In the second stage, the classifier was used to select additional conversational data from the newswire corpus in order to augment limited spontaneous speech data. The experiments were conducted to see how efficiently these additional data can make contribution in improving the spontaneous speech language model. Introduction The fast growth of the WorldWideWeb and the increase of digital information have made the organization of information a vitally important task. Automated categorization of text plays a crucial role in many applications such as Search Engines, Text Filtering, Part of Speech Tagging, E-mail Classification and Task Orientation in Question Answering. There are many kinds of text classification tasks depending on their applications. Text can be categorized in terms of linguistic style, authorship attribution, content and so on. In this paper, we are concerned with the difference between styles in writing and speaking. The question we address is the following: given some labelled documents from a newswire corpus and conversational speech transcriptions, can we find an approach to accurately label previously unseen documents as conversational or newswire? In this paper, a survey was conducted on text classification technologies such as Naïve Bayes classifier, maximum entropy classifier, Nearest-Neighbor text classification and Support Vector Machines. After that, a unigram language model was built to use 2

cross-entropy to measure the difference between the test documents and the target class represented by its language model. The classifier obtained the classification boundary from 16,000 newswire documents and 4000 conversational articles. The test results showed that the classifier achieved a high accuracy rate in disambiguating newswire articles from conversational speech. The purpose we build a text classifier is that we need to use this classifier to find additional data that can be integrated into the Switchboard corpus and incorporated into the training process that will improve the Switchboard language model. With the help of the unigram classifier, about 31,000 conversational articles were extracted from the newswire corpus to improve the language model. To investigate how efficiently these data can contribute in improving the spontaneous speech language model, different data including the telephone speech transcriptions, randomly selected newswire articles and conversational documents extracted from the newswire corpus were introduced into the language model improvement to make a comparison. The results showed that the perplexity of the language model decreased apparently with the additional conversational data from the newswire corpus. 1 The state of the art of text classification Both the machine learning and the information retrieval communities have made contributions to the development of text classification. Nonetheless, more and more new methods are still coming out. A non-exhaustive list includes Naïve Bayes classifier, maximum entropy classifier, Nearest-Neighbor text classification and Support Vector Machines. Naïve Bayes text classifier [Lewis, 1998; McCallum and Nigam, 1998; Sahami, 1996] The machine learning community has spent years developing classifier-learning methods. Among these methods, the Naïve Bayes classifier has been gaining popularity lately and has been found to perform surprisingly well. 3

It follows a probabilistic approach for classification based on a strong assumption that all attributes of the examples are independent of each other given the context of the class. The core of the Naïve Bayes text classifier is a unigram language model. The unigram language model can tell the posterior probability that a test example belongs to a target class. Meanwhile, the prior probabilities of target class are also derived from their occurrence frequency in the observation data. The ultimate score is the product of the prior and posterior probability. The class with the maximum score is assigned to the text example. Maximum entropy classifier [K. Nigam, J. La_erty, A. McCallum, 1999] Maximum entropy is a general technique for estimating probability distributions from data. It estimates the conditional distribution of the class label given a document, which is a set of word-count features. The principle in maximum entropy is that a uniform distribution should be preferred in the absence of external knowledge. Labelled training data is used to derive a set of constraints on the expectations for the distribution. The solution to the maximum entropy formulation is found by the improved iterative scaling algorithm. Nearest-Neighbor text classification [Yang, 1999] The Nearest-Neighbor method gives a solution for text classification by finding the training examples near each test example and having them vote for the label of the example. In Nearest-Neighbor text classification, a document is regarded as a bag of words. Every word stands for a dimension. If the word does not occur in the document, its corresponding vector value is zero. While for the word appearing in the article, its value is non-zero. The Nearest-Neighbor uses a weighting mechanism to set the non-zero value so that it is higher if the word occurs frequently in an article and is lower if it infrequently appears in a document. Similarity between two documents is measured using the cosine of the angles between the vectors representing the two documents. 4

Support Vector Machines [Joachims, 1998; Dumais et al, 1998] Support Vector Machines (SVMs) are learning machines that are based on statistical learning theory. They perform binary classification and regression estimation tasks. SVMs non-linearly map their n dimensional input space into a higher dimensional feature space. In this high dimensional feature space a linear classifier is then constructed using quadratic programming. 2 Building a unigram classifier Based on the assumption of the Naïve Bayes classifier, we would like to build a unigram language model and use it for text classification by measuring the cross-entropy. 2.1 Introduction of the data sets The data we process are from two corpuses, the Gigaword corpus containing newswire text and the Switchboard corpus with transcriptions of telephone conversation. Gigaword English Corpus The Gigaword English Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consurtium (LDC). It includes four distinct international sources of English newswire: AFE - Agence France Press English Service APW - Associated Press Worldstream English Service NYT - The New York Times Newswire Service XIE - The Xinhua News Agency English Service Switchboard The Switchboard is a corpus of the telephone speech. The corpus contains 2430 conversations averaging 6 minutes in length and about 3 million words of text spoken by over 500 speakers of both sexes from every major dialect of American English. 5

2.2 Text normalization The Switchboard is different from the Gigaword in terms of the structure and the format. To fix the incompatibilities between the Gigaword and the Switchboard, it is necessary to normalize the Gigaword text. The procedure of text normalization includes: Splitting the text into sentences Removing the punctuations Converting the numbers into the full letter words Turning the text into the capitalization uniform Expanding the abbreviations 2.3 Organizing the corpus into training, development and test parts Typically, we specify 80% of the entire corpus for training and 10% for development and test parts respectively. The intuition behind this is that we need more training data to estimate the parameters of the language model and less development data to adjust them. However, we make some adaptation to this in special case. The Gigaword corpus was organized into 3 ID tables (document ID list) for every subcorpus. 80% ID terms were put into the training list; 10 % into the test list and the development list. The Switchboard corpus was also split in to 3 parts: 1.5 million words for training, 1 million words for development and the rest 0.5 million words for test. The reason of dividing the corpus in such a way is that we need more development data and test data when we compare the efficiency between the Switchboard data and the conversational articles from the Gigaword for language model improvement in the second part. 2.4 Construction of a Switchboard unigram language model What is a language model Here, the language model refers to a statistical language model. Given a vocabulary, a language model is a set of conditional probability mass functions. It returns the probability that a sentence prefix is followed by a certain word. 6

The role of the language model The statistical language model plays an important role in automatic speech recognition. It helps a speech recognizer figure out how likely a word sequence is, independently of the acoustics. The language model not only contributes in resolving acoustic ambiguity for Automatic Speech Recognition, but is also widely used for Natural Language Processing applications such as text classification. For example, the Naïve Bayes classifier uses a unigram language model to compute the posterior probability that a document belongs to a target class. In our task, language model is used to give the probability distribution of a language in a given discourse context so that we can evaluate the cross-entropy of the test data. The procedure of the construction Acquiring a vocabulary from the training data The vocabulary for building the language model contains 8000 words from the Switchboard training part and 12000 words from the Gigaword training corpus. Decomposing sentences into word unigram events. According to the chain rule, we use: P(W) = P( w1, w2,.., wn ) = P( wi w1, w2,..., wi 1) n i= 1 to estimate the probability of sentences. The function of a language model is to compute: ( w w, w,..., ) or P (w h) P i 1 2 w i 1 The event space (h, w) is large with temper to any available training corpus. Therefore, we only take into count the previous n-1 words for estimating the context-dependent probability of the current word. The N-gram assumption is to partition the history: P( wi w1, w2,..., wi 1) = P( wi wi n+ 1,..., wi 1) In this part, a unigram language model (n = 1) was generated for text classification. 7

Collecting unigram frequency and computing frequency count Estimating the parameter according to the unigram frequency count 2.5 Classification with Cross Entropy 2.5.1 A brief review of entropy, cross-entropy and perplexity Entropy is a measure of average uncertainty of a random variable and is defined as: H(x) = - P(x)log 2P(x) Cross-entropy is used to compare the strengths of one model with those of another, given the same test data: H m x [x] = - x 1 P(x)log 2Pm (x) log N 2 P (W) = H m m [x] where N is the length of the text W measured in words, P is the true (unknown) probability distribution, m is the current model, estimated by m. Perplexity is defined as: P m is the probability of W as PP (W) = 2 m H m [x] It corresponds to a natural figure, the average number of choices a model sees. 2.5.2 The role of cross-entropy in text classification We have derived a unigram language model from the Switchboard training data. Given a collection of newswire articles and telephone conversation transcriptions, we use the cross-entropies to indicate which of them is close to the spontaneous speech represented by Switchboard unigram language model, and which is far from it. Put it another way, the newswire document has higher cross entropy value while the conversation transcription has lower one. 8

2.5.3 Computing the cross-entropy of development data The Switchboard text is composed of a large number of sentences. There are no document boundaries between them, while the Gigaword data are divided into documents. In order to create similar conditions, we split the Switchboard data into simulated articles. The procedure of computing the cross-entropy is illustrated in the following figure: sentences of articles cross-entropy text2ngramevts unigram event log probability of unigrm event unigram language model 1 log P (W N ) 2 m We measure the cross-entropies of 4000 documents from 4 newswire subcorpora and those of the same number of simulated articles from the Switchboard development corpus. The results of the experiments are visualized in the following pictures: 9

Cross-entropy comparison between Switchboard and Gigaword 10

2.5.4 Finding the decision boundary for classification From the above diagrams, we see that there is an obvious distance between the Gigaword articles and the Switchboard transcriptions. We measured that only 0.02% of Switchboard articles (from development part) have cross-entropies less than 10. While 94.5 % of New York Times articles have cross-entropy more than 10. For cross-entropies of Associated Press Worldstream, Xinhua News Agency and Agency France Press articles, they are nearly all above 10. Unigram cross-entropies of development data Cross-entropy SWB NYT APW AFE XIE > 10 0.02% 94.5% 98.78% 98.15% 99.3% Therefore, it is reasonable to define the value of 10 as the decision boundary of cross-entropy, as measured with the given Switchboard unigram language model. 2.6 Evaluating the unigram classifier with test data We random select 16,000 articles from the Gigaword test corpus and use the unigram classifier to measure their cross-entropies. For those articles with cross-entropy more than 10, we put them into the category of writing style. For those less than 10, we identify them as speech style. We obtained an accuracy rate of 96.23 %. By measuring test examples, we also found a small number of Gigaword articles with a surprisingly lower cross-entropy. Manual inspection revealed that they contained conversational data. Furthermore, we need to use this classifier to extract conversational data from the Gigaword corpus for Switchboard language model improvement. 12

3 Improving the Switchboard language model 3.1 Inspecting Newswire articles with lower cross-entropy Following is an article from New York Time News service, but having a crossentropy of 8.58, measured with our unigram classifier. I WORK FOR THE POSTAL SERVICE I'VE BEEN TO THE DOCTOR I WENT TO THE DOCTOR THURSDAY HE TOOK A CULTURE BUT HE NEVER GOT BACK TO ME WITH THE RESULTS I GUESS THERE WAS SOME HANG UP OVER THE WEEKEND I'M NOT SURE BUT IN THE MEANTIME I WENT THROUGH ACHINESS AND HEADACHINESS Such kind of conversational content is spontaneous enough to be used for training the Switchboard language model. 3.2 Procedure of improving the language model Selecting conversational documents from the Gigaword corpus The unigram classifier is used to capture those articles with cross-entropy lower than 10 and to add them to a new corpus. Let s call this a quasi-conversation corpus because we noticed that some articles are mixtures with the conversation and the comment. Sorting the quasi-dialog corpus with cross-entropy The quasi-dialog corpus contains about 31,000 articles and 20 million words. In order to inspect the validity of the quasi-conversation data for training the Switchboard language model, we sort these articles by their cross-entropies. Improving the Switchboard language model First we introduce the ngram frequency count file obtained from the Switchboard (1.5 million words) training data into the procedure of language model improvement as baseline. Secondly, we generate another count file from the first 0.1 million words of the quasi-conversation corpus. Then, we merge these two count 13

files into a new count file and generate a new language model. Finally, we use this new language model to measure the perplexity of Switchboard test data. We retake the above-mentioned steps with an increment of 0.1 million words and measure the perplexities. We used a unigram to evaluate the performance of the language model. However, from the result of the experiment, we saw that the 1.5 million conversational words had already trained the unigrm language model very well and the additional quasiconversation 0.9 million words made no difference in the unigram language model improvement. Therefore, we use the bigram and trigram language models to evaluate the improvement we get. The results are in the following diagrams: bigram evaluation 105 104 103 102 Perplexity 101 100 99 98 97 96 95 1.5 2 2.5 3 3.5 4 4.5 Million Words 14

trigram evaluation 95 94 93 92 Perplexity 91 90 89 88 87 86 85 1.5 2 2.5 3 3.5 4 4.5 Million Words 3.3 Analysis The perplexities of the initial bigram and trigram models are 102.26 and 90.14 respectively. With the increment of training data from the quasi-conversation corpus, the perplexity descends. It means that if we use the improved language model for automatic speech recognition, the average number of words that can follow any word, shrinks. For the bigram language model, 0.9 million words makes the perplexity decrease from 102.26 to 99.67. For the trigram language model, the perplexity declines from 90.14 to 86.18. Obviously, with help of the same amount of additional spontaneous data, the degree that the Switchboard language model is improved is more for the trigram than for the bigram. Although both perplexity curves decrease with the similar speed within 0.4 million words, the drop in perplexity between 0.5 and 0.9 million words is much faster for the trigram than for the bigram. Moreover, the increase of perplexity with more than 0.9 million additional words for trigram language model is slower than that of the bigram. Another interesting measure is the cross-entropy of the critical point from where the perplexity starts to go up. We measured the cross-entropy of the critical point with 15

unigram and the value is 9.47. This means that the Gigaword articles with a unigram cross-entropy less than 9.47 can be used for improving the bigram and trigram language models. The following diagram illustrates the relationship between the critical cross-entropy of the Gigaword articles and the mean cross-entropy of the Switchboard development data. 3.4 Evaluation of the language model improvement In order to see how efficient the quasi-conversation data is to improve the Switchboard language model, we use different data for training and compare their effects on the language model improvement. We use the Switchboard training data to generate the bigram language model, after training of 1.5 million Switchboard words, the perplexity decrease from 157.41 to 102.26 with a steep slope. Then we start the language model improvement from the perplexity of 102.26. At this point, if we use randomly selected Gigaword articles, we can see that they have a 16

negative impact in improving the language model: the perplexity curve goes up dramatically. If we use Switchboard development data to train the language model, the perplexity curve continues to decline with fast speed. For the data from quasi-conversation corpus, the first 0.3 million words improve the language model with a fast step. It is worth noticing that the unigram cross-entropies of these articles are less than 9.1. For those data with unigram cross-entropy more than 9.1 but less than 9.47, the rate of the improvement slows down. Once the data with cross-entropy more than 9.47 are introduced into the training, the improvement is stopped and the perplexity starts to increase dramatically. To show a clear comparison of the effects of different data on the bigram language model, the initial status of the language model (the perplexity value of 157.41) is not included in the following diagram. The diagram only illustrates the perplexity curve from 111.53 to 95.7. comparision of different data for improving language model 110 105 Perplexity 100 Switchboard development data 95 random Gigaword articles Switchboard training data 90 1 1.5 2 2.5 3 3.5 4 4.5 Million Words quasi dialog data 17

4 Conclusion and future work In this paper, we have reviewed different approaches to text classification. Based on the assumption of Naïve Bayes classifier, we built a unigram language model using the cross-entropy to select the conversational data from the newswire corpus. Cross-entropy is used to measure the difference between the test documents and the language represented by a Switchboard unigram language model. With the unigram classifier, we acquired about 31,000 articles with the cross-entropy lower than 10, from the Gigaword corpus to augment the limited Switchboard data for the language model improvement. About 0.9 million words with cross-entropy less than 9.47 improved the Switchboard trigram language model by decreasing the perplexity from 90.14 to 86.18 (or 102.26 to 99.67 for bigram language model). These articles are almost pure conversation and seldomly mixed with writing style data. Articles with unigram cross-entropies more than 9.47 are no use for the Switchboard language model improvement. For those newswire articles with unigram cross-entropies between 9.47 and 10, they are mixture of spontaneous speech and comment. The newswire articles with unigram cross-entropy more than 10 are obvious writing style. However, the amount of the additional spontaneous data extracted from the Gigaword is still not enough to boost the performance of the Switchboard language model dramatically. The text classification and conversation selection described in this paper are based on the document. This means that we only selected the conversational articles from the Gigaword corpus. After all, these documents are inadequate. Indeed, we noticed that although some newswire articles have cross-entropies more than 9.47, they still have some local conversations inside. And the number of this type of articles is large. As for future work, it is worth exploring how to refine the procedure of the conversation selection. We need further method to select conversational sentences from a document. If we can find an approach to select spontaneous speech sentences instead of the conversational document from the Gigaword corpus, we will obtain much more conversational data than present. 18

Acknowledgement Special thanks to the following people: Dr. Dong Hoon Van Uytsel for his guiding me throughout the internship and the articles, books he sending me. Dr. Jacques Duchateau for his suggestions and enlightening me when I was bewildered. Prof. Hugo Van hamme for his introducing me into the internship from which I have acquired much experience and skills. 19

References [1] T. Mitchell, Machine Learning, McGraw Hill, 1997 [2] F. Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys, 2001 [3] Andrew McCallum, Kamal Nigam, A Comparison of Event Model for Naive Bayes Text Classification, AAAI-98 Workshop on "Learning for Text Categorization", 1998 [4] Y. Yang, An evaluation of statistical approaches to text categorization, Journal of Information Retrieval, Vol 1, No. 1/2, pp 67--88, 1999 [5] Y. Yang, C. Chute, An example-based mapping method for text classification and retrieval, ACM Transactions on Information Systems (TOIS), 12(3): 252-77, 1994 [6] K. Nigam, J. Lafferty, A. McCallum, Using maximum entropy for text Classification, IJCAI-99, pp. 61-67, 1999 [7] V. N. Vapnik, The Nature of Statistical Learning Theory, 1995 [8] Dong Hoon Van Uytsel, Probabilistic Language Modeling with Left Corner Parsing, Chapter 1, September 2003 [9] Villaseñor-Pineda, M. Montes-y-Gómez, M. Pérez-Coutiño, and D. Vaufreydaz, A Corpus Balancing Method for Language Model Construction, Computational Linguistics and Intelligent Text Processing, 2003 20

Appendix Modeling tools Available tools for modelling language model text2voc text2ngramevts collcnt count2ngram clean text perpl Obtaining a vocabulary containing the n most frequent words from text Segment sentence of text into n-gram events Collecting n-gram events and counting their frequencies Generating n-gram language model according to frequency count Normalizing the text Measuring the perplexity of language model A tool for manipulating the Gigaword database Turning a raw corpus into indexed database <DOC id="nyt19940701.0005" > <HEADLINE> WORLD'S BEST FLOCK TO MANHATTAN BEACH </HEADLINE>. <DOC id="nyt19940701.0006" >. <DOC id="nyt19940701.0007" > Input Cat_giga Output NYT19940701.0005 NYT19940701.0006 NYT19940701.0007 NYT19940701.0008 NYT19940701.0009 Raw Corpus ID tables Displaying a random news article given its document ID NYT19940701.0005 Document ID Input Cat_giga Output This holiday weekend the sands of Manhattan Beach will be shuffling with the top professional beach volleyball players in the world.. Text of the article 21