Experiments on the LIMSI Broadcast News Data Interim Report for SNF Project 105211-112133: Rule-Based Language Model for Speech Recognition Tobias Kaufmann Institut für Technische Informatik und Kommunikationsnetze February 2007
1 Lattice Preprocessing The research reported here is based on an experiment performed with the LIMSI German broadcast news transcription system [1]. We selected the first 30 lattices of this experiment, which corresponds roughly to the first 10 minutes of a German 8 o clock news broadcast (Tagesschau der ARD) from the 14 th of April, 2002. We first reconstructed the exact scoring scheme, which was not known to us 1. Thus, we finally were able to reproduce the first best scores noted in the comment section of each lattice. We also corrected the reference transcription which contained several errors. For our experiments we assume perfect sentence segmentation, as the benefit of grammar information heavily deteriorates if the sentence boundaries are incorrect. Therefore, the 30 lattices were manually split at the sentence boundaries and merged where a sentence crossed a lattice boundary. As a result, we obtained 107 lattices, each spanning a single sentence. Due to this manual segmentation the word error rate dropped from 13.79% to 13.24%. 2 Linguistic resources We used the Head-driven Phrase Structure Grammar (HPSG, [2]) formalism to develop a precise large-coverage grammar for German. The main grammar consists of 17 general rules, 12 rules for modeling the German sentence structure and 13 construction-specific rules (relative clauses, genitive attributes, optional determiners, nominalized adjectives etc...). The various subgrammars (expressions of date and time, spoken numbers, compound nouns and acronyms) amount to a total of 43 rules. As split numbers and compounds (e.g. ein und zwanzig, kriegs pläne) are not counted as errors in the evaluation scheme, the grammar is able to analyze such expressions. The main grammar is largely based on existing linguistic work, e.g. [3], [4], [5] and [6]. We have added a couple of linguistic phenomena which we considered to be important, but which are often neglected in formal syntactic theories. Among them are prenominal and postnominal genitives, expressions of quantity, expressions of date and time, forms of address and written numbers. The coverage and precision of our grammar is denoted by the set of test sentences which was developed in parallel to the grammar. The test results are available at http://www.tik.ee.ethz.ch/ kaufmann/grammar/test.html. The lexicon has been created manually. A sorted word list was created from each of the 30 original lattices, and each word has been precisely annotated with its syntactic features such as agreement features and valencies. The context in which the word appeared in the reference transcription was not known to the lexicon developer. Consequently, every possible usage for each lexeme had to be entered. Multi-word lexemes posed a particular challenge, as they do not appear as units in the word lists. Important examples of German multi-word lexemes are certain adverbials (e.g. nach wie vor, zu Fuss) and verbs with separable prefixes (e.g. die Sonne geht auf ). 1 A [silence] node receives a silence penalty, but no word penalty. </s> and <s/> are ignored altogether and receive no penalty at all. {breath} and {fw} receive a word penalty. 1
3 Feature Extraction In the feature extraction step, each of the N best hypotheses of a lattice is parsed, i.e. the parser identifies every grammatical word sequence in a given hypothesis. In addition, it determines all possible syntactic structures of such a word sequence. To this end, the parse trees are transformed into grammar-independent dependency graph representations similar to those used in the German TIGER treebank [7]. Subsequently, the probability of each dependency graph is estimated by means of a statistical model. This statistical model was designed manually and trained on the TIGER treebank. Formally, our linguistic postprocessing results in a set of features for each recognizer hypothesis W k = w 1, w 2,..., w nk. A feature is a tuple (i, j, p), stating the fact that the word sequence w i, w i+1,..., w j is grammatical and that its most likely syntactic structure has the probability p. Note that for every word w i, there is a feature (i, i, 1). 4 N-Best Rescoring The k-th best recognizer hypothesis W k is assigned a new score which takes into account the linguistic knowledge, and the best hypothesis with respect to this new score is returned as the new recognition result W : ( W = arg max s rec (W k ) W k max P partition(w k ) (i,j,p) P α p) λ (1) In this expression, s rec is the score on the basis of which the original N-best list was computed. It includes an acoustic score, an N-gram language model score and additional correction terms. The bracketed expression is the score computed from our rule-based language model. The two scores are balanced by means of a weight λ. The function partition(w k ) returns all sequences of features spanning the hypothesis W k = w 1, w 2,..., w nk. Formally, (i 1, j 1, p 1 ),..., (i m, j m, p m ) partition(w k ) iff i 1 = 1 and j m = n k and j s + 1 = i s+1 for 1 s < m. The parameter α influences the number of features of the optimal partitioning. If α is very big, partitionings into single-word features are favoured. This means that syntactic information is ignored entirely. However, if α is very small, partitionings into a few features covering many words are favoured, even if they have a small probability. In this case, the binary information on grammaticality dominates the more fine-grained probability provided by the statistical model. 5 Experiment Our experiments were performed on the 107 single-sentence lattices manually created from the first 30 lattices of the LIMSI data. We computed the 20 best hypotheses of each lattice. The average hypothesis length is 13.4 words, with a maximum of 31 words. For given parameters α and λ, each hypothesis was parsed in order to extract the features. Subsequently, the hypotheses were rescored and a new set of first-best solutions was produced. The parameters α and λ were optimized by means of leave-one-out cross- 2
experiment word error rate baseline 13.24% grammar 12.31% (-7.0% relative) grammar+cheating 11.89% (-10.2% relative) oracle 7.97% (-39.8% relative) Tabelle 1: The impact of our rule-based language model on the word error rate. validation. Due to the small number of parameters and the small training set, we could apply a simple grid search. Table 1 compares the word error rate of the LIMSI broadcast new transcription system (baseline) to that of the system extended with a rule-based rescoring component (grammar). For comparison, the table also shows the result of the extended system with α and λ optimized on the test data (grammar+cheating), as well as the 20-best oracle word error rate (oracle). By applying the rule-based linguistic knowledge, the word error rate could be reduced by 7.0% relative. Unfortunately, this result is not statistically significant: the significance level for the Matched-Pair Sentence Segment Test [8] is 7.2%, whereas a level of 5% or lower is generally considered to be significant. However, we expect to achieve statistically significant results for a larger training set. The rule-based language model corrects 25 errors and produces 12 new errors. Surprisingly, only about half of the corrected errors are due to the case where the parser picks the correct sentence from the 20-best hypotheses. The remaining errors were corrected by preferring a better, but still incorrect sentence. This suggests that our approach also works in the presence of ungrammatical or out-of-grammar sentences. 6 Problems Our decision to develop a domain-independent lexicon (i.e. to include virtually all usages of a given word) leads to a large amount of ambiguity. The tendency that correct word sequences have many readings is apart form processing issues not problematic for our approach. However, the fact that many bad word sequences do have some readings as well suggests that the criterion of grammaticality alone is not sufficient to distinguish between good and bad word sequences. Instead, the probabilistic model should be extended and refined. For instance: There are many personal names and geographic names which are homophones of nouns, adjectives or verbs 2. Although most of these proper names are very rare and unlikely to appear in a broadcast news context, they were entered into the lexicon. Proper names contribute considerably to ambiguity, as they can appear without a determiner. To deal with this problem, we manually disabled those proper name entries which were not known to us. This is of course unsatisfying. In future, we intend to use corpus linguistics (i.e. named entity extraction) to compute the probability of each proper name for a given domain. 2 For instance, about 40% of all nouns in our lexicon have an inflected form that can also be used as a personal name. 3
As the grammar generally allows for split compound nouns, two consecutive nouns can often be analyzed as a compound noun, which leads to massive ambiguity. The performance of our rule-based language model can be expected to improve if the probability of a given compound noun (as estimated on a corpus) is taken into account. Of course, ambiguity also has a big impact on processing efficiency. In the reported experiment, we were able to derive all possible readings for 99.9% of the parsed recognizer hypotheses, even though some of the sentences were quite long. However, processing can take rather long for some highly ambiguous sentences. Therefore it seems to be desirable to use a stochastic HPSG (e.g. [9]), such that the most probable readings are derived first and parsing can be stopped after a number of processing steps. 7 Acknowledgements We wish to thank Jean-Luc Gauvain of LIMSI for providing us with word lattices produced by their German broadcast news transcription system. Literatur [1] Kevin McTait and Martine Adda-Decker, The 300k LIMSI German Broadcast News Transcription System, in ISCA Eurospeech, Geneva, September 2003. [2] C. J. Pollard and I. A. Sag, Head-Driven Phrase Structure Grammar, University of Chicago Press, Chicago, 1994. [3] Stefan Müller, Head-Driven Phrase Structure Grammar: Eine Einführung, to appear, Stauffenburg Verlag, 2007. [4] Stefan Müller, Deutsche Syntax deklarativ. Head-Driven Phrase Structure Grammar für das Deutsche, Number 394 in Linguistische Arbeiten. Max Niemeyer Verlag, Tübingen, 1999. [5] Berthold Crysmann, Relative clause extraposition in german: An efficient and portable implementation, Research on Language and Computation, vol. 3, no. 1, pp. 61 82, 2005. [6] Berthold Crysmann, On the efficient implementation of german verb placement in hpsg, in Proceedings of RANLP 2003, 2003. [7] Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith, The TIGER treebank, in Proceedings of the Workshop on Treebanks and Linguistic Theories, Sozopol, 2002. [8] L. Gillick and S. Cox, Some statistical issues in the comparison of speech recognition algorithms, in ICASSP, 1989, pp. 532 535. [9] Steven P. Abney, Stochastic attribute-value grammars, Comput. Linguist., vol. 23, no. 4, pp. 597 618, 1997. 4