Speech Recognition Lecture 6: Language Modeling Software Library

Speech Recognition Lecture 6: Language Modeling Software Library Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri

Software Library GRM Library: Grammar Library. General software collection for constructing and modifying weighted automata and transducers representing grammars and statistical language models (Allauzen, Mohri, and Roark, 2005). http://www.research.att.com/projects/mohri/grm 2

Software Libraries OpenGRM Libraries: open source libraries for constructing and using formal grammars in FST form, using OpenFST as underlying representation. NGram Library: create and manipulate n-gram language models encoded as weighted FSTs. (Roark et al., 2012) Thrax: compile regular expressions and contextdependent rewrite grammars into weighted FSTs. (Tai, Skut, and Sproat, 2011) http://opengrm.org 3

Overview Generality: to support the representation and use of the various grammars in dynamic speech recognition. Efficiency: to support competitive large-vocabulary dynamic recognition using automata of several hundred million states and transitions. Reliability: to serve as a solid foundation for research in statistical language modeling. 4

Language Modeling Tools Counts: automata (strings or lattices), merging. Models: Backoff or deleted interpolation smoothing. Katz or absolute discounting. Kneser-Ney models. Shrinking: weighted difference or relative entropy. Class-based modeling: straightforward. 5

Corpus Input: hello. bye. hello. bye bye. Corpus Labels <s> 1 </s> 2 <unknown> 3 hello 4 bye 5 Program: or farcompilestrings --symbols=labels.txt corpus.txt > corpus.far cat lattice1.fst... latticen.fst > foo.far 6

This Lecture Counting Model creation, shrinking, and conversion Class-based models 7

Counting Weights: use fstpush to remove initial weight and create a probabilistic automaton. counting from far files. counts produced in log semiring. Algorithm: applies to all probabilistic automata. In particular, no cycles with weight zero or less. 8

Counting Transducers b:ε/1 a:ε/1 b:ε/1 a:ε/1 x = ab bbabaabba 0 X:X/1 1/1 εεabεεεεε εεεεεabεε X is an automaton representing a string or any other regular expression. Alphabet Σ={a, b}. 9

Counting Program: ngramcount --order=2 corpus.far > corpus.2.counts.fst ngrammerge foo.counts.fst bar.counts.fst > foobar.counts.fst Graphical representation: <s>/4 2/0 hello/2 bye/2 hello/2 3/0 </s>/2 0/0 </s>/4 1/0 bye/3 </s>/2 4/0 bye/1 10

This Lecture Counting Model creation, shrinking, and conversion Class-based models 11

Creating Back-off Model Program: ngrammake corpus.2.counts.fst > corpus.2.lm.fst Graphical representation: bye/0.698 bye/1.108 1!/3.500 bye/1.098 4 </s>/0.410 hello/1.504 </s>/0.810 </s>/0.005 0/0 3!/4.481 hello/0.698!/4.704 2 12

Shrinking Back-off Model Program: ngramshrink --method=relative_entropy --theta=0.2 foo. 2.lm.fst > foo.2.s.lm.fst Graphical representation: hello/0.698 1 </s>/0.005 bye/1.098!/4.704 hello/1.504 3!/4.481 </s>/0.810 0/0 2 bye/0.698 13

Back-off Smoothing Definition: for a bigram model, Pr[w i w i 1 ]= d c(wi 1 w i )c(w i 1 w i ) c(w i 1 ) α Pr[w i ] if k>0; otherwise; where d k = 1 if k>5; k+1 kn k otherwise. 14

Merging/Interpolation Program: ngrammerge --normalize --alpha=2 --beta=3 a.lm.fst b.lm.fst > merged.fst Resulting language models are mixed at relative importance corresponding to --alpha and --beta, normalizing the output LM to be a probability distribution 15

This Lecture Counting Model creation, shrinking, and conversion Class-based models 16

Class-Based Models Simple class-based models: Pr[w i h] =Pr[w i C i ]Pr[C i h]. Methods in GRM: no special utility needed. create transducer mapping strings to classes. use fstcompose to map from word corpus to classes. build and make model over classes. use fstcompose to map from classes to words. Generality: classes defined by weighted automata. 17

Class-Based Model - Example Example: BYE = {bye, bye bye}. Graphical representation: mapping from strings to classes. hello:hello/0 0/0 bye:bye/0.693 hello:hello/0 bye:!/0 1/0 18

Class-Based Model - Counts <s>/4 2/0 hello/2 bye/2 <s>/4 2/0 hello/2 BYE/2 0/0 hello/2 3/0 </s>/4 </s>/2 1/0 0/0 hello/2 3/0 </s>/4 </s>/2 1/0 bye/3 </s>/2 BYE/2 </s>/2 4/0 bye/1 4/0 Original counts. Class-based counts. 19

Models bye/0.698 bye/1.108 1!/3.500 bye/1.098 4 </s>/0.410 hello/1.504 </s>/0.810 </s>/0.005 0/0 3!/4.481 hello/0.698!/4.704 2 original model. 3 BYE/0.698 1!/4.605 BYE/1.386!/4.605 hello/0.698 4 </s>/0.005 hello/1.386!/4.605 </s>/0.693 2 </s>/0.005 0/0 class-based model. 20

Final Class-Based Model bye/1.391!/4.605 3 </s>/0.005 bye/2.079 bye/0!/4.605 0 1 </s>/0.693!/4.605 </s>/0.005 4/0 hello/0.698!/4.605 5 </s>/0.005 hello/1.386 2 21

References Cyril Allauzen, Mehryar Mohri, and Brian Roark. Generalized Algorithms for Constructing Statistical Language Models. In 41st Meeting of the Association for Computational Linguistics (ACL 2003), Proceedings of the Conference, Sapporo, Japan. July 2003. Cyril Allauzen, Mehryar Mohri, and Brian Roark. The Design Principles and Algorithms of a Weighted Grammar Library. International Journal of Foundations of Computer Science, 16(3): 403-421, 2005. Peter F. Brown, Vincent J. Della Pietra, Peter V. desouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467-479. Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical Report, TR-10-98, Harvard University. 1998. William Gale and Kenneth W. Church. What s wrong with adding one? In N. Oostdijk and P. de Hann, editors, Corpus-Based Research into Language. Rodolpi, Amsterdam. Good, I. The population frequencies of species and the estimation of population parameters, Biometrica, 40, 237-264, 1953. 22

References Frederick Jelinek and Robert L. Mercer. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, s 381-397. Slava Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Transactions on Acoustics, Speech and Signal Processing, 35, 400-401, 1987. Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, s 181-184, 1995. David A. McAllester, Robert E. Schapire: On the Convergence Rate of Good-Turing Estimators. Proceedings of Conference on Learning Theory (COLT) 2000: 1-6. Mehryar Mohri. Weighted Grammar Tools: the GRM Library. In Robustness in Language and Speech Technology. s 165-186. Kluwer Academic Publishers, The Netherlands, 2001. Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic dependences in stochastic language modeling. Computer Speech and Language, 8:1-38. 23

References Brian Roark, Richard Sproat, Cyril Allauzen, Michael Riley, Jeffrey Sorensen, Terry Tai. The OpenGrm Open-Source Finite-State Grammar Software Libraries, ACL (System Demonstrations) (2012), pp. 61-66 Terry Tai, Wojciech Skut and Richard Sproat. Thrax: An Open Source Grammar Compiler Built on OpenFst. ASRU 2011. Kristie Seymore and Ronald Rosenfeld. Scalable backoff language models. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), 1996. Andreas Stolcke. 1998. Entropy-based pruning of back-off language models. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, s 270-274. Ian H. Witten and Timothy C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression, IEEE Transactions on Information Theory, 37(4):1085-1094, 1991. 24