Speech Recognition Lecture 6: Language Modeling Software Library

Similar documents
Deep Neural Network Language Models

Noisy SMS Machine Translation in Low-Density Languages

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Language Model and Grammar Extraction Variation in Machine Translation

Investigation on Mandarin Broadcast News Speech Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Abbreviated text input. The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters.

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Lecture 9: Speech Recognition

Lecture 1: Machine Learning Basics

Toward a Unified Approach to Statistical Language Modeling for Chinese

Learning Methods in Multilingual Speech Recognition

Calibration of Confidence Measures in Speech Recognition

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Using dialogue context to improve parsing performance in dialogue systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Speech Recognition at ICSI: Broadcast News and beyond

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Softprop: Softmax Neural Network Backpropagation Learning

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

arxiv: v1 [cs.cl] 27 Apr 2016

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Detecting English-French Cognates Using Orthographic Edit Distance

IMPROVING PRONUNCIATION DICTIONARY COVERAGE OF NAMES BY MODELLING SPELLING VARIATION. Justin Fackrell and Wojciech Skut

arxiv:cmp-lg/ v1 22 Aug 1994

Learning Methods for Fuzzy Systems

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

cmp-lg/ Jan 1998

Natural Language Processing. George Konidaris

A study of speaker adaptation for DNN-based speech synthesis

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Quantitative Method for Machine Translation Evaluation

Modeling function word errors in DNN-HMM based LVCSR systems

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Improving Fairness in Memory Scheduling

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Grammars & Parsing, Part 1:

Phonological Processing for Urdu Text to Speech System

Multi-Lingual Text Leveling

WHEN THERE IS A mismatch between the acoustic

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Language properties and Grammar of Parallel and Series Parallel Languages

The KIT-LIMSI Translation System for WMT 2014

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Corpus Linguistics (L615)

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Acquiring Competence from Performance Data

SARDNET: A Self-Organizing Feature Map for Sequences

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Python Machine Learning

arxiv: v1 [cs.cl] 2 Apr 2017

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Introduction to Simulation

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Human-like Natural Language Generation Using Monte Carlo Tree Search

Memory-based grammatical error correction

A Case-Based Approach To Imitation Learning in Robotic Agents

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

ARNE - A tool for Namend Entity Recognition from Arabic Text

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation

The Smart/Empire TIPSTER IR System

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Disambiguation of Thai Personal Name from Online News Articles

Modeling function word errors in DNN-HMM based LVCSR systems

Constructing Parallel Corpus from Movie Subtitles

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

The NICT Translation System for IWSLT 2012

Team Formation for Generalized Tasks in Expertise Social Networks

Characterizing and Processing Robot-Directed Speech

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Refining the Design of a Contracting Finite-State Dependency Parser

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Transcription:

Speech Recognition Lecture 6: Language Modeling Software Library Cyril Allauzen Google, NYU Courant Institute allauzen@cs.nyu.edu Slide Credit: Mehryar Mohri/Eugene Weinstein

Software Library GRM Library: Grammar Library. General software collection for constructing and modifying weighted automata and transducers representing grammars and statistical language models (Allauzen, Mohri, and Roark, 2005). http://www.research.att.com/projects/mohri/grm 2

Software Libraries OpenGRM Libraries: open source libraries for constructing and using formal grammars in FST form, using OpenFST as underlying representation. NGram Library: create and manipulate n-gram language models encoded as weighted FSTs. (Roark et al., 2012) Thrax: compile regular expressions and contextdependent rewrite grammars into weighted FSTs. (Tai, Skut, and Sproat, 2011) http://opengrm.org 3

Overview Generality: to support the representation and use of the various grammars in dynamic speech recognition. Efficiency: to support competitive large-vocabulary dynamic recognition using automata of several hundred million states and transitions. Reliability: to serve as a solid foundation for research in statistical language modeling. 4

Language Modeling Tools Counts: automata (strings or lattices), merging. Models: Backoff or deleted interpolation smoothing. Katz or absolute discounting. Kneser-Ney models. Shrinking: weighted difference or relative entropy. Class-based modeling: straightforward. 5

Corpus Input: hello bye hello bye bye Corpus Labels <eps> 0 hello 1 bye 2 <unknown> 3 Program: or farcompilestrings --symbols=labels.txt --keep_symbols corpus.txt > corpus.far cat lattice1.fst... latticen.fst > foo.far 6

This Lecture Counting Model creation, shrinking, and conversion Class-based models 7

Counting Weights: use fstpush to remove initial weight and create a probabilistic automaton. counting from far files. counts produced in log semiring. Algorithm: applies to all probabilistic automata. In particular, no cycles with weight zero or less. 8

Counting Transducers b:ε/1 a:ε/1 b:ε/1 a:ε/1 x = ab bbabaabba 0 X:X/1 1/1 εεabεεεεε εεεεεabεε X is an automaton representing a string or any other regular expression. Alphabet Σ={a, b}. 9

Counting Program: ngramcount --order=2 corpus.far > corpus.2.counts.fst ngrammerge foo.counts.fst bar.counts.fst > foobar.counts.fst Graphical representation: hello/-0.69315 hello/-0.69315 2/-0.69315 ε/0 1 0/-1.3863 ε/0 bye/-1.0986 bye/0 bye/-0.69315 ε/0 3/-0.69315 10

This Lecture Counting Model creation, shrinking, and conversion Class-based models 11

Creating Back-off Model Program: ngrammake corpus.2.counts.fst > corpus.2.lm.fst Graphical representation: hello/0.89794 hello/1.5041 2/0.20479 1 0/0.81093 bye/1.0986 bye/1.0986 bye/0.81093 ε/0.91629 3/0.54857 12

Shrinking Back-off Model Program: ngramshrink --method=relative_entropy --theta=0.02 corpus.2.lm.fst > corpus.2.s.lm.fst Graphical representation: hello/0.89794 hello/1.5041 2/0.20479 1 0/0.81093 bye/1.0986 bye/0.81093 ε/0.27444 3/0.54857 13

Merging/Interpolation Program: ngrammerge --normalize --alpha=2 --beta=3 a.lm.fst b.lm.fst > merged.fst Resulting language models are mixed at relative importance corresponding to --alpha and --beta, normalizing the output LM to be a probability distribution 14

This Lecture Counting Model creation, shrinking, and conversion Class-based models 15

Class-Based Models Simple class-based models: Pr[w i h] =Pr[w i C i ]Pr[C i h]. Methods in GRM: no special utility needed. create transducer mapping strings to classes. use fstcompose to map from word corpus to classes. build and make model over classes. use fstcompose to map from classes to words. Generality: classes defined by weighted automata. 16

Class-Based Model - Example Example: BYE = {bye, bye bye}. Graphical representation: mapping from strings to classes. hello:hello/0 0/0 bye:bye/0.693 hello:hello/0 bye:!/0 1/0 17

Class-Based Model - Counts hello/-0.69315 hello/-0.69315 2/-0.69315 hello/-0.69315 hello/-0.69315 2/-0.69315 ε/0 1 0/-1.3863 ε/0 bye/-1.0986 bye/0 ε/0 1 0/-1.3863 ε/0 BYE/-0.69315 bye/-0.69315 ε/0 3/-0.69315 BYE/-0.69315 ε/0 3/-0.69315 Original counts. Class-based counts. 18

Models hello/0.89794 hello/1.5041 2/0.20479 1 0/0.81093 bye/0.81093 bye/1.0986 ε/0.91629 bye/1.0986 3/0.54857 original model. hello/0.87547 hello/1.3863 2/0.18232 1 0/0.69315 BYE/1.3863 class-based model. BYE/0.87547 3/0.18232 19

Final Class-Based Model hello/1.3863 1/0.69315 bye/2.0794 bye/0 5/0.18232 hello/1.3863 2/0.18232 0 bye/1.5686 3/0.18232 4/0.69315 hello/0.87547 20

References Cyril Allauzen, Mehryar Mohri, and Brian Roark. Generalized Algorithms for Constructing Statistical Language Models. In 41st Meeting of the Association for Computational Linguistics (ACL 2003), Proceedings of the Conference, Sapporo, Japan. July 2003. Cyril Allauzen, Mehryar Mohri, and Brian Roark. The Design Principles and Algorithms of a Weighted Grammar Library. International Journal of Foundations of Computer Science, 16(3): 403-421, 2005. Peter F. Brown, Vincent J. Della Pietra, Peter V. desouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467-479. Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical Report, TR-10-98, Harvard University. 1998. William Gale and Kenneth W. Church. What s wrong with adding one? In N. Oostdijk and P. de Hann, editors, Corpus-Based Research into Language. Rodolpi, Amsterdam. Good, I. The population frequencies of species and the estimation of population parameters, Biometrica, 40, 237-264, 1953. 21

References Frederick Jelinek and Robert L. Mercer. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, s 381-397. Slava Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Transactions on Acoustics, Speech and Signal Processing, 35, 400-401, 1987. Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, s 181-184, 1995. David A. McAllester, Robert E. Schapire: On the Convergence Rate of Good-Turing Estimators. Proceedings of Conference on Learning Theory (COLT) 2000: 1-6. Mehryar Mohri. Weighted Grammar Tools: the GRM Library. In Robustness in Language and Speech Technology. s 165-186. Kluwer Academic Publishers, The Netherlands, 2001. Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic dependences in stochastic language modeling. Computer Speech and Language, 8:1-38. 22

References Brian Roark, Richard Sproat, Cyril Allauzen, Michael Riley, Jeffrey Sorensen, Terry Tai. The OpenGrm Open-Source Finite-State Grammar Software Libraries, ACL (System Demonstrations) (2012), pp. 61-66 Terry Tai, Wojciech Skut and Richard Sproat. Thrax: An Open Source Grammar Compiler Built on OpenFst. ASRU 2011. Kristie Seymore and Ronald Rosenfeld. Scalable backoff language models. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), 1996. Andreas Stolcke. 1998. Entropy-based pruning of back-off language models. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, s 270-274. Ian H. Witten and Timothy C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression, IEEE Transactions on Information Theory, 37(4):1085-1094, 1991. 23