MINIMIZING SEARCH ERRORS DUE TO DELAYED BIGRAMS IN REAL-TIME SPEECH RECOGNITION SYSTEMS INTERACTIVE SYSTEMS LABORATORIES

Similar documents
Improvements to the Pruning Behavior of DNN Acoustic Models

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Switchboard Language Model Improvement with Conversational Data from Gigaword

Learning Methods in Multilingual Speech Recognition

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Speech Recognition at ICSI: Broadcast News and beyond

Infrastructure Issues Related to Theory of Computing Research. Faith Fich, University of Toronto

Radius STEM Readiness TM

Large vocabulary off-line handwriting recognition: A survey

On-Line Data Analytics

Dynamic Pictures and Interactive. Björn Wittenmark, Helena Haglund, and Mikael Johansson. Department of Automatic Control

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Investigation on Mandarin Broadcast News Speech Recognition

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

On the Combined Behavior of Autonomous Resource Management Agents

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Generating Test Cases From Use Cases

Calibration of Confidence Measures in Speech Recognition

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Running head: DELAY AND PROSPECTIVE MEMORY 1

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Discriminative Learning of Beam-Search Heuristics for Planning

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Measurement. Time. Teaching for mastery in primary maths

An Introduction to Simio for Beginners

Application of Virtual Instruments (VIs) for an enhanced learning environment

Merbouh Zouaoui. Melouk Mohamed. Journal of Educational and Social Research MCSER Publishing, Rome-Italy. 1. Introduction

A Generic Object-Oriented Constraint Based. Model for University Course Timetabling. Panepistimiopolis, Athens, Greece

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Age Effects on Syntactic Control in. Second Language Learning

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Rule Learning With Negation: Issues Regarding Effectiveness

NORTH CAROLINA VIRTUAL PUBLIC SCHOOL IN WCPSS UPDATE FOR FALL 2007, SPRING 2008, AND SUMMER 2008

Series IV - Financial Management and Marketing Fiscal Year

An Efficient Implementation of a New POP Model

Shockwheat. Statistics 1, Activity 1

Language Arts: ( ) Instructional Syllabus. Teachers: T. Beard address

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Seminar - Organic Computing

Circuit Simulators: A Revolutionary E-Learning Platform

Trends in College Pricing

Lower and Upper Secondary

Language Acquisition Chart

Reinforcement Learning by Comparing Immediate Reward

Deep Neural Network Language Models

Data Structures and Algorithms

Moderator: Gary Weckman Ohio University USA

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Measures of the Location of the Data

A Pipelined Approach for Iterative Software Process Model

Mandarin Lexical Tone Recognition: The Gating Paradigm

An empirical study of learning speed in backpropagation

WHEN THERE IS A mismatch between the acoustic

Accuracy (%) # features

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Prediction of Maximal Projection for Semantic Role Labeling

Lecture 1: Machine Learning Basics

DIDACTIC MODEL BRIDGING A CONCEPT WITH PHENOMENA

Letter-based speech synthesis

CS Machine Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

GACE Computer Science Assessment Test at a Glance

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Team Formation for Generalized Tasks in Expertise Social Networks

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Detecting English-French Cognates Using Orthographic Edit Distance

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Learning From the Past with Experiment Databases

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

The Effects of Ability Tracking of Future Primary School Teachers on Student Performance

Modeling function word errors in DNN-HMM based LVCSR systems

How to Judge the Quality of an Objective Classroom Test

Introduction to Causal Inference. Problem Set 1. Required Problems

Formulaic Language and Fluency: ESL Teaching Applications

PROMOTION MANAGEMENT. Business 1585 TTh - 2:00 p.m. 3:20 p.m., 108 Biddle Hall. Fall Semester 2012

Characterizing and Processing Robot-Directed Speech

CEFR Overall Illustrative English Proficiency Scales

user s utterance speech recognizer content word N-best candidates CMw (content (semantic attribute) accept confirm reject fill semantic slots

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

(Sub)Gradient Descent

Lecture 9: Speech Recognition

Lucy Calkins Units of Study 3-5 Heinemann Books Support Document. Designed to support the implementation of the Lucy Calkins Curriculum

re An Interactive web based tool for sorting textbook images prior to adaptation to accessible format: Year 1 Final Report

The Round Earth Project. Collaborative VR for Elementary School Kids

Initial English Language Training for Controllers and Pilots. Mr. John Kennedy École Nationale de L Aviation Civile (ENAC) Toulouse, France.

Transcription:

MINIMIZING SEARCH ERRORS DUE TO DELAYED BIGRAMS IN REAL-TIME SPEECH RECOGNITION SYSTEMS M.Woszczyna M.Finke INTERACTIVE SYSTEMS LABORATORIES at Carnegie Mellon University, USA and University of Karlsruhe, Germany ABSTRACT When building applications from large vocabulary speech recognition systems, a certain amount of search errors due to pruning often has to be accepted in order to obtain the required speed. In this paper we tackle the problems resulting from aggressive pruning strategies as typically applied in large vocabulary systems to achieve close to real-time performance. We consider a typical scenario of a two pass viterbi search with the rst pass being organized as a phoneme (allophone) tree. For such a tree organized lexicon, there are two possiblities to use a bigram language model: either by building tree copies or by using so-called delayed bigrams. Since copying trees turns out to be too expensive for real time applications we basically refer to delayed bigrams, discuss their drastic inuence on the word accuracy and show how to alleviate the desastrous eect of delayed bigrams under aggressive pruning. 1. INTRODUCTION Many approaches used for large vocabulary speech recognition require a time synchronous viterbi search as rst pass, which is either used as a lookahead for an A 3 search or to restrict the search space for a more detailed viterbi search. Since a large number of words in the vocabulary begin with the same initial sequence of phonemes or allophones, it is advantageous to arrange the pronunciation lexicon as a tree. Each node in the tree stands for an allophone such that a path from the tree root to a tree leaf represents a legal allophone sequence and thus a legal word in the vocabulary. Compared to a linear (at) organisation of the vocabulary the tree structure causes a problem when including language models at word transitions: expanding from the end of a word w 1 to the beginning of the next word is done by expanding into the tree root. But when a tree is started, all words are hypothesized and the word identities are only known at the end of the tree. Therefore, the transition probability p(w 2jw 1) which is typically a bigram language model score cannot be computed immediately upon transition. There are two solutions to this problem: either tree copies are generated for each active word end at a given frame [2] or the bigram score is not added before reaching the leaf of the tree and thus the word identity is known (delayed bigram approach). Since creating tree copies is often too expensive for a fast rst pass of a multipass search, we focus on the benets and problems of using delayed bigrams instead. In this paper we investigate the eects of using delayed bigrams in combination with real-time performance oriented and thus kind of aggressive pruning conditions on our JANUS speech recognition demo system [1]. Simulations demonstrate the often desastrous eect of the delayed language model approach under these special circumstances. We also study dierent strategies of recovering from these additional search errors caused by using delayed bigrams. The experiments presented in this paper are performed on two dierent tasks; The rst set of test data is composed of 12 german sentences chosen randomly from utterances recorded with our demo system. For testing a 35 word vocabulary and a bigram language model are used. In the demonstration the subjects speak to other people via a computer. The resulting sentences are inherently shorter and easier to recognize than sentences collected in fully humanto-human dialog setup usually used for collecting data for the German Spontaneous Scheduling Task (GSST). However, it seemed to be of more practical relevance to examine the eects of pruning on a typical on-line demo situation than on a typical o-line evaluation system where word accuracy losses are often not acceptable. The second set are the rst 1 minutes of speech form the 1994 WSJ evaluation with a vocabulary 2 words and a trigram language model. These experiments are to verify that the conclusions derived form the experiments on spontaneous speech with medium vocabulary size and bigrams still hold for this completely dierent application. 2. DELAYED BIGRAMS In a linear as well as in a tree organized vocabulary delayed bigrams have two main advantages compared to standard (immediate) bigram language models: as they are added before entering the last phoneme of a word (which for a tree organized vocabulary is a tree leaf) they can be used even when the vocabulary is organized as a tree without the necessity of creating tree copies. most word hypotheses are pruned away before they reach the end of the word. Delayed bigrams only have to be computed for the remaining word-ends. Thus, the total amount of language model queries can be reduced by a factor 1 to 2. These benets have to be paid by two kinds of search errors, those which are inherent in the algorithm and independent

of the beam size, and those who get worse when the beams are reduced to build real time systems. 1 Pruning Errors using Delayed Bigrams 2.1. Beam independent search errors When a path is expanded into a tree root, the best matching acoustic word end w 1 is stored as the predecessor in the new path (backtrace). Later, when this path is expanded to a tree leaf from the penultimate into the last phoneme, the bigram score is computed and the backpointer adjusted as follows: at this point the identity of the current word w 2 is known. All words ending at the frame where w 2 started are considered possible predecessor candidates of w 2 and the candidate with the lowest total score (the accumulated score up to the end of the candidate plus the bigram penalty into the current word) becomes the predecessor of w 2. However, the information about where w 2 started is not modied. This assumes that the ideal starting point of a word is independent of the identity of the predecessor word. The problem is, that a predecessor word which is expanded into the tree root at a dierent point of time might loose against the locally best path even though its total score after adding the language model would be better. Obviously, there is no way to recover from this kind of search errors by choosing a larger beamwidth. We have to add a second linearly organized pass to the algorithm instead. Because of these beam independent search errors the JANUS recognition engine uses the tree pass to select likely starting points for words only and then does a second at pass using standard bigram models. 2.2. Beam dependent search errors Figure 1 demonstrates how for reasonable large beam sizes, nearly the whole search error due to using delayed bigrams in a tree can be recovered by a second path. The four curves represent four dierent settings of the main beam used to prune the nodes within the tree. The data points on each of these curves represent dierent settings of the secondary beam that is used to prune the competing tree leafs only 1. The word accuracy of the recognizer is plotted over the number of calls to the score routine, which can be used as a machine independent measure of the volume of the search space remaining after pruning. Figure 1 also reveals that for smaller beams the recognition performance is far from degrading gracefully. On the one hand, even if a 5% word accuracy loss due to pruning were acceptable, the number of required score computations could only be reduced by about 25%. On the other hand, to get a faster recognition engine (e.g. to achieve real-time performance) you have to reduce the beams to such an extend that virtually no recognition performance can be achieved. The reason for this behavior is that the bigram information is added later for a delayed bigram than for a standard bigram. Therefore, words that do not match well acoustically but would get a good bigram score later are likely to be pruned away before they reach their last phoneme. 1 The second leaf related beam was introduced to control the number of language model requests (when entering the leaf node) and word transitions (i.e. expanding the leaf node to the tree root(s)) individually. 8 6 4 2 beam1 = 5 beam1 = 4 beam1 = 3 beam1 = 2 5e+6 1e+7 1.5e+7 2e+7 2.5e+7 3e+7 3.5e+7 1 8 6 4 2 Pruning Errors using Delayed Bigrams beam1 = 5 beam1 = 4 beam1 = 3 beam1 = 2 5e+6 1e+7 1.5e+7 2e+7 2.5e+7 3e+7 3.5e+7 Figure 1. Search errors due to tight pruning in tree pass. For small beams the pruning errors due to delayed bigrams in the rst pass cannot be recovered by the second linear pass. But for a beam > 5 it is possible to achieve the original evaluation beam performance with the at pass corrected output again (see 2.1). 3. MINIMUM UNIGRAM LOOKAHEAD In order to compensate the eect described above the idea is to get an estimate of how well a branch of the tree will do including language model information as early as possible. We tried to use the following minimum unigram approximation: For each node in the tree, the minimum unigram penalty for all words in the subtree is computed. This approximation is more accurate for nodes that are close to the tree leafs, less accurate for nodes that are close to the root. At each phoneme transition the inaccurate estimate of the node before is subtracted from the total score and replaced by the more accurate estimate of the next node.

1 Pruning Errors on GSST with Minimum Unigram lookahead 1 Pruning Errors on WSJ task with Minimum Unigram lookahead 8 6 4 2 normal beam1 = 5 normal beam1 = 4 normal beam1 = 3 normal beam1 = 2 unigram lookahead beam1 = 5 unigram lookahead beam1 = 4 unigram lookahead beam1 = 3 unigram lookahead beam1 = 2 8 6 4 2 normal beam = 1 normal beam = 9 normal beam = 8 normal beam = 7 unigram lookahead beam = 1 unigram lookahead beam = 9 unigram lookahead beam = 8 unigram lookahead beam = 7 5e+6 1e+7 1.5e+7 2e+7 2.5e+7 3e+7 3.5e+7 Pruning Errors on GSST with Minimum Unigram lookahead 1 2e+7 4e+7 6e+7 8e+7 1e+8 Pruning Errors on WSJ task with Minimum Unigram lookahead 1 8 8 6 4 2 normal beam1 = 5 normal beam1 = 4 normal beam1 = 3 normal beam1 = 2 unigram lookahead beam1 = 5 unigram lookahead beam1 = 4 unigram lookahead beam1 = 3 unigram lookahead beam1 = 2 6 4 2 normal beam = 1 normal beam = 9 normal beam = 8 normal beam = 7 unigram lookahead beam = 1 unigram lookahead beam = 9 unigram lookahead beam = 8 unigram lookahead beam = 7 WA for Eval 5e+6 1e+7 1.5e+7 2e+7 2.5e+7 3e+7 3.5e+7 Figure 2. Pruning errors are reduced due to minimum unigram lookahead on GSST. Error reduction also helps for second pass. Figure 2 shows that using the proposed language model lookaheads within the tree pass the word accuracy remains very stable over a large range of beams. With a word accuracy loss of about 5% a speedup by 65% can be achieved. Only at very small beams the word accuracy drops drastically to 2%. Figure 3 shows that the same algorithm also helps to avoid pruning errors in a demonstration system for the 2 word Wall Street Journal dictation task. The WSJ test were run on the rst 1 minutes of the ocial 1994 evaluation set. 4. MINIMUM BIGRAM LOOKAHEAD For the plots in gure 4 we refer to a slightly modied lookahead technique. Instead of considering the minimal unigram penalty as lookahead score we used minimal bigram scores where for each word wi we selected the minimal bigram penalty minw j p(wijwj). It turns out that this kind of lookahead performs better than using no lookahead at all 2e+7 4e+7 6e+7 8e+7 1e+8 Figure 3. Pruning error reduction with minimum unigram lookahead on WSJ. Result after second pass. but slightly worse than the minimum unigram lookahead. Part of the problem of this approach is that, close to the root of tree, the lookahead score is always close to which is comparable to the situation of having no lookahead at all. 5. CONCLUSIONS In this paper we demonstrated that there seems to be a very poor degradation behavior in a speech recognition engine given its rst pass is tree organized and based on delayed bigrams as language model. We observed a drastic inuence of the delayed bigram approach on the word accuracy in a setting where aggressive pruning has to be used to achieve close to real-time performance. In order to alleviate the desastrous eect of delayed bigrams under these circumstances we proposed and evaluated a new kind of language model lookahead technique which makes a speech recognition engine much more robust against search errors due to pruning.

1 Pruning Errors using Delayed Bigrams with Minimum Bigram lookahead 8 6 4 2 normal beam1 = 5 normal beam1 = 4 normal beam1 = 3 normal beam1 = 2 bigram lookahead beam1 = 5 bigram lookahead beam1 = 4 bigram lookahead beam1 = 3 bigram lookahead beam1 = 2 5e+6 1e+7 1.5e+7 2e+7 2.5e+7 3e+7 3.5e+7 Pruning Errors using Delayed Bigrams with Minimum Bigram lookahead 1 8 6 4 2 normal beam1 = 5 normal beam1 = 4 normal beam1 = 3 normal beam1 = 2 bigram lookahead beam1 = 5 bigram lookahead beam1 = 4 bigram lookahead beam1 = 3 bigram lookahead beam1 = 2 5e+6 1e+7 1.5e+7 2e+7 2.5e+7 3e+7 3.5e+7 Figure 4. Pruning errors are reduced due to bigram lookahead. Error reduction also helps for second pass. 6. ACKNOWLEDGEMENTS Many thanks to Fil Alleva for helpful discussion and valuable insights on using delayed bigrams. This work was funded in part by grand 413-41-1IV11S3 from the German Federal Ministry of Education, Science, Research and Technology (BMBF) as part of the VERB- MOBIL project. REFERENCES [1] A.Waibel, M.Finke, D.Gates, M.Gavalda, T.Kemp, A.Lavie, L.Levin, M.Maier, L.Mayeld, A.McNair, I.Rogina, K.Shima, T.Sloboda, M.Woszczyna, T.Zeppenfeld, P.Zhan JANUS-II Advances in Spontaneous Speech Translation ICASSP96; [2] V.Steinbiss,B.H. Tran, H.Ney Improvements in Beam Search ICSLP'94 Volume 4 pp 2143{2147; [3] X.Aubert, H.Ney Large Vocabulary Continuous Speech Recognition Using Word Graphs ICASSP'95, Volume 1 pp 49{52;

MINIMIZING SEARCH ERRORS DUE TO DELAYED BIGRAMS IN REAL-TIME SPEECH RECOGNITION SYSTEMS M.Woszczyna and M.Finke INTERACTIVE SYSTEMS LABORATORIES at Carnegie Mellon University, USA and University of Karlsruhe, Germany When building applications from large vocabulary speech recognition systems, a certain amount of search errors due to pruning often has to be accepted in order to obtain the required speed. In this paper we tackle the problems resulting from aggressive pruning strategies as typically applied in large vocabulary systems to achieve close to real-time performance. We consider a typical scenario of a two pass viterbi search with the rst pass being organized as a phoneme (allophone) tree. For such a tree organized lexicon, there are two possiblities to use a bigram language model: either by building tree copies or by using so-called delayed bigrams. Since copying trees turns out to be too expensive for real time applications we basically refer to delayed bigrams, discuss their drastic inuence on the word accuracy and show how to alleviate the desastrous eect of delayed bigrams under aggressive pruning.