T Automatic Speech Recognition: From Theory to Practice

Similar documents
Learning Methods in Multilingual Speech Recognition

A Neural Network GUI Tested on Text-To-Phoneme Mapping

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

On-Line Data Analytics

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Large vocabulary off-line handwriting recognition: A survey

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Lecture 9: Speech Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Investigation on Mandarin Broadcast News Speech Recognition

Mandarin Lexical Tone Recognition: The Gating Paradigm

Switchboard Language Model Improvement with Conversational Data from Gigaword

Speech Recognition at ICSI: Broadcast News and beyond

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Language Model and Grammar Extraction Variation in Machine Translation

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

On the Formation of Phoneme Categories in DNN Acoustic Models

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Natural Language Processing. George Konidaris

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Modeling function word errors in DNN-HMM based LVCSR systems

Discriminative Learning of Beam-Search Heuristics for Planning

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

The Strong Minimalist Thesis and Bounded Optimality

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

What the National Curriculum requires in reading at Y5 and Y6

Calibration of Confidence Measures in Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Letter-based speech synthesis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Detecting English-French Cognates Using Orthographic Edit Distance

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

CEFR Overall Illustrative English Proficiency Scales

Math 96: Intermediate Algebra in Context

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Math 098 Intermediate Algebra Spring 2018

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Corrective Feedback and Persistent Learning for Information Extraction

SARDNET: A Self-Organizing Feature Map for Sequences

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Characterizing and Processing Robot-Directed Speech

Developing a TT-MCTAG for German with an RCG-based Parser

Prediction of Maximal Projection for Semantic Role Labeling

Learning goal-oriented strategies in problem solving

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Proof Theory for Syntacticians

Disambiguation of Thai Personal Name from Online News Articles

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

SEMAFOR: Frame Argument Resolution with Log-Linear Models

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

An empirical study of learning speed in backpropagation

Extending Place Value with Whole Numbers to 1,000,000

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Using Synonyms for Author Recognition

Noisy SMS Machine Translation in Low-Density Languages

Probabilistic Latent Semantic Analysis

Radius STEM Readiness TM

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Parsing of part-of-speech tagged Assamese Texts

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Linking Task: Identifying authors and book titles in verbose queries

Lecture 10: Reinforcement Learning

Seminar - Organic Computing

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Using dialogue context to improve parsing performance in dialogue systems

Software Maintenance

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

A Case Study: News Classification Based on Term Frequency

Automatic Pronunciation Checker

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Transcription:

Automatic Speech Recognition: From Theory to Practice http://www.cis.hut.fi/opinnot// October 25, 2004 Prof. Bryan Pellom Department of Computer Science Center for Spoken Language Research University of Colorado pellom@cslr.colorado.edu Automatic Speech Recognition: From Theory to Practice 1

References for Today s Material S. J. Young, N. H. Russel, J.H.S. Thornton, Token Passing: a Simple Conceptual Model for Connected Speech Recognition Systems, Technical Report TR-38, Cambridge University Engineering Dept., July 1989. X. Huang, A. Acero, H. Hon, Spoken Language Processing, Prentice Hall, 2001 (chapters 12 and 13) L.R. Rabiner & B. W. Juang, Fundamentals of Speech Recognition, Prentice-Hall, ISBN 0-13 015157-2, 1993 (see chapters 7 and 8) Automatic Speech Recognition: From Theory to Practice 2

Search Goal of ASR search is to find the most likely string of symbols (e.g., words) to account for the observed speech waveform: Ŵ = arg max w P( O W) P(W) Types of input: Isolated Words Connected Words Automatic Speech Recognition: From Theory to Practice 3

Designing an Isolated-Word HMM Whole-Word Model Collect many examples of word spoken in isolation Assign number of HMM states based on word duration Estimate HMM model parameters using iterative Forward-Backward algorithm Subword-Unit Model Collect large corpus of speech and estimate phoneticunit HMMs (e.g., decision-tree state clustered triphones) Construct word-level HMM from phoneme-level HMMs More general than whole-word approach Automatic Speech Recognition: From Theory to Practice 4

Whole-Word HMM one one O 1 T 1 O 2 one O 3 T3 one O M.. T 2 T M HMM for word one Automatic Speech Recognition: From Theory to Practice 5

Computing Log-Probability of Model (Viterbi Algorithm) ~ ~ δt (4) = P( O, q λ) ~ δ 2 ~ δ 2 (3) (2) invalid initial t = 0 ~ t =1 ~ [ ~ ] ) t = T δt ( j) = max δt 1( i) + aij + b j ( ot 1 i N Automatic Speech Recognition: From Theory to Practice 6 t = 2 ~ time final

Isolated Word Recognition speech P O, q ) ( 1 W 1 P( W1 ) Speech Detection Feature Extraction O P ( O q2 W 2),.. P( W2 ) Pick Max P ( O, q N WN ) P( WN ) P(O W) computed using Viterbi algorithm rather than Forward-Algorithm. Viterbi provides probability path represented by mostlikely state sequence. Simplifies our recognizer Automatic Speech Recognition: From Theory to Practice 7

Connected-Word (Continuous) Speech Recognition Utterance boundaries are unknown Number of words spoken in audio is unknown Exact position of word-boundaries are often unclear and difficult to determine Can not exhaustively search for all possibilities (M= num words, V=length of utterance M V possible word sequences). Automatic Speech Recognition: From Theory to Practice 8

Simple Connected-Word Example Consider this hypothetical network consisting of 2 words, P( W1) W 1 P( W2) W 2 Automatic Speech Recognition: From Theory to Practice 9

Connected-Word Log-Viterbi Search Remember at each node, we must compute, ~ ~ δt ( j) = max t 1 i N [ δ ( i) + a ~ + β ] + b ( o ) 1 Where β ij is the (log) language model score, ~ sp( W k ) + p :if "i"is the last state of any word ~ β ij = " j"is the initial state of kth word 0 :otherwise Recall s is the grammar-scale factor and p is a log-scale word transition penalty Automatic Speech Recognition: From Theory to Practice 10 ij ~ ij ~ j t

Connected-Word Log-Viterbi Search Remember at each node, we must also compute, ψ t ( j) = arg max t 1 1 i N [ δ ( i) + a ~ + β ] This allows us to back-trace to discover the most-probable state-sequence. Words and word-boundaries are found during back-trace. Going backwards we look for state transitions from state 0 into the last state of another word. ~ Automatic Speech Recognition: From Theory to Practice 11 ij ~ ij

Connected-Word Viterbi Search P( Wk ) invalid initial final t = 0 t =1 t = 2 t = 3 t = 4 t = 5 time Automatic Speech Recognition: From Theory to Practice 12

Viterbi with Beam-Pruning Idea : Prune away low-scoring paths, At each time, t, determine the log-probability of the absolute best Viterbi path, ~ MAX ~ δ t = max δ 1 i N [ ( i) ] Prune away paths which fall below a pre-determined beam (BW) from the maximum probable path. Deactivate state j if, ~ δ ( t j) ~ < δ MAX t Automatic Speech Recognition: From Theory to Practice 13 t BW

Hypothetical Beam Search P( Wk ) t = 0 t =1 t = 2 t = 3 t = 4 t = 5 invalid time initial final pruned Automatic Speech Recognition: From Theory to Practice 14

Issues with the Trellis Search Important note : language model is applied at the point that we transition into the word. As the number of words increases, so do the number of states and interconnections Beam-Search Improves efficiency Still difficult to evaluate the entire search space Not easy to incorporate word histories (e.g., n-gram models) into such a framework Not easy to account for between-word acoustics Automatic Speech Recognition: From Theory to Practice 15

The Token Passing Model Proposed by Young et al. (1989) Provides a conceptually appealing framework for connected word speech recognition search Allows for arbitrarily complex networks to be constructed and searched Efficiently allows n-gram language models to be applied during search Automatic Speech Recognition: From Theory to Practice 16

Token Passing Approach Let s assume each HMM state can hold (multiple) movable token(s) Think of a token as an object that can move from state-to-state in our network For now, let s assume each token carries with it the (log-scale) Viterbi path cost: s Automatic Speech Recognition: From Theory to Practice 17

Token Passing Idea At each time, t, we examine the tokens that are assigned to nodes in the network Tokens are propagated to reachable network positions at time t+1, Make a copy of the token Adjust path score to account for HMM transition and observation probability Tokens are merged based on Viterbi algorithm, Select token with best-path by picking the one with the maximum score Discard all other competing tokens Automatic Speech Recognition: From Theory to Practice 18

Token Passing Algorithm Initialization (t=0) Initialize each initial state to hold a token with, s = 0 All other states initialized with a token of score, s = Algorithm (t>0): Propagate tokens to all possible next states Prune tokens whose path scores fall below a search beam Termination (t=t) Examine the tokens in all possible final states Find the token with the largest Viterbi path score This is the probability of the most likely state alignment Automatic Speech Recognition: From Theory to Practice 19

Token Propagation (Without Language Model) for t := 1 to T foreach state i do Pass token copy in state i to all connecting states j, increment, ~ s = s + a ~ ij + b j ( ot ) end foreach state i do Find the token in state i with the largest s and discard the rest of the tokens in state i. (Viterbi Search) end end Automatic Speech Recognition: From Theory to Practice 20

Token Propagation Example time t-1 time t a ii a jj a ii a jj i a ij j i a ij j s t 1( i ) s t ( ) s t (i) ( j) 1 j s t tokens s ~ = max s + ~ t ( i) aij + b j ( ot ) 1444 2444 3 forward transition token ~ s + ~ t 1( j) a jj + b j ( o 1444 2444 3 self loop transition token t ( j) 1, t ) Automatic Speech Recognition: From Theory to Practice 21

Token Passing Model for Connected Word Recognition Individual word models are connected together into a looped composite model Can transition from final state of word i to initial state of word j. Path scores are maintained by tokens Language model score added to path when transitioning between words. Path through network also maintained by tokens Allows us to recover best word sequence Automatic Speech Recognition: From Theory to Practice 22

Connected Word Example (with Token Passing) ~ s = s + gp( W ) + W 1 1 p tokens W 2 Tokens emitted from last state of each word propagate to initial state of each word. Language model score added to path score upon word-entry. Automatic Speech Recognition: From Theory to Practice 23 s s

Maintaining Path Information The previous example assumes a unigram language model. Knowledge of the previous word is not maintained by the tokens. For connected word recognition, we don t care much about the underlying state sequence within each word model We care about transitions between words and when they occur Must augment token structure with a path identifier & path score Automatic Speech Recognition: From Theory to Practice 24

Word-Link Record Path Identifier points to a record (data structure) containing word-boundary information Word-Link Record (WLR): data structure created each time a token exits a word. Contains, Word Identifier (e.g., hello ) Word End Frame (e.g., time=t ) Viterbi Path Score at time t. Pointer to previous WLR word_id end_frame path_score_s previous_wlr Automatic Speech Recognition: From Theory to Practice 25

Word-Link Record WLR s link together to provide search outcome: word_id this is it s a test end_frame 50 76 76 126 181 score_s -1500-2200 -2410-2200 -2200 prev_wlr (NULL) (NULL) is begins at frame 50 (.5 sec), ends at frame 76 (0.76 sec). The total path cost for the word is -700. This begins at frame 0 and ends at frame 50. Automatic Speech Recognition: From Theory to Practice 26

Illustration of WLR Generation Figure From Young et al, 1989. Automatic Speech Recognition: From Theory to Practice 27

WLRs as a Word-History Provider Each propagating token contains a pointer to a word link record Tracing back provides word-history w n 2 word_id end_frame path_score_s prev_wlr w n 1 word_id end_frame path_score_s prev_wlr w n token Automatic Speech Recognition: From Theory to Practice 28

Incorporating N-gram Language Models During Token Passing Search When a token exits a word and is about to propagate into a new word, we can augment the token s path cost with the LM score. Upon exit, each token contains pointer to a word link record. Can obtain previous word(s) from WLR Therefore, update the path with, s ~ = s + gp( Wn Wn 1, Wn 2 ) + p Automatic Speech Recognition: From Theory to Practice 29

Word-Link Records & Lattices Word Link Records encode the possible word sequences seen during search Words can overlap in time Words can have different path scores Can generate a lattice of word confusions from WLR s. Automatic Speech Recognition: From Theory to Practice 30

Lattice Representation take fidelity s case as an example Automatic Speech Recognition: From Theory to Practice 31

Recovering the Best Word String Scan through Word-Link Records created at final time T and find WLR corresponding to word with best path score (s). Follow link from current WRL to previous WRL. Extract word identity. Repeat until current WRL does not point to any previous WRL (null). Reverse decoded word sequence Word begin/end times determined from WRL sequence Word score determined by taking between path scores Automatic Speech Recognition: From Theory to Practice 32

Token Passing Search Issues How to correctly apply language model which may depend on multiple previous words? How to prune away tokens which represent unpromising paths? How can we implement cross-word acoustic models into the token passing search? Automatic Speech Recognition: From Theory to Practice 33

Language Modeling & Token Passing Tokens entering a particular state are merged by keeping the token with maximum partial path score s (Viterbi path assumption) When N-gram language models are used, must consider merging tokens if they have the same word histories Trigram LM: given a token in a state of word n, pick max over all competing tokens which share same 2 previous words Automatic Speech Recognition: From Theory to Practice 34

Implications Tokens represent partial paths which have unique word histories. Tokens must be propagated and merged carefully Each HMM state may have multiple tokens assigned to it at any given time. Each assigned token should represent a unique word-history Automatic Speech Recognition: From Theory to Practice 35

(Practically Speaking) For a trigram language model, Unpruned tokens with unique 2-word history are merged Results in many tokens assigned to each network state Makes propagation of tokens very costly (slow decoding) Bigram Approximation merge tokens with unique 1-word previous history Negligible loss in accuracy for English Implemented in CSLR SONIC, CMU Sphinx-II, other recognizers as well. Automatic Speech Recognition: From Theory to Practice 36

Pruning & Efficiency The number of tokens will increase in the network as frame count (t) increases. Maintaining tokens with unique word histories makes problem worse Beam pruning is a useful mechanism for controlling the number of tokens (partial paths) being explored at any given time Automatic Speech Recognition: From Theory to Practice 37

Beam Pruning for Token Passing Find token with maximum partial path log-score, s at time t. Prune away tokens that have score less than a threshold, e.g., prune if s < ( smax BW ) BW is preset beam width BW > 0 Automatic Speech Recognition: From Theory to Practice 38

Example Types of Beams Global-Beam: Word-beam: Phone-Beam: State-beam: overall best token - BW g best token in word BW w best token in phone BW p best token within state - BW s Automatic Speech Recognition: From Theory to Practice 39

Histogram Pruning For each frame, keep top N tokens (based on path score) propagated throughout search network. N = 10k 40k tokens (depends on vocabulary size) Smaller N means fewer tokens, faster search speed, possibly more word errors due to accidental pruning of correct path. Reduces peak-memory required by decoder to store tokens Automatic Speech Recognition: From Theory to Practice 40

Active Tokens Per Frame (WSJ 5k Vocabulary) Frequency Pruning Region Thousands of Active Tokens Histogram from Julian Odell s PhD thesis, Cambridge University Automatic Speech Recognition: From Theory to Practice 41

Typical Token Passing Search Loop t 1 Propagate & Merge Tokens Prune Tokens t WLRs (raw lattice) Automatic Speech Recognition: From Theory to Practice 42

Cross-Word Modeling How to incorporate between-word context dependency within search? BRYAN PELLOM?-B+R B-R+AY R-AY+AX AY-AX+N AX-N+P N-P+EH P-EH+L EH-L+AX L-AX+M AX-M+? BRYAN GEORGE?-B+R B-R+AY R-AY+AX AY-AX+N AX-N+JH N-JH+AO JH-AO+R AO-R+JH R-JH+? Automatic Speech Recognition: From Theory to Practice 43

Linear (Flat) Lexicon Search BRYAN PELLOM Green = variable left-context (word-entry) GEORGE Red = variable right-context (word-exit) Automatic Speech Recognition: From Theory to Practice 44

Right-Context Fan-out The right-context of the last base phone of each word is the first base phone of the next word. Impossible to know the next word in advance of the search; can be several possible next words Solution: model the last phone of each word using a parallel set of triphone models; one for each possible phonetic right-context Automatic Speech Recognition: From Theory to Practice 45

Illustration of Right-Context Fan-out Illustration from CMU Sphinx-II Recognizer Automatic Speech Recognition: From Theory to Practice 46

Left-Context Fan-Out The phonetic left-context for the first phone position in a word is the last base phone from the previous word During search, no unique predecessor word Can fan-out at initial phone just as in the case of the rightcontext fan out; However, word initial states are evaluated quite often. Some recognizers do suboptimal things. CMU Sphinx-II performs Left-Context Inheritance Dynamically Inherit the left-context from the competing word with the highest partial path score. Automatic Speech Recognition: From Theory to Practice 47

Lexical Prefix Tree Search As vocabulary size increases: Number of states needed to represent the flat search network increases linearly Number of cross-word transitions increases rapidly Number of language model calculations (required at word boundaries) increases rapidly Solution: Convert Linear Search Network into a Prefix Tree. Automatic Speech Recognition: From Theory to Practice 48

Lexical Prefix Tree KD(EY,?) BAKE B(?,EY) EY(B,KD) KD(EY,TD) TD(KD,?) BAKED BAKING EY(B,K) K(EY,IX) K(EY,AXR) IX(K,NG) AXR(K,?) AXR(K,IY) NG(IX,?) BAKER BAKERY IY(AXR,?) * Figure adapted from Huang et al., Spoken Language Processing, Prentice Hall Automatic Speech Recognition: From Theory to Practice 49

Leaf Node Construction Leaf Nodes ideally should have unique word identity Allows for efficient application of language model Handles instances such as, When word is the prefix of another word [ stop, stops ]. Homophones like two and to. Automatic Speech Recognition: From Theory to Practice 50

Leaf Node Construction UW(T,?) TO T(?,UW) UW(T,?) TOO STOP P(AA,?) STOPS S(?,T) T(S,AA) AA(T,P) P(AA,S) S(P,?) Automatic Speech Recognition: From Theory to Practice 51

Advantages of Lexical Tree Search High degree of sharing at the root nodes reduces the number of word-initial HMMs needed to be evaluated in each frame Reduces the number of cross-word transitions Number of active HMM states and cross-word transitions grow more slowly with increasing vocabulary size Automatic Speech Recognition: From Theory to Practice 52

Advantages of Lexical Tree Search Savings in the number of nodes in the search space [e.g., 12k vocabulary, 2.5x less nodes]. Memory savings; fewer paths searched Search effort reduced by a factor of 5-7 over linear lexicon [since most effort is spent searching the first or second phone of each word due to ambiguities at word boundaries]. Automatic Speech Recognition: From Theory to Practice 53

Comparing Flat Network and Tree Network in terms of # of HMM states Automatic Speech Recognition: From Theory to Practice 54

Speed Comparison between Flat and Tree Search CMU Sphinx-II : Speed Improvements of tree search compared to flat search for 20k and 58k word vocabularies [speed is about 4-5x faster!] Accuracy is about 20% relative worse for tree search. Automatic Speech Recognition: From Theory to Practice 55

Disadvantages of Lexical Tree Root nodes model the beginnings of several words which have similar phonetic sequences Identity of word not known at the root of the tree Can not apply language model until tree represents a unique word identity. Delayed Language Modeling Delayed Language Modeling implies that pruning early on is based on acoustics-alone. This generally leads to increased pruning errors and loss in accuracy Automatic Speech Recognition: From Theory to Practice 56

Next Week More search issues N-best Lists Lattices / Word-Graphs Pronunciation Lexicon Development & Prediction of Word Pronunciations from orthography. A review of approaches Practical aspects of training, testing, tuning speech recognition systems Automatic Speech Recognition: From Theory to Practice 57