HMM Speech Recognition. Words: Pronunciations and Language Models. Out-of-vocabulary (OOV) rate. Pronunciation dictionary.

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Learning Methods in Multilingual Speech Recognition

Letter-based speech synthesis

Investigation on Mandarin Broadcast News Speech Recognition

Lecture 9: Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Modeling function word errors in DNN-HMM based LVCSR systems

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

A study of speaker adaptation for DNN-based speech synthesis

Switchboard Language Model Improvement with Conversational Data from Gigaword

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

On the Formation of Phoneme Categories in DNN Acoustic Models

Calibration of Confidence Measures in Speech Recognition

English Language and Applied Linguistics. Module Descriptions 2017/18

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Deep Neural Network Language Models

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Using dialogue context to improve parsing performance in dialogue systems

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Lecture 1: Machine Learning Basics

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Improvements to the Pruning Behavior of DNN Acoustic Models

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Language Model and Grammar Extraction Variation in Machine Translation

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Large vocabulary off-line handwriting recognition: A survey

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Natural Language Processing. George Konidaris

Toward a Unified Approach to Statistical Language Modeling for Chinese

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Phonological Processing for Urdu Text to Speech System

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Journal of Phonetics

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Disambiguation of Thai Personal Name from Online News Articles

Mandarin Lexical Tone Recognition: The Gating Paradigm

Characterizing and Processing Robot-Directed Speech

Cross-Lingual Text Categorization

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

The Strong Minimalist Thesis and Bounded Optimality

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

CS 598 Natural Language Processing

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Detecting English-French Cognates Using Orthographic Edit Distance

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Investigation of Indian English Speech Recognition using CMU Sphinx

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Linking Task: Identifying authors and book titles in verbose queries

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

CS Machine Learning

Chapter 5: Language. Over 6,900 different languages worldwide

The taming of the data:

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Noisy SMS Machine Translation in Low-Density Languages

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Training and evaluation of POS taggers on the French MULTITAG corpus

Universiteit Leiden ICT in Business

A Reinforcement Learning Variant for Control Scheduling

Online Updating of Word Representations for Part-of-Speech Tagging

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

CEFR Overall Illustrative English Proficiency Scales

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Problems of the Arabic OCR: New Attitudes

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Language Independent Passage Retrieval for Question Answering

Word-based dialect identification with georeferenced rules

Constructing Parallel Corpus from Movie Subtitles

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Formulaic Language and Fluency: ESL Teaching Applications

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

A Class-based Language Model Approach to Chinese Named Entity Identification 1

Transcription:

HMM Speech Recognition ords: Pronunciations and Language Models Recorded Speech Decoded Text (Transcription) Steve Renals Signal Analysis Acoustic Model Automatic Speech Recognition ASR Lecture 8 11 February 2013 Training Data Lexicon Language Model Search Space Pronunciation dictionary ASR Lecture 8 ords: Pronunciations and Language Models 1 Out-of-vocabulary (OOV) rate ASR Lecture 8 ords: Pronunciations and Language Models 2 ords and their pronunciations provide the link between sub-word HMMs and language models ritten by human experts Typically based on phones Constructing a dictionary involves 1 Selection of the words in the dictionary want to ensure high coverage of words in test data 2 Representation of the pronunciation(s) of each word Explicit modelling of pronunciation variation OOV rate: percent of word tokens in test data that are not contained in the ASR system dictionary Training vocabulary requires pronunciations for all words in training data (since training requires an HMM to be constructed for each training utterance) Select the recognition vocabulary to minimize the OOV rate (by testing on development data) Recognition vocabulary may be different to training vocabulary Empirical result: each OOV word results in 1.5 2 extra errors (>1 due to the loss of contextual information) ASR Lecture 8 ords: Pronunciations and Language Models 3 ASR Lecture 8 ords: Pronunciations and Language Models 4

Multilingual aspects Many languages are morphologically richer than English: this has a major effect of vocabulary construction and language modelling Compounding (eg German): decompose compund words into constituent parts, and carry out pronunciation and language modelling on the decomposed parts Highly inflected languages (eg Arabic, Slavic languages): specific components for modelling inflection (eg factored language models) Inflecting and compounding languages (eg Finnish) All approaches aim to reduce ASR errors by reducing the OOV rate through modelling at the morph level; also addresses data sparsity ASR Lecture 8 ords: Pronunciations and Language Models 5 OOV Rate for different languages Morph-Based Speech Recognition 3:19 New words in test set [%] 30 25 20 15 10 5 English Estonian Turkish Finnish 0 0 4 8 12 16 20 24 28 32 36 40 44 Training corpus size [million words] Fig. 8. For growing amounts of training data, development of the proportions of words in the test M. Creutz et al, Morph-based speech recognition and modeling OOV words across languages, ACM Trans set that are not covered by the training set. Speech and Language Processing, 5(1), art. 3. http://doi.acm.org/10.1145/1322391.1322394 for English). If the entire 150-million word Finnish corpus were to be used (i.e., a lexicon containing more than 4 million words), the OOV rate for the test set would still be 1.5%. ASR Lecture 8 ords: Pronunciations and Language Models 7 Vocabulary size for different languages 3:18 M. Creutz et al. Unique words [million words] Finnish 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 Turkish Estonian English 0 0 4 8 12 16 20 24 28 32 36 40 44 Corpus size [million words] Fig. 7. Vocabulary growth curves for different languages: For growing amounts of text (word M. Creutz tokens), et the al, Morph-based numbers of unique speechdifferent recognition word andforms modeling (wordoov types), words occurring across languages, in the text ACM are plotted. Trans Speech and Language Processing, 5(1), art. 3. http://doi.acm.org/10.1145/1322391.1322394 3.3 ord Models, Vocabulary Growth, and Spontaneous Speech To improve the word models, one could attempt to increase the vocabulary ASR Lecture 8 ords: Pronunciations and Language Models 6 (recognition lexicon) of these models. A high coverage of the vocabulary of the training set might also reduce the OOV rate of the recognition data (test set). However, this may be difficult to obtain. Figure 7 shows the development of the size of the training set vocabulary for growing amounts of training data. The corpora used for Finnish, Estonian, and Turkish are the datasets used for training language models (mentioned in Section 3.1.2). For comparison, a curve for English is also shown; the English corpus ords consists mayofhave text from multiple the Newpronunciations: York Times magazine. hile there are fewer than 200,000 different word forms in the 40-million word English corpus, the corresponding values for Finnish and Estonian corpora of the same 1 Accent, dialect: tomato, zebra size exceed global 1.8 million changes andto 1.5dictionary million words, based respectively. on consistent The rate pronunciation of growth remains high variations as the entire Finnish LM training data of 150 million words (used in Fin4) 2 contains Phonological more than phenomena: 4 million unique handbag/ word forms. h ae This mvalue b aeis thus g ten times the size of the (rather large) word lexicon currently used in the Finnish experiments. I can t stay / [ah k ae n s t ay] Figure 3 Part 8 illustrates of speech: the development project, excuse of the OOV rate in the test sets for growing amounts of training data. That is, assuming that the entire vocabulary This seems to imply many pronunciations per word, including: of the training set is used as the recognition lexicon, the words in the test set that do 1 not Global occurtransform in the training based set on arespeaker OOVs. The characteristics test sets are the same as used in 2 the Context-dependent speech recognition experiments, pronunciation and for models, English, encoding a held-out ofsubset of the New York Times corpus was used. Again, the proportions of OOVs are fairly high phonological for Finnish and phenomena Estonian; at 25 million words, the OOV rates are 3.6% and 4.4%, respectively (compared with 1.7% for Turkish and only 0.74% Single and multiple pronunciations BUT state-of-the-art large vocabulary systems average about 1.1 pronunciations per word: most words have a single pronunciation ACM Transactions on Speech and Language Processing, Vol. 5, No. 1, Article 3, Publication date: December 2007. ASR Lecture 8 ords: Pronunciations and Language Models 8

Consistency vs Fidelity Modelling pronunciation variability Empirical finding: adding pronunciation variants can result in reduced accuracy Adding pronunciations gives more flexibility to word models and increases the number of potential ambiguities more possible state sequences to match the observed acoustics Speech recognition uses a consistent rather than a faithful representation of pronunciations A consistent representation requires only that the same word has the same phonemic representation (possibly with alternates): the training data need only be transcribed at the word level A faithful phonemic representation requires a detailed phonetic transcription of the training speech (much too expensive for large training data sets) State-of-the-art systems absorb variations in pronunciation in the acoustic models Context-dependent acoustic models may be though of as giving broad class representation of word context Cross-word context dependent models can implicitly represent cross-word phonological phenomena Hain (2002): a carefully constructed single pronunciation dictionary (using most common alignments) can result in a more accurate system than a multiple pronunciation dictionary Mathematical framework ASR Lecture 8 ords: Pronunciations and Language Models 9 HMM Framework for speech recognition. Let be the universe of possible utterances, and X be the observed acoustics, then we want to find: = arg max P( X ) = arg max P(X )P( ) P(X ) = arg max P(X )P( ) ords are composed of a sequence of HMM states Q: = arg max arg max arg max P(X Q, )P(Q, ) P(X Q)P(Q )P( ) Q max P(X Q)P(Q )P( ) Q Three levels of model ASR Lecture 8 ords: Pronunciations and Language Models 10 Acoustic model P(X Q) Probability of the acoustics given the phone states: context-dependent HMMs using state clustering, phonetic decision trees, etc. Pronunciation model P(Q ) Probability of the phone states given the words; may be as simple a dictionary of pronunciations, or a more complex model Language model P( ) Probability of a sequence of words. Typically an n-gram ASR Lecture 8 ords: Pronunciations and Language Models 11 ASR Lecture 8 ords: Pronunciations and Language Models 12

Language modelling Finite-state network Basic idea The language model is the prior probability of the word sequence P( ) Use a language model to disambiguate between similar acoustics when combining linguistic and acoustic evidence never mind the nudist play / never mind the new display Use hand constructed networks in limited domains one two three ticket tickets to Edinburgh London Leeds and typically hand-written does not have a wide coverage or robustness ASR Lecture 8 ords: Pronunciations and Language Models 13 ASR Lecture 8 ords: Pronunciations and Language Models 14 Language modelling Basic idea The language model is the prior probability of the word sequence P( ) Use a language model to disambiguate between similar acoustics when combining linguistic and acoustic evidence never mind the nudist play / never mind the new display Use hand constructed networks in limited domains Statistical language models: cover ungrammatical utterances, computationally efficient, trainable from huge amounts of data, can assign a probability to a sentence fragment as well as a whole sentence ASR Lecture 8 ords: Pronunciations and Language Models 15 Statistical language models For use in speech recognition a language model must be: statistical, have wide coverage, and be compatible with left-to-right search algorithms Only a few grammar-based models have met this requirement (eg Chelba and Jelinek, 2000), and do not yet scale as well as simple statistical models n-grams are (still) the state-of-the-art language model for ASR Unsophisticated, linguistically implausible Short, finite context Model solely at the shallow word level But: wide coverage, able to deal with ungrammatical strings, statistical and scaleable Probability of a word depends only on the identity of that word and of the preceding n-1 words. These short sequences of n words are called n-grams. ASR Lecture 8 ords: Pronunciations and Language Models 16

Bigram language model Bigram network ord sequence = w 1, w 2,... w M P() = P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 )... P(w M w 1, w 2,... w M 1 ) Bigram approximation consider only one word of context: P(one start of sentence) one P(ticket one) ticket P(Edinburgh one) P() P(w 1 )P(w 2 w 1 )P(w 3 w 2 )... P(w M w M 1 ) Parameters of a bigram are the conditional probabilities P(w i w j ) Maximum likelihood estimates by counting: P(w i w j ) c(w j, w i ) c(w j ) where c(w j, w i ) is the number of observations of w j followed by w i, and c(w j ) is the number of observations of w j (irrespective of what follows) ASR Lecture 8 ords: Pronunciations and Language Models 17 Edinburgh P(end of sentence Edinburgh) n-grams can be represented as probabilistic finite state networks only some arcs (and nodes) are shown for clarity: in a full model there is an arc from every word to every word note the special start and end sentence probabilities ASR Lecture 8 ords: Pronunciations and Language Models 18 The zero probability problem Smoothing language models Maximum likelihood estimation is based on counts of words in the training data If a n-gram is not observed, it will have a count of 0 and the maximum likelihood probability estimate will be 0 The zero probability problem: just because something does not occur in the training data does not mean that it will not occur As n grows larger, so the data grow sparser, and the more zero counts there will be Solution: smooth the probability estimates so that unobserved events do not have a zero probability Since probabilities sum to 1, this means that some probability is redistributed from observed to unobserved n-grams hat is the probability of an unseen n-gram? Add-one smoothing: add one to all counts and renormalize. Discounts non-zero counts and redistributes to zero counts Since most n-grams are unseen (for large n more types than tokens!) this gives too much probability to unseen n-grams (discussed in Manning and Schütze) Absolute discounting: subtract a constant from the observed (non-zero count) n-grams, and redistribute this subtracted probability over the unseen n-grams (zero counts) Kneser-Ney smoothing: family of smoothing methods based on absolute discounting that are at the state of the art (Goodman, 2001) ASR Lecture 8 ords: Pronunciations and Language Models 19 ASR Lecture 8 ords: Pronunciations and Language Models 20

Backing off How is the probability distributed over unseen events? Basic idea: estimate the probability of an unseen n-gram using the (n-1)-gram estimate Use successively less context: trigram bigram unigram Back-off models redistribute the probability freed by discounting the n-gram counts For a bigram P(w i w j ) = c(w j, w i ) D c(w j ) = P(w i )b wj otherwise if c(w j, w i ) > c c is the count threshold, and D is the discount. b wj is the backoff weight required for normalization Practical language modelling ASR Lecture 8 ords: Pronunciations and Language Models 21 Interpolation References Basic idea: Mix the probability estimates from all the estimators: estimate the trigram probability by mixing together trigram, bigram, unigram estimates Simple interpolation ˆP(w n w n 2, w n 1 ) = λ 3 P(w n w n 2, w n 1 ) + λ 2 P(w n w n 1 ) + λ 1 P(w n ) ith i λ i = 1 Interpolation with coefficients conditioned on the context ˆP(w n w n 2, w n 1 ) = λ 3 (w n 2, w n 1 )P(w n w n 2, w n 1 )+ λ 2 (w n 2, w n 1 )P(w n w n 1 ) + λ 1 (w n 2, w n 1 )P(w n ) Set λ values to maximise the likelihood of the interpolated language model generating a held-out corpus (possible to use EM to do this) ASR Lecture 8 ords: Pronunciations and Language Models 22 ork in log probabilities The ARPA language model format is commonly used to store n-gram language models (unless they are very big) Many toolkits: SRILM, IRSTLM, KenLM, Cambridge-CMU toolkit,... Some research issues: Advanced smoothing Adaptation to new domains Incorporating topic information Long-distance dependencies Distributed representations Jurafsky and Martin, chapter 4 Fosler-Lussier (2003) - pronunciation modelling tutorial Hain (2002) - implicit pronunciation modelling by context-dependent acoustic models Gotoh and Renals (2003) - language modelling tutorial (and see refs within) Good coverage of n-gram models in Manning and Schütze (1999) Jelinek (1991) - review of early attempts to go beyond n-grams Chelba and Jelinek (2000) - example of a probabilistic grammar-based language model Goodman (2001) - state-of-the-art smoothing for n-grams ASR Lecture 8 ords: Pronunciations and Language Models 23 ASR Lecture 8 ords: Pronunciations and Language Models 24