Lexicon and Language Model

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Learning Methods in Multilingual Speech Recognition

Lecture 9: Speech Recognition

Deep Neural Network Language Models

Investigation on Mandarin Broadcast News Speech Recognition

Letter-based speech synthesis

Switchboard Language Model Improvement with Conversational Data from Gigaword

Modeling function word errors in DNN-HMM based LVCSR systems

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Lecture 1: Machine Learning Basics

On the Formation of Phoneme Categories in DNN Acoustic Models

Large vocabulary off-line handwriting recognition: A survey

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

A study of speaker adaptation for DNN-based speech synthesis

Calibration of Confidence Measures in Speech Recognition

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Phonological Processing for Urdu Text to Speech System

English Language and Applied Linguistics. Module Descriptions 2017/18

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Program in Linguistics. Academic Year Assessment Report

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Improvements to the Pruning Behavior of DNN Acoustic Models

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

arxiv: v1 [cs.cl] 27 Apr 2016

Toward a Unified Approach to Statistical Language Modeling for Chinese

Probabilistic Latent Semantic Analysis

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Journal of Phonetics

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Rhythm-typology revisited.

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Universal contrastive analysis as a learning principle in CAPT

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Natural Language Processing. George Konidaris

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Language Independent Passage Retrieval for Question Answering

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Arabic Orthography vs. Arabic OCR

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Training and evaluation of POS taggers on the French MULTITAG corpus

Characterizing and Processing Robot-Directed Speech

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Florida Reading Endorsement Alignment Matrix Competency 1

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Problems of the Arabic OCR: New Attitudes

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

CS 598 Natural Language Processing

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

Investigation of Indian English Speech Recognition using CMU Sphinx

WHEN THERE IS A mismatch between the acoustic

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Cross-Lingual Text Categorization

Universiteit Leiden ICT in Business

Phonological and Phonetic Representations: The Case of Neutralization

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Word-based dialect identification with georeferenced rules

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

The Strong Minimalist Thesis and Bounded Optimality

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Transcription:

Lexicon and Language Model Steve Renals Automatic Speech Recognition ASR Lecture 10 15 February 2018 ASR Lecture 10 Lexicon and Language Model 1

Three levels of model Acoustic model P(X Q) Probability of the acoustics given the phone states: context-dependent HMMs using state clustering, phonetic decision trees, etc. Pronunciation model P(Q W ) Probability of the phone states given the words; may be as simple a dictionary of pronunciations, or a more complex model Language model P(W ) Probability of a sequence of words. Typically an n-gram ASR Lecture 10 Lexicon and Language Model 2

HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic Features Acoustic Model Training Data Lexicon Language Model Search Space ASR Lecture 10 Lexicon and Language Model 3

HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic Features Acoustic Model Training Data Lexicon Language Model Search Space ASR Lecture 10 Lexicon and Language Model 4

Pronunciation dictionary Words and their pronunciations provide the link between sub-word HMMs and language models Written by human experts Typically based on phones ASR Lecture 10 Lexicon and Language Model 5

Pronunciation dictionary Words and their pronunciations provide the link between sub-word HMMs and language models Written by human experts Typically based on phones Constructing a dictionary involves 1 Selection of the words in the dictionary want to ensure high coverage of words in test data 2 Representation of the pronunciation(s) of each word ASR Lecture 10 Lexicon and Language Model 5

Pronunciation dictionary Words and their pronunciations provide the link between sub-word HMMs and language models Written by human experts Typically based on phones Constructing a dictionary involves 1 Selection of the words in the dictionary want to ensure high coverage of words in test data 2 Representation of the pronunciation(s) of each word Explicit modelling of pronunciation variation ASR Lecture 10 Lexicon and Language Model 5

Out-of-vocabulary (OOV) rate OOV rate: percent of word tokens in test data that are not contained in the ASR system dictionary Training vocabulary requires pronunciations for all words in training data (since training requires an HMM to be constructed for each training utterance) Select the recognition vocabulary to minimize the OOV rate (by testing on development data) Recognition vocabulary may be different to training vocabulary Empirical result: each OOV word results in 1.5 2 extra errors (>1 due to the loss of contextual information) ASR Lecture 10 Lexicon and Language Model 6

Multilingual aspects Many languages are morphologically richer than English: this has a major effect of vocabulary construction and language modelling Compounding (eg German): decompose compound words into constituent parts, and carry out pronunciation and language modelling on the decomposed parts Highly inflected languages (eg Arabic, Slavic languages): specific components for modelling inflection (eg factored language models) Inflecting and compounding languages (eg Finnish) All approaches aim to reduce ASR errors by reducing the OOV rate through modelling at the morph level; also addresses data sparsity ASR Lecture 10 Lexicon and Language Model 7

Single and multiple pronunciations Words may have multiple pronunciations: 1 Accent, dialect: tomato, zebra global changes to dictionary based on consistent pronunciation variations 2 Phonological phenomena: handbag/ h ae m b ae g I can t stay / [ah k ae n s t ay] 3 Part of speech: project, excuse ASR Lecture 10 Lexicon and Language Model 8

Single and multiple pronunciations Words may have multiple pronunciations: 1 Accent, dialect: tomato, zebra global changes to dictionary based on consistent pronunciation variations 2 Phonological phenomena: handbag/ h ae m b ae g I can t stay / [ah k ae n s t ay] 3 Part of speech: project, excuse This seems to imply many pronunciations per word, including: 1 Global transform based on speaker characteristics 2 Context-dependent pronunciation models, encoding of phonological phenomena ASR Lecture 10 Lexicon and Language Model 8

Single and multiple pronunciations Words may have multiple pronunciations: 1 Accent, dialect: tomato, zebra global changes to dictionary based on consistent pronunciation variations 2 Phonological phenomena: handbag/ h ae m b ae g I can t stay / [ah k ae n s t ay] 3 Part of speech: project, excuse This seems to imply many pronunciations per word, including: 1 Global transform based on speaker characteristics 2 Context-dependent pronunciation models, encoding of phonological phenomena BUT state-of-the-art large vocabulary systems average about 1.1 pronunciations per word: most words have a single pronunciation ASR Lecture 10 Lexicon and Language Model 8

Consistency vs Fidelity Empirical finding: adding pronunciation variants can result in reduced accuracy Adding pronunciations gives more flexibility to word models and increases the number of potential ambiguities more possible state sequences to match the observed acoustics ASR Lecture 10 Lexicon and Language Model 9

Consistency vs Fidelity Empirical finding: adding pronunciation variants can result in reduced accuracy Adding pronunciations gives more flexibility to word models and increases the number of potential ambiguities more possible state sequences to match the observed acoustics Speech recognition uses a consistent rather than a faithful representation of pronunciations A consistent representation requires only that the same word has the same phonemic representation (possibly with alternates): the training data need only be transcribed at the word level A faithful phonemic representation requires a detailed phonetic transcription of the training speech (much too expensive for large training data sets) ASR Lecture 10 Lexicon and Language Model 9

Current topics in pronunciation modelling Automatic learning of pronunciation variations or alternative pronunciations for some words e.g. learning probability distribution over possible pronunciations generated by grapheme-to-phoneme models Automatic learning of pronunciations of new words based on an initial seed lexicon Joint learning of the inventory of subword units and the pronunciation lexicon Sub-phonetic / articulatory feature model Grapheme-based modelling: model at the character level and remove the problem of pronunciation modelling entirely ASR Lecture 10 Lexicon and Language Model 10

HMM Speech Recognition Recorded Speech Decoded Text (Transcription) Acoustic Features Acoustic Model Training Data Lexicon Language Model Search Space ASR Lecture 10 Lexicon and Language Model 11

Statistical language models Basic idea The language model is the prior probability of the word sequence P(W ) Statistical language models: cover ungrammatical utterances, computationally efficient, trainable from huge amounts of data, can assign a probability to a sentence fragment as well as a whole sentence Until very recently n-grams were the state-of-the-art language model for ASR Unsophisticated, linguistically implausible Short, finite context Model solely at the shallow word level But: wide coverage, able to deal with ungrammatical strings, statistical and scaleable In an n-gram, the probability of a word depends only on the identity of that word and of the preceding n-1 words. These short sequences of n words are called n-grams. ASR Lecture 10 Lexicon and Language Model 12

Bigram language model Word sequence W = w 1, w 2,... w M P(W) = P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 )... P(w M w 1, w 2,... w M 1 ) Bigram approximation consider only one word of context: P(W) P(w 1 )P(w 2 w 1 )P(w 3 w 2 )... P(w M w M 1 ) ASR Lecture 10 Lexicon and Language Model 13

Bigram language model Word sequence W = w 1, w 2,... w M P(W) = P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 )... P(w M w 1, w 2,... w M 1 ) Bigram approximation consider only one word of context: P(W) P(w 1 )P(w 2 w 1 )P(w 3 w 2 )... P(w M w M 1 ) Parameters of a bigram are the conditional probabilities P(w j w i ) Maximum likelihood estimates by counting: P(w j w i ) c(w i, w j ) c(w i ) where c(w i, w j ) is the number of observations of w i followed by w j, and c(w i ) is the number of observations of w i (irrespective of what follows) ASR Lecture 10 Lexicon and Language Model 13

The zero probability problem Maximum likelihood estimation is based on counts of words in the training data If a n-gram is not observed, it will have a count of 0 and the maximum likelihood probability estimate will be 0 The zero probability problem: just because something does not occur in the training data does not mean that it will not occur As n grows larger, so the data grow sparser, and the more zero counts there will be ASR Lecture 10 Lexicon and Language Model 14

The zero probability problem Maximum likelihood estimation is based on counts of words in the training data If a n-gram is not observed, it will have a count of 0 and the maximum likelihood probability estimate will be 0 The zero probability problem: just because something does not occur in the training data does not mean that it will not occur As n grows larger, so the data grow sparser, and the more zero counts there will be Solution: smooth the probability estimates so that unobserved events do not have a zero probability Since probabilities sum to 1, this means that some probability is redistributed from observed to unobserved n-grams ASR Lecture 10 Lexicon and Language Model 14

Smoothing language models What is the probability of an unseen n-gram? ASR Lecture 10 Lexicon and Language Model 15

Smoothing language models What is the probability of an unseen n-gram? Add-one smoothing: add one to all counts and renormalize. Discounts non-zero counts and redistributes to zero counts Since most n-grams are unseen (for large n more types than tokens!) this gives too much probability to unseen n-grams (discussed in Manning and Schütze) Absolute discounting: subtract a constant from the observed (non-zero count) n-grams, and redistribute this subtracted probability over the unseen n-grams (zero counts) Kneser-Ney smoothing: family of smoothing methods based on absolute discounting that are at the state of the art (Goodman, 2001) ASR Lecture 10 Lexicon and Language Model 15

Backing off How is the probability distributed over unseen events? Basic idea: estimate the probability of an unseen n-gram using the (n-1)-gram estimate Use successively less context: trigram bigram unigram Back-off models redistribute the probability freed by discounting the n-gram counts ASR Lecture 10 Lexicon and Language Model 16

Backing off How is the probability distributed over unseen events? Basic idea: estimate the probability of an unseen n-gram using the (n-1)-gram estimate Use successively less context: trigram bigram unigram Back-off models redistribute the probability freed by discounting the n-gram counts For a bigram P(w j w i ) = c(w i, w j ) D c(w i ) = P(w j )b wi otherwise if c(w i, w j ) > c c is the count threshold, and D is the discount. b wi backoff weight required for normalization is the ASR Lecture 10 Lexicon and Language Model 16

Interpolation Basic idea: Mix the probability estimates from all the estimators: estimate the trigram probability by mixing together trigram, bigram, unigram estimates Simple interpolation ˆP(w n w n 2, w n 1 ) = λ 3 P(w n w n 2, w n 1 ) + λ 2 P(w n w n 1 ) + λ 1 P(w n ) With i λ i = 1 Interpolation with coefficients conditioned on the context ˆP(w n w n 2, w n 1 ) = λ 3 (w n 2, w n 1 )P(w n w n 2, w n 1 )+ λ 2 (w n 2, w n 1 )P(w n w n 1 ) + λ 1 (w n 2, w n 1 )P(w n ) Set λ values to maximise the likelihood of the interpolated language model generating a held-out corpus (possible to use EM to do this) ASR Lecture 10 Lexicon and Language Model 17

Perplexity Measure the quality of a language model by how well it predicts a test set W (i.e. estimated probability of word sequence) Perplexity (PP(W )) inverse probability of the test set W, normalized by the number of words N PP(W ) = P(W ) 1 N = P(w1 w 2... w N ) 1 N Perplexity of a bigram LM PP(W ) = (P(w 1 )P(w 2 w 1 )P(w 3 w 2 )... P(w N w N 1 )) 1 N Example perplexities for different n-gram LMs trained on Wall St Journal (38M words) Unigram 962 Bigram 170 Trigram 109 ASR Lecture 10 Lexicon and Language Model 18

Distributed representation for language modelling Each word is associated with a learned distributed representation (feature vector) Use a neural network to estimate the conditional probability of the next word given the the distributed representations of the context words Learn the distributed representations and the weights of the conditional probability estimate jointly by maximising the log likelihood of the training data Similar words (distributionally) will have similar feature vectors small change in feature vector will result in small change in probability estimate (since the NN is a smooth function) ASR Lecture 10 Lexicon and Language Model 19

Neural Probabilistic Language Model Bengio et al (2006) ASR Lecture 10 Lexicon and Language Model 20

Neural Probabilistic Language Model Train using stochastic gradient ascent to maximise log likelihood Number of free parameters (weights) scales Linearly with vocabulary size Linearly with context size Can be (linearly) interpolated with n-gram model Perplexity results on AP News (14M words training). V = 18k model n perplexity NPLM(100,60) 6 109 n-gram (KN) 3 127 n-gram (KN) 4 119 n-gram (KN) 5 117 ASR Lecture 10 Lexicon and Language Model 21

Shortlists Reduce computation by only including the s most frequent words at the output the shortlist (S) (full vocabulary still used for context) Use an n-gram model to estimate probabilities of words not in the shortlist Neural network thus redistributes probability for the words in the shortlist P S (h t ) = w S P(w h t ) { PNN (w P(w t h t ) = t h t )P S (h t ) ifw t S P KN (w t h t ) else In a V = 50k task a 1024 word shortlist covers 89% of 4-grams, 4096 words covers 97% ASR Lecture 10 Lexicon and Language Model 22

NPLM ASR results Speech recognition results on Switchboard 7M / 12M / 27M words in domain data. 500M words background data (broadcast news) Vocab size V = 51k, Shortlist size S = 12k WER/% in-domain words 7M 12M 27M KN (in-domain) 25.3 23.0 20.0 NN (in-domain) 24.5 22.2 19.1 KN (+b/g) 24.1 22.3 19.3 NN (+b/g) 23.7 21.8 18.9 ASR Lecture 10 Lexicon and Language Model 23

Summary Pronunciation dictionaries n-gram language models Neural network language models ASR Lecture 10 Lexicon and Language Model 24

Reading Jurafsky and Martin, chapter 4 Y Bengio et al (2006), Neural probabilistic language models (sections 6.1, 6.2, 6.3, 6.6, 6.7, 6.8), Studies in Fuzziness and Soft Computing Volume 194, Springer, chapter 6. http:// link.springer.com/chapter/10.1007/3-540-33486-6_6 ASR Lecture 10 Lexicon and Language Model 25