The Big Picture OR The Components of Automatic Speech Recognition (ASR)

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Methods in Multilingual Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Lecture 9: Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A study of speaker adaptation for DNN-based speech synthesis

On the Formation of Phoneme Categories in DNN Acoustic Models

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Natural Language Processing. George Konidaris

Modeling function word errors in DNN-HMM based LVCSR systems

Large vocabulary off-line handwriting recognition: A survey

Modeling function word errors in DNN-HMM based LVCSR systems

Human Emotion Recognition From Speech

Investigation on Mandarin Broadcast News Speech Recognition

Software Maintenance

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Lecture 1: Machine Learning Basics

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

CS 598 Natural Language Processing

Word Segmentation of Off-line Handwritten Documents

Speaker recognition using universal background model on YOHO database

Calibration of Confidence Measures in Speech Recognition

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Probabilistic Latent Semantic Analysis

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

English Language and Applied Linguistics. Module Descriptions 2017/18

Lecture 10: Reinforcement Learning

Florida Reading Endorsement Alignment Matrix Competency 1

Speech Recognition by Indexing and Sequencing

Speech Emotion Recognition Using Support Vector Machine

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

WHEN THERE IS A mismatch between the acoustic

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Body-Conducted Speech Recognition and its Application to Speech Support System

Linking Task: Identifying authors and book titles in verbose queries

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

The Strong Minimalist Thesis and Bounded Optimality

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Automatic Pronunciation Checker

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

STA 225: Introductory Statistics (CT)

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

CSL465/603 - Machine Learning

Effect of Word Complexity on L2 Vocabulary Learning

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Using dialogue context to improve parsing performance in dialogue systems

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Evolutive Neural Net Fuzzy Filtering: Basic Description

Python Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Voice conversion through vector quantization

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A Case Study: News Classification Based on Term Frequency

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Problems of the Arabic OCR: New Attitudes

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Introduction to Simulation

Edinburgh Research Explorer

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

CS Machine Learning

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Detecting English-French Cognates Using Orthographic Edit Distance

Beyond the Pipeline: Discrete Optimization in NLP

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Lecture 1: Basic Concepts of Machine Learning

Segregation of Unvoiced Speech from Nonspeech Interference

elearning OVERVIEW GFA Consulting Group GmbH 1

Transcription:

The Big Picture OR The Components of Automatic Speech Recognition (ASR) Reference: Steve Young s paper - highly recommended! (online at webpage: http://csl.anthropomatik.kit.edu > Studium und Lehre > SS2013 > Multilinguale Mensch-Maschine Kommunikation) Donnerstag, 18. April 2013 1

Overview ASR (I) Representation of Speech Speech Coding Statistical Pattern-based Speech Recognition Sampling & Quantization Quantization of Signals Quantization of Speech Signals Sampling Continuous-time Signals How Frequently Should we Sample? - The Aliasing Effect Feature Extraction 2

Overview ASR (II) Automatic Speech Recognition Fundamental Equation of Speech Recognition Acoustic Model Purpose of Acoustic Model (Pronunciation Dictionary) Why breaking down the words into phones Speech Production seen as Stochastic Process Generating an Observation of Speech Features Vectors x 1,x 2,,x T Hidden Markov Models Formal Definition of Hidden Markov Models Three Main Problems Of Hidden Markov Models Hidden Markov Models in ASR From the Sentence to the Sentence-HMM Context Dependent Acoustic Modeling From Sentence to Context Dependent HMM 3

Overview ASR (III) Automatic Speech Recognition Language Model Motivation What do we expect from Language Models in ASR? Stochastic Language Models Probabilities of Word Sequences Classification of Word Sequence Histories Estimation of N-grams Search Simplified Training Simplified Decoding Comparing Complete Utterances Alignment of Vector Sequences Dynamic Time Warping 4

Overview Signal Processing Representation of Speech Speech Coding Statistical Pattern-based Speech Recognition Sampling & Quantization Quantization of Signals Quantization of Speech Signals Sampling Continuous-time Signals How Frequently Should we Sample? - The Aliasing Effect Feature Extraction 5

Automatic Speech Recognition??? Output Text Input Speech Hello world 6

ASR Signal Processing Input Speech Signal Pre- Processing??? Output Text Hello world 7

Automatic Speech Recognition The purpose of Signal Preprocessing is: 1) Signal Digitalization (Quantization and Sampling) Represent an analog signal in an appropriate form to be processed by the computer 2) Digital Signal Preprocessing (Feature Extraction) Extract features that are suitable for recognition process??? Output Text Input Speech Hello world 8

Representation of Speech Definition: Digital representation of speech Represent speech as a sequences of numbers (as a prerequisite for automatic processing using computers) 1) Direct representation of speech waveform: represent speech waveform as accurate as possible so that an acoustic signal can be reconstructed 2) Parametric representation Represent a set of properties/parameters with regard to a certain model Decide the targeted application first: Speech coding Speech synthesis Speech recognition Classical paper: Schafer/Rabiner in Waibel/Lee (paper online) 9

Speech Coding Objectives of Speech Coding: Quality versus bit rate Quantization Noise High measured intelligibility Low bit rate (b/s of speech) Low computational requirement Robustness to transmission errors Robustness to successive encode/decode cycles Objectives for real-time: Low coding/decoding delay Work with non-speech signals (e.g. touch tone) 10

Statistical Pattern-based Speech Recognition Goals for Digital Representation of Speech: Capture important phonetic information in speech Computational efficiency Efficiency in storage requirements Optimize generalization 11

Overview Signal Processing Representation of Speech Speech Coding Statistical Pattern-based Speech Recognition Sampling & Quantization Quantization of Signals Quantization of Speech Signals Sampling Continuous-time Signals How Frequently Should we Sample? - The Aliasing Effect Feature Extraction 12

Sampling & Quantization Goal: Given a signal that is continuous in time and amplitude, find a discrete representation. For it, 2 steps are necessary: sampling and quantization. Quantization corresponds to a discretization of the y-axis Sampling corresponds to a discretization of the x-axis 13

Quantization of Signals Given a discrete signal f[i] to be quantized into q[i] Assume that f is between f min and f max Partition y-axis into a fixed number n of (equally sized) intervals Usually n=2 b, in ASR typically b=16 > n=65536 (16-bit quantization) q[i] can only have values that are centers of the intervals Quantization: assign q[i] the center of the interval in which lies f[i] Quantization makes errors, i.e. adds noise to the signal f[i]=q[i]+e[i] The average quantization error e[i] is (f max -f min )/(2n) Define signal to noise ratio SNR[dB] = power(f[i]) / power(e[i]) 14

Quantization of Speech Signals Choice of sampling depth: Speech signals are usually in the range between 50 db and 60 db The lower the SNR, the lower the speech recognition performance To get a reasonable SNR, b should be at least 10 to 12 Each bit contributes to about 6db of SNR (see e.g. http://cnx.org/content/m0051/latest/) Typically in ASR the samples are quantized with 16 bits 15

Sampling Continuous-time Signals Original speech waveform and its samples: 1.5 Original speech signal 1 0.5 0-0.5 0 20 40 60 80 100 120 1.5 Sampled version of signal 1 0.5 0-0.5 0 20 40 60 80 100 120 16

How Frequently Should we Sample? Undersampling at 10 khz: 1 Input frequency 8 khz 0.5 0-0.5-1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 10-3 1 Resulting frequency 2 khz 0.5 0-0.5-1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 10-3 17

The Aliasing Effect Nyquist or sampling theorem: When a f l -band-limited signal is sampled with a sampling rate of at least 2f l then the signal can be exactly reproduced from the samples When the sampling rate is too low, the samples can contain "incorrect" frequencies: Prevention: increase sampling rate anti-aliasing filter (restrict signal bandwith) 18

Feature Extraction WHY Capture important phonetic information in speech Computational efficiency, Efficiency in storage requirements Optimize generalization WHAT Features in frequency domain Reason: It is hard to infer much from time domain waveform Human hearing is based on frequency analysis Use of frequency analysis simplifies signal processing Use of frequency analysis facilitates understanding 19

Automatic Speech Recognition Two sessions Digital Signal Processing Input Speech Signal Pre- Processing??? Output Text Hello world 20

Overview Automatic Speech Recognition Fundamental Equation of Speech Recognition Acoustic Model Purpose of Acoustic Model (Pronunciation Dictionary) Why breaking down the words into phones Speech Production seen as Stochastic Process Generating an Observation of Speech Features Vectors x 1,x 2,,x T Hidden Markov Models Formal Definition of Hidden Markov Models Three Main Problems Of Hidden Markov Models Hidden Markov Models in ASR From the Sentence to the Sentence-HMM Context Dependent Acoustic Modeling From Sentence to Context Dependent HMM 21

Automatic Speech Recognition Fundamental Equation of Speech Recognition: Observe a sequence of feature vectors X Find the most likely word sequence W arg max W P( W X ) arg max W P( W ) p( X P( X ) W ) Input Speech Signal Pre- Processing Output Text Hello world 22

Automatic Speech Recognition arg max W P( W X ) arg max W P( W ) p( X P( X ) W ) Input Speech Signal Pre- Processing p(x W) Output Text Hello world Acoustic Model 23

Automatic Speech Recognition arg max W P( W X ) arg max W P( W ) p( X P( X ) W ) Input Speech Signal Pre- Processing p(x W) P(W) Acoustic Model Language Model Output Text Hello world 24

Automatic Speech Recognition Search how to efficiently try all W arg max W P( W X ) arg max W P( W ) p( X P( X ) W ) Input Speech Signal Pre- Processing p(x W) 25 P(W) Hello world Acoustic Model Language Model Output Text

Overview Automatic Speech Recognition Fundamental Equation of Speech Recognition Acoustic Model Purpose of Acoustic Model (Pronunciation Dictionary) Why breaking down the words into phones Speech Production seen as Stochastic Process Generating an Observation of Speech Features Vectors x 1,x 2,,x T Hidden Markov Models Formal Definition of Hidden Markov Models Three Main Problems Of Hidden Markov Models Hidden Markov Models in ASR From the Sentence to the Sentence-HMM Context Dependent Acoustic Modeling From Sentence to Context Dependent HMM 26

Automatic Speech Recognition Input Speech Signal Pre- Processing p(x W) Acoustic Model P(W) Output Text Hello world 27

Automatic Speech Recognition Purpose of Acoustic Model: Given W, what is the likelihood to see feature vector(s) X we need a representation for W in terms of feature vectors Usually a two-part representation / modeling: pronunciation dictionary: describe W as concatenation of phones Phones models that explain phones in terms of feature vectors p(x W) Input Speech Signal Pre- Processing Acoustic Model + Pronunciation Dict 28 I /i/ you /j/ /u/ we /v/ /e/ Output Text Hello world

Why breaking down the words into phones Need collection of reference patterns for each word High computational effort (esp. for large vocabularies), proportional to vocabulary size Large vocabulary also means: need huge amount of training data Difficult to train suitable references (or sets of references) Impossible to recognize untrained words Replace whole words by suitable sub units Poor performance when the environment changes Works only well for speaker-dependent recognition (variations) Unsuitable where speaker is unknown and no training is feasible Unsuitable for continuous speech (combinatorial explosion) Difficult to train/recognize subword units Replace the pattern approach by a better modeling process 29

Automatic Speech Recognition p(x W) P(W) Input Speech Signal Pre- Processing Acoustic Model Output Text Hello world 30

Speech Production seen as Stochastic Process The same word / phoneme sounds different every time it is uttered Regard words / phonemes as states of a speech production process In a given state we can observe different acoustic sounds Not all sounds are possible / likely in every state We say: In a given state the speech process "emits" sounds according to some probability distribution The production process makes transitions from one state to another Not all transitions are possible, they have different probabilities When we specify the probabilities for sound-emissions (emission probabilities) and for the state transitions, we call this a model. 31

Generating an Observation of Speech Features Vectors x 1,x 2,,x T The term "hidden" comes from observing observations and drawing conclusions without knowing the hidden sequence of states 32

Formal Definition of Hidden Markov Models A Hidden Markov Model is a five-tuple consisting of: S The set of States S={s 1,s 2,...,s n } A B V The initial probability distribution (s i ) = probabilty of s i being the first state of a state sequence The matrix of state transition probabilities: A=(a ij ) where a ij is the probability of state s j following s i The set of emission probability distributions/densities, B={b 1,b 2,...,b n } where b i (x) is the probabiltiy of observing x when the system is in state s i The observable feature space can be discrete: V={x 1,x 2,...,x v }, or continuous V=R d 33

Three Main Problems Of Hidden Markov Models The evaluation problem: given an HMM and an observation x 1,x 2,...,x T, compute the probability of the observation p(x 1,x 2,...,x T ) The decoding problem: given an HMM and an observation x 1,x 2,...,x T, compute the most likely state sequence s q1,s q2,...,s qt, i.e. argmax q1,..,qt p(q 1,..,q T x 1,x 2,...,x T, ) The learning / optimization problem: given an HMM and an observation x 1,x 2,...,x T, find an HMM such that p(x 1,x 2,...,x T ) > p(x 1,x 2,...,x T ) 34

Hidden Markov Models in ASR States that correspond to the same acoustic phaenomenon share the same "acoustic model" Training data is better used In this HMM: b 1 =b 7 =b g-b Emission prob parameters are estimated more robustly Save computation time: (don't evaluate b(..) for every s i ) 35

From the Sentence to the Sentence-HMM Generate word lattice of possible word sequences: Generate phoneme lattice of possible pronunciations: Generate state lattice (HMM) of possible state sequences: 36

Context Dependent Acoustic Modeling Consider the pronunciations of TRUE, TRAIN, TABLE, and TELL. Most common lexicon entries are: TRUE TRAIN TABLE TELL T R UW T R EY N T EY B L T EH L Notice that the actual pronunciation sounds a bit like: TRUE TRAIN TABLE TELL CH R UW CH R EY N T HH EY B L T HH EH L Statement: The phoneme T sounds different depending on whether the following phoneme is an R or a vowel. 37

Context Dependent Acoustic Modeling First idea: use actual pronunciations in the lexicon: i.e. CH R UW instead of T R UW. Problem: The CH in TRUE does sound different from the CH in CHURCH. Second idea: Introduce new acoustic units such that the lexicon looks like: TRUE TRAIN TABLE TELL T(R) R UW T(R) R EY N T(vowel) EY B L T(vowel) EH L i.e. use context dependent models of the phoneme T 38

From Sentence to Context Dependent HMM A context independent HMM for the sentence "HELLO WORLD : Making the phoneme H dependend on it successor (context dependent), out of we make Typical improvements of speech recognizers when introducing context dependence: 30% - 50% fewer errors. 39

Automatic Speech Recognition Two lectures on Hidden Markov Modeling Two lectures on Acoustic Modeling (CI, CD) One lecture on Pronunciation Modeling, Variants, Adaptation Input Speech Signal Pre- Processing p(x W) Acoustic Model + Pronunciation Dict 40 I /i/ you /j/ /u/ we /v/ /e/ P(W) Output Text Hello world

Automatic Speech Recognition p(x W) P(W) Input Speech Signal Pre- Processing 41 I /i/ you /j/ /u/ we /v/ /e/ eu sou você é ela é Language Model Output Text Hello world

Overview Automatic Speech Recognition Language Model Motivation What do we expect from Language Models in ASR? Stochastic Language Models Probabilities of Word Sequences Classification of Word Sequence Histories Estimation of N-grams Search Simplified Training Simplified Decoding Comparing Complete Utterances Alignment of Vector Sequences Dynamic Time Warping 42

Motivation Language Model Equally important to recognize and understand natural speech: Acoustic pattern matching and knowledge about language Language Knowledge: in SR covered by: Lexical knowledge vocabulary definition vocabulary word pronunciation dictionary Syntax and Semantics, I.e. rules that determine: LM word sequence is grammatically well-formed / Grammar word sequence is meaningful Pragmatics LM structure of extended discourse / Grammar what is likely to be said in particular context / Discourse These different levels of knowledge are tightly integrated!!! 43

What do we expect from Language Models in ASR? Improve speech recognizer add another information source Disambiguate homophones find out that "I OWE YOU TOO" is more likely than "EYE O U TWO" Search space reduction when vocabulary is n words, don't consider all n k possible k-word sequences Analysis analyze utterance to understand what has been said disambiguate homonyms (bank: money vs river) 44

Stochastic Language Models In formal language theory P(W) is regarded either as 1.0 if word sequence W is accepted 0.0 if word sequence W is rejected Inappropriate for spoken language since, grammar has no complete coverage (conversational) spoken language is often ungrammatical Describe P(W) from the probabilistic viewpoint Occurrence of word sequence W is described by a probability P(W) find a good way to accurately estimate P(W) Training problem: reliably estimate probabilities of W Recognition problem: compute probabilities for generating W 45

Probabilities of Word Sequences The probability of a word sequence can be decomposed as: P(W) = P(w 1 w 2.. w n ) = P(w 1 ) P(w 2 w 1 ) P(w 3 w 1 w 2 ) P(w n w 1 w 2... w n-1 ) The choice of w n thus depends on the entire history of the input, so when computing P(w history), we have a problem: For a vocabulary of 64,000 words and average sentence lengths of 25 words (typical for Wall Street Journal), we end up with a huge number of possible histories (64,000 25 > 10 120 ). So it is impossible to precompute a special P(w history) for every history. Two possible solutions: compute P(w history) "on the fly" (rarely used, very expensive) replace the history by one out of a limited feasible number of equivalence classes C such that P'(w history) = P(w C(history)) Question: how do we find good equivalence classes C? 46

Classification of Word Sequence Histories We can use different equivalence classes using information about: Grammatical content (phrases like noun-phrase, etc.) POS = part of speech of previous word(s) (e.g. subject, object,...) Semantic meaning of previous word(s) Context similarity (words that are observed in similar contexts are treated equally, e.g. weekdays, people's names etc.) Apply some kind of automatic clustering (top-down, bottom-up) Classes are simply based on previous words unigram: P'(w k w 1 w 2... w k-1 ) = P(w k ) bigram: P'(w k w 1 w 2... w k-1 ) = P(w k w k-1 ) trigram: P'(w k w 1 w 2... w k-1 ) = P(w k w k-2 w k-1 ) n-gram: P'(w k w 1 w 2... w k-1 ) = P(w k w k-(n-1) w k-n-2... w k-1 ) 47

Estimation of N-grams The standard approach to estimate P(w history) is to use a large amount of training corpus (There's no data like more data) determine the frequency with which the word w occurs given the history simply count how often the word sequence history w occurs in the text normalize the count by the number of times history occurs P(w history) = Example: Let our training corpus consists of 3 sentences, use bigram model John read her book. I read a different book. John read a book by Mulan. P(John <s>) = C(<s>,John) / C(<s>) = 2/3 P(read John) = C(John,read) / C(John) = 2/2 P(a read) = C(read,a) / C(read) = 2/3 P(book a) = C(a,book) / C(a) = 1/2 P(</s> book) = C(book, </s>) / C(book) = 2/3 Now calculate the probability of sentence John read a book. P(John read a book) = P(John <s>) P(read John) P(a read) P(book a) P(</s> book) = 0.148 But what about the sentence Mulan read her book? - We don t have P(read Mulan). 48 Count(history w) Count(history)

Automatic Speech Recognition Two lectures on Language Modeling p(x W) P(W) Input Speech Signal Pre- Processing I /i/ you /j/ /u/ we /v/ /e/ eu sou você é ela é Language Model Output Text Hello world 49

Overview Automatic Speech Recognition Language Model Motivation What do we expect from Language Models in ASR? Stochastic Language Models Probabilities of Word Sequences Classification of Word Sequence Histories Estimation of N-grams Search Simplified Training Simplified Decoding Comparing Complete Utterances Alignment of Vector Sequences Dynamic Time Warping 50

Automatic Speech Recognition Search how to efficiently try all W arg max W P( W X ) arg max W P( W ) p( X P( X ) W ) Input Speech Signal Pre- Processing p(x W) P(W) Output Text Hello world 51

Search The entire set of possible sequences of pattern is called the search space Typical search spaces have 1,000 time frames (10sec speech) and 500,000 possible sequences of pattern With an average of 25 words per sentence (e.g. WSJ) and a vocabulary of 64,000 words, more possible word sequences than the universe has atoms! It is not feasible to compute the most likely sequence of words by evaluating the scores of all possible sequences We need an intelligent algorithm that scans the search space and finds the best (or at least a very good) hypothesis This problem is referred to search or decoding 52

Simplified Training Aligned Speech Feature extraction Speech features Train Classifier Improved Classifiers /h/ /e/ /l/ /o/ /h/ /e/ /l/ /o/ One lecture on Classification /e/ Use all aligned speech features (e.g. of phoneme /e/) to train the reference vectors of /e/ (=Codebook) - kmeans - LVQ 53

Simplified Decoding Speech Speech features Hypotheses (phonemes) Feature extraction Decision (apply trained classifiers) /h/... /h/ /e/ /l/ /o/ /w/ /o/ /r/ /l/ /d/ 54

Comparing Complete Utterances What we had so far: Record a sound signal Compute frequency representation Quantize/classify vectors We now have: A sequence of pattern vectors Want we want: The similiarity between two such sequences Obviously: The order of vectors is important! => vs. 55

Comparing Complete Utterances Comparing speech vector sequences has to overcome three problems: 1) Speaking rate characterizes speakers (speaker dependent!) if the speaker is speaking faster, we get fewer vectors 2) Changing speaking rate by purpose: e.g. talking to a foreign person 3) Changing speaking rate non-purposely: speaking disfluencies vs. So we have to find a way to decide which vectors to compare to another Impose some constraints! (compare every vector to all others is too costly) 56

Alignment of Vector Sequences First idea to overcome the varying length of Utterances, Problem (2): 1. Normalize their length 2. Make a linear alignment Linear alignment can handle the problem of different speaking rates But: It can not handle the problem of varying speaking rates during the same utterance. 57

One Example Pattern Dynamic Time Warping (DTW) Goal: Identify example pattern that is most similar to unknown input compare patterns of different length Note: all patterns are preprocessed 100 vectors / second of speech DTW: Find alignment between unknown input and the example pattern that minimizes the overall distance Find average vector distance, but which frame-pairs? t 1 t 2 t M? t 1 t 2 t N Euclidean Distance 58 Input = unknown pattern

Automatic Speech Recognition Search how to efficiently try all W Two lectures on Search arg max W P( W X ) arg max W P( W ) p( X P( X ) W ) Input Speech Signal Pre- Processing p(x W) P(W) Output Text Hello world 59

P(e) -- a priori probability The chance that e happens. For example, if e is the English string I like snakes, then P(e) is the chance that a certain person at a certain time will say I like snakes as opposed to saying something else. P(f e) -- conditional probability The Thanks chance of f given for e. For your example, if interest! e is the English string I like snakes, and if f is the French string maison bleue, then P(f e) is the chance that upon seeing e, a translator will produce f. Not bloody likely, in this case. P(e,f) -- joint probability The chance of e and f both happening. If e and f don't influence each other, then we can write P(e,f) = P(e) * P(f). If e and f do influence each other, then we had better write P(e,f) = P(e) * P(f e). That means: the chance 60