Speech Recognition Lecture 1: Introduction. Mehryar Mohri Courant Institute and Google Research

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speaker recognition using universal background model on YOHO database

Learning Methods in Multilingual Speech Recognition

Lecture 9: Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Emotion Recognition Using Support Vector Machine

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Edinburgh Research Explorer

A study of speaker adaptation for DNN-based speech synthesis

Lecture 1: Machine Learning Basics

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Segregation of Unvoiced Speech from Nonspeech Interference

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Support Vector Machines for Speaker and Language Recognition

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Investigation on Mandarin Broadcast News Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Calibration of Confidence Measures in Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Automatic Pronunciation Checker

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

(Sub)Gradient Descent

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speaker Identification by Comparison of Smart Methods. Abstract

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Speech Recognition by Indexing and Sequencing

Body-Conducted Speech Recognition and its Application to Speech Support System

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Mandarin Lexical Tone Recognition: The Gating Paradigm

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Letter-based speech synthesis

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Statistical Parametric Speech Synthesis

Switchboard Language Model Improvement with Conversational Data from Gigaword

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Voice conversion through vector quantization

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Word Segmentation of Off-line Handwritten Documents

Learning Methods for Fuzzy Systems

Rhythm-typology revisited.

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Large vocabulary off-line handwriting recognition: A survey

English Language and Applied Linguistics. Module Descriptions 2017/18

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Probabilistic Latent Semantic Analysis

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Semi-Supervised Face Detection

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Phonological Processing for Urdu Text to Speech System

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

An Online Handwriting Recognition System For Turkish

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Corrective Feedback and Persistent Learning for Information Extraction

Proceedings of Meetings on Acoustics

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

THE RECOGNITION OF SPEECH BY MACHINE

The Good Judgment Project: A large scale test of different methods of combining expert predictions

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Expressive speech synthesis: a review

Python Machine Learning

CSL465/603 - Machine Learning

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Spoofing and countermeasures for automatic speaker verification

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Generative models and adversarial training

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Transcription:

Speech Recognition Lecture 1: Introduction Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com

Logistics Prerequisites: basics in analysis of algorithms and probability. No specific knowledge about signal processing. Workload: 2-3 homework assignments, 1 project (your choice). Textbooks: no single textbook covering the material presented in this course. Lecture slides available electronically. 2

Objectives Computer science view of automatic speech recognition (ASR) (no signal processing). Essential algorithms for large-vocabulary speech recognition. But, emphasis on general algorithms: automata and transducer algorithms. statistical learning algorithms. 3

Topics introduction, formulation, components, features. weighted transducer software library. weighted automata algorithms. statistical language modeling software library. ngram models. maximum entropy models. pronunciation models, decision trees, contextdependent models. 4

Topics search algorithms, transducer optimizations, Viterbi decoder. search algorithms, N-best algorithms, lattice generation, rescoring. structured prediction algorithms. adaptation. active learning. semi-supervised learning. 5

This Lecture Speech recognition problem Statistical formulation Acoustic features 6

Speech Recognition Problem Definition: find accurate written transcription of spoken utterances. transcriptions may be in words, phonemes, syllables, or other units. Accuracy: typically measured in terms of the editdistance between reference transcription and sequence output by the model. 7

Other Related Problems Speaker verification. Speaker identification. Spoken-dialog systems. Detection of voice features, e.g., gender, age, dialect, emotion, height, weight! Speech synthesis. 8

Speech Spectogram 9

Speech Recognition Is Difficult Highly variable: the same words pronounced by the same person in the same conditions typically lead to different waveforms. source variation: speaking rate, volume, accent, dialect, pitch, coarticulation. channel variation: microphone (type, position), noise (background, distortion). Key problem: robustness to such variations. 10

ASR Characteristics Vocabulary size: small (digit recognition, 10), medium (Resource Management, 1000), large (Broadcast News, 100,000), very large (+1M). Speaker-dependent or speaker-independent. Domain-specific or unconstrained, e.g., travel reservation, modern spoken-dialog systems. Isolated (pause between units) or continuous. Read or spontaneous, e.g., dictation, news broadcast, conversational speech. 11

Example - Broadcast News 12

History See (Juang and Rabiner, 1995) 1922: Radio Rex, toy, single-word recognizer (rex). 1939: voder and vocoder (mechanical synthesizer), Dudley (Bell Labs). 1952: isolated digit recognition, single speaker (Bell Labs). 1950s: 10 syllables of single speaker, Olson and Belar, (RCA Labs). 1950s: speaker-independent 10-vowel recognizer (MIT). 13

History 1960s: Linear Predictive Coding (LPC), Atal and Itakura. 1969: John Pierce s negative comments about ASR (Bell Labs). 1970s: Advanced Research Projects Agency (ARPA) funds speech understanding program. CMU s Harpy system based on automata had reasonable accuracy for 1,000 words. 14

History 1980s: n-gram models. ARPA Resource Management, Wall Street Journal, and ATIS tasks. Delta/delta-delta cepstra, mel cepstra. mid-1980s: Hidden Markov models (HMMs) become the preferred technique for speech recognition. 1990s: Discriminative training, vocal tract normalization, speaker adaptation. Very largevocabulary speech recognition, e.g., 1M names recognizer (Bell Labs), 500,000 words North American Business News (NAB) recognizer. 15

History mid 1990s: FSM library. Weighted transducers major component of almost all modern speech recognition and understanding systems. SVMs, kernel methods. Dictation systems, Dragon, IBM speaker-dependent system. 2000s: Broadcast News, conversational speech, e.g., Switchboard, Call Home, real-time largevocabulary systems, unconstrained spoken-dialog systems, e.g., HMIHY. 16

History (Juang and Rabiner, 1995)!"#$%&'($%)"()*+$$,-).(/)!0#&"1'/.#)2$,-('#'34)5$%$.6,-) *1.##) <',.=0#.64>)?,'0%&",) @-'($&",%A =.%$/!!$/"01) <',.=0#.64>) 2$1+#.&$A =.%$/! B.63$) <',.=0#.64>) *&.&"%&",.#A =.%$/! B.63$) <',.=0#.649) *4(&.C>) *$1.(&",%>)! <$64)B.63$) <',.=0#.649) *$1.(&",%>)!0#&"1'/.#) ;".#'3>)22*! 7%'#.&$/) 8'6/%! 7%'#.&$/)8'6/%9) :'(($,&$/) ;"3"&%9) :'(&"(0'0%) *+$$,- :'(($,&$/) 8'6/%9) :'(&"(0'0%) *+$$,- :'(&"(0'0%) *+$$,-9) *+$$,-) *+'D$()/".#'39)!0#&"+#$) 1'/.#"&"$% )*+,-./0123! 121+45*56! 7*8-/ 29.81+*:1,*926! ;4218*<! =.9>.188*2>!?1,,-.2!.-<9>2*,*926!@?A! 121+45*56! A+B5,-.*2>! 1+>9.*,C856!@-D-+! 0B*+E*2>6 G*EE-2!H1.39D!!! 89E-+56!!!!!!!!! F,9<C15,*<! @12>B1>-! 89E-+*2>6 F,9<C15,*<!+12>B1>-! B2E-.5,12E*2>6! )*2*,-/5,1,-! 81<C*2-56!! F,1,*5,*<1+!+-1.2*2>6! A92<1,-21,*D-! 542,C-5*56!H1<C*2-! +-1.2*2>6!!H*I-E/ *2*,*1,*D-!E*1+9>6!!!!!!!!!!!!!!!!!"#$%!!!"#$&!!!!!!!!!!"#&%!!!!!!!"#&&!!!!!!!!!!!!"#'%!!!!!!!!!!!!!"#'&!!!!!!!!!!!"##%!!!!!!!!!!!"##&!!!!!!!!!!!%((%!"#$!!"#$%&'()''''-18:45.>:4!1>!JC::7;!K:7.2>151.>!6><!L><:0456><1>2!G:7;>.8.2D!.@:0!5;:!M645!!! '#!N:604A' 17

Unscontrained Spoken-Dialog Systems 18

This Lecture Speech recognition problem Statistical formulation Acoustic features 19

This Lecture Speech recognition problem Statistical formulation Maximum likelihood and maximum a posteriori Statistical formulation of speech recognition Components of a speech recognizer Acoustic features 20

Problem Data: sample drawn i.i.d. from set some distribution D, x 1,...,x m X. X according to Problem: find distribution p out of a set P that best estimates D. 21

Maximum Likelihood Likelihood: probability of observing sample under distribution p P, which, given the independence assumption is m Pr[x 1,...,x m ]= p(x i ). Principle: select distribution maximizing sample probability m p = argmax p(x i ), p P i=1 m or p = argmax log p(x i ). p P i=1 i=1 22

Example: Bernoulli Trials Problem: find most likely Bernoulli distribution, given sequence of coin flips H, T, T, H, T, H, T, H, H, H, T, T,..., H. Bernoulli distribution: p(h) =θ, p(t )=1 θ. dl(p) dθ Likelihood: Solution: l = N(H) θ l(p) = log θ N(H) (1 θ) N(T ) = N(H) log θ + N(T ) log(1 θ). is differentiable and concave; N(T ) 1 θ =0 θ = N(H) N(H)+N(T ). 23

Example: Gaussian Distribution Problem: find most likely Gaussian distribution, given sequence of real-valued observations Normal distribution: Likelihood: Solution: p(x) µ =0 µ = 1 m l 3.18, 2.35,.95, 1.175,... ( 1 p(x) = exp (x µ)2 2πσ 2 2σ 2 l(p) = 1 m 2 m (x i µ) 2 log(2πσ2 ) 2σ 2. is differentiable and concave; m i=1 x i p(x) σ 2 i=1 =0 σ 2 = 1 m ) m x 2 i µ 2. i=1. 24

Properties Problems: the underlying distribution may not be among those searched. overfitting: number of examples too small wrt number of parameters. 25

Maximum A Posteriori (MAP) Principle: select the most likely hypothesis h H given the sample, with some prior distribution over the hypotheses, Pr[h], h = argmax h H = argmax h H = argmax h H Pr[h S] Pr[S h] Pr[h] Pr[S] Pr[S h]pr[h]. Note: for a uniform prior, MAP coincides with maximum likelihood. 26

This Lecture Speech recognition problem Statistical formulation Maximum likelihood and maximum a posteriori Statistical formulation of speech recognition Components of a speech recognizer Acoustic features 27

General Ideas Probabilistic formulation: given a spoken utterance, find the most likely transcription. Decomposition: mapping from spoken utterances to word sequences decomposed into intermediate units. observ. seq. o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16 CD phone seq. phoneme seq. word seq. c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 w1 w2 w3 w4 28

Statistical Formulation Observation sequence produced by signal processing system: o = o 1...o m. Sequence of words over alphabet Σ : w = w 1...w k. Formulation (maximum a posteriori decoding): ŵ = argmax Pr[w o] w Σ = argmax w Σ Pr[o w]pr[w] Pr[o] = argmax w Σ Pr[o w] }{{} Pr[w] }{{}. acoustic & pronunciation model (Bahl, Jelinek, and Mercer, 1983) language model 29

Fred Jelinek 18 November 1932-14 September 2010 30

Components Acoustic and pronunciation model: Pr(o w) = d,c,p Pr(o d)pr(d c)pr(c p)pr(p w). acoustic model Pr(o d) : observation seq. distribution seq. Pr(d c) : distribution seq. CD phone seq. Pr(c p) : CD phone seq. phoneme seq. Pr(p w) : phoneme seq. word seq. Language model: seq. Pr(w), distribution over word 31

Notes Formulation does not match the way speech recognition errors are typically measured: editdistance between hypothesis and reference transcription. 32

Speech recognition problem Statistical formulation Maximum likelihood and maximum a posteriori Statistical formulation of speech recognition Components of a speech recognizer Acoustic features This Lecture 33

Acoustic Observations Discretization time: local spectral analysis of the speech waveform at regular intervals, t = t 1,...,t m, Parameter vectors t i+1 t i = 10ms (typically). o = o 1...o m, o i R N,N = 39 (typically). magnitude. Note: other perceptual information, e.g., visual information is ignored. 34

Acoustic Model Three-state hidden Markov models (HMMs) d0:! d1:! d2:! (Rabiner and Juang, 1993) 0 d0:! 1 d1:! 2 d2:ae b,d 3 Distributions: Full covariance multivariate Gaussians: 1 Pr[ω] = (2π) N/2 σ 1/2 e 1 2 (ω µ) Diagonal covariance Gaussian mixture. Semi-continuous, tied mixtures. T σ 1 (ω µ). 35

Idea: Context-Dependent Model phoneme pronunciation depends on environment (allophones, co-articulation). model phone in context Context-dependent rules: Context-dependent units: Allophonic rules: t/v better accuracy. Complex contexts: regular expressions. (Lee, 1990; Young et al., 1994) ae/b d ae b,d. V dx. 36

Pronunciation Dictionary Phonemic transcription Example: word data in American English. data D ey dx ax 0.32 data D ey t ax 0.08 data D ae dx ax 0.48 data D ae t ax 0.12 Representation d:!/1.0 0 1 ey:!/0.4 ae:!/0.6 2 dx:!/0.8 t:!/0.2 3 ax:data/1.0 4/1 37

Language Model Definition: probabilistic model for sequences of words w = w 1...w k. By the chain rule, k Pr[w] = Pr[w i w 1...w i 1 ]. i=1 Modeling simplifications: Clustering of histories: (w 1,...,w i 1 ) c(w 1,...,w i 1 ). Example: nth order Markov assumption, i, Pr[w i w 1...w i 1 ]=Pr[w i h i ], h i n 1. 38

Recognition Cascade Combination of components observ. seq. CD phone seq. phoneme seq. word seq. word seq. HMM CD Model Pron. Model Lang. Model Viterbi approximation ŵ = argmax w argmax w d,c,p max d,c,p Pr[o d]pr[d c]pr[c p]pr[p w]pr[w] Pr[o d]pr[d c]pr[c p]pr[p w]pr[w]. 39

Speech Recognition Problems Learning: how to create accurate models for each component? Search: how to efficiently combine models and determine best transcription? Representation: compact data structure for the computational representation of the models. common representation and algorithmic framework based on weighted transducers (next lectures). 40

This Lecture Speech recognition problem Statistical formulation Acoustic features 41

Feature Selection Short-time Fourier analysis: log x(t)w(t τ)e iωt dt power (db) freq. (Hz) Short-time (25 msec. Hamming window) spectrum of /ae/. Idea: find smooth approximation eliminating large variations over short frequency intervals. 42

Cepstral Coefficients Let x(ω) denote the Fourier transform of the signal. Definition: the 13 cepstral coefficients are the energy and the 12 first coefficients of the expansion log x(ω) = c n e inω. n= Other coefficients: 13 first-order (delta-cepstra) and 13 second-order (delta-delta cepstra) differentials. 43

Mel Frequency Cepstral Coefficients Refinement: non-linear scale, approximation of human perception of distance between frequencies, e.g., mel frequency scale: MFCCs: f mel = 2595 log 10 (1 + f/700). signal first transformed using the Mel frequency band. extraction of cepstral coefficients. (Stevens and Volkman, 1940) 44

Other Refinements Speaker/Channel adaptation: mean cepstral subtraction. vocal tract normalization. linear transformations. 45

References Bahl, L. R., Jelinek, F., and Mercer, R. (1983). A Maximum Likelihood Approach to Continuous Speech Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 5(2), 179-190. Biing-Hwang Juang and Lawrence R. Rabiner. Automatic Speech Recognition - A Brief History of the Technology. Elsevier Encyclopedia of Language and Linguistics, Second Edition, 2005. Frederick Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1998. Kai-Fu Lee. Context-Dependent Phonetic Hidden Markov Models for Continuous Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 38(4): 599-609, 1990. Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993. 46

References S.S. Stevens and J. Volkman. The relation of pitch to frequency. American Journal of Psychology, 53:329, 1940. Steve Young, J. Odell, and Phil Woodland. Tree-Based State-Tying for High Accuracy Acoustic Modelling. In Proceedings of ARPA Human Language Technology Workshop, Morgan Kaufmann, San Francisco, 1994. 47