CS 545 Lecture XI: Speech (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com brownies_choco81@yahoo.com Benjamin Snyder Announcements Office hours change for today and next week: 1pm - 1:45pm or by appointment -- but please schedule ahead HW 4 / 5? will be out soon Speech in a Slide Frequency gives pitch; amplitude gives volume amplitude n p ee ch l a b Frequencies at each time slice processed into observation vectors frequency n s..a12a13a12a14a14..
The Noisy-Channel Model w P (w) language model o P (o w) acoustic model ASR System Components Language Model Acoustic Model source P(w) w channel P(o w) o best w decoder observed o argmax P(w o) = argmax P(o w)p(w) w w Phoneme Inventories Phoneme: sound used as a building block in words Some phonemes occur in most languages (b, p, m, n, s) But substantial variation occurs in size and scope of phoneme inventories across languages Consonants characterized by 1. place of articulation 2. manner of articulation 3. voicing
Vowels Phonotactics Languages exhibit phonotactics Some phoneme sequences are favored, others are forbidden Phonotactics are largely language-specific... But, often shared within language families And some sound sequences are anatomically difficult for everyone: kgvrsatr
Speech Recognition Applications of Speech Recognition (ASR) Dictation Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching LVCSR Large Vocabulary Continuous Speech Recognition ~20,000-64,000 words Speaker independent (vs. speakerdependent) Continuous speech (vs isolated-word) Current error rates Ballpark numbers; exact numbers depend very much on the specific corpus Task Vocabulary Error Rate% Digits 11 0.5 WSJ read speech 5K 3 WSJ read speech 20K 3 Broadcast news 64,000+ 10 Conversational Telephone 64,000+ 20
HSR versus ASR Task Vocab ASR Hum SR Continuous digits 11.5.009 WSJ 1995 clean 5K 3 0.9 WSJ 1995 w/noise 5K 9 1.1 SWBD 2004 65K 20 4 Conclusions: Machines about 5 times worse than humans Gap increases with noisy speech These numbers are rough, take with grain of salt 7/30/08 Speech and Language Processing Jurafsky and Martin Issues Pronunciation error 3-4 times higher for native Spanish and Japanese speakers Car noise error 2-4 times higher Multiple speakers LVCSR Design Intuition Build a statistical model of the speech-towords process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search 7/30/08
Speech Recognition Architecture 7/30/08 Architecture: Five easy pieces (only 3-4 for today) HMMs, Lexicons, and Pronunciation Feature extraction Acoustic Modeling Decoding Language Modeling (seen this already) 7/30/08 16 Noisy Channel Part 1: Words to Phonemes (transitions in HMM)
Lexicon A list of words Each one with a pronunciation in terms of phones We get these from on-line pronucniation dictionary CMU dictionary: 127K words http://www.speech.cs.cmu.edu/cgi-bin/ cmudict We ll represent the lexicon as an HMM 7/30/08 HMMs for speech: the word six Phones are not homogeneous! 5000 0 0.48152 ay k 0.937203 Time (s) 7/30/08 Speech and Language Processing Jurafsky and Martin 19
Each phone has 3 subphones Resulting HMM word model for six with their subphones Noisy Channel Part 1I: Phonemes to Sounds (emissions in HMM)
George Miller figure And also, human acoustic perception... We care about the filter not the source Most characteristics of the source F0 Details of glottal pulse Don t matter for phone detection What we care about is the filter The exact position of the articulators in the oral tract So we want a way to separate these And use only the filter function 7/30/08 Speech and Language Processing Jurafsky and Martin 4 Mel-scale Human hearing is not equally sensitive to all frequency bands Less sensitive at higher frequencies, roughly > 1000 Hz I.e. human perception of frequency is non-linear: 7/30/08 Speech and Language Processing Jurafsky and Martin 3
MFCC: Mel-Frequency Cepstral Coefficients Final Feature Vector 39 Features per 10 ms frame: 12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy) So each frame represented by a 39D vector 7/30/08 Speech and Language Processing Jurafsky and Martin Acoustic Modeling (= Phone detection) Given a 39-dimensional vector corresponding to the observation of one frame o i And given a phone q we want to detect Compute p(o i q) Most popular method: GMM (Gaussian mixture models) Other methods Neural nets, CRFs, SVM, etc 7/30/08 Speech and Language Processing Jurafsky and Martin 2
Gaussian Mixture Models Also called fully-continuous HMMs P(o q) computed by a Gaussian: p(o q) = 1 (o µ)2 exp( ) σ 2π 2σ 2 7/30/08 Speech and Language Processing Jurafsky and Mart Gaussians for Acoustic Modeling A Gaussian is parameterized by a mean and a variance: Different means P(o q): P(o q) is highest here at mean P(o q) P(o q is low here, very far from mean) o 7/30/08 Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled And imagine that we were just computing 1 single spectral value (real valued number) as our acoustic observation We could just compute the mean and variance from the data: T µ i = 1 o t s.t. o t is phone i T t=1 T σ 2 i = 1 (o t µ i ) 2 s.t. o t is phone i T t=1 7/30/08
But we need 39 gaussians, not 1! The observation o is really a vector of length 39 So need a vector of Gaussians: p( o q) = 2π D 2 1 D exp( 1 D (o[d] µ[d]) 2 ) 2 σ 2 [d] σ 2 d =1 [d] d =1 Gaussian Intuitions: Size of Σ µ = [0 0] µ = [0 0] µ = [0 0] Σ = I Σ = 0.6I Σ = 2I As Σ becomes larger, Gaussian becomes more spread out; as Σ becomes smaller, Gaussian more compressed 7/30/08 Text and figures from Andrew Ng s lecture notes for Speech CS229 and Language Processing Jurafsky and Martin 30 Actually, mixture of gaussians Phone A Phone B Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data 7/30/08 34
Gaussians acoustic modeling Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices Usually assume covariance matrix is diagonal I.e. just keep separate variance for each cepstral feature HMMs for speech HMM for digit recognition task 7/30/08 47
Training and Decoding Training Would be easy if phones observed (Maximum Likelihood) But they are not... and neither are mixture weights Use EM algorithm (Expectation Maximization) Decoding Basic idea: Viterbi algorithm from last time But many little details... Summary: ASR Architecture Five easy pieces: ASR Noisy Channel architecture 1) Feature Extraction: 39 MFCC features 2) Acoustic Model: Gaussians for computing p(o q) 3) Lexicon/Pronunciation Model HMM: what phones can follow each other 4) Language Model N-grams for computing p(w i w i-1 ) 5) Decoder Viterbi algorithm: dynamic programming for combining all these to get word sequence from speech! 7/30/08