C S T R H G O F E B. Speech Processing. Steve Renals. Centre for Speech Technology Research University of Edinburgh

C S T R H T O F E E U D N I I N V E B R U S I R T Y H G Speech Processing Steve Renals Centre for Speech Technology Research University of Edinburgh

Motivation

Motivation How can machines make sense of and participate in human communication?

Motivation How can machines make sense of and participate in human communication? recognizing, interpreting, understanding, generating

Motivation How can machines make sense of and participate in human communication? recognizing, interpreting, understanding, generating Underpins richer, human-centred approaches to computing perceptual computers that can interpret their environment technological enhancements to human-human communication

Outline

Outline Topics: Speech recognition Speech synthesis

Outline Approach: Topics: Speech recognition Speech synthesis Main concepts A flavour of the details Current challenges

Speech technology history

Speech Recognition

Capturing the speech

Acoustic features Process the speech waveform to obtain a representation that emphasizes those aspects of the speech signal most relevant to ASR Represent speech as a sequence of centisecond frames - 100 acoustic feature vectors per second Most frequently used representations: mel frequency cepstral coeffiecients (MFCCs) and perceptual linear prediction (PLP) cepstral coefficients Use first and second derivatives to model the local temporal dynamics

Variability in speech recognition

Variability in speech recognition Speech recognition is difficult due to several sources of variation

Variability in speech recognition Speech recognition is difficult due to several sources of variation Size - number of words in the vocabulary, perpelexity

Variability in speech recognition Speech recognition is difficult due to several sources of variation Size - number of words in the vocabulary, perpelexity Style - continuous speech or isolated; planned or spontaneous; Speaker characteristics and accent - tuned for a single speaker, or speaker-independent?

Linguistic Knowledge One could construct a speech recognizer using linguistic knowledge Acoustic phonetic rules to relate spectrogram representations of sounds to phonemes Base pronunciations of words stored in a dictionary Morphological rules to construct inflected forms Grammatical rules to model syntax Semantic and pragmatic constraints Very difficult to take account of the variability of spoken language with such approaches

Machine Learning Intense effort needed to derive and encode linguistic rules that cover all the language Speech has a high degree of variability (speaker, pronunciation, spontaneity,...) Difficult to write a grammar for spoken language - many people rarely speak grammatically Data-driven approach Construct simple models of speech which can be learned from large amounts of data (thousands of hours of speech recordings)

Statistical speech recognition

Statistical speech recognition The Fundamental Equation of Speech Recognition: where X is the observed acoustics, and W is the word sequence W = arg max W P(W X)

Statistical speech recognition The Fundamental Equation of Speech Recognition: where X is the observed acoustics, and W is the word sequence W = arg max W P(W X) Apply Bayes theorem, and since X is identical for all word sequences: P(W X) = P(X W)P(W) P(X) P(X W)P(W) W = arg max W P(X W)P(W)

Statistical speech recognition

Statistical speech recognition only offers a statistical guarantee - the licence conditions of the best known automatic dictation system:

Statistical speech recognition only offers a statistical guarantee - the licence conditions of the best known automatic dictation system: LICENSEE UNDERSTANDS THAT SPEECH RECOGNITION IS A STATISTICAL PROCESS AND THAT RECOGNITION ERRORS ARE INHERENT IN THE PROCESS. LICENSEE ACKNOWLEDGES THAT IT IS LICENSEE S RESPONSIBILITY TO CORRECT RECOGNITION ERRORS BEFORE USING THE RESULTS OF THE RECOGNITION.

Acoustic and language models

Acoustic and language models Acoustic model: P(X W) - estimated from a corpus of transcribed speech

Acoustic and language models Acoustic model: P(X W) - estimated from a corpus of transcribed speech Language model: P(W) estimated from text

Acoustic and language models Acoustic model: P(X W) - estimated from a corpus of transcribed speech Language model: P(W) estimated from text Generative model of acoustics: P(X W) provides a probability distribution over the space of acoustic feature vectors

Hidden Markov models

Hidden Markov models P(q 1 q 1 ) P(q 2 q 2 ) P(q 3 q 3 ) Probabilistic finite state automaton q s P(q 1 q s ) P(q 2 q 1 ) P(q 3 q 2 ) P(q e q 3 ) q 1 q 2 q 3 q e p(x q 1 ) p(x q 2 ) p(x q 3 ) x x x

Hidden Markov models P(q 1 q 1 ) P(q 2 q 2 ) P(q 3 q 3 ) Probabilistic finite state automaton q s P(q 1 q s ) P(q 2 q 1 ) P(q 3 q 2 ) P(q e q 3 ) q 1 q 2 q 3 q e p(x q 1 ) p(x q 2 ) p(x q 3 ) x x x q(t 1) q(t) q(t+1) Graphical model - dependences between variables x(t 1) x(t) x(t + 1)

Hidden Markov models P(q 1 q 1 ) P(q 2 q 2 ) P(q 3 q 3 ) Probabilistic finite state automaton q s P(q 1 q s ) P(q 2 q 1 ) P(q 3 q 2 ) P(q e q 3 ) q 1 q 2 q 3 q e Surface plot of p(x 1, x 2 ) p(x q 1 ) p(x q 2 ) 0.1 p(x q 3 ) x x 0.08 x p(x 1, x 2 ) 0.06 0.04 q(t 1) q(t) q(t+1) Graphical model - 0.02 dependences between variables 0 4 2 x(t 1) x(t) x(t + 1) 0!2!2 0 2

Hierarchical model

Hierarchical model "Don t Ask" Utterance DON T ASK Word d oh n t ah s k Subword (phone) Acoustic model (HMM) 8000 6000 freq (Hz) 4000 2000 Speech Acoustics 0 0 200 400 600 800 1000 1200 1400 time (ms)

Hidden Markov models

Hidden Markov models Generative modelling a model for each word sequence W that generates acoustics X choose the word sequence that generates X with the highest probability

Hidden Markov models Generative modelling a model for each word sequence W that generates acoustics X choose the word sequence that generates X with the highest probability Assumptions state sequence is a (first-order) Markov process given the current state, the observed acoustic feature vector is conditionally independent of all past and future observations

HMM assumptions A state depends only on the previous state How to encode long term dependences between the observations (acoustic feature vectors)? Hidden states integrate information from the past The current observation depends only on the current hidden state Thus an HMM has two sets of parameters state transition probabilities output probability distribution

HMM Algorithms

HMM Algorithms t-1 t t+1 i i i j j j k k k

HMM Algorithms t-1 t t+1 Efficient recursive algorithms: i i i Alignment - most likely state sequence to have generated the observation sequence j j j Decoding - most likely model sequence to have generated the observation sequence Training - estimate the model parameters using quantities k k k such as the probability of generating an observation sequence to time t and of being in state i at time t

The training process Recorded Speech Acoustic Features Acoustic Model Transcriptions Lexicon Language Resources Language Model

HMM training HMMs with millions of parameters are trainable from large amounts of speech data (with no need for time-aligned or phonetic transcriptions) Self-organizing training algorithm - forwardbackward (aka Baum-Welch) - maximum likelihood estimation (although Bayesian estimation is possible) Estimate the state-time alignment probabilistically and weight parameter updates by these probabilities - the states are hidden variables Iterative algorithm that is guranteed to increase the likelihood

The recognition process Recorded Speech Decoded Text (Transcription) Acoustic Features Acoustic Model Training Data Lexicon Language Model Search Space

Acoustic modelling

Advances in acoustic modelling 1. Gaussian mixture models 2. Context-dependent modelling 3. Discriminative training 4. Speaker adaptation 5. Robustness to challenging acoustic environments

Gaussian mixture models 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4 3 2 1 0 1 2 3 4 2.5 2 1.5 1 0.5 0 0.5 Gaussians are mathematically convenient, but do not model multiple modes or heavy tails well Gaussian mixture model distribution is a weighted combination of Gaussians Trainable using a straightforward extension of Baum-Welch mixture components are also hidden variables 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2

Context-dependent modelling Initial context-independent model L-nasal? Model phones dependent on their context divide and conquer approach R-l? y R-liquid? n y n y y n y R-m? n L-fricative? n Increase size of the HMM state space Share states between models to avoid overfitting Decision trees to infer fine- and broad-class phonetic contexts from data

Discriminative training Generative modelling: train the models to reproduce the training data (improve the correct models) Discriminative training: as well as improving the correct models, penalize the incorrect models Maximize the mutual information between the observations and the word sequence 1983 - outline for discrimnative training of HMMs 1986 - MMI training for HMMs using gradient descent 1996 - Extended Baum-Welch algorithm for MMI training 2000 - First successfully applied to large vocab ASR

Other discriminative approaches Hybrid connectionist/hmm approaches use multilayer perceptron or recurrent network to discriminatively estimate HMM output probabilities (scaled likelihoods framework) Conditional random fields, support vector machines, etc. computationally expensive for large tasks Discriminative features framewise posterior probability estimates from connectionist network use features derived from the set of Gaussians

Speaker adaptation Tune a speaker-independent system to a target speaker Speaker normalization adapt the acoustic features of the target to be more like an average speaker (eg: vocal tract length normalization) Model-based approaches adapt the parameters of the speaker-independent model (eg: MAP training, maximum likelihood linear regression) Speaker space approaches estimate multiple sets of acoustic models and interpolate new speakers between these models (eg: Eigenvoices, cluster-adaptive training) Speaker adaptation may be supervised or unsupervised

Robust speech recognition Recognize speech in a challenging acoustic environment background noise, competing speakers, reverberation Parallel model combination use models in parallel to account for different parts of the signal Missing feature theory identify the reliable parts of the signal Microphone array approaches use multiple microphones to construct directional listening in software

Parallel model combination Clean speech HMM Noise HMM Combine a noise model and a speech model to make a noisy speech model Model Combination Combined model is product of noise and speech models More than single state noise model results in complex Noisy speech HMM compound model (2D viterbi search)

Missing feature theory Assume each location in time-frequency map is dominated by one of the sources, and attempt to identify reliable regions for the required source

Microphone arrays

Microphone arrays Sound from a source takes different times to reach different mics in an array Can use delay-and-sum (or more complicated) methods to enhance sound from a particular direction Tracking and localization of speakers

Linguistic modelling

Modelling pronunciation Pronunciation model is used to map from a word sequence to a phone sequence (and hence an utterance level HMM) Pronunciation dictionary: listing of words and their pronunciations Multiple pronunciations increase the richness of the dictionary but at a cost of increased flexibility most current systems average about 1.1 prons/word The acoustic model itself is also able to absorb pronunciation variation Embeds a beads on a string view of speech results in a consistent (not faithful) representation

Language modelling The language model is the prior probability of the word sequence P(W) Use a language model to disambiguate between similar acoustics never mind the new display when combining linguistic and acoustic evidence

Language modelling

Language modelling The language model is the prior probability of the word sequence P(W) Use a language model to disambiguate between similar acoustics never mind the nudist play when combining linguistic and acoustic evidence Use hand constructed networks in limited domains Statistical language models cover ungrammatical utterances, computationally efficient, trainable from huge amounts of data, can assign a probability to a sentence fragment as well as a whole sentence

Finite state network

Finite state network one ticket Edinburgh two tickets to London three Leeds and

n-grams Re-express Assume that the probability of a word depends only the previous n-1 words (n-gram assumption) if n=2 this is a bigram P(W) = P(W 1, W 2,..., W M 1, W M ) P(W) = P(W 1 )P(W 2 W 1 )P(W 3 W 1, W 2 )... P(W M W 1, W 2,..., W M 1 ) P(W) P(W 1 )P(W 2 W 1 )P(W 3 W 2 )... P(W M W M 1 ) Estimate the probabilities by counting P(W B W A ) = C(W A, W B ) C(W A ) Maximum likelihood estimate

Bigram network P(one start of sentence) one P(ticket one) ticket P(Edinburgh one) Edinburgh P(end of sentence Edinburgh)

The zero probability problem Estimating n-gram probabilities by counting will fail when n-grams are unseen in the training data and will be unreliable for rarely encountered n-grams The zero probability problem just because something is not observed in training doesn t mean it will never occur Smoothing reserve some probability mass for unseen n-grams by discounting counts Allocate the reserved probability by using simpler models (eg lower order n-grams) by interpolation or backoff

Search Find the most likely model sequence for the observed acoustics one ticket two tickets three w ah n t uw th r iy

Search algorithms Viterbi is efficient and exact but infeasible for large vocabularies and long-span language models (which result in large recognition networks) Search techniques pruning do not consider unlikely hypotheses dynamically compile the network as needed multipass search start with simple models, produce word graphs, then progressively refine with more complex models heuristic search (eg A*)

Discussion

Evaluation Align the recognizer output to a human transcription and compute a string edit distance in terms of substitutions, insertions, deletions Word error rate is obtained by summing the errors WER = 100 (S + D + I) % N Standardized corpora and experimental protocols (training, development, test sets) have enabled precise comparisons and driven the field forwards Regular international benchmark evaluations

State-of-the-art Error rates for speaker-independent systems Dictated business news about 5-10% WER Conversational telephone speech about 15-20% WER Broadcast news about 10-15% WER, much higher for general broadcast speech (drama, etc.) Meeting transcription Close-talking mics 25-30% WER Distant mics (array) - 35-40% WER

Multiparty speech recognition Yeah I know we re talking a voice recognition also because they re not be an order just a shuffle how to locate the remote control if it s lost Mm Uh-huh So i m looking at what you think Yeah i was just a resistor cost is she without that is that good idea we just need to check on the cost of uh Or maybe like a banana suggesting the last thing some devices input and teachings Oh yeah you have the whistle ones yeah Well yeah the results so we can define in chile voice recognition is not feasible we could go for a visit Um incorporating the company logo

Beyond transcription Rich transcription automatic extraction of semantic content from speech: named entities, segmentation into dialogue acts or sentences, automatic capitalization and punctuation, summarization Spoken dialogue systems Prosodic modelling Multimodal processing audio-video speech recognition (lip tracking) person tracking and localization focus of attention detection

ASR vs HSR Performance gap between human and automatic speech recognition is substantial both in core recognition of clean speech and in dealing with cluttered acoustic environments Current systems incorporate very shallow linguistic knowledge non-linear scaling of the frequency axis spectral warping to take account of vocal tract size use of phoneme as basic units of speech!

Speech synthesis

Approaches to speech generation

Approaches to speech generation Articulatory: rules to obtain the articulatory dynamics for a given sequence of phonemes

Approaches to speech generation Articulatory: rules to obtain the articulatory dynamics for a given sequence of phonemes Formant based: acoustic phonetic rules to obtain the spectrogram for a given sequence of phonemes Concatenative synthesis: string togther a sequence of speech sounds corresponding to the sequence of phonemes extracted from a large database of speech - eg Festival

Concatenative speech synthesis "Don t Ask" Utterance DON T ASK Word d oh n t ah s k Subword (phone) d oh n t ah s k... k ah s k...... k aa n...... k aa t...... d oh m 8000 6000 Speech Database freq (Hz) 4000 2000 Speech Acoustics 0 0 200 400 600 800 1000 1200 1400 time (ms)

Unit selection Database of naturally spoken speech Many variants of each sound (several hours total) For a given sentence to be synthesised select the unit sequence that fits best target cost how close a possible unit is to the ideal unit for that location join cost how well does it fit with surrounding units Solve by dynamic programming search Can be close to studio quality further processing (pitch, timing) tends to degrade quality

HMM speech synthesis "Don t Ask" Utterance DON T ASK Word d oh n t ah s k Subword (phone) Acoustic model (HMM) 8000 6000 freq (Hz) 4000 2000 Speech Acoustics 0 0 200 400 600 800 1000 1200 1400 time (ms)

Trajectory HMMs Speech synthesis using HMMs generate acoustic features from statistical model Transforming the HMM parameters enables the synthetic speech to be precisely controlled speaker adaptation from an average voice control of intonation and timing Unified model for recognition and synthesis

Text-to-speech Speech synthesis is not just a process of generating speech sounds from a sequence of phonemes Intonation Timing Speaker specific aspects: accent, voice quality,... Linguistic knowledge is required to control the intonation and timing syllabification part-of-speech tags: object, content, discount grammatical information

Speech synthesis examples < >

Speech synthesis examples Formant synthesis (OVE 1953) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) < >

Research challenges

Beyond HMMs HMMs are a weak model of speech that succeed by dividing the space into small regions Speech is not a simple sequence of discrete units A flat hidden structure has limited expressiveness Richer models increased temporal dependencies multiple asynchronous streams hierarchical hidden structure feature representations with a closer link to audition and articulation

Dynamic Bayesian network y t-1 y t-1 y t-1 y t y t y t m t-1 m t v t-1 v t p t-1 p t f t-1 f t s t-1 s t r t-1 r t y t-1 y t-1 y t-1 y t y t y t

Communication Scene Analysis

Communication scenes Interdisciplinary problem signal processing and machine learning: making sense of communication scenes starting from the signals linguistic and discourse modelling: understanding the content of the recognized signals moving from qualitative to quantitative models of social dynamics applications that correspond to the needs and requirements of people

Current state Automatic processing of communication scenes in constrained environments speech recognition from distant microphones multimodal tracking of people in meeting rooms automatic segmentation by speaker, dialogue acts, topic, meeting phase automatic summarization Integration into systems indexing search, browsing of archives limited online processing

AMI Meeting Browsers

AMI Meeting Browsers!"#"$#%! "#$%&'(! )&'*+*$,+! ($)$-+!.&&(*,/+! +&0#'1! )$'2.&,(+!.&++0/&+! 3456!!!!!&748!!!!946:!!!!165;! +6<=>?@! +6<=>?!!"#$%&&'()*"+,&-./+&01,&0223! "=AB6>8@!"#$%&#!'%(&)%*! "=AC456@!+,-./!0%)/,(!D!EAF!ACC!G!.&&(*,/+! 12234567! H<>I! #65<867!.6684JFK! 077!.66LL! )<86! M4>INACC!O6684JF! 3PJ>84AJ<5!)6K4FJ!.6684JF! 'AJ>6;8P<5!)6K4FJ!.6684JF! )68<4567!)6K4FJ!.6684JF! QRN%PJNSQT! UVN%PJNSQT! QWN%P5NSQT! QWN+6;NXQT! 1223456!+238497! &748! :;(<&.%(,*!+#=.>(!1##&.(>! "#$%!&!'()%! *$$%+,%%-!./$!01%-%+$! 23#(1! 4%51%$#16! '#7-! "/58)%+$-!?)%@#)&.#=! 1.(;&#=! 9%%$(+7!*7%+,#:,/5! 01%-%+$#$(/+!4;(,%-:<<$! 3%@.<A!;=#)!.(&#)B,<#!/#=.>(! +#<.=.%(! Y6!J667!<!;A:6=CP5!<J7!6<KZ!8A!PK6!=6OA86!>AJ8=A5[!>AO;<=6! :48?!/AAF56!:?4>?!4K!<5KA!KP>>6KKCP5!\6><PK6!AC!48K!K4O;54>48ZL! )6>4K4AJ!)68<45K! 3%@.<A!.(/;=&).,*!/#=.>(! C@#(!4==;#! ACC6=!8?6!=6OA86!>AJ8=A5!4J!74CC6=6J8!>A5AP=K[!7A!PK6=!=6K6<=>?! D! E4+2C! 946:@! 0P8A!.AJ8<F6! $]6=]46:!U! $]6=]46:!R! +5476K! Y?486\A<=7! 07<!^>?<4=_! )<]47! 35A=6J8! H<\<! 3),(=<).@&! +5476K! (A;4>K! )6>4K4AJK! (A7A-K! 07<@! 35A=6J8! 07<! )<]47! 07<! 0K!<5KA!"=AF=<O!.<N!.<J<F6=L!+A[!:6!:455!?<]6!8?6!8?=66!;=6K6J8<84AJK!C=AO!8?6! *JN!*J7PK8=4<5!)6K4FJ6=[!2K6=!*J86=C<>6! )6K4FJ6=!<J7!PO! *J7PK8=4<5!)6K4FJL!.<=I684JF!&`;6=8L! Y?<8-K!ZAP=!8<5Ia!.<=I684JF!&`;6=8KL! $I<ZL!.OL! 0J7!<C86=!8?<8!:6!P?!:455!?<]6!8?6! P?! J6:!;=A7P>8b!=6cP4=6O6J8K[! 8?6!76>4K4AJ!AJ!8?6!=6OA86!>AJ8=A5!CPJ>84AJK[! <J7!:6!:455! >5AK6!8?6!O6684JFK!<C86=L! 34586=! 0P8A!.AJ8<F6! +#<.=.%(! Y6!:455!PK6!4JC=<=67!86>?JA5AFZ!<K!A;;AK67!8A!5<K6=! 86>?JA5AFZ[!K4J>6!8?6!C4=K8!4K!>?6<;6=! )6>4K4AJ!)68<45K! +#<.=.%(! (?6!=6OA86!>AJ8=A5!:455!\6!76K4FJ67!CA=!(9!AJ5ZL!(A!>AJ8=A5!<5KA! 8?6!]476A!=6>A=76=[!><O>A=76=[!68>!:455!\6!8AA!6`;6JK4]6d!:6! 8?6J!>AP57JX8!O668!8?6!>AK8!=6cP4=6O6J8L! )6>4K4AJ!)68<45K! (4856! 07<[!)<]47[!35A=6J8[!H<\<! 07<! 07<! CPJ>84AJ<5!76K4FJ[!PK6=!4J86=C<>6[!CPJ>84AJK[! ;=A7P>8!=6cP4=6O6J8K! UVN%PJNSQT! UQ@QQ

In conclusion

Final remarks Several basic models and algorithms underpin speech processing dynamic programming finite state models of time inference of a (simple) hidden state from huge amounts of data Current systems are rather inflexible regarding domain and rely on benign acoustic environments But: given these constraints we have high performing approaches to speech recognition and synthesis

The end.

Further reading B Gold and N Morgan (2000). Speech and Audio Signal Processing, Wiley. X D Huang, A Acero and H W Hon (2001). Spoken Language Processing: A Guide to Theory, Algorithms and System Development, Prentice Hall. D Jurafsky and J H Martin (2008). Speech and Language Processing, Prentice Hall. F Jelinek (1998). Statistical Methods for Speech Recognition, MIT Press. P Taylor (20??). Text-to-speech synthesis,???.

Software HTK, hidden Markov model toolkit - http://htk.eng.cam.ac.uk SRILM, language modelling toolkit - http://www.speech.sri.com/projects/srilm Festival, text-to-speech synthesis - http://www.cstr.ed.ac.uk/projects/festival HTS, HMM-based speech synthesis system - http://hts.sp.nitech.ac.jp