C S T R H G O F E B. Speech Processing. Steve Renals. Centre for Speech Technology Research University of Edinburgh

Size: px

Start display at page:

Download "C S T R H G O F E B. Speech Processing. Steve Renals. Centre for Speech Technology Research University of Edinburgh"

Felix Leonard
5 years ago
Views:

1 C S T R H T O F E E U D N I I N V E B R U S I R T Y H G Speech Processing Steve Renals Centre for Speech Technology Research University of Edinburgh

2 Motivation

3 Motivation How can machines make sense of and participate in human communication?

4 Motivation How can machines make sense of and participate in human communication? recognizing, interpreting, understanding, generating

5 Motivation How can machines make sense of and participate in human communication? recognizing, interpreting, understanding, generating Underpins richer, human-centred approaches to computing perceptual computers that can interpret their environment technological enhancements to human-human communication

6 Outline

7 Outline Topics: Speech recognition Speech synthesis

8 Outline Approach: Topics: Speech recognition Speech synthesis Main concepts A flavour of the details Current challenges

9 Speech technology history

10 Speech technology history

11 Speech technology history

12 Speech technology history

13 Speech technology history

14 Speech technology history

15 Speech Recognition

16 Capturing the speech

17 Capturing the speech

18 Capturing the speech

19 Acoustic features Process the speech waveform to obtain a representation that emphasizes those aspects of the speech signal most relevant to ASR Represent speech as a sequence of centisecond frames acoustic feature vectors per second Most frequently used representations: mel frequency cepstral coeffiecients (MFCCs) and perceptual linear prediction (PLP) cepstral coefficients Use first and second derivatives to model the local temporal dynamics

20 Variability in speech recognition

21 Variability in speech recognition Speech recognition is difficult due to several sources of variation

22 Variability in speech recognition Speech recognition is difficult due to several sources of variation Size - number of words in the vocabulary, perpelexity

23 Variability in speech recognition Speech recognition is difficult due to several sources of variation Size - number of words in the vocabulary, perpelexity Style - continuous speech or isolated; planned or spontaneous;

24 Variability in speech recognition Speech recognition is difficult due to several sources of variation Size - number of words in the vocabulary, perpelexity Style - continuous speech or isolated; planned or spontaneous; Speaker characteristics and accent - tuned for a single speaker, or speaker-independent?

25 Variability in speech recognition Speech recognition is difficult due to several sources of variation Size - number of words in the vocabulary, perpelexity Style - continuous speech or isolated; planned or spontaneous; Speaker characteristics and accent - tuned for a single speaker, or speaker-independent? Acoustic environment - noise, competing speakers, channel conditions (microphone, phone line,...)

26 Linguistic Knowledge One could construct a speech recognizer using linguistic knowledge Acoustic phonetic rules to relate spectrogram representations of sounds to phonemes Base pronunciations of words stored in a dictionary Morphological rules to construct inflected forms Grammatical rules to model syntax Semantic and pragmatic constraints Very difficult to take account of the variability of spoken language with such approaches

27 Machine Learning Intense effort needed to derive and encode linguistic rules that cover all the language Speech has a high degree of variability (speaker, pronunciation, spontaneity,...) Difficult to write a grammar for spoken language - many people rarely speak grammatically Data-driven approach Construct simple models of speech which can be learned from large amounts of data (thousands of hours of speech recordings)

28 Statistical speech recognition

29 Statistical speech recognition

30 Statistical speech recognition The Fundamental Equation of Speech Recognition: where X is the observed acoustics, and W is the word sequence W = arg max W P(W X)

31 Statistical speech recognition The Fundamental Equation of Speech Recognition: where X is the observed acoustics, and W is the word sequence W = arg max W P(W X) Apply Bayes theorem, and since X is identical for all word sequences: P(W X) = P(X W)P(W) P(X) P(X W)P(W) W = arg max W P(X W)P(W)

32 Statistical speech recognition

33 Statistical speech recognition only offers a statistical guarantee - the licence conditions of the best known automatic dictation system:

34 Statistical speech recognition only offers a statistical guarantee - the licence conditions of the best known automatic dictation system: LICENSEE UNDERSTANDS THAT SPEECH RECOGNITION IS A STATISTICAL PROCESS AND THAT RECOGNITION ERRORS ARE INHERENT IN THE PROCESS. LICENSEE ACKNOWLEDGES THAT IT IS LICENSEE S RESPONSIBILITY TO CORRECT RECOGNITION ERRORS BEFORE USING THE RESULTS OF THE RECOGNITION.

35 Acoustic and language models

36 Acoustic and language models Acoustic model: P(X W) - estimated from a corpus of transcribed speech

37 Acoustic and language models Acoustic model: P(X W) - estimated from a corpus of transcribed speech Language model: P(W) estimated from text

38 Acoustic and language models Acoustic model: P(X W) - estimated from a corpus of transcribed speech Language model: P(W) estimated from text Generative model of acoustics: P(X W) provides a probability distribution over the space of acoustic feature vectors

39 Acoustic and language models Acoustic model: P(X W) - estimated from a corpus of transcribed speech Language model: P(W) estimated from text Generative model of acoustics: P(X W) provides a probability distribution over the space of acoustic feature vectors What is the generative model?

40 Hidden Markov models

41 Hidden Markov models

42 Hidden Markov models P(q 1 q 1 ) P(q 2 q 2 ) P(q 3 q 3 ) Probabilistic finite state automaton q s P(q 1 q s ) P(q 2 q 1 ) P(q 3 q 2 ) P(q e q 3 ) q 1 q 2 q 3 q e p(x q 1 ) p(x q 2 ) p(x q 3 ) x x x

43 Hidden Markov models P(q 1 q 1 ) P(q 2 q 2 ) P(q 3 q 3 ) Probabilistic finite state automaton q s P(q 1 q s ) P(q 2 q 1 ) P(q 3 q 2 ) P(q e q 3 ) q 1 q 2 q 3 q e p(x q 1 ) p(x q 2 ) p(x q 3 ) x x x q(t 1) q(t) q(t+1) Graphical model - dependences between variables x(t 1) x(t) x(t + 1)

44 Hidden Markov models P(q 1 q 1 ) P(q 2 q 2 ) P(q 3 q 3 ) Probabilistic finite state automaton q s P(q 1 q s ) P(q 2 q 1 ) P(q 3 q 2 ) P(q e q 3 ) q 1 q 2 q 3 q e Surface plot of p(x 1, x 2 ) p(x q 1 ) p(x q 2 ) 0.1 p(x q 3 ) x x 0.08 x p(x 1, x 2 ) q(t 1) q(t) q(t+1) Graphical model dependences between variables x(t 1) x(t) x(t + 1) 0!2!2 0 2

45 Hierarchical model

46 Hierarchical model "Don t Ask" Utterance DON T ASK Word d oh n t ah s k Subword (phone) Acoustic model (HMM) freq (Hz) Speech Acoustics time (ms)

47 Hierarchical model "Don t Ask" Utterance DON T ASK Word d oh n t ah s k Subword (phone) Acoustic model (HMM) freq (Hz) Speech Acoustics time (ms)

48 Hidden Markov models

49 Hidden Markov models Generative modelling a model for each word sequence W that generates acoustics X choose the word sequence that generates X with the highest probability

50 Hidden Markov models Generative modelling a model for each word sequence W that generates acoustics X choose the word sequence that generates X with the highest probability Assumptions state sequence is a (first-order) Markov process given the current state, the observed acoustic feature vector is conditionally independent of all past and future observations

51 HMM assumptions A state depends only on the previous state How to encode long term dependences between the observations (acoustic feature vectors)? Hidden states integrate information from the past The current observation depends only on the current hidden state Thus an HMM has two sets of parameters state transition probabilities output probability distribution

52 HMM Algorithms

53 HMM Algorithms t-1 t t+1 i i i j j j k k k

54 HMM Algorithms t-1 t t+1 Efficient recursive algorithms: i i i Alignment - most likely state sequence to have generated the observation sequence j j j Decoding - most likely model sequence to have generated the observation sequence Training - estimate the model parameters using quantities k k k such as the probability of generating an observation sequence to time t and of being in state i at time t

55 The training process Recorded Speech Acoustic Features Acoustic Model Transcriptions Lexicon Language Resources Language Model

56 HMM training HMMs with millions of parameters are trainable from large amounts of speech data (with no need for time-aligned or phonetic transcriptions) Self-organizing training algorithm - forwardbackward (aka Baum-Welch) - maximum likelihood estimation (although Bayesian estimation is possible) Estimate the state-time alignment probabilistically and weight parameter updates by these probabilities - the states are hidden variables Iterative algorithm that is guranteed to increase the likelihood

57 The recognition process Recorded Speech Decoded Text (Transcription) Acoustic Features Acoustic Model Training Data Lexicon Language Model Search Space

58 Acoustic modelling

59 Advances in acoustic modelling 1. Gaussian mixture models 2. Context-dependent modelling 3. Discriminative training 4. Speaker adaptation 5. Robustness to challenging acoustic environments

60 Gaussian mixture models Gaussians are mathematically convenient, but do not model multiple modes or heavy tails well Gaussian mixture model distribution is a weighted combination of Gaussians Trainable using a straightforward extension of Baum-Welch mixture components are also hidden variables

61 Context-dependent modelling Initial context-independent model L-nasal? Model phones dependent on their context divide and conquer approach R-l? y R-liquid? n y n y y n y R-m? n L-fricative? n Increase size of the HMM state space Share states between models to avoid overfitting Decision trees to infer fine- and broad-class phonetic contexts from data

62 Discriminative training Generative modelling: train the models to reproduce the training data (improve the correct models) Discriminative training: as well as improving the correct models, penalize the incorrect models Maximize the mutual information between the observations and the word sequence outline for discrimnative training of HMMs MMI training for HMMs using gradient descent Extended Baum-Welch algorithm for MMI training First successfully applied to large vocab ASR

63 Other discriminative approaches Hybrid connectionist/hmm approaches use multilayer perceptron or recurrent network to discriminatively estimate HMM output probabilities (scaled likelihoods framework) Conditional random fields, support vector machines, etc. computationally expensive for large tasks Discriminative features framewise posterior probability estimates from connectionist network use features derived from the set of Gaussians

64 Speaker adaptation Tune a speaker-independent system to a target speaker Speaker normalization adapt the acoustic features of the target to be more like an average speaker (eg: vocal tract length normalization) Model-based approaches adapt the parameters of the speaker-independent model (eg: MAP training, maximum likelihood linear regression) Speaker space approaches estimate multiple sets of acoustic models and interpolate new speakers between these models (eg: Eigenvoices, cluster-adaptive training) Speaker adaptation may be supervised or unsupervised

65 Robust speech recognition Recognize speech in a challenging acoustic environment background noise, competing speakers, reverberation Parallel model combination use models in parallel to account for different parts of the signal Missing feature theory identify the reliable parts of the signal Microphone array approaches use multiple microphones to construct directional listening in software

66 Parallel model combination Clean speech HMM Noise HMM Combine a noise model and a speech model to make a noisy speech model Model Combination Combined model is product of noise and speech models More than single state noise model results in complex Noisy speech HMM compound model (2D viterbi search)

67 Missing feature theory Assume each location in time-frequency map is dominated by one of the sources, and attempt to identify reliable regions for the required source

68 Microphone arrays

69 Microphone arrays

70 Microphone arrays

71 Microphone arrays

72 Microphone arrays Sound from a source takes different times to reach different mics in an array Can use delay-and-sum (or more complicated) methods to enhance sound from a particular direction Tracking and localization of speakers

73 Linguistic modelling

74 Modelling pronunciation Pronunciation model is used to map from a word sequence to a phone sequence (and hence an utterance level HMM) Pronunciation dictionary: listing of words and their pronunciations Multiple pronunciations increase the richness of the dictionary but at a cost of increased flexibility most current systems average about 1.1 prons/word The acoustic model itself is also able to absorb pronunciation variation Embeds a beads on a string view of speech results in a consistent (not faithful) representation

75 Language modelling The language model is the prior probability of the word sequence P(W) Use a language model to disambiguate between similar acoustics never mind the new display when combining linguistic and acoustic evidence

76 Language modelling

77 Language modelling The language model is the prior probability of the word sequence P(W) Use a language model to disambiguate between similar acoustics never mind the nudist play when combining linguistic and acoustic evidence

78 Language modelling The language model is the prior probability of the word sequence P(W) Use a language model to disambiguate between similar acoustics never mind the nudist play when combining linguistic and acoustic evidence Use hand constructed networks in limited domains

79 Language modelling The language model is the prior probability of the word sequence P(W) Use a language model to disambiguate between similar acoustics never mind the nudist play when combining linguistic and acoustic evidence Use hand constructed networks in limited domains Statistical language models cover ungrammatical utterances, computationally efficient, trainable from huge amounts of data, can assign a probability to a sentence fragment as well as a whole sentence

80 Finite state network

81 Finite state network one ticket Edinburgh two tickets to London three Leeds and

82 n-grams Re-express Assume that the probability of a word depends only the previous n-1 words (n-gram assumption) if n=2 this is a bigram P(W) = P(W 1, W 2,..., W M 1, W M ) P(W) = P(W 1 )P(W 2 W 1 )P(W 3 W 1, W 2 )... P(W M W 1, W 2,..., W M 1 ) P(W) P(W 1 )P(W 2 W 1 )P(W 3 W 2 )... P(W M W M 1 ) Estimate the probabilities by counting P(W B W A ) = C(W A, W B ) C(W A ) Maximum likelihood estimate

83 Bigram network P(one start of sentence) one P(ticket one) ticket P(Edinburgh one) Edinburgh P(end of sentence Edinburgh)

84 The zero probability problem Estimating n-gram probabilities by counting will fail when n-grams are unseen in the training data and will be unreliable for rarely encountered n-grams The zero probability problem just because something is not observed in training doesn t mean it will never occur Smoothing reserve some probability mass for unseen n-grams by discounting counts Allocate the reserved probability by using simpler models (eg lower order n-grams) by interpolation or backoff

85 Search Find the most likely model sequence for the observed acoustics one ticket two tickets three w ah n t uw th r iy

86 Search algorithms Viterbi is efficient and exact but infeasible for large vocabularies and long-span language models (which result in large recognition networks) Search techniques pruning do not consider unlikely hypotheses dynamically compile the network as needed multipass search start with simple models, produce word graphs, then progressively refine with more complex models heuristic search (eg A*)

87 Discussion

88 Evaluation Align the recognizer output to a human transcription and compute a string edit distance in terms of substitutions, insertions, deletions Word error rate is obtained by summing the errors WER = 100 (S + D + I) % N Standardized corpora and experimental protocols (training, development, test sets) have enabled precise comparisons and driven the field forwards Regular international benchmark evaluations

89 State-of-the-art Error rates for speaker-independent systems Dictated business news about 5-10% WER Conversational telephone speech about 15-20% WER Broadcast news about 10-15% WER, much higher for general broadcast speech (drama, etc.) Meeting transcription Close-talking mics 25-30% WER Distant mics (array) % WER

90 Multiparty speech recognition Yeah I know we re talking a voice recognition also because they re not be an order just a shuffle how to locate the remote control if it s lost Mm Uh-huh So i m looking at what you think Yeah i was just a resistor cost is she without that is that good idea we just need to check on the cost of uh Or maybe like a banana suggesting the last thing some devices input and teachings Oh yeah you have the whistle ones yeah Well yeah the results so we can define in chile voice recognition is not feasible we could go for a visit Um incorporating the company logo

91 Beyond transcription Rich transcription automatic extraction of semantic content from speech: named entities, segmentation into dialogue acts or sentences, automatic capitalization and punctuation, summarization Spoken dialogue systems Prosodic modelling Multimodal processing audio-video speech recognition (lip tracking) person tracking and localization focus of attention detection

92 ASR vs HSR Performance gap between human and automatic speech recognition is substantial both in core recognition of clean speech and in dealing with cluttered acoustic environments Current systems incorporate very shallow linguistic knowledge non-linear scaling of the frequency axis spectral warping to take account of vocal tract size use of phoneme as basic units of speech!

93 Speech synthesis

94 Approaches to speech generation

95 Approaches to speech generation Articulatory: rules to obtain the articulatory dynamics for a given sequence of phonemes

96 Approaches to speech generation Articulatory: rules to obtain the articulatory dynamics for a given sequence of phonemes Formant based: acoustic phonetic rules to obtain the spectrogram for a given sequence of phonemes

97 Approaches to speech generation Articulatory: rules to obtain the articulatory dynamics for a given sequence of phonemes Formant based: acoustic phonetic rules to obtain the spectrogram for a given sequence of phonemes Concatenative synthesis: string togther a sequence of speech sounds corresponding to the sequence of phonemes extracted from a large database of speech - eg Festival

98 Approaches to speech generation Articulatory: rules to obtain the articulatory dynamics for a given sequence of phonemes Formant based: acoustic phonetic rules to obtain the spectrogram for a given sequence of phonemes Concatenative synthesis: string togther a sequence of speech sounds corresponding to the sequence of phonemes extracted from a large database of speech - eg Festival Parametric statistical models: use automatically learned models to generate the speech sounds - eg HTS

99 Concatenative speech synthesis "Don t Ask" Utterance DON T ASK Word d oh n t ah s k Subword (phone) d oh n t ah s k... k ah s k k aa n k aa t d oh m Speech Database freq (Hz) Speech Acoustics time (ms)

100 Unit selection Database of naturally spoken speech Many variants of each sound (several hours total) For a given sentence to be synthesised select the unit sequence that fits best target cost how close a possible unit is to the ideal unit for that location join cost how well does it fit with surrounding units Solve by dynamic programming search Can be close to studio quality further processing (pitch, timing) tends to degrade quality

101 HMM speech synthesis "Don t Ask" Utterance DON T ASK Word d oh n t ah s k Subword (phone) Acoustic model (HMM) freq (Hz) Speech Acoustics time (ms)

102 Trajectory HMMs Speech synthesis using HMMs generate acoustic features from statistical model Transforming the HMM parameters enables the synthetic speech to be precisely controlled speaker adaptation from an average voice control of intonation and timing Unified model for recognition and synthesis

103 Text-to-speech Speech synthesis is not just a process of generating speech sounds from a sequence of phonemes Intonation Timing Speaker specific aspects: accent, voice quality,... Linguistic knowledge is required to control the intonation and timing syllabification part-of-speech tags: object, content, discount grammatical information

104 Speech synthesis examples < >

105 Speech synthesis examples Formant synthesis (OVE 1953) < >

106 Speech synthesis examples Formant synthesis (OVE 1953) < >

107 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) < >

108 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) < >

109 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) < >

110 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) < >

111 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) < >

112 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) < >

113 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) < >

114 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) < >

115 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) < >

116 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) < >

117 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) < >

118 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) < >

119 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) < >

120 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) < >

121 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) < >

122 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) HMM synthesis (HTS 2007) < >

123 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) HMM synthesis (HTS 2007) < >

124 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) HMM synthesis (HTS 2007) Speaker adapted HMM synthesis (HTS 2007) < >

125 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) HMM synthesis (HTS 2007) Speaker adapted HMM synthesis (HTS 2007) < >

126 Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) HMM synthesis (HTS 2007) Speaker adapted HMM synthesis (HTS 2007) < >

127 Research challenges

128 Beyond HMMs HMMs are a weak model of speech that succeed by dividing the space into small regions Speech is not a simple sequence of discrete units A flat hidden structure has limited expressiveness Richer models increased temporal dependencies multiple asynchronous streams hierarchical hidden structure feature representations with a closer link to audition and articulation

129 Dynamic Bayesian network y t-1 y t-1 y t-1 y t y t y t m t-1 m t v t-1 v t p t-1 p t f t-1 f t s t-1 s t r t-1 r t y t-1 y t-1 y t-1 y t y t y t

130 Communication Scene Analysis

131 Communication Scene Analysis

132 Communication scenes Interdisciplinary problem signal processing and machine learning: making sense of communication scenes starting from the signals linguistic and discourse modelling: understanding the content of the recognized signals moving from qualitative to quantitative models of social dynamics applications that correspond to the needs and requirements of people

133 Current state Automatic processing of communication scenes in constrained environments speech recognition from distant microphones multimodal tracking of people in meeting rooms automatic segmentation by speaker, dialogue acts, topic, meeting phase automatic summarization Integration into systems indexing search, browsing of archives limited online processing

134 AMI Meeting Browsers

135 AMI Meeting Browsers

136 AMI Meeting Browsers

137 AMI Meeting Browsers

138 AMI Meeting Browsers

139 AMI Meeting Browsers

140 AMI Meeting Browsers

AMI Meeting Browsers!"#"$#%! "#$%&'(! )&'*+*$,+! ($)$-+!.&&(*,/+! +&0#'1! )$'2.&,(+!.&++0/&+! 3456!!!!!&748!!!!946:!!!!165;! +6<=>?@! +6<=>?!!"#$%&&'()*"+,&-./+&01,&0223! "=AB6>8@!"#$%&#!'%(&)%*!

141 AMI Meeting Browsers!"#"$#%! "#$%&'(! )&'*+*$,+! ($)$-+!.&&(*,/+! +&0#'1! )$'2.&,(+!.&++0/&+! 3456!!!!!&748!!!!946:!!!!165;! +6<=>?!!"#$%&&'()*"+,&-./+&01,&0223! ! H<>I! #65<867!.6684JFK! 077!.66LL! )<86! M4>INACC!O6684JF! 3PJ>84AJ<5!)6K4FJ!.6684JF! 'AJ>6;8P<5!)6K4FJ!.6684JF! )68<4567!)6K4FJ!.6684JF! QRN%PJNSQT! UVN%PJNSQT! QWN%P5NSQT! QWN+6;NXQT! ! ! &748! :;(<&.%(,*!+#=.>(!1##&.(>! "#$%!&!'()%! *$$%+,%%-!./$!01%-%+$! 23#(1! 4%51%$#16! '#7-! 1.(;&#=! 9%%$(+7!*7%+,#:,/5! 01%-%+$#$(/+!4;(,%-:<<$! +#<.=.%(! Y6!J667!<!;A:6=CP5!<J7!6<KZ!8A!PK6!=6OA86!>AJ8=A5[!>AO;<=6! :48?!/AAF56!:?4>?!4K!<5KA!KP>>6KKCP5!\6><PK6!AC!48K!K4O;54>48ZL! )6>4K4AJ!)68<45K! ACC6=!8?6!=6OA86!>AJ8=A5!4J!74CC6=6J8!>A5AP=K[!7A!PK6=!=6K6<=>?! D! E4+2C! 0P8A!.AJ8<F6! $]6=]46:!U! $]6=]46:!R! +5476K! Y?486\A<=7! 07<!^>?<4=_! )<]47! 35A=6J8! H<\<! +5476K! (A;4>K! )6>4K4AJK! (A7A-K! 35A=6J8! 07<! )<]47! 07<! 0K!<5KA!"=AF=<O!.<N!.<J<F6=L!+A[!:6!:455!?<]6!8?6!8?=66!;=6K6J8<84AJK!C=AO!8?6! *JN!*J7PK8=4<5!)6K4FJ6=[!2K6=!*J86=C<>6! )6K4FJ6=!<J7!PO! *J7PK8=4<5!)6K4FJL!.<=I684JF!&`;6=8L! Y?<8-K!ZAP=!8<5Ia!.<=I684JF!&`;6=8KL! $I<ZL!.OL! 0J7!<C86=!8?<8!:6!P?!:455!?<]6!8?6! P?! J6:!;=A7P>8b!=6cP4=6O6J8K[! 8?6!76>4K4AJ!AJ!8?6!=6OA86!>AJ8=A5!CPJ>84AJK[! <J7!:6!:455! >5AK6!8?6!O6684JFK!<C86=L! 34586=! 0P8A!.AJ8<F6! +#<.=.%(! Y6!:455!PK6!4JC=<=67!86>?JA5AFZ!<K!A;;AK67!8A!5<K6=! 86>?JA5AFZ[!K4J>6!8?6!C4=K8!4K!>?6<;6=! )6>4K4AJ!)68<45K! +#<.=.%(! (?6!=6OA86!>AJ8=A5!:455!\6!76K4FJ67!CA=!(9!AJ5ZL!(A!>AJ8=A5!<5KA! 8?6!]476A!=6>A=76=[!><O>A=76=[!68>!:455!\6!8AA!6`;6JK4]6d!:6! 8?6J!>AP57JX8!O668!8?6!>AK8!=6cP4=6O6J8L! )6>4K4AJ!)68<45K! (4856! 07<[!)<]47[!35A=6J8[!H<\<! 07<! 07<! CPJ>84AJ<5!76K4FJ[!PK6=!4J86=C<>6[!CPJ>84AJK[! ;=A7P>8!=6cP4=6O6J8K! UVN%PJNSQT!

142 In conclusion

143 Final remarks Several basic models and algorithms underpin speech processing dynamic programming finite state models of time inference of a (simple) hidden state from huge amounts of data Current systems are rather inflexible regarding domain and rely on benign acoustic environments But: given these constraints we have high performing approaches to speech recognition and synthesis

144 The end.

145 Further reading B Gold and N Morgan (2000). Speech and Audio Signal Processing, Wiley. X D Huang, A Acero and H W Hon (2001). Spoken Language Processing: A Guide to Theory, Algorithms and System Development, Prentice Hall. D Jurafsky and J H Martin (2008). Speech and Language Processing, Prentice Hall. F Jelinek (1998). Statistical Methods for Speech Recognition, MIT Press. P Taylor (20??). Text-to-speech synthesis,???.

146 Software HTK, hidden Markov model toolkit - SRILM, language modelling toolkit - Festival, text-to-speech synthesis - HTS, HMM-based speech synthesis system -

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI