C S T R H G O F E B. Speech Processing. Steve Renals. Centre for Speech Technology Research University of Edinburgh

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Lecture 9: Speech Recognition

Learning Methods in Multilingual Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Lecture 1: Machine Learning Basics

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Emotion Recognition Using Support Vector Machine

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Investigation on Mandarin Broadcast News Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

WHEN THERE IS A mismatch between the acoustic

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speaker recognition using universal background model on YOHO database

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

On the Formation of Phoneme Categories in DNN Acoustic Models

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Python Machine Learning

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Artificial Neural Networks written examination

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

CS 598 Natural Language Processing

English Language and Applied Linguistics. Module Descriptions 2017/18

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Letter-based speech synthesis

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Human Emotion Recognition From Speech

Segregation of Unvoiced Speech from Nonspeech Interference

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Edinburgh Research Explorer

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Calibration of Confidence Measures in Speech Recognition

Deep Neural Network Language Models

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Natural Language Processing. George Konidaris

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Probabilistic Latent Semantic Analysis

Body-Conducted Speech Recognition and its Application to Speech Support System

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Large vocabulary off-line handwriting recognition: A survey

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Corpus Linguistics (L615)

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Speaker Recognition. Speaker Diarization and Identification

CEFR Overall Illustrative English Proficiency Scales

Switchboard Language Model Improvement with Conversational Data from Gigaword

CS Machine Learning

GACE Computer Science Assessment Test at a Glance

Applications of memory-based natural language processing

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Software Maintenance

THE world surrounding us involves multiple modalities

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Statewide Framework Document for:

Language Model and Grammar Extraction Variation in Machine Translation

Probability and Statistics Curriculum Pacing Guide

AQUA: An Ontology-Driven Question Answering System

Lecture 10: Reinforcement Learning

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Expressive speech synthesis: a review

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Linking Task: Identifying authors and book titles in verbose queries

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Automatic Pronunciation Checker

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Number of students enrolled in the program in Fall, 2011: 20. Faculty member completing template: Molly Dugan (Date: 1/26/2012)

Dialog Act Classification Using N-Gram Algorithms

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Transcription:

C S T R H T O F E E U D N I I N V E B R U S I R T Y H G Speech Processing Steve Renals Centre for Speech Technology Research University of Edinburgh

Motivation

Motivation How can machines make sense of and participate in human communication?

Motivation How can machines make sense of and participate in human communication? recognizing, interpreting, understanding, generating

Motivation How can machines make sense of and participate in human communication? recognizing, interpreting, understanding, generating Underpins richer, human-centred approaches to computing perceptual computers that can interpret their environment technological enhancements to human-human communication

Outline

Outline Topics: Speech recognition Speech synthesis

Outline Approach: Topics: Speech recognition Speech synthesis Main concepts A flavour of the details Current challenges

Speech technology history

Speech technology history

Speech technology history

Speech technology history

Speech technology history

Speech technology history

Speech Recognition

Capturing the speech

Capturing the speech

Capturing the speech

Acoustic features Process the speech waveform to obtain a representation that emphasizes those aspects of the speech signal most relevant to ASR Represent speech as a sequence of centisecond frames - 100 acoustic feature vectors per second Most frequently used representations: mel frequency cepstral coeffiecients (MFCCs) and perceptual linear prediction (PLP) cepstral coefficients Use first and second derivatives to model the local temporal dynamics

Variability in speech recognition

Variability in speech recognition Speech recognition is difficult due to several sources of variation

Variability in speech recognition Speech recognition is difficult due to several sources of variation Size - number of words in the vocabulary, perpelexity

Variability in speech recognition Speech recognition is difficult due to several sources of variation Size - number of words in the vocabulary, perpelexity Style - continuous speech or isolated; planned or spontaneous;

Variability in speech recognition Speech recognition is difficult due to several sources of variation Size - number of words in the vocabulary, perpelexity Style - continuous speech or isolated; planned or spontaneous; Speaker characteristics and accent - tuned for a single speaker, or speaker-independent?

Variability in speech recognition Speech recognition is difficult due to several sources of variation Size - number of words in the vocabulary, perpelexity Style - continuous speech or isolated; planned or spontaneous; Speaker characteristics and accent - tuned for a single speaker, or speaker-independent? Acoustic environment - noise, competing speakers, channel conditions (microphone, phone line,...)

Linguistic Knowledge One could construct a speech recognizer using linguistic knowledge Acoustic phonetic rules to relate spectrogram representations of sounds to phonemes Base pronunciations of words stored in a dictionary Morphological rules to construct inflected forms Grammatical rules to model syntax Semantic and pragmatic constraints Very difficult to take account of the variability of spoken language with such approaches

Machine Learning Intense effort needed to derive and encode linguistic rules that cover all the language Speech has a high degree of variability (speaker, pronunciation, spontaneity,...) Difficult to write a grammar for spoken language - many people rarely speak grammatically Data-driven approach Construct simple models of speech which can be learned from large amounts of data (thousands of hours of speech recordings)

Statistical speech recognition

Statistical speech recognition

Statistical speech recognition The Fundamental Equation of Speech Recognition: where X is the observed acoustics, and W is the word sequence W = arg max W P(W X)

Statistical speech recognition The Fundamental Equation of Speech Recognition: where X is the observed acoustics, and W is the word sequence W = arg max W P(W X) Apply Bayes theorem, and since X is identical for all word sequences: P(W X) = P(X W)P(W) P(X) P(X W)P(W) W = arg max W P(X W)P(W)

Statistical speech recognition

Statistical speech recognition only offers a statistical guarantee - the licence conditions of the best known automatic dictation system:

Statistical speech recognition only offers a statistical guarantee - the licence conditions of the best known automatic dictation system: LICENSEE UNDERSTANDS THAT SPEECH RECOGNITION IS A STATISTICAL PROCESS AND THAT RECOGNITION ERRORS ARE INHERENT IN THE PROCESS. LICENSEE ACKNOWLEDGES THAT IT IS LICENSEE S RESPONSIBILITY TO CORRECT RECOGNITION ERRORS BEFORE USING THE RESULTS OF THE RECOGNITION.

Acoustic and language models

Acoustic and language models Acoustic model: P(X W) - estimated from a corpus of transcribed speech

Acoustic and language models Acoustic model: P(X W) - estimated from a corpus of transcribed speech Language model: P(W) estimated from text

Acoustic and language models Acoustic model: P(X W) - estimated from a corpus of transcribed speech Language model: P(W) estimated from text Generative model of acoustics: P(X W) provides a probability distribution over the space of acoustic feature vectors

Acoustic and language models Acoustic model: P(X W) - estimated from a corpus of transcribed speech Language model: P(W) estimated from text Generative model of acoustics: P(X W) provides a probability distribution over the space of acoustic feature vectors What is the generative model?

Hidden Markov models

Hidden Markov models

Hidden Markov models P(q 1 q 1 ) P(q 2 q 2 ) P(q 3 q 3 ) Probabilistic finite state automaton q s P(q 1 q s ) P(q 2 q 1 ) P(q 3 q 2 ) P(q e q 3 ) q 1 q 2 q 3 q e p(x q 1 ) p(x q 2 ) p(x q 3 ) x x x

Hidden Markov models P(q 1 q 1 ) P(q 2 q 2 ) P(q 3 q 3 ) Probabilistic finite state automaton q s P(q 1 q s ) P(q 2 q 1 ) P(q 3 q 2 ) P(q e q 3 ) q 1 q 2 q 3 q e p(x q 1 ) p(x q 2 ) p(x q 3 ) x x x q(t 1) q(t) q(t+1) Graphical model - dependences between variables x(t 1) x(t) x(t + 1)

Hidden Markov models P(q 1 q 1 ) P(q 2 q 2 ) P(q 3 q 3 ) Probabilistic finite state automaton q s P(q 1 q s ) P(q 2 q 1 ) P(q 3 q 2 ) P(q e q 3 ) q 1 q 2 q 3 q e Surface plot of p(x 1, x 2 ) p(x q 1 ) p(x q 2 ) 0.1 p(x q 3 ) x x 0.08 x p(x 1, x 2 ) 0.06 0.04 q(t 1) q(t) q(t+1) Graphical model - 0.02 dependences between variables 0 4 2 x(t 1) x(t) x(t + 1) 0!2!2 0 2

Hierarchical model

Hierarchical model "Don t Ask" Utterance DON T ASK Word d oh n t ah s k Subword (phone) Acoustic model (HMM) 8000 6000 freq (Hz) 4000 2000 Speech Acoustics 0 0 200 400 600 800 1000 1200 1400 time (ms)

Hierarchical model "Don t Ask" Utterance DON T ASK Word d oh n t ah s k Subword (phone) Acoustic model (HMM) 8000 6000 freq (Hz) 4000 2000 Speech Acoustics 0 0 200 400 600 800 1000 1200 1400 time (ms)

Hidden Markov models

Hidden Markov models Generative modelling a model for each word sequence W that generates acoustics X choose the word sequence that generates X with the highest probability

Hidden Markov models Generative modelling a model for each word sequence W that generates acoustics X choose the word sequence that generates X with the highest probability Assumptions state sequence is a (first-order) Markov process given the current state, the observed acoustic feature vector is conditionally independent of all past and future observations

HMM assumptions A state depends only on the previous state How to encode long term dependences between the observations (acoustic feature vectors)? Hidden states integrate information from the past The current observation depends only on the current hidden state Thus an HMM has two sets of parameters state transition probabilities output probability distribution

HMM Algorithms

HMM Algorithms t-1 t t+1 i i i j j j k k k

HMM Algorithms t-1 t t+1 Efficient recursive algorithms: i i i Alignment - most likely state sequence to have generated the observation sequence j j j Decoding - most likely model sequence to have generated the observation sequence Training - estimate the model parameters using quantities k k k such as the probability of generating an observation sequence to time t and of being in state i at time t

The training process Recorded Speech Acoustic Features Acoustic Model Transcriptions Lexicon Language Resources Language Model

HMM training HMMs with millions of parameters are trainable from large amounts of speech data (with no need for time-aligned or phonetic transcriptions) Self-organizing training algorithm - forwardbackward (aka Baum-Welch) - maximum likelihood estimation (although Bayesian estimation is possible) Estimate the state-time alignment probabilistically and weight parameter updates by these probabilities - the states are hidden variables Iterative algorithm that is guranteed to increase the likelihood

The recognition process Recorded Speech Decoded Text (Transcription) Acoustic Features Acoustic Model Training Data Lexicon Language Model Search Space

Acoustic modelling

Advances in acoustic modelling 1. Gaussian mixture models 2. Context-dependent modelling 3. Discriminative training 4. Speaker adaptation 5. Robustness to challenging acoustic environments

Gaussian mixture models 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 4 3 2 1 0 1 2 3 4 2.5 2 1.5 1 0.5 0 0.5 Gaussians are mathematically convenient, but do not model multiple modes or heavy tails well Gaussian mixture model distribution is a weighted combination of Gaussians Trainable using a straightforward extension of Baum-Welch mixture components are also hidden variables 1 1.5 1.5 1 0.5 0 0.5 1 1.5 2

Context-dependent modelling Initial context-independent model L-nasal? Model phones dependent on their context divide and conquer approach R-l? y R-liquid? n y n y y n y R-m? n L-fricative? n Increase size of the HMM state space Share states between models to avoid overfitting Decision trees to infer fine- and broad-class phonetic contexts from data

Discriminative training Generative modelling: train the models to reproduce the training data (improve the correct models) Discriminative training: as well as improving the correct models, penalize the incorrect models Maximize the mutual information between the observations and the word sequence 1983 - outline for discrimnative training of HMMs 1986 - MMI training for HMMs using gradient descent 1996 - Extended Baum-Welch algorithm for MMI training 2000 - First successfully applied to large vocab ASR

Other discriminative approaches Hybrid connectionist/hmm approaches use multilayer perceptron or recurrent network to discriminatively estimate HMM output probabilities (scaled likelihoods framework) Conditional random fields, support vector machines, etc. computationally expensive for large tasks Discriminative features framewise posterior probability estimates from connectionist network use features derived from the set of Gaussians

Speaker adaptation Tune a speaker-independent system to a target speaker Speaker normalization adapt the acoustic features of the target to be more like an average speaker (eg: vocal tract length normalization) Model-based approaches adapt the parameters of the speaker-independent model (eg: MAP training, maximum likelihood linear regression) Speaker space approaches estimate multiple sets of acoustic models and interpolate new speakers between these models (eg: Eigenvoices, cluster-adaptive training) Speaker adaptation may be supervised or unsupervised

Robust speech recognition Recognize speech in a challenging acoustic environment background noise, competing speakers, reverberation Parallel model combination use models in parallel to account for different parts of the signal Missing feature theory identify the reliable parts of the signal Microphone array approaches use multiple microphones to construct directional listening in software

Parallel model combination Clean speech HMM Noise HMM Combine a noise model and a speech model to make a noisy speech model Model Combination Combined model is product of noise and speech models More than single state noise model results in complex Noisy speech HMM compound model (2D viterbi search)

Missing feature theory Assume each location in time-frequency map is dominated by one of the sources, and attempt to identify reliable regions for the required source

Microphone arrays

Microphone arrays

Microphone arrays

Microphone arrays

Microphone arrays Sound from a source takes different times to reach different mics in an array Can use delay-and-sum (or more complicated) methods to enhance sound from a particular direction Tracking and localization of speakers

Linguistic modelling

Modelling pronunciation Pronunciation model is used to map from a word sequence to a phone sequence (and hence an utterance level HMM) Pronunciation dictionary: listing of words and their pronunciations Multiple pronunciations increase the richness of the dictionary but at a cost of increased flexibility most current systems average about 1.1 prons/word The acoustic model itself is also able to absorb pronunciation variation Embeds a beads on a string view of speech results in a consistent (not faithful) representation

Language modelling The language model is the prior probability of the word sequence P(W) Use a language model to disambiguate between similar acoustics never mind the new display when combining linguistic and acoustic evidence

Language modelling

Language modelling The language model is the prior probability of the word sequence P(W) Use a language model to disambiguate between similar acoustics never mind the nudist play when combining linguistic and acoustic evidence

Language modelling The language model is the prior probability of the word sequence P(W) Use a language model to disambiguate between similar acoustics never mind the nudist play when combining linguistic and acoustic evidence Use hand constructed networks in limited domains

Language modelling The language model is the prior probability of the word sequence P(W) Use a language model to disambiguate between similar acoustics never mind the nudist play when combining linguistic and acoustic evidence Use hand constructed networks in limited domains Statistical language models cover ungrammatical utterances, computationally efficient, trainable from huge amounts of data, can assign a probability to a sentence fragment as well as a whole sentence

Finite state network

Finite state network one ticket Edinburgh two tickets to London three Leeds and

n-grams Re-express Assume that the probability of a word depends only the previous n-1 words (n-gram assumption) if n=2 this is a bigram P(W) = P(W 1, W 2,..., W M 1, W M ) P(W) = P(W 1 )P(W 2 W 1 )P(W 3 W 1, W 2 )... P(W M W 1, W 2,..., W M 1 ) P(W) P(W 1 )P(W 2 W 1 )P(W 3 W 2 )... P(W M W M 1 ) Estimate the probabilities by counting P(W B W A ) = C(W A, W B ) C(W A ) Maximum likelihood estimate

Bigram network P(one start of sentence) one P(ticket one) ticket P(Edinburgh one) Edinburgh P(end of sentence Edinburgh)

The zero probability problem Estimating n-gram probabilities by counting will fail when n-grams are unseen in the training data and will be unreliable for rarely encountered n-grams The zero probability problem just because something is not observed in training doesn t mean it will never occur Smoothing reserve some probability mass for unseen n-grams by discounting counts Allocate the reserved probability by using simpler models (eg lower order n-grams) by interpolation or backoff

Search Find the most likely model sequence for the observed acoustics one ticket two tickets three w ah n t uw th r iy

Search algorithms Viterbi is efficient and exact but infeasible for large vocabularies and long-span language models (which result in large recognition networks) Search techniques pruning do not consider unlikely hypotheses dynamically compile the network as needed multipass search start with simple models, produce word graphs, then progressively refine with more complex models heuristic search (eg A*)

Discussion

Evaluation Align the recognizer output to a human transcription and compute a string edit distance in terms of substitutions, insertions, deletions Word error rate is obtained by summing the errors WER = 100 (S + D + I) % N Standardized corpora and experimental protocols (training, development, test sets) have enabled precise comparisons and driven the field forwards Regular international benchmark evaluations

State-of-the-art Error rates for speaker-independent systems Dictated business news about 5-10% WER Conversational telephone speech about 15-20% WER Broadcast news about 10-15% WER, much higher for general broadcast speech (drama, etc.) Meeting transcription Close-talking mics 25-30% WER Distant mics (array) - 35-40% WER

Multiparty speech recognition Yeah I know we re talking a voice recognition also because they re not be an order just a shuffle how to locate the remote control if it s lost Mm Uh-huh So i m looking at what you think Yeah i was just a resistor cost is she without that is that good idea we just need to check on the cost of uh Or maybe like a banana suggesting the last thing some devices input and teachings Oh yeah you have the whistle ones yeah Well yeah the results so we can define in chile voice recognition is not feasible we could go for a visit Um incorporating the company logo

Beyond transcription Rich transcription automatic extraction of semantic content from speech: named entities, segmentation into dialogue acts or sentences, automatic capitalization and punctuation, summarization Spoken dialogue systems Prosodic modelling Multimodal processing audio-video speech recognition (lip tracking) person tracking and localization focus of attention detection

ASR vs HSR Performance gap between human and automatic speech recognition is substantial both in core recognition of clean speech and in dealing with cluttered acoustic environments Current systems incorporate very shallow linguistic knowledge non-linear scaling of the frequency axis spectral warping to take account of vocal tract size use of phoneme as basic units of speech!

Speech synthesis

Approaches to speech generation

Approaches to speech generation Articulatory: rules to obtain the articulatory dynamics for a given sequence of phonemes

Approaches to speech generation Articulatory: rules to obtain the articulatory dynamics for a given sequence of phonemes Formant based: acoustic phonetic rules to obtain the spectrogram for a given sequence of phonemes

Approaches to speech generation Articulatory: rules to obtain the articulatory dynamics for a given sequence of phonemes Formant based: acoustic phonetic rules to obtain the spectrogram for a given sequence of phonemes Concatenative synthesis: string togther a sequence of speech sounds corresponding to the sequence of phonemes extracted from a large database of speech - eg Festival

Approaches to speech generation Articulatory: rules to obtain the articulatory dynamics for a given sequence of phonemes Formant based: acoustic phonetic rules to obtain the spectrogram for a given sequence of phonemes Concatenative synthesis: string togther a sequence of speech sounds corresponding to the sequence of phonemes extracted from a large database of speech - eg Festival Parametric statistical models: use automatically learned models to generate the speech sounds - eg HTS

Concatenative speech synthesis "Don t Ask" Utterance DON T ASK Word d oh n t ah s k Subword (phone) d oh n t ah s k... k ah s k...... k aa n...... k aa t...... d oh m 8000 6000 Speech Database freq (Hz) 4000 2000 Speech Acoustics 0 0 200 400 600 800 1000 1200 1400 time (ms)

Unit selection Database of naturally spoken speech Many variants of each sound (several hours total) For a given sentence to be synthesised select the unit sequence that fits best target cost how close a possible unit is to the ideal unit for that location join cost how well does it fit with surrounding units Solve by dynamic programming search Can be close to studio quality further processing (pitch, timing) tends to degrade quality

HMM speech synthesis "Don t Ask" Utterance DON T ASK Word d oh n t ah s k Subword (phone) Acoustic model (HMM) 8000 6000 freq (Hz) 4000 2000 Speech Acoustics 0 0 200 400 600 800 1000 1200 1400 time (ms)

Trajectory HMMs Speech synthesis using HMMs generate acoustic features from statistical model Transforming the HMM parameters enables the synthetic speech to be precisely controlled speaker adaptation from an average voice control of intonation and timing Unified model for recognition and synthesis

Text-to-speech Speech synthesis is not just a process of generating speech sounds from a sequence of phonemes Intonation Timing Speaker specific aspects: accent, voice quality,... Linguistic knowledge is required to control the intonation and timing syllabification part-of-speech tags: object, content, discount grammatical information

Speech synthesis examples < >

Speech synthesis examples Formant synthesis (OVE 1953) < >

Speech synthesis examples Formant synthesis (OVE 1953) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) HMM synthesis (HTS 2007) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) HMM synthesis (HTS 2007) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) HMM synthesis (HTS 2007) Speaker adapted HMM synthesis (HTS 2007) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) HMM synthesis (HTS 2007) Speaker adapted HMM synthesis (HTS 2007) < >

Speech synthesis examples Formant synthesis (OVE 1953) Synthesis by Rule (Holmes, Mattingley, Shearme, 1964) Concatenative synthesis (Bell Labs 1977) Formant synthesis (DECtalk 1983) Diphone synthesis (Festival 1997) Unit selection (Rhetorical 2001) Unit selection (Cereproc 2007) HMM synthesis (HTS 2007) Speaker adapted HMM synthesis (HTS 2007) < >

Research challenges

Beyond HMMs HMMs are a weak model of speech that succeed by dividing the space into small regions Speech is not a simple sequence of discrete units A flat hidden structure has limited expressiveness Richer models increased temporal dependencies multiple asynchronous streams hierarchical hidden structure feature representations with a closer link to audition and articulation

Dynamic Bayesian network y t-1 y t-1 y t-1 y t y t y t m t-1 m t v t-1 v t p t-1 p t f t-1 f t s t-1 s t r t-1 r t y t-1 y t-1 y t-1 y t y t y t

Communication Scene Analysis

Communication Scene Analysis

Communication scenes Interdisciplinary problem signal processing and machine learning: making sense of communication scenes starting from the signals linguistic and discourse modelling: understanding the content of the recognized signals moving from qualitative to quantitative models of social dynamics applications that correspond to the needs and requirements of people

Current state Automatic processing of communication scenes in constrained environments speech recognition from distant microphones multimodal tracking of people in meeting rooms automatic segmentation by speaker, dialogue acts, topic, meeting phase automatic summarization Integration into systems indexing search, browsing of archives limited online processing

AMI Meeting Browsers

AMI Meeting Browsers

AMI Meeting Browsers

AMI Meeting Browsers

AMI Meeting Browsers

AMI Meeting Browsers

AMI Meeting Browsers

AMI Meeting Browsers!"#"$#%! "#$%&'(! )&'*+*$,+! ($)$-+!.&&(*,/+! +&0#'1! )$'2.&,(+!.&++0/&+! 3456!!!!!&748!!!!946:!!!!165;! +6<=>?@! +6<=>?!!"#$%&&'()*"+,&-./+&01,&0223! "=AB6>8@!"#$%&#!'%(&)%*! "=AC456@!+,-./!0%)/,(!D!EAF!ACC!G!.&&(*,/+! 12234567! H<>I! #65<867!.6684JFK! 077!.66LL! )<86! M4>INACC!O6684JF! 3PJ>84AJ<5!)6K4FJ!.6684JF! 'AJ>6;8P<5!)6K4FJ!.6684JF! )68<4567!)6K4FJ!.6684JF! QRN%PJNSQT! UVN%PJNSQT! QWN%P5NSQT! QWN+6;NXQT! 1223456!+238497! &748! :;(<&.%(,*!+#=.>(!1##&.(>! "#$%!&!'()%! *$$%+,%%-!./$!01%-%+$! 23#(1! 4%51%$#16! '#7-! "/58)%+$-!?)%@#)&.#=! 1.(;&#=! 9%%$(+7!*7%+,#:,/5! 01%-%+$#$(/+!4;(,%-:<<$! 3%@.<A!;=#)!.(&#)B,<#!/#=.>(! +#<.=.%(! Y6!J667!<!;A:6=CP5!<J7!6<KZ!8A!PK6!=6OA86!>AJ8=A5[!>AO;<=6! :48?!/AAF56!:?4>?!4K!<5KA!KP>>6KKCP5!\6><PK6!AC!48K!K4O;54>48ZL! )6>4K4AJ!)68<45K! 3%@.<A!.(/;=&).,*!/#=.>(! C@#(!4==;#! ACC6=!8?6!=6OA86!>AJ8=A5!4J!74CC6=6J8!>A5AP=K[!7A!PK6=!=6K6<=>?! D! E4+2C! 946:@! 0P8A!.AJ8<F6! $]6=]46:!U! $]6=]46:!R! +5476K! Y?486\A<=7! 07<!^>?<4=_! )<]47! 35A=6J8! H<\<! 3),(=<).@&! +5476K! (A;4>K! )6>4K4AJK! (A7A-K! 07<@! 35A=6J8! 07<! )<]47! 07<! 0K!<5KA!"=AF=<O!.<N!.<J<F6=L!+A[!:6!:455!?<]6!8?6!8?=66!;=6K6J8<84AJK!C=AO!8?6! *JN!*J7PK8=4<5!)6K4FJ6=[!2K6=!*J86=C<>6! )6K4FJ6=!<J7!PO! *J7PK8=4<5!)6K4FJL!.<=I684JF!&`;6=8L! Y?<8-K!ZAP=!8<5Ia!.<=I684JF!&`;6=8KL! $I<ZL!.OL! 0J7!<C86=!8?<8!:6!P?!:455!?<]6!8?6! P?! J6:!;=A7P>8b!=6cP4=6O6J8K[! 8?6!76>4K4AJ!AJ!8?6!=6OA86!>AJ8=A5!CPJ>84AJK[! <J7!:6!:455! >5AK6!8?6!O6684JFK!<C86=L! 34586=! 0P8A!.AJ8<F6! +#<.=.%(! Y6!:455!PK6!4JC=<=67!86>?JA5AFZ!<K!A;;AK67!8A!5<K6=! 86>?JA5AFZ[!K4J>6!8?6!C4=K8!4K!>?6<;6=! )6>4K4AJ!)68<45K! +#<.=.%(! (?6!=6OA86!>AJ8=A5!:455!\6!76K4FJ67!CA=!(9!AJ5ZL!(A!>AJ8=A5!<5KA! 8?6!]476A!=6>A=76=[!><O>A=76=[!68>!:455!\6!8AA!6`;6JK4]6d!:6! 8?6J!>AP57JX8!O668!8?6!>AK8!=6cP4=6O6J8L! )6>4K4AJ!)68<45K! (4856! 07<[!)<]47[!35A=6J8[!H<\<! 07<! 07<! CPJ>84AJ<5!76K4FJ[!PK6=!4J86=C<>6[!CPJ>84AJK[! ;=A7P>8!=6cP4=6O6J8K! UVN%PJNSQT! UQ@QQ

In conclusion

Final remarks Several basic models and algorithms underpin speech processing dynamic programming finite state models of time inference of a (simple) hidden state from huge amounts of data Current systems are rather inflexible regarding domain and rely on benign acoustic environments But: given these constraints we have high performing approaches to speech recognition and synthesis

The end.

Further reading B Gold and N Morgan (2000). Speech and Audio Signal Processing, Wiley. X D Huang, A Acero and H W Hon (2001). Spoken Language Processing: A Guide to Theory, Algorithms and System Development, Prentice Hall. D Jurafsky and J H Martin (2008). Speech and Language Processing, Prentice Hall. F Jelinek (1998). Statistical Methods for Speech Recognition, MIT Press. P Taylor (20??). Text-to-speech synthesis,???.

Software HTK, hidden Markov model toolkit - http://htk.eng.cam.ac.uk SRILM, language modelling toolkit - http://www.speech.sri.com/projects/srilm Festival, text-to-speech synthesis - http://www.cstr.ed.ac.uk/projects/festival HTS, HMM-based speech synthesis system - http://hts.sp.nitech.ac.jp