SAiL Speech Recognition or Speech-to-Text conversion: The first block of a virtual character system.

Similar documents
Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learning Methods in Multilingual Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speaker recognition using universal background model on YOHO database

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Emotion Recognition Using Support Vector Machine

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

A study of speaker adaptation for DNN-based speech synthesis

Investigation on Mandarin Broadcast News Speech Recognition

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Lecture 9: Speech Recognition

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Recognition at ICSI: Broadcast News and beyond

Speaker Recognition. Speaker Diarization and Identification

Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

WHEN THERE IS A mismatch between the acoustic

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Segregation of Unvoiced Speech from Nonspeech Interference

Deep Neural Network Language Models

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speaker Identification by Comparison of Smart Methods. Abstract

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Affective Classification of Generic Audio Clips using Regression Models

Switchboard Language Model Improvement with Conversational Data from Gigaword

Edinburgh Research Explorer

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Automatic Pronunciation Checker

Large vocabulary off-line handwriting recognition: A survey

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Calibration of Confidence Measures in Speech Recognition

Probabilistic Latent Semantic Analysis

Speech Recognition by Indexing and Sequencing

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Body-Conducted Speech Recognition and its Application to Speech Support System

Support Vector Machines for Speaker and Language Recognition

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Proceedings of Meetings on Acoustics

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Characterizing and Processing Robot-Directed Speech

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Letter-based speech synthesis

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Automatic segmentation of continuous speech using minimum phase group delay functions

Rhythm-typology revisited.

Voice conversion through vector quantization

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

A Privacy-Sensitive Approach to Modeling Multi-Person Conversations

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Disambiguation of Thai Personal Name from Online News Articles

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The taming of the data:

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Automatic intonation assessment for computer aided language learning

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Corrective Feedback and Persistent Learning for Information Extraction

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

A Case Study: News Classification Based on Term Frequency

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

THE RECOGNITION OF SPEECH BY MACHINE

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Transcription:

Speech Recognition or Speech-to-Text conversion: The first block of a virtual character system. Panos Georgiou Research Assistant Professor (Electrical Engineering) Signal and Image Processing Institute http://sail.usc.edu 1

State of the Art in Speech Recognition LVCSR = Large Vocubulary Speech Recognition ASR = Automatic Speech Recognition WER = Word Error Rate (can be above 100%) Current state of the art error rates range dramatically by task: (not all are real time systems ) Digits 11 0.5 Read speech (WSJ) 5K 3 Read speech (WSJ) 20K 3 Broadcast news 64K 10 Conversational telephone 64K 20 Virtual character? Data starved 1K seen, but 15K models models 2

Speech Recognition: Training Process (e.g. 300h audio) Feature extraction Transcript of above Dictionary Word-Phoneme Phonetic transcription Mostly human-made, especially in non-phonetic languages like English Training (e.g. Baum Welch) TRAINING PROCESS Gaussian Mixture Acoustic Model Millions of words of representative transcripts for the domain Feature extraction Language Model (e.g. ngram) 3

Speech Recognition: Recognition (Decoding) Process Decoding: Word sequence = the word sequence that is maximum given the observations It is mathematically the same as (Bayes rule) And we can drop the common denominator Ŵ = arg max W D Ŵ = arg max W D P (W O) P (O W )P (W ) P (O) Ŵ = arg max W D P (O W )P (W ) Real life: Acoustic Model Language Model Ŵ = arg max W D P (O W )P (W )N 4

Speech Recognition: Training Process (e.g. 300h audio) Feature extraction Transcript of above Dictionary Word-Phoneme Phonetic transcription Mostly human-made, especially in non-phonetic languages like English Training (e.g. Baum Welch) TRAINING PROCESS Gaussian Mixture Acoustic Model Millions of words of representative transcripts for the domain Feature extraction Language Model (e.g. ngram) 5

Feature Characterization Acoustic representation: In short take advantage of spectral characteristics Think of voiced sounds like harmonics of the vocal chord vibrations, that due to shape of the vocal tract create resonances. Different sounds, different resonances Early work approximates the vocal tract with a tube 6

Feature Characterization Acoustic representation: In short take advantage of spectral characteristics Think of voiced sounds like harmonics of the vocal chord vibrations, that due to shape of the vocal tract create resonances. Different sounds, different resonances Early work approximates the vocal tract with a tube 6

Feature Characterization Acoustic representation: In short take advantage of spectral characteristics Think of voiced sounds like harmonics of the vocal chord vibrations, that due to shape of the vocal tract create resonances. Different sounds, different resonances Early work approximates the vocal tract with a tube 6

Feature Characterization Acoustic representation: In short take advantage of spectral characteristics Think of voiced sounds like harmonics of the vocal chord vibrations, that due to shape of the vocal tract create resonances. Different sounds, different resonances Early work approximates the vocal tract with a tube 6

Feature Characterization Acoustic representation: In short take advantage of spectral characteristics Think of voiced sounds like harmonics of the vocal chord vibrations, that due to shape of the vocal tract create resonances. Different sounds, different resonances Early work approximates the vocal tract with a tube 6

Feature Characterization Acoustic representation: In short take advantage of spectral characteristics Think of voiced sounds like harmonics of the vocal chord vibrations, that due to shape of the vocal tract create resonances. Different sounds, different resonances Early work approximates the vocal tract with a tube 6

Features Acoustic representation: Speech signal complex, with fricatives, voiced, unvoiced, plosives etc... Spectrum good for visualizing voiced sounds LPC (last slide) one option. File: /Users/georgiou/cmusphinx/OtoSenseP/dump_file_8.raw Page: 1 of 2 Printed: Fri Sep 19 02:30:54 28027 30 25 20 15 10 5 0-5 -10-15 -20-25 -30-32766 time 2.68 2.69 2.70 2.71 2.72 2.73 2.74 2.75 2.76 2.77 2.78 2.79 2.80 2.81 2.82 2.83 2.84 2.85 2.86 2.87 2.88 2.89 2.90 2.91 2.92 2.93 2 Hz 7000 6500 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 7

Features: Mel Frequency Cepstral Coefficients More commonly than LPC: MFCC = Mel Frequency Cepstral Coefficients Frame extraction (25ms, 10 ms shift) Windowing Energy DFT Mel Filterbank Log IDFT (or DCT) 12 features Deltas ("derivatives") 39 features + Energy 8

Features: Mel Frequency Cepstral Coefficients More commonly than LPC: MFCC = Mel Frequency Cepstral Coefficients Frame extraction (25ms, 10 ms shift) Windowing Energy DFT Mel Filterbank Log IDFT (or DCT) 12 features Deltas ("derivatives") 39 features + Energy 8

Speech Recognition: Training Process (e.g. 300h audio) Feature extraction Transcript of above Dictionary Word-Phoneme Phonetic transcription Mostly human-made, especially in non-phonetic languages like English Training (e.g. Baum Welch) TRAINING PROCESS Gaussian Mixture Acoustic Model Millions of words of representative transcripts for the domain Feature extraction Language Model (e.g. ngram) 9

Components: Lexicon or Dictionary In simple representation: ABOUT AH B AW T ABSORPTION AH B S AO R P SH AH N ABSORPTION(2) AH B Z AO R P SH AH N But in reality each of these are an Hidden Markov Model state: α 1 1 α 2 2 α 3 3 α 4 4 IN AH B AW T OUT α IN 1 α 1 2 α 2 3 α 3 4 α 4 OUT α 1 1 α 2 2 α 3 3 IN Bs Bm Be OUT α IN 1 α 1 2 B α 2 3 α 4 OUT 10

Components: Lexicon or Dictionary In reality it is more complicated We use triphone models ABOUT _AHB AHBAW BAWT AWT_ ABSORPTION _AHB AHBS BSAO SAOR AORP RPSH PSHAH SHAHN AHN_ For a phoneme set of 50 phonemes (~English) potentially 50 3 Triphones 3 states each Reduce space through tying states (say down to 10K states) Every word in the dictionary is represented by a Hidden Markov Model based on these states 11

Speech Recognition: Training Process (e.g. 300h audio) Feature extraction Transcript of above Dictionary Word-Phoneme Phonetic transcription Mostly human-made, especially in non-phonetic languages like English Training (e.g. Baum Welch) TRAINING PROCESS Gaussian Mixture Acoustic Model Millions of words of representative transcripts for the domain Feature extraction Language Model (e.g. ngram) 12

Acoustic model Acoustic model: Represent the variability for each of these 39 numbers for each state Due to multiple sound instantiations/conditions/speakers/... Gaussian is not a good model. Histogram??? Preferred method is a Mixture Gaussian model So in summary: Each phoneme is represented by 3 states Each state is represented by 39 dimensions Each dimension is represented by a mixture Gaussian model (N-means, N-variances, and N-mixture weights -- assuming diagonal cov. matrix) Complexity of Acoustic model in real numbers: Say 50 phonemes (English) (REAL SYSTEMS) For better accuracy use triphone representation (potentially 50^3 but usually >5K triphones) Each of these has 3 states Each of these has 39 representation dimensions Each dimension has about 32 mixture gaussians 5,000*3*39*(32+32+32) = ~50,000,000 parameters!! (Current SAIL models - 297,000,000 parameters) 13

Acoustic model Acoustic model: Represent the variability for each of these 39 numbers for each state Due to multiple sound instantiations/conditions/speakers/... Gaussian is not a good model. Histogram??? Preferred method is a Mixture Gaussian model So in summary: Each phoneme is represented by 3 states Each state is represented by 39 dimensions Each dimension is represented by a mixture Gaussian model (N-means, N-variances, and N-mixture weights -- assuming diagonal cov. matrix) Complexity of Acoustic model in real numbers: Say 50 phonemes (English) (REAL SYSTEMS) For better accuracy use triphone representation (potentially 50^3 but usually >5K triphones) Each of these has 3 states Each of these has 39 representation dimensions Each dimension has about 32 mixture gaussians 5,000*3*39*(32+32+32) = ~50,000,000 parameters!! (Current SAIL models - 297,000,000 parameters) 13

Acoustic model Acoustic model: Represent the variability for each of these 39 numbers for each state Due to multiple sound instantiations/conditions/speakers/... Gaussian is not a good model. Histogram??? Preferred method is a Mixture Gaussian model So in summary: Each phoneme is represented by 3 states Each state is represented by 39 dimensions Each dimension is represented by a mixture Gaussian model (N-means, N-variances, and N-mixture weights -- assuming diagonal cov. matrix) Complexity of Acoustic model in real numbers: Say 50 phonemes (English) (REAL SYSTEMS) For better accuracy use triphone representation (potentially 50^3 but usually >5K triphones) Each of these has 3 states Each of these has 39 representation dimensions Each dimension has about 32 mixture gaussians 5,000*3*39*(32+32+32) = ~50,000,000 parameters!! (Current SAIL models - 297,000,000 parameters) 13

Acoustic model Acoustic model: Represent the variability for each of these 39 numbers for each state Due to multiple sound instantiations/conditions/speakers/... Gaussian is not a good model. Histogram??? Preferred method is a Mixture Gaussian model So in summary: Each phoneme is represented by 3 states Each state is represented by 39 dimensions Each dimension is represented by a mixture Gaussian model (N-means, N-variances, and N-mixture weights -- assuming diagonal cov. matrix) Complexity of Acoustic model in real numbers: Say 50 phonemes (English) (REAL SYSTEMS) For better accuracy use triphone representation (potentially 50^3 but usually >5K triphones) Each of these has 3 states Each of these has 39 representation dimensions Each dimension has about 32 mixture gaussians 5,000*3*39*(32+32+32) = ~50,000,000 parameters!! (Current SAIL models - 297,000,000 parameters) 13

Speech Recognition: Training Process (e.g. 300h audio) Feature extraction Transcript of above Dictionary Word-Phoneme Phonetic transcription Mostly human-made, especially in non-phonetic languages like English Training (e.g. Baum Welch) TRAINING PROCESS Gaussian Mixture Acoustic Model Millions of words of representative transcripts for the domain Feature extraction Language Model (e.g. ngram) 14

Language Models Second term: Ŵ = arg max W D P (O W )P (W ) Acoustic Language Model Model P(W) can be extracted from existing text: P (W )=P(W 1,W 2,...W n )=P(W 1 )P (W 2 W 1 )P (W 3 W 1 W 2...)P (W n W 1 W 2...W n 1 ) For simplicity and feasibility aproximate with: P (W )=P(W 1,W 2,...W n )=P(W 1 )...P (W n 1 W n 3 W n 2 )P (W n W n 2 W n 1 ) When we don t have enough data - next best: p(w 3 w 1,w 2 )= if(trigram exists) P 3 (w 1,w 2,w 3 ) else if(bigram w 1,w 2 exists) BOW (w 1,w 2 )P (w 3 w 2 ) else P (w 3 w 2 ) 15

Virtual Character Complications Learn from large amounts of existing text Dealing with data sparsity: Smoothing Background models Mining etc One UNIVERSITY unigram: -3.86769 UNIVERSITY -0.5197889 Results in 1056 bigrams -3.120121 UNIVERSITY WORK -0.07356837 and 1650 trigrams -1.634784 HIS UNIVERSITY WORK Virtual character data: Really data starved. Very few potential n-grams seen, especially 2+grams Background LM on the same data. Much better coverage but not of this domain. Smoothing w/background covers the language possibilities better, but the probabilities are flat \data\ ngram 1=1422 ngram 2=6613 ngram 3=9943 \data\ ngram 1=1422 ngram 2=370422 ngram 3=2231793 \data\ ngram 1=5353 ngram 2=2650680 ngram 3=6881435 16

Speech Recognition: Recognition (Decoding) Process Decoding: Ŵ = arg max P (O W )P (W )N Every frame: W D Birth of new words: this is probabilistic so hundreds of words are potentially starting every 10ms Lexical Tree like search makes this faster (i.e. If we have seen phonemes X Y then all the words starting from X Y will be searched, but not remaining words) As we move forward we can prune paths based on: Maximum total alive words at any time instant Maximum new words at any time instant Pruning low probability paths by deeming them un-viable Constraining total search space (dangerous), etc Pruning reduces performance, so a good LM, and AM reduces the probability of pruning good paths Real time systems, bad LM, large/mismatched domains 17

Summary ASR aspects Needed: Representative audio Transcriptions of the audio Good HMM models (word -> phoneme dictionaries) for all transcripts Large amounts of representative text (in the millions) Other real-system complications: Click to talk: needed to reduce search space and ambiguity Without it we need: VAD: Voice Activity Detection can do the coarse segmentation of speech--non-speech Utterance segmentation: needed for breaking up continuous streams of audio (e.g. this presentation) If both absent: ASR is near useless. Speed Audio quality 18

Acoustic models: A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, Bilmes, J.A., International Computer Science Institute, Vol. 4, 1998 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.4498 A tutorial on hidden Markov models and selected applications inspeech recognition, LR Rabiner Proceedings of the IEEE, Vol. 77, No. 2. (1989), pp. 257-286. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&arnumber=18626&isnumber=698 Abhinav Sethy, Panayiotis Georgiou, Bhuvana Ramabhadran, and Shrikanth Narayanan. An iterative relative entropy minimization based data selection approach for n-gram model adaptation. IEEE Transactions on Speech, Audio and Language Processing, In press, 2008. 19