Hidden Markov Models use for speech recognition

Similar documents
On the Formation of Phoneme Categories in DNN Acoustic Models

Lecture 9: Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learning Methods in Multilingual Speech Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Large vocabulary off-line handwriting recognition: A survey

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Speech Recognition at ICSI: Broadcast News and beyond

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Improvements to the Pruning Behavior of DNN Acoustic Models

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Investigation on Mandarin Broadcast News Speech Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Natural Language Processing. George Konidaris

Calibration of Confidence Measures in Speech Recognition

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Automatic Pronunciation Checker

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Edinburgh Research Explorer

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Language Model and Grammar Extraction Variation in Machine Translation

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Mandarin Lexical Tone Recognition: The Gating Paradigm

Modern TTS systems. CS 294-5: Statistical Natural Language Processing. Types of Modern Synthesis. TTS Architecture. Text Normalization

English Language and Applied Linguistics. Module Descriptions 2017/18

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Characterizing and Processing Robot-Directed Speech

Probabilistic Latent Semantic Analysis

CS 598 Natural Language Processing

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Corrective Feedback and Persistent Learning for Information Extraction

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Body-Conducted Speech Recognition and its Application to Speech Support System

Letter-based speech synthesis

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Training and evaluation of POS taggers on the French MULTITAG corpus

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speaker recognition using universal background model on YOHO database

The Strong Minimalist Thesis and Bounded Optimality

SARDNET: A Self-Organizing Feature Map for Sequences

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Artificial Neural Networks written examination

Segregation of Unvoiced Speech from Nonspeech Interference

A Graph Based Authorship Identification Approach

Abbreviated text input. The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters.

Lecture 1: Machine Learning Basics

Radius STEM Readiness TM

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Language properties and Grammar of Parallel and Series Parallel Languages

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Proof Theory for Syntacticians

INPE São José dos Campos

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

An Online Handwriting Recognition System For Turkish

Linking Task: Identifying authors and book titles in verbose queries

A Statistical Model for Word Discovery in Transcribed Speech

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Lecture 10: Reinforcement Learning

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Effect of Word Complexity on L2 Vocabulary Learning

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

A Comparison of Two Text Representations for Sentiment Analysis

Speech Emotion Recognition Using Support Vector Machine

Transcription:

HMMs 1 Phoneme HMM HMMs 2 Hidden Markov Models use for speech recognition Each phoneme is represented by a left-to-right HMM with 3 states Contents: Viterbi training Acoustic modeling aspects Isolated-word recognition Connected-word recognition Token passing algorithm Language models Word and sentence HMMs are constructed by concatenating the phoneme-level HMMs W AX N Viterbi training HMMs 3 Viterbi training HMMs 4 HMM states Forward-backward algorithm assigns a probability that a feature vector was emitted from an HMM state Viterbi training: we construct the composite HMM from the phoneme units and use Viterbi algorithm to find the best state- For each training example, use current HMM models to assign feature vectors to HMM states Using Viterbi algorithm, find the most likely path through the composite HMM model This is called Viterbi forced alignment Group the feature vectors assigned to each HMM state and estimate new parameters for each HMM (for example using the GMM update equations) Repeat alignment and parameter reestimation

Acoustic models HMMs 5 Whole-word HMMs HMMs 6 An ideal acoustic model is: Accurate It accounts for context dependency (phonetic context) Compact It provides a compact representation, trainable from finite amounts of data General It is a general representation that allows new words to be modeled, even if they were not seen in the training data Each word is modeled as a whole Each word is assigned an HMM with a number of states Is it a good acoustic model? Accurate Yes, if there is enough data and the system has a small vocabulary; No, if trying to model context changes between words Compact No. It needs many states as the vocabulary increases, and there might not be enough training data to model EVERY word. General No. It cannot be used to build new words. Phoneme HMMs HMMs 7 Modeling phonetic context HMMs 8 Each phoneme is modeled using an HMM with M states Is it a good acoustic model? Accurate No. It does not model well coarticulation. Compact Yes. The complete system will have M states and N phonemes, a total of MxN states, not so many parameters to be estimated General Yes. Any new word can be formed by concatenating the units. Monophone A single model is used to represent a phoneme in all contexts Biphone One model represents a particular left or right context Notation: left context biphone: (a-b) right context biphone: (b+c) Triphone One model represents a particular left and right context Notation: (a-b+c)

Context-dependent model examples HMMs 9 Context-dependent model examples HMMs 10 Monophone SPEECH Biphone Left context: Right context: Triphone S P IY CH Monophone SPEECH Biphone Left context: Right context: Triphone S P IY CH SIL-S S-P P-IY IY-CH S+P P+IY IY+CH CH+SIL SIL-S+P S-P+IY P-IY+CH IY-CH+SIL Word-internal context dependent triphones backs off to left and right biphone models at the word boundary SPEECH RECOGNITION SIL S-P S-P+IY P-IY+CH IY+CH R-EH R-EH+K EH-K+AH K-AH+G.. Cross-word context-dependent triphones SIL-S+P S-P+IY P-IY+CH IY-CH+R CH-R+EH R-EH+K EH-K Context-dependent triphone HMMs HMMs 11 Isolated word recognition HMMs 12 Each phoneme unit within the immediate left and right context is modeled using an HMM with M states Is it a good acoustic model? Accurate Yes. Takes into account coarticulation. Compact Yes. Trainable No. For N phonemes there are NxNxN triphone models, too many parameters to estimate! General Yes. New words can be formed by concatenating units Training issues Many triphones occur infrequently not enough training data Solution: clustering of HMM states which have similar statistical distributions, to estimate HMM parameters using pooled data Whole-word model Collect many examples of each word spoken in isolation Assign a number of states to each word model based on word duration Estimate HMM model parameters Subword-unit model Collect a large corpus of speech and estimate phonetic unit HMMs Construct word-level HMMs from phoneme-level HMMs This is more general than the whole-word approach

Whole-word HMM HMMs 13 Viterbi algorithm through a model HMMs 14 Isolated word recognition system HMMs 15 Connected-word recognition HMMs 16 Boundaries of utterance are unknown Number of words spoken is unknown position of word boundaries is often unclear, difficult to determine Example: two word network P(O W) calculated using Viterbi algorithm rather than forward algorithm Viterbi provides the probability of the path represented by the most likely state sequence

Connected-words Viterbi search HMMs 17 Beam pruning HMMs 18 At each node we must compute - the probability of the best state sequence up to that point, and keep the information about where it came from this will allow back-tracing to find the best state sequence During back-tracing we will find the word boundaries Beam pruning: at each point determine the log-probability of the absolute best Viterbi path j if Beam pruning illustration HMMs 19 Token passing approach HMMs 20 Assume each HMM state can hold multiple tokens Token is an object that can move from state to state in the HMM network Each token carries with it the log scale Viterbi path score s At each time t we examine tokens assigned to the nodes We propagate tokens to reachable positions at time t+1: Make a copy of the token Adjust path score to account for the transition within the HMM network and observation probability Merge tokens according to Viterbi algorithm Select the token with maximum score Discard all other competing tokens

Token passing algorithm HMMs 21 Token propagation illustration HMMs 22 Initialization (t=0) Initialize each initial state to hold a token with score s = 0 All other states are initialized with a token with Algorithm (t>0) Propagate tokens to all possible next states (all connecting states) and increment In each state, find the token with the largest s and discard the rest of the tokens in that state (Viterbi) Termination (t=t) Examine the tokens in all possible final states, find the one with the largest Viterbi path score This is the probability of the most likely state sequence Token passing for connected-word recognition HMMs 23 Bayes formulation revisited HMMs 24 Individual word models are connected into a composite model can transition from final state of word m to initial state of word n Path scores are maintained by the tokens Path sequence also maintaned by the tokens, allowing recovery of the best word sequence Recall the Bayes rule applied to speech recognition s = s + P(W 1 ) Tokens emitted from last state of each word propagate to initial state of each word In practice, we use log-probabilities: Probability of entering the initial state of each word P(W 1 ) is the probability of that word given by the language model Probabilities of word sequences, given by the language model

Language models HMMs 25 Language models HMMs 26 Usually the language model is also scaled by a grammar scale factor s and word transition penalty p Assign probabilities to word sequences P(W) The additional information provides help to reduce the search space Language models resolve homonyms: Write a letter to Mr. Wright right away. Tradeoff between constraint and flexibility Stastistical language models HMMs 27 How does this work? HMMs 28 We want to estimate We can decompose this probability left-to-right: P(W) = P(analysis of audio, speech and music signals) = P(analysis) P(of analysis) P(audio analysis of).. How can we model the entire word sequence? There is never enough training data! Consider restricting the word history

Practical training HMMs 29 n-gram language models HMMs 30 Consider word-histories ending in the same last N-1 words, and treat is as a markov model N = 1 N = 2 N = 3 Probability of a word based on the previous N-1 words: N=1 unigram N=2 bigram N=3 trigram Training: probabilities are estimated from a corpus of training data (a large amount of text) Once the model is trained, it can be used to generate new sentences randomly Syntax is roughly encoded by the obtained model, but generated sentences are often ungrammatical and semantically strange Trigram example HMMs 31 Estimating the n-gram probabilities HMMs 32 Given a text corpus, define: Count of occurences of word n P(states the united) =.. Count of occurences of word (n-1) followed by word n P(America states of) =.. Count of occurences of word (n-2) followed by word n-1 and word n

Estimating the n-gram probabilties HMMs 33 n-grams in the decoding process HMMs 34 Based on the count frequency of occurence for the word sequences, the maximum likelihood estimates of word probabilities are calculated: The goal of the search is to find the most likely string of symbols (phonemes, words, etc) to account for the observed speech waveform: Connected-word example: Connected-word log-viterbi search HMMs 35 Beam search revisited HMMs 36 At each node we must compute where ij is the log language model score s is the grammar scale factor and p is the (log) word transition penalty

Language model in the search HMMs 37 Lyrics recognition from singing HMMs 38 The language model scores are applied at the point where there is a transition INTO a word As the number of words increases, the number of states and interconnections increases too N-grams are easier to incorporate into the token passing algorithm s = s + gp(w 1 )+p The language model score is added to the path score upon word entry, so the token keeps the combined acoustic and language model information *Note: here g is the grammar scale factor, as s was used to denote the path score Y EH S T ER D EY vs Y EH S. T AH D EY M AY. M AY vs M AA M AH AO L. DH AH. W EY vs AO L. AH W EY