Specialization Module. Speech Technology. Timo Baumann

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Learning Methods in Multilingual Speech Recognition

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Investigation on Mandarin Broadcast News Speech Recognition

Lecture 1: Machine Learning Basics

Automatic Pronunciation Checker

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

The Good Judgment Project: A large scale test of different methods of combining expert predictions

WHEN THERE IS A mismatch between the acoustic

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Speaker recognition using universal background model on YOHO database

Artificial Neural Networks written examination

Speech Recognition by Indexing and Sequencing

Speech Emotion Recognition Using Support Vector Machine

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Human Emotion Recognition From Speech

INPE São José dos Campos

Calibration of Confidence Measures in Speech Recognition

Phonological Processing for Urdu Text to Speech System

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Large vocabulary off-line handwriting recognition: A survey

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Speaker Identification by Comparison of Smart Methods. Abstract

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

The Strong Minimalist Thesis and Bounded Optimality

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Improvements to the Pruning Behavior of DNN Acoustic Models

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

English Language and Applied Linguistics. Module Descriptions 2017/18

Lecture 9: Speech Recognition

Beyond the Blend: Optimizing the Use of your Learning Technologies. Bryan Chapman, Chapman Alliance

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

LING 329 : MORPHOLOGY

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Lecture 10: Reinforcement Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Generative models and adversarial training

An Online Handwriting Recognition System For Turkish

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CENTRAL MAINE COMMUNITY COLLEGE Introduction to Computer Applications BCA ; FALL 2011

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Radius STEM Readiness TM

Corrective Feedback and Persistent Learning for Information Extraction

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Body-Conducted Speech Recognition and its Application to Speech Support System

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Extending Place Value with Whole Numbers to 1,000,000

arxiv: v1 [math.at] 10 Jan 2016

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Learning Methods for Fuzzy Systems

SARDNET: A Self-Organizing Feature Map for Sequences

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Stopping rules for sequential trials in high-dimensional data

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Phonological and Phonetic Representations: The Case of Neutralization

Assignment 1: Predicting Amazon Review Ratings

Cross Language Information Retrieval

Evolutive Neural Net Fuzzy Filtering: Basic Description

Switchboard Language Model Improvement with Conversational Data from Gigaword

Edinburgh Research Explorer

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

This scope and sequence assumes 160 days for instruction, divided among 15 units.

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

The Evolution of Random Phenomena

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Stochastic Calculus for Finance I (46-944) Spring 2008 Syllabus

Team Work in International Programs: Why is it so difficult?

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A study of speaker adaptation for DNN-based speech synthesis

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Probabilistic Latent Semantic Analysis

Linking Task: Identifying authors and book titles in verbose queries

Transcription:

Specialization Module Speech Technology Timo Baumann baumann@informatik.uni-hamburg.de Universität Hamburg, Department of Informatics Natural Language Systems Group

Speech Recognition

The Chain Model of Communication Speaker decoded linguistic representation sensory impression Listener speech sound derived from: Pétursson/Neppert: Elementarbuch der Phonetik, 1996.

Noisy-Channel Model Speaker Word, Word, Word. Listener speech sound C. Shannon, W. Weaver: The mathematical theory of communication, 1949.

Noisy-Channel Model Speaker Word, Word, Word. Listener speech sound distorted sensory impression Noise! C. Shannon, W. Weaver: The mathematical theory of communication, 1949.

Noisy-Channel Model Speaker Word, Word, Word. What words, given the sensory impression? Listener speech sound distorted sensory impression Noise! C. Shannon, W. Weaver: The mathematical theory of communication, 1949.

Noisy-Channel Model Speaker Word, Word, Word. What words, given the sensory impression? Listener Ŵ = arg arg max W : distorted : P(W O) sensory impression speech sound Noise! C. Shannon, W. Weaver: The mathematical theory of communication, 1949.

The Speech Recognition Task Given a language L and a sensory impression (observation) O sequence of (MFCC) parameters over sliding windows we search Ŵ in L such that Ŵ = arg max W : P(W O) the most likely word sequence given the observation maximum-likelihood principle how to determine P(W O)? how to organize the search?

Bayes' Rule Given conditional probabilities A and B: P(B A) P(A) P(A B)= P(B) Ŵ = arg arg max max W : : P(W O) P(W O) our formula uses arg max the denominator P(B) does not matter, we can ignore it: P(A B) ~ P(B A) P(A)

The Speech Recognition Task (II) Ŵ = arg max W : P(W O) applying Bayes' rule: P(A B)= P(B A) P(A) P(B) Ŵ = arg max W : P(O W) P(W) P(W) P(O W): acoustic model observation likelihood given a word sequence What do words sound like? P(W): language model a priori probability for word sequences What word sequences are likely?

Words or Phonemes? acoustics primarily depend on phonemes, not on words words have an internal structure (cmp. last week) this was disregarded in early approaches e.g. for single-word recognition. Hence it's almost always ignored in descriptions. thus we should rather estimate P(O Ph), instead of P(O W) we need an additional conversion step that relates words to phoneme sequences P(Ph W)

The Lexicon linking acoustic and language models thus, we get: Ŵ = arg max W : P(O Ph) P(Ph W) P(Ph W) P(W) P(W) simple lexicons map each word to a phone sequence extensions: pronunciation variants for words adapt lexicon at runtime to speaker's pronunciation (tempo, context, dialect, ) rule-based grapheme-to-phoneme conversion (model phonological rules; may include weighted variants)

The Speech Recognition Task (III) Ŵ = arg max W : P(O Ph) P(Ph W) P(Ph W) P(W) P(W) we'll discuss P(W) next week. The simplest form could be a list of possible sentences or a simple context-free grammar we skip P(Ph W) (will be dealt with in one of the labs) the acoustic model P(O Ph) assesses the observed speech signal wrt. a phoneme hypothesis describes the signal by sequence of acoustic features O = (o 1, o 2, o 3, o 4, o tmax ), with o i being the feature vectors (e.g. MFCCs) based on short stretches of audio (previous lecture)

From Observations to Probabilities each phone model is associated with an acceptance function to map an observation o i to a probability often based on Gaussian distributions: just two parameters: µ and σ probability can be computed based on observed value o i could belong to any phone compute distribution for all phones phone probability observed value o i

From Observations to Probabilities each phone model is associated with an acceptance function to map an observation o i to a probability often based on Gaussian distributions: just two parameters: µ and σ probability can be computed based on observed value o i could belong to any phone compute distribution for all phones phone probability [p] [t] [k] observed value o i

Phone Models usually, a speech sound will last longer than one observation but how long exactly? we model this using transition probabilities phone(states) differ in likely duration transition probabilities + observation probabilities plus Lexicon plus Language Model Hidden Markov Models to the rescue!

Hidden-Markov Models unifying model for the speech recognition process Markov assumption: we can model the future without looking too far into the past no need for full history to differentiate next observation, the present state is sufficient we can construct a state-graph where each state contains the full (relevant) history for determining the next state in the graph

The Search Graph built from language model (here: S one two ), lexicon (one /W AX N/, two /T OO/), and phone models aus: Walker et al., Sphinx-4: A Flexible Open Source Framework for SR, 2004.

The Search Graph transition probabilities from language model

The Search Graph expansion to sounds from the lexicon

The Search Graph acoustic model: transition probabilities (A) and emission/observation probabilities (B)

all we need to do is find the most likely path through the graph

Decoding: Searching the Graph we're looking for the path in the graph that distributes the observations to (emitting) phone states while keeping costs at a minimum (identical to the highest probability)

Token-Pass Algorithm: Basic Idea time-synchronous search of the observations at every point in time, keep a number of hypotheses, that are represented each by a token generate new tokens from old tokens in every step the winner: best token that reaches the final state in the end

Token-Pass Algorithm: Basic Idea every token stores the current state in the graph the sum of costs incurred so far possibly differentiated for LM and AM costs details to preceding token (necessary to recover path)

Token-Pass Algorithm en détail start with an empty token in the initial state for all tokens take the next observation generate all successor tokens from the current state add costs (transition, observation) of all token that are in one state keep only the best token principle of dynamic programming: the best path leading here is the only relevant path in the globally best path

Token-Pass Algorithm Initialization: put a token into initial state find next tokens (forward to next emitting state) add transition costs for edges add emission/acceptance cost of observation

Token-Pass Algorithm Initialization: put a token into initial state find next tokens (forward to next emitting state) add transition costs for edges add emission/acceptance cost of observation

Token-Pass Algorithm: Multiple Tokens in the Same State different alignments of observations to one state path only the best path needs to be kept all others can't be on the best final path

Token-Pass Algorithm: Multiple Tokens in the Same State different alignments of observations to one state path only the best path needs to be kept all others can't be on the best final path

Token-Pass Algorithm: Multiple Tokens in the Same State different alignments of observations to one state path only the best path needs to be kept all others can't be on the best final path

Token-Pass Algorithm: Multiple Tokens in the Same State different alignments of observations to one state path only the best path needs to be kept all others can't be on the best final path

Limiting the Search The search graph may become very large remedy: dynamically expand the search graph during recognition only expand where hypotheses are likely purge unlikely hypotheses make the graph more compact by sharing common prefixes z @ Ferse Verse f E 6 n fern t I C fertig

Token-Pass Algorithm: Extensions sort tokens by cost in every step and prune list to a maximum of N tokens at every time step keep only tokens that are `good' relative to the best token reduces search space but may result in non-optimal path it's not necessary to operate time-synchronously could e.g. also use A* search more administrative complexity when using dynamic search graph, LexTree, Triphones,

Training the HMM-parameters: Baum-Welch Algorithm computing Gaussian µ and σ is straightforward from training data... if we know phoneme/state boundaries beforehand in practice we only have texts and corresponding audio 1) turn text into phoneme/state sequence 2) split audio into as many parts as there are states in the sequence 3) estimate parameters based on these state boundaries 4) use parameters to re-align state boundaries 5) goto 3) until convergence

Phone Models (II) reality is slightly more complex: the observation vector is multi-dimensional multi-dimensional Gaussian there are usually three states per phone (transition/stable phase/next transition) more states phone context shapes acoustics use Triphone contexts more states probability distribution is not necessarily Gaussian in practice complex distributions can be modelled by mixing multiple Gaussians more parameters per state drawback: need to estimate many parameters during training remedy: share mixtures between some phonemes (sharing strategy is determined from training data)

Sphinx-4: A Flexible Open Source Framework for Speech Recognition Walker et al., Sphinx-4: A Flexible Open Source Framework for SR, 2004.

Sphinx-4: A Flexible Open Source Framework for Speech Recognition speech signal parameterization observation vector every 10 ms Walker et al., Sphinx-4: A Flexible Open Source Framework for SR, 2004.

Sphinx-4: A Flexible Open Source Framework for Speech Recognition speech signal parameterization P(O Ph) W Ph P(W) observation vector every 10 ms SearchGraph is an interface allows all sorts of graph layouts Walker et al., Sphinx-4: A Flexible Open Source Framework for SR, 2004.

Sphinx-4: A Flexible Open Source Framework for Speech Recognition Ŵ = arg max W : P(W O) Token Pass Algorithmus speech signal parameterization P(O Ph) W Ph P(W) observation vector every 10 ms SearchGraph is an interface allows all sorts of graph layouts Walker et al., Sphinx-4: A Flexible Open Source Framework for SR, 2004.

Summary Noisy-channel model Problem: Ŵ = arg max W : P(W O) Solution: Ŵ = arg max W : P(O Ph) P(Ph W) P(W) P(W): Word Sequence Model N-Gram, (weighted) Grammar P(Ph W): Pronunciation Model e.g. table lookup, rules,... P(O Ph): Allophone Model Hidden Markov Models Search Problem time-synchronous search, dynamic programming Token Pass Algorithmus idea of Baum-Welch training

Thank you. baumann@informatik.uni-hamburg.de https://nats-www.informatik.uni-hamburg.de/slp16 Universität Hamburg, Department of Informatics Natural Language Systems Group

Further Reading Speech Recognition in General: D. Jurafsky & J. Martin (2009): Speech and Language Processing. Pearson International. InfBib: A JUR 4204x Token-Pass Algorithm: Young, Russel, Thornton (1989): Token Passing: A Simple Conceptual Model for Connected Speech Recognition Systems, Tech.Rep. CUED/F- INFENG/TR, Cambridge University. The Sphinx-4 Speech Recognizer: Walker et al. (2004): Sphinx-4: A Flexible Open Source Framework for Speech Recognition, Tech.Rep. SMLI TR2004-0811, Sun Microsystems.

Notizen

Desired Learning Outcomes understand the optimization target of speech recognition and see implications on the whole-system perspective know and understand the details of the basic speech decoding algorithm based on token-passing, as well as be able to discuss its properties