Fundamentals of Automatic Speech Recognition

Similar documents
Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods in Multilingual Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Emotion Recognition Using Support Vector Machine

Speech Recognition at ICSI: Broadcast News and beyond

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speaker recognition using universal background model on YOHO database

Human Emotion Recognition From Speech

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Lecture 9: Speech Recognition

CS 598 Natural Language Processing

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Automatic Pronunciation Checker

A study of speaker adaptation for DNN-based speech synthesis

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

WHEN THERE IS A mismatch between the acoustic

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Body-Conducted Speech Recognition and its Application to Speech Support System

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speaker Identification by Comparison of Smart Methods. Abstract

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Segregation of Unvoiced Speech from Nonspeech Interference

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Speech Recognition by Indexing and Sequencing

Proceedings of Meetings on Acoustics

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

An Online Handwriting Recognition System For Turkish

Natural Language Processing. George Konidaris

Probabilistic Latent Semantic Analysis

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

On the Formation of Phoneme Categories in DNN Acoustic Models

Calibration of Confidence Measures in Speech Recognition

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Using dialogue context to improve parsing performance in dialogue systems

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Investigation on Mandarin Broadcast News Speech Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Voice conversion through vector quantization

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Large vocabulary off-line handwriting recognition: A survey

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Switchboard Language Model Improvement with Conversational Data from Gigaword

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Lecture 1: Machine Learning Basics

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

CS Machine Learning

Parsing of part-of-speech tagged Assamese Texts

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Phonological Processing for Urdu Text to Speech System

Speaker Recognition. Speaker Diarization and Identification

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

INPE São José dos Campos

Derivational and Inflectional Morphemes in Pak-Pak Language

Edinburgh Research Explorer

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

SARDNET: A Self-Organizing Feature Map for Sequences

IEEE Proof Print Version

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Expressive speech synthesis: a review

Assignment 1: Predicting Amazon Review Ratings

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Degeneracy results in canalisation of language structure: A computational model of word learning

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Florida Reading Endorsement Alignment Matrix Competency 1

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Rhythm-typology revisited.

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Problems of the Arabic OCR: New Attitudes

English Language and Applied Linguistics. Module Descriptions 2017/18

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Transcription:

Fundamentals of Automatic Speech Recognition Britta Wrede Gernot A. Fink Applied Computer Science Group, Bielefeld University July 2005

Fundamentals of Automatic Speech Recognition Britta Wrede Gernot A. Fink Applied Computer Science Group, Bielefeld University July 2005 Introduction Statistical Speech Recognition Feature Extraction Acoustic Modeling Language Modeling Summary Why use speech recognition? General Framework Short-Time Analysis Hidden Markov Models n-gram Models

Motivation Application Areas for Automatic Speech Recognition (ASR) telephone-based information systems dictating machine control of machines e.g. for medical applications (OP) long-term vision: interaction with robots related application areas: speech therapy (second) language acquisition Britta Wrede > 1

Introduction Why automatic speech recognition? Spoken speech is: natural method of interaction for humans important modality in human-human communication efficient and easy to use...and requires little/no additional training Britta Wrede < > 2

Why is speech recognition difficult? Complexity high data rate (16,000+ samples/second, 100+ words/minute) large inventory of units ( 50 phones, 100,000+ words) Variability production of sounds influenced by context (coarticulation/assimilation) between different speakers, however: even for single speaker! (speakerdependent vs. independent) due to speaking style (controlled, formal, spontaneous) with respect to recording environment/equipment (close talking microphone, quite office room, driving car,...) Continuity no segment boundaries present between phones, words (isolated word recognition vs. continuous speech recognition) Britta Wrede < > 3

Application Areas Voice command systems / numbers recognition Error Rate e.g. in cars, for telephony-based services (small vocabulary 2-100, speakerindependent, isolated words / short, well defined phrases, robust to noise) < 5% Dictation systems e.g. for physicians or lawyers, also private users (large vocabulary 10.000-100.000, speaker dependent, controlled speech, sensitive ) 5 10% Research systems (average to large vocabulary 3.000-20.000, speakerindependent, spontaneous speech, adaptive) 15 50% Britta Wrede < > 4

Model of Speech Production & Recognition Theory: Channel Model LINGUISTIC SOURCE ACOUSTIC CHANNEL SPEECH RECOGNITION text production w X ŵw word articulation feature extraction model decoding P w P X w argmaxp w X w Britta Wrede < > 5

Model of Speech Production & Recognition Theory: Channel Model LINGUISTIC SOURCE ACOUSTIC CHANNEL SPEECH RECOGNITION text production w X ŵw word articulation feature extraction model decoding P w P X w argmaxp w X w 2 components: acoustic model P(X w) & language model P(w) Assumption: strong relation between articulation and acoustics Britta Wrede < > 5

Modeling for Speech Recognition Feature Extraction: description of relevant characteristics of the signal short-time analysis (Mel-Cepstrum) Britta Wrede < > 6

Modeling for Speech Recognition Feature Extraction: description of relevant characteristics of the signal short-time analysis (Mel-Cepstrum) Acoustic Modeling: description of acoustic units, e.g. speech sounds, words Hidden Markov Models statistical pattern matching Britta Wrede < > 6

Modeling for Speech Recognition Feature Extraction: description of relevant characteristics of the signal short-time analysis (Mel-Cepstrum) Acoustic Modeling: description of acoustic units, e.g. speech sounds, words Hidden Markov Models statistical pattern matching Language Modeling: restriction of potential word sequences using e.g. formal grammars stochastic grammars purely statistical : calculation of P(w) valid vs. invalid likely... unlikely vs. invalid n-gram models Britta Wrede < > 6

Feature Extraction Short-Time Analysis parametric representation of short speech segments (approx. 10-30 ms) Assumption: characteristic (= spectral?) features are stationary within segments Most widely used method: spectral analysis Mel-Cepstrum warping of the frequency axis similar to human hearing (filter bank) separation of coarse and fine structure of the log-power spectrum signal DFT. Mel log DCT Mel-Cepstrum Dynamic Features: capture spectral variations by calculating time derivatives Britta Wrede < > 7

Feature Extraction: Static Features Grobstruktur Feinstruktur Sprachsignal Spektrum Cepstrum windowing of signal (10-30 ms) computation of cepstrum containing: coarse spectral structure (slope, formants) spectral fine structure (jitter, shimmer, harmonics) removal of spectral fine structure Britta Wrede < > 8

Feature Extraction: Dynamic Features Dynamic Features contain: contain acoustic changes (e.g. of formants) and thus articulatory movements over time are computed as 1st and 2nd order derivatives over time Britta Wrede < > 9

Summary: Feature Extraction C1 C2 C3.. C1 C2 C3... C1 C2 C3... C1 C2 C3.. C1 C2 C3... C1 C2 C3... Every 10 ms a 39-dimensional feature vector is computed: C1 C2 C3.. C1 C2 C3... C1 C2 C3... 12 static MFCCs + 1 energy 13 first order derivatives 13 second order derivatives Britta Wrede < > 10

Hidden Markov Models (HMM) What units should be modelled? Phonemes, syllables, words... Phonemes are too variable due to coarticulation Triphones = phonemes in context: capture coarticulation while keeping non-variable information of phoneme Example: Grapheme Phonemes Triphones Fisch fis #/f/i f/i/s I/S/# Kit kit #/k/i k/i/t I/t/# Britta Wrede < > 11

Acoustic Modeling: Sub-Word Units models for complete words (i.e. inflected forms) can generally not be used smaller sub-word units models for speech sounds ( phoneme models ) linear usually linear models, 3 6 states for phases models for groups of sounds (e.g. for syllables or words) Bakis context dependent (phoneme) models usually tri-phones, e.g. p/i/t in /spits/ very flexible, can easily be combined E trainability generalization necessary! ergodic! speech pauses also need to be modeled! Britta Wrede < > 12

Acoustic Modeling: Model Structure Goal: segmentation segmentation units = words...... represented as sequence of phoneme models (i.e. states) lexicon = set of words to recognize (also: phonetic prefix-tree) utterance = arbitrary sequence of... words from the lexicon decoding the model produces segmentation (i.e. determining the optimal state/model sequence) Britta Wrede < > 13

Hidden Markov Models (HMM) How should units be modelled? HMMs HMM consists of states and transitions each state describes a (hopefully) stationary phase of a phoneme emission-probabilities describe acoustic features of this phase transition-probabilities describe temporal structure of phoneme persevering coarticulation stationary phase anticipatory coarticulation Britta Wrede < > 14

Hidden Markov Models How can emission- and transition-probabilities be estimated? #/j/a j/a/u a/u/# i i i u a u a u a initial segmentation of training data into phonemes needed assignment of speech samples (= feature vectors) to triphone states computation of statist. parameters from feature vectors (e.g. mean, variance) Britta Wrede < > 15

Hidden Markov Models: Formal Description A 1st order Hidden Markov Model λ is defined by: a finite set of states {s 1 s N} a matrix of state transition probabilities A = {a i j a i j = P(s t = j s t 1 = i)} a vector of initial state probabilities π = {π i π i = P(s 1 = i)} Observationen für das Triphon j/a/u a a a ii jj kk aij ajk i j k b b b i i i i a a a u u u j k and state specific emission probability distributions {b j (O t ) b j (O t ) = p(o t s t = j)} Britta Wrede < > 16

Hidden Markov Models How can HMMs be applied for pattern recognition? Britta Wrede < > 17

Hidden Markov Models How can HMMs be applied for pattern recognition? Assumption: patterns (e.g. speech signals) are generated by a stochastic model with principally equivalent behavior! Britta Wrede < > 17

Hidden Markov Models How can HMMs be applied for pattern recognition? Assumption: patterns (e.g. speech signals) are generated by a stochastic model with principally equivalent behavior! Evaluation: determining quality of modeling calculate production probability P(O λ) Britta Wrede < > 17

Hidden Markov Models How can HMMs be applied for pattern recognition? Assumption: patterns (e.g. speech signals) are generated by a stochastic model with principally equivalent behavior! Evaluation: determining quality of modeling calculate production probability P(O λ) Decoding: uncovering the internal structure of the model ( ˆ= recognition ) determine optimal state sequence s = argmaxp(o, s λ) s Britta Wrede < > 17

Hidden Markov Models How can HMMs be applied for pattern recognition? Assumption: patterns (e.g. speech signals) are generated by a stochastic model with principally equivalent behavior! Evaluation: determining quality of modeling calculate production probability P(O λ) Decoding: uncovering the internal structure of the model ( ˆ= recognition ) determine optimal state sequence s = argmaxp(o, s λ) s Training: creating the optimal model improve a given model λ so that P(O ˆλ) P(O λ) Britta Wrede < > 17

Hidden Markov Models: Other applications recognition of phoneme quality e.g. for language acquisition: how well does the spoken utterance map the target utterance? visualisation of articulatory features in spoken utterance could also be used for intonation recognition and emotion recognition Britta Wrede < > 18

Hidden Markov Models: Summary parameters can be estimated automatically from training samples (e.g. pre-recorded utterances) models capture substantial amount of variation in realization and duration E for robust, large vocabulary, speaker independent systems considerable amounts of training data necessary (several hours of speech data) E model configurations have to be specified by experts (i.e. number of mixture densities and model states, type and structure of subword units) Britta Wrede < > 19

Overview LINGUISTIC SOURCE ACOUSTIC CHANNEL SPEECH RECOGNITION text production w X ŵw word articulation feature extraction model decoding P w P X w argmaxp w X w Britta Wrede < > 20

Why Language Modeling? Typical Speech Recognition problems: They are leaving in about fifteen minuets to go to her house. The study was conducted mainly be John Black. The design an construction of the system will take more than a year. Hopefully, all with continue smoothly in my absence. Can they lave me a message? I need to notified the bank of this problem. He is trying to fine out. Britta Wrede < > 21

Why Language Modeling? acoustic cues alone do not convey enough information human performance on speech recognition for unknown language is also not good Use other information sources: Knowledge about which words are likely to occur together Statistical solution: N-gram models Britta Wrede < > 22

What are N-gram models? Example Bi-grams for: I want to eat dinner <S> I.25 I want.32 want to.65 to eat.26 eat dinner.60 <S> I d.06 I would.29 want a.05 to have.14 eat lunch <S> Tell.04 I don t.08 want some.04 to spend.09 eat some.01 <S> I m.02 I have.04 want thai.01 to be.02 eat a N-grams: Uni-gram: dinner dinner Bi-gram: W1 dinner eat dinner Tri-gram: W2 W1 dinner to eat dinner 4-gram: W3 W2 W1 dinner want to eat dinner Britta Wrede < > 23

Statistical Language Models How to estimate N-grams select a corpus that represents your application area for every word in the lexicon count its occurence in a Bi-gram context, e.g. eat: compute probabilities p(w2 W1) Bi-gram count p(* eat) eat on 16.49 eat some 6.18 eat lunch 6.18 eat dinner 5.15 Britta Wrede < > 24

Overview LINGUISTIC SOURCE ACOUSTIC CHANNEL SPEECH RECOGNITION text production w X ŵw word articulation feature extraction model decoding P w P X w argmaxp w X w Britta Wrede < > 25

ESMERALDA: System Architecture Feature extraction Codebook evaluation Integrated path search best word chain Heuristic methods P(z x y) S > NP VP NP > N psycho acoustic knowledge Vector quantisation HMM training Language model design Linguistic knowledge Britta Wrede < > 26

Integrated Parsing and Recognition Goal: use declarative grammar as a language model (especially useful for artificial domains with limited or no training data) apply grammatical restrictions robustly Problems: grammar decisions are binary: valid vs. invalid utterance grammars decide about complete sentences Solutions: use penalty scores for ungrammatical input allow for partial parses i.e. phrases or constituents Britta Wrede < > 27

Integration of Speech Recognition & Understanding Speech understanding Grammar Linguistic and pragmatic knowledge Speech recognition? P(w) Statistical language model Acoustic model Britta Wrede < > 28

Open Challenges for ASR open vocabulary (understanding of unknown words) ASR in noisy environments closer coupling with speech understanding and dialog context gather more information from speech signal that may be important: prosodic information (F0, speech rate, articulation style..) emotional state Britta Wrede < > 29

References Phonetics: Clark, John & Yallop, Colin 1995. An introduction to phonetics and phonology. Oxford: B. Blackwell (Blackwell Textbooks in Linguistics, 9). ASR and Language Modeling: Huang, Xuedong, Agero, Alex & Hon, Hsiao-Wuen. Spoken Language Processing: A guide to theory, algorithm, and system development. Prenctice Hall, 2001. Jurafsky, Dan & Martin, James 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall, 2000. Britta Wrede < 30