Acoustic Modelling I + II - 1. Acoustic Modeling Part 2

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Learning Methods in Multilingual Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Calibration of Confidence Measures in Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Investigation on Mandarin Broadcast News Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Python Machine Learning

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Phonological Processing for Urdu Text to Speech System

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

The Good Judgment Project: A large scale test of different methods of combining expert predictions

CS Machine Learning

Lecture 1: Machine Learning Basics

On the Formation of Phoneme Categories in DNN Acoustic Models

Universal contrastive analysis as a learning principle in CAPT

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Rule Learning With Negation: Issues Regarding Effectiveness

Word Segmentation of Off-line Handwritten Documents

(Sub)Gradient Descent

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Consonants: articulation and transcription

Speech Emotion Recognition Using Support Vector Machine

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Switchboard Language Model Improvement with Conversational Data from Gigaword

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Improvements to the Pruning Behavior of DNN Acoustic Models

Letter-based speech synthesis

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Edinburgh Research Explorer

Acquiring Competence from Performance Data

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Evidence for Reliability, Validity and Learning Effectiveness

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Lecture 1: Basic Concepts of Machine Learning

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

English Language and Applied Linguistics. Module Descriptions 2017/18

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

CSL465/603 - Machine Learning

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

AQUA: An Ontology-Driven Question Answering System

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Computerized Adaptive Psychological Testing A Personalisation Perspective

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speaker recognition using universal background model on YOHO database

Artificial Neural Networks written examination

Softprop: Softmax Neural Network Backpropagation Learning

Body-Conducted Speech Recognition and its Application to Speech Support System

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Arabic Orthography vs. Arabic OCR

Deep Neural Network Language Models

Journal of Phonetics

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Phonetics. The Sound of Language

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Chapter 2 Rule Learning in a Nutshell

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Case Study: News Classification Based on Term Frequency

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Detailed Instructions to Create a Screen Name, Create a Group, and Join a Group

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown

A heuristic framework for pivot-based bilingual dictionary induction

Florida Reading Endorsement Alignment Matrix Competency 1

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Problems of the Arabic OCR: New Attitudes

Transcription:

Acoustic Modelling I + II - 1 Acoustic Modeling Part 2 June 18, 2013

Acoustic Modelling I + II - 2 Outline Discrete versus Continuous HMMs Parameter Tying Pronunciation Variants Speech Units Context Dependent Acoustic Modeling Bottom-Up vs. Top-Down Clustering Clustering with Decision Trees Part 1 Current Issues in Acoustic Modeling Practical Issues with HMMs Distances Between Model Clusters Clustering of Context Problems with Vocabulary Dependence Part 2

Acoustic Modelling I + II - 3 Bottom-Up vs. Top-Down Clustering There are two different approaches to clustering: Bottom-up clustering (agglomerative): look for good combination of two classes into one Top-down clustering (divisive): look for good separation of a class into two subclasses Both approaches result in a clustering tree:

Acoustic Modelling I + II - 4 Clustering of Contexts (1) First idea for Context Tying: Unsupervised Clustering (bottom up): 1. Start with classes C i = { Phone i } 2. Compare all class pairs: C i with C j (j>i) 3. If we find that C i and C j are "similar enough" replace C i with C i + C j remove C j 4. Continue until satisfied.

Acoustic Modelling I + II - 5 Clustering of Contexts (2) Second Idea for Context Tying: Unsupervised Clustering (top down): 1. Start with class C 0 = { context 1, context 2,..., context n } 2. Anticipate all possible splits of every class C i into two subclasses 3. If we find that it is good idea to split C i then replace C i with its two subclasses 4. Continue with step 2 until satisfied Big Problem: If we start with n different contexts of the same phoneme then there are 2 n possible separations! Most real-world cases have hundreds of contexts This makes the algorithm not applicable

Acoustic Modelling I + II - 6 What means a good split / merge? Use distance measure between model clusters to Decide if a class separation or combination is "good Find out which separation/combination is best

Acoustic Modelling I + II - 7 Distances between Model Clusters Continuous parametric models: e.g. d(f,g) = min(f(x),g(x)) or Kullback-Leibler distances, and others: KL(f,g) = f(x i ) log f(x i ) / g(x i ) In general (but typically for nonparametric models): Entropy-distance: d(f,g) = H(f+g) - 1/2 H(f) - 1/2 H(g), where H(f) is the entropy of the function f, and f+g is the combined model Combining two models lose parameters lose information The entropy distance measures the loss of information. Goal: Lose as little information as possible! Minimize d

Acoustic Modelling I + II - 8 Discrete Entropy Distance Remember: Semi-continuous and discrete HMMs are represented by discrete distributions For a discrete distribution f[i] the entropy: H(f) = - i f[i] log 2 f[i] = i f[i] log 2 (1/f[i]) d(f,g) = H(f+g) - 1/2 H(f) - 1/2 H(g) Obvious: If f=g then H(f) = H(g) = H(f+g), thus d(f,g) = 0.0 If f = { 1 0 }, g = { 0 1 } then H(f) = H(g) = 0, H(f+g) = 1, d(f,g) = 1.0 Example: f = { 1/2 1/2 }, g = { 3/4 1/4 }, f+g = { 5/8 3/8 } H(f) = 1/2 log 2 (2) + 1/2 log 2 (2) = 1.0 H(g) = 3/4 log 2 (4/3) + 1/4 log 2 (4/1) = 0.811 H(f+g) = 5/8 log 2 (8/5) + 3/8 log 2 (8/3) = 0.954 d(f,g) = 0.049

Acoustic Modelling I + II - 9 Weighted Discrete Entropy Distance Problem: Speech examples are not equally distributed among models (some (poly-)phones are more frequent than others) M 1 : Model trained with many examples (=robust) M f : Model trained with few examples (=unreliable) M f+ : Model trained with few but more examples than M f Combining M 1 with M f should have a minor impact on the distance than combining M 1 with model M f+ Solution: Weight the model entropy by number of training samples, so the commonly used entropy distance is: d(f,g) = (n f +n g ) H(f+g) - n f H(f) - n g H(g)

Acoustic Modelling I + II - 10 Context Clustering after Kai-Fu Lee 1. Train semicontinuous models for all three states of each triphone, e.g. triphone = T(AE,K)-b T(AE,K)-m T(AE,K)-e 2. Initialize a context class for every triphone (a class is defined by three distributions: e.g. T 17 -b T 17 -m T 17 -e) 3. Compute all distances between different context classes of same phone: d(c i,c j ) = E(C i -b,c j -b) + E(C i -m,c j -m) + E(C i -e,c j -e), where E is the weighted entropy distance 4. Replace the two classes with the smallest distance by their combination 5. Try to improve distance by moving any element from any class into any other class Continue with step 3 while end criterion is not met Note: This algorithm is completely data driven. Step 5 is expensive but important.

Acoustic Modelling I + II - 11 Generalized Triphones vs. Senones Kai-Fu Lee's algorithm produces generalized triphones: A better approach (M.Hwang) produces generalized subtriphones (senones):

Acoustic Modelling I + II - 12 Problems with Vocabulary Dependencies Example Scenario: During the training we have seen the phoneme P 1 in the contexts P 1 (P 2,P 3 ), P 1 (P 4,P 5 ), P 1 (P 6,P 7 ), P 1 (P 8,P 9 ). After clustering we have found the classes: C 1 = { P 1 (P 2,P 3 ), P 1 (P 4,P 5 ) } and C 2 = { P 1 (P 6,P 7 ), P 1 (P 8,P 9 ) } During testing we would like to recognize the word with phoneme sequence: P 3 P 1 P 7 Problem: Do we use C 1 or C 2 to model P 1 (P 3,P 7 )?

Acoustic Modelling I + II - 13 Clustering with Decision Trees Approaches to achieve vocabulary independency: 1) If the test vocabulary contains an untrained context m(l,r) then use the context independent model m that was trained on all contexts 2) Use the model of a context class that is somehow "similar" to the unseen context In general: If a context has not been seen during training then use some class further up in the hierarchy that was trained. To make a system independent of the vocabulary, we have to be able to find out in which context class it would have been clustered This approach discourages the usage of recorded data (there's not much during the test, and we don't know where it is) Solution: Build a decision tree that asks phonetic questions about the context

Acoustic Modelling I + II - 14 Clustering with Decision Trees Example Decision Tree: Clustering Algorithm: 1. Initialize one cluster containing all contexts 2. For all clusters / questions: compute distance of subclusters 3. Perform the split that get the largest distance (information gain) 4. Continue with step 2 until satisfied (number of clusters)

Acoustic Modelling I + II - 15 Stopping Criterion for Building Decision Trees Typical optimal entropy distance during clustering: d How many clusters (models) do we want? Standard answer to many question about the amount of parameters: "As many as our CPU / memory can stand." Some kind of intelligent guess (based on experience) Number of samples per cluster does not fall below a certain threshold Use cross-validation set: Separate training data into 2 (or more) subsets A,B, train models from A When computing the distance between clusters C 1 and C 2 : Compute the likelihood P 1 of all data from B that belong to C 1, compute the likelihood P 2 of all data from B that belong to C 2, and P 1+2 of all data from B that belong to C 1 or C 2 using the combined class C 1 +C 2, define the distance as (P 1 P 2 ) / P 1+2 The likelihood gain of the split will not always be positive

Acoustic Modelling I + II - 16 Growing the Decision Tree

Acoustic Modelling I + II - 17 Clustering with Decision Trees A(G,S)?? During training five contexts have been seen. These were clustered into three clusters. If we need to model the context A(G,S) we will use: Left context of A is a vowel? (-1 = vowel?) NO G is not a vowel Right context of A is a fricative (+1 = fricative?) YES S is a fricative use model A-b(4). Problem: Where do the questions come from?

Acoustic Modelling I + II - 18 Where do the Questions come from? Knowledge-based Expert defines natural classes based on IPA classification Example list: Table 9.3 of [XH] nasal: m n ng velar: k g ng labial: w m b p v and 39 other classes Automatically learned classes (e.g. Rita Singh, CMU) Provide phone names and feature properties Use acoustic distance to cluster features These become questions for context clustering Random Selection of questions (IBM) Proved that the selection of questions is not critical

Acoustic Modelling I + II - 19 Question Sets for Decision Trees Problem: How to find good questions for decision tree Answer: Does the question set really matter? Study 1: Impact of different question sets IBM study: As long as the question set allows variable enough separation, no significant differences Question 2: Impact of different trees based on same question set Siohan et al, IBM: Randomly choose among the top-n best splits instead of always selecting the best split Construct ensemble of ASR systems based on different trees Single ASR systems do not significantly differ terms of WER but combined ASR results using ROVER gives big gain Systems make different errors!

Acoustic Modelling I + II - 20 Rapid Portability: Acoustic Models Phone set & Speech data + Hello Input: Speech hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am AM Lex LM NLP / MT TTS Output: Speech & Text

Acoustic Modelling I + II - 21 Rapid Portability: Data Phone set & Speech data Step 1: Uniform multilingual database (GlobalPhone) Build monolingual acoustic models in many languages

Acoustic Modelling I + II - 22 Multilingual Acoustic Modeling Step 2: Combine monolingual acoustic models to a set of multilingual (ML) language independent acoustic model

Acoustic Modelling I + II - 23 Rapid Portability: Acoustic Models Step 3: Define mapping between ML set and new language Bootstrap acoustic model of unseen language + Hello Input: Speech hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am AM Lex LM NLP / MT TTS Output: Speech & Text

Acoustic Modelling I + II - 24 Universal Sound Inventory Speech production is independent from language IPA 1) IPA-based Universal Sound Inventory 2) Each sound class is trained by data sharing Reduction from 485 to 162 sound classes m,n,s,l appear in all 12 languages p,b,t,d,k,g,f and i,u,e,a,o in almost all ML-Sep ML-Mix ML-Tag

Acoustic Modelling I + II - 25 Polyphone Decision Tree Adaptation Blaukraut Brautkleid Brotkorb Weinkarte k (0) lau k ra in k ar N k -1=Plosiv? J lau k ra ut k le ot k or in k ar +2=Vokal? N J k (1) k (2) ot k or ut k le Problem: Context of sounds are language specific. How to train context dependent models for new languages? Solution: 1) Multilingual decision context trees 2) Specialize decision tree by adaptation

Acoustic Modelling I + II - 26 Context Decision Trees lau k ra ut k le ot k or in k ar k (0) lau k ra in k ar N k -1=Plosiv? N J +2=Vokal? k (1) k (2) ot k or Blaukraut Brautkleid Brotkorb Weinkarte J ut k le Context dependent phones ( n = polyphone) Trainability vs. granularity Divisive clustering based on linguistic motivated questions One model is assigned to each context cluster Multilingual case: Should we ignore language information? Depends on application Yes in case of adaptation to new target languages

Acoustic Modelling I + II - 27 Polyphone Coverage Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.101

Acoustic Modelling I + II - 28 Rapid Language Adaptation Model mapping to the target language 1) Map the multilingual phonemes to Portuguese ones based on the IPA scheme 2) Copy the corresponding acoustic models in order to initialize Portuguese models Problem: Contexts are highly language specific. How to apply context dependent models to a new target language? Solution: 1) Train a multilingual polyphone decision tree 2) Specialize this tree to target language using limited data (Polyphone decision tree specialization - PDTS)

Acoustic Modelling I + II - 29 Polyphone Decision Tree Specialization (1) English Polyphone Tree

Acoustic Modelling I + II - 30 Polyphone Decision Tree Specialization (2) English Other languages

Acoustic Modelling I + II - 31 Polyphone Decision Tree Specialization (3) Multilingual Polyphone Tree

Acoustic Modelling I + II - 32 Polyphone Decision Tree Specialization (4) Polyphones found in Portuguese

Acoustic Modelling I + II - 33 Polyphone Decision Tree Specialization (5) 1. Tree Pruning: Select from all polyphones only the ones which are relevant for the particular language

Acoustic Modelling I + II - 34 Polyphone Decision Tree Specialization (6) 2. Tree Regrowing Further specialize the tree according to the adaptation data

Acoustic Modelling I + II - 35 Word Error rate [%] Rapid Portability: Acoustic Model 100 Ø Tree ML-Tree Po-Tree PDTS 80 69,1 60 57,1 49,9 40 40,6 32,8 28,9 20 19,6 19 0 0 00:15 00:15 00:25 00:25 00:25 01:30 16:30 +

Acoustic Modelling I + II - 36 Traverse and Analyze the Decision Tree

Acoustic Modelling I + II - 37 Current Research Issues in Acoustic Modeling Data Collection, Lack of transcripts ("There's no data like more data.") Example: Training with 20h of speech 13% WER, with 80h of speech 9% WER Today 5,000 hours audio material (with partly semi-automatically generated transcripts) Signal preprocessing (remove the unimportant, enhance the important) Training techniques (ML, MAP, discriminative training,...) Parameter tying (what are acoustic atoms) Usage of memory and CPU resources Robustness (reduce effect of disturbances) Adaptation (keep learning and improving while in use) Multilinguality (recognizers in many languages)

Acoustic Modelling I + II - 38 Current Research Issues in AM: Multilinguality Language independent recognizers (in analogy to speaker independent) Benefits: More training data for only little more parameters Same acoustic model can be trained with different languages more robust? Language Identification included, no other module necessary Rapid deployment of acoustic models to new target language Allows code-switching (= language switch within a sentence) Problems: What is a good common set of phonemes (speech units)? How to decide which speech units are similar across languages? How to fight the "smearing" effect? (different appearances of same model)

Acoustic Modelling I + II - 39 Polyphones Types over Context Width for 9 Languages

Acoustic Modelling I + II - 40 Number of Polyphones Depends on the language Length of context (triphones = ±1, quinphones = ±2, ) Number of mono-phone types (may vary between [30 150]) Phonotactics (consonantal clusters, mora, ) Morphology Word segmentation

Acoustic Modelling I + II - 41 Universal Sound Inventory Speech production is independent from language IPA 1) IPA-based Universal Sound Inventory 2) Each sound class is trained by data sharing Reduction from 485 to 162 sound classes m,n,s,l appear in all 12 languages p,b,t,d,k,g,f and i,u,e,a,o in almost all ML-Sep ML-Mix ML-Tag

Acoustic Modelling I + II - 42 Word Error Rate [%] Acoustic Model Combination Mono ML-Tag7500 ML-Tag3000 ML-Mix3000 50 40 30 20 27 30 32 35 13 14 15 20 28 30 32 37 20 21 21 29 10 0 Croatian Japanese Spanish Turkish

Acoustic Modelling I + II - 43 Lack of Transcripts Projects such as EARS and GALE initiated the collection of vast amount of audio data (up to 5,000 hours by now) Will likely to become more rapidly (e.g. Jim Baker, 1 Mio hour plan) Problem: Can not be transcribed by human beings any more Solution 1: Quick Transcription Use pre-existing recognizer to decode audio recordings Ask humans to cross-check the output (in about 6 times real-time) If hypothesis is correct PASS If hypothesis is close enough CORRECT If hypothesis is off THROW AWAY Still too expensive, also lose all bad hypos Solution 2: Slightly Supervised Training Some kind of references are given (similar to close captions) Step 1: Create a biased language model on these close captions Step 2: Decode all audio recordings using this biased language model Find speech portions of high confidence and train on those Leads to significant improvements (e.g. GALE 1500hrs ~ 5-7% relative) Solution 3: Unsupervised Training see lecture by Thang/Tim

Acoustic Modelling I + II - 44 Current Research Issues in AM: Signal Preprocessing Minor Issues: What features to use? (Spectrum, Cepstrum, LPC, bottleneck features...) Major Issues: Normalization techniques for speaker, e.g. vocal tract lengths (VTLN), Speaker Adaptive Training, Articulatory Features Preprocessing "afterburner" (RASTA, LDA, HLDA,...) Dynamic Features (higher order HMMs, formant shapes,...) Decomposition (multidimensional HMMs,...) Noise Reduction (echo canceling, car-noise reduction,...) Speaker Segmentation and Clustering

Acoustic Modelling I + II - 45 Current Research Issues in AM: Robustness In general: Robustness is stability against variations. Variations that affect the recognition accuracy are: Speech itself (styles, speeds, dialects, spontaneity,...) Background noise (car, cocktail party, street noise, music,...) Channel effects (microphones, telephone, room specs,...) Current Efforts: Enhance the part that humans use to recognize speech Suppress those parts that are irrelevant (e.g. noise subtraction) Normalization (map different appearances of same thing to one appearance, e.g. VTLN)

Acoustic Modelling I + II - 46 Current Research Issues in AM: Adaptation In general: Adaptation is modifying the parameters such that they fit better onto the current signal (= model adaptation) or modifying the signal such that it fits better onto the system's parameters (= feature space adaptation) Most common adaptations: Adapt to the speaker (move speaker independent recognizer towards a speaker dependent) Adapt to the environment (make recognizer a bit more environment-dependent) Reasons for using adaptation: Speaker or environment dependent recognizers are more precise Data sparseness: not enough data available to train a speaker dependent recognizer, adaptation can work with fewer data

Acoustic Modelling I + II - 47 Summary Acoustic Modeling (Part 1+2) Pronunciation Variants Context Dependent Acoustic Modeling From Sentence to Context Dependent HMM Speech Units Crossword Context Modeling, Problems Tying of Contexts Clustering of Context Bottom-Up vs. Top-Down Clustering Distances Between Model Clusters Problems with Vocabulary Dependencies Clustering with Decision Trees Some open questions in AM Upcoming: Adaptation, Special problems

Acoustic Modelling I + II - 48 Thanks for your interest!