ASR for Spoken-Dialogue Systems

Similar documents
Learning Methods in Multilingual Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Speech Recognition at ICSI: Broadcast News and beyond

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Characterizing and Processing Robot-Directed Speech

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Calibration of Confidence Measures in Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition

Lecture 9: Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Switchboard Language Model Improvement with Conversational Data from Gigaword

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Mandarin Lexical Tone Recognition: The Gating Paradigm

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Probabilistic Latent Semantic Analysis

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Disambiguation of Thai Personal Name from Online News Articles

Linking Task: Identifying authors and book titles in verbose queries

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Improvements to the Pruning Behavior of DNN Acoustic Models

Effect of Word Complexity on L2 Vocabulary Learning

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Speech Emotion Recognition Using Support Vector Machine

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

WHEN THERE IS A mismatch between the acoustic

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

A study of speaker adaptation for DNN-based speech synthesis

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Large Kindergarten Centers Icons

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Deep Neural Network Language Models

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Online Updating of Word Representations for Part-of-Speech Tagging

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Letter-based speech synthesis

Universal contrastive analysis as a learning principle in CAPT

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

Python Machine Learning

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

On the nature of voicing assimilation(s)

Large vocabulary off-line handwriting recognition: A survey

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Multi-Lingual Text Leveling

Cross Language Information Retrieval

A Comparison of Two Text Representations for Sentiment Analysis

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

(Sub)Gradient Descent

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

Using dialogue context to improve parsing performance in dialogue systems

Teachers: Use this checklist periodically to keep track of the progress indicators that your learners have displayed.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Noisy SMS Machine Translation in Low-Density Languages

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Edinburgh Research Explorer

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Softprop: Softmax Neural Network Backpropagation Learning

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

SLINGERLAND: A Multisensory Structured Language Instructional Approach

English Language and Applied Linguistics. Module Descriptions 2017/18

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Rule Learning with Negation: Issues Regarding Effectiveness

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Transcription:

Lecture # 18 Session 2003 ASR for Spoken-Dialogue Systems Introduction Speech recognition issues Example using SUMMIT system for weather information Reducing computation Model aggregation Committee-based classifiers Acoustic Modelling 1

Example Dialogue-based Systems 30 25 Ave Words/Utt Ave Utts/Call 20 15 10 5 0 CSELT SW/F Philips CMU/M CMU/F LIMSI MIT/W MIT/F AT&T Human Vocabularies typically have 1000s of words Widely deployed systems tend to be more conservative Directed dialogues have fewer words per utterance Word averages lowered by more confirmations Human-human conversations use more words Acoustic Modelling 2

Telephone-based, Conversational, ASR Telephone bandwidths with variable handsets Noisy background conditions Novice users with small number of interactions Men, women, children Native and non-native speakers Genuine queries, browsers, hackers Spontaneous speech effects e.g., filled pauses, partial words, non-speech artifacts Out-of-vocabulary words and out-of-domain queries Full vocabulary needed for complete understanding Word and phrase spotting are not primary strategies Mixed-initiative dialog provides little constraint to recognizer Real-time decoding Acoustic Modelling 3

Data Collection Issues System development is chicken & egg problem Data collection has evolved considerably Wizard-based system-based data collection Laboratory deployment public deployment 100s of users thousands millions Data from real users solving real problems accelerates technology development Significantly different from laboratory environment Highlights weaknesses, allows continuous evaluation But, requires systems providing real information! Expanding corpora requires unsupervised training or adaptation to unlabelled data Acoustic Modelling 4

Data Collection (Weather Domain) Initial collection of 3,500 read utterances and 1,000 wizard utterances 70000 60000 50000 40000 30000 20000 10000 0 Calls Utterances May Jul Sep Nov Jan Mar May Jul Sep Nov Jan Mar May Jul Sep Nov Over 756K utterances from 112K calls since May, 1997 Acoustic Modelling 5

Weather Corpus Characteristics Corpus dominated by American male speakers Female 21% Child 9% Non- Native 14% Male 70% Native 86% Approximately 11% of data contained significant noises Over 6% of data contained spontaneous speech effects At least 5% of data from speakerphones Acoustic Modelling 6

Vocabulary Selection 100 % Coverage 80 60 40 20 0 0 1000 2000 3000 Vocabulary Size Constrained domains naturally limit vocabulary sizes 2000 word vocabulary gives good coverage for weather ~2% out-of-vocabulary rate on test sets Acoustic Modelling 7

Vocabulary Current vocabulary consists of nearly 2000 words Based on system capabilities and user queries Type Size Examples Geography 933 boston, alberta, france, africa Weather 217 temperature, snow, sunny, smog Basic 815 i, what, january, tomorrow Incorporation of common reduced words & word pairs Type Reduction Compound Examples give_me, going_to, want_to, what_is, i_would clear_up, heat_wave, pollen_count Lexicon based on syllabified LDC PRONLEX dictionary Acoustic Modelling 8

<>* <pause1> <pause2> <uh> <um> <unknown>* a a_m am don+t new_york_city sixty today today+s Example Vocabulary File Sorted alphabetically Utterance start & end marker Pauses at utterance start & end Filled pause models * d items have no acoustic realization Out-of-vocabulary word model <> d words don t count as errors Underbars distinguish letter sequences from actual words + symbol conventionally used for Lower case is a common convention Numbers tend to be spelled out Each word form has separate entry Acoustic Modelling 9

Example Baseform File <pause1> : - + <pause2> : - + <uh> : ah_fp <um> : ah_fp m a_m : ey & eh m either : ( iy, ay ) th er laptop : l ae pd t aa pd new_york : n uw & y ao r kd northwest : n ao r th w eh s td trenton : tr r eh n tq en winter : w ih nt er previous symbol can repeat special filled pause vowel alternate pronunciations word break allowing pause Acoustic Modelling 10

Editing Generated Baseforms Automatically generated baseform file should be manually checked for the following problems: Missing pronunciation variants that are needed Unwanted pronunciation variants that are present Vocabulary words missing in PRONLEX going_to : g ow ix ng & t uw reading : ( r iy df ix ng, r eh df ix ng ) woburn : <???> going_to : g ( ow ix ng & t uw, ah n ax ) reading : r eh df ix ng woburn : w ( ow, uw ) b er n Acoustic Modelling 11

Applying Phonological Rules Phonemic baseforms are canonical representation Baseforms may have multiple acoustic realizations Acoustic realizations are phones or phonetic units Example: batter : b ae tf er This can be realized phonetically as: or as: bcl b ae tcl t er bcl b ae dx er Standard /t/ Flapped /t/ Acoustic Modelling 12

Example Phonological Rules Example rule for /t/ deletion ( destination ): {s} t {ax ix} => [tcl t]; Left Context Phoneme Right Context Phonetic Realization Example rule for palatalization of /s/ ( miss you ): {} s {y} => s sh; Acoustic Modelling 13

Language Modelling Class bi- and trigrams used to produce 10-best outputs Training data augmented with city and state constraints Relative entropy measure used to help select classes raining, snowing cold, hot, warm extended, general humidity, temperature advisories, warnings conditions, forecast, report 200 word classes reduced perplexities and error rates Type Perplexity % Word Error Rate word bigram 18.4 16.0 + word trigram 17.8 15.5 class bigram 17.6 15.6 + class trigram 16.1 14.9 Acoustic Modelling 14

Defining N-gram Word Classes CITY ==> boston CITY ==> chicago CITY ==> seattle <U>_DIGIT ==> one <U>_DIGIT ==> two <U>_DIGIT ==> three DAY ==> today tomorrow Class definitions have class name on left and word on right Class names with <U>_ forces all words to be equally likely Alternate words in class can be placed on same line with separator Acoustic Modelling 15

The Training Sentence File An n-gram model is estimated from training data Training file contains one utterance per line Words in training file must have same case and form as words in vocabulary file Training file uses the following conventions: Each clean utterance begins with <pause1> and ends with <pause2> Compound word underbars are typically removed before training Underbars automatically re-inserted during training based on compound words present in vocabulary file Special artifact units may be used for noises and other significant non-speech events: <clipped1>, <clipped2>, <hangup>, <cough>, <laugh> Acoustic Modelling 16

Example Training Sentence File <pause1> when is the next flight to chicago <pause2> <pause> to san <partial> san francisco <pause2> <pause1> <um> boston <pause2> <clipped1> it be in time <pause2> <pause1> good bye <hangup> partial word, e.g., san die(go) clipped word, e.g., ~(w)ill it <pause1> united flight two oh four <pause2> <pause1> <cough> excuse me <laugh> <pause2> all significant sounds are transcribed Acoustic Modelling 17

Composing FST Lexical Networks Four basic FST networks are composed to form full search network. G : Language model L : Lexical model P : Pronunciation model C : Context-dependent acoustic model mapping Mathematical composed using the expression: CoPoLoG Words G : Language Model Words L : Lexical Model Phonemic Units P : Pronunciation Model Phonetic Units C : CD Model Mapping CD Acoustic Model Labels Acoustic Modelling 18

FST Example Arc input label Arc output label Arc score b ao BOSTON 0 ao s ε 0 ix n IN -3.57 n bcl ε 0 bcl b ε -2.35 b aa ε 0 aa s BOSTON -0.35 Alternate pronunciations Words share arcs in network aa z BOSNIA -1.23 Acoustic Modelling 19

Acoustic Models Models can be built for segments and boundaries Best accuracy can be achieved when both are used Current real-time recognition uses only boundary models Boundary labels combined into classes Classes determined using decision tree clustering One Gaussian mixture model trained per class 112 dimension feature vector reduced to 50 dimensions via PCA 1 Gaussian component for every 50 training tokens (based on # dims) Models trained on over 100 hours of spontaneous telephone speech collected from several domains Acoustic Modelling 20

Search Details Search uses forward and backward passes: Forward Viterbi search using bigram Backwards A* search using bigram to create a word graph Rescore word graph with trigram (i.e., subtract bigram scores) Backwards A* search using trigram to create N-best outputs Search relies on two types of pruning: Pruning based on relative likelihood score Pruning based maximum number of hypotheses Pruning provides tradeoff between speed and accuracy Search can control tradeoff between insertions and deletions Language model biased towards short sentences Word transition weight (wtw) heuristic adjusted to remove bias Acoustic Modelling 21

Recognition Experiments Error Rate (%) 80 70 60 50 40 30 20 10 Sentence Word Data 100 10 Training Data (x1000) 0 1 Apr May Jun Jul Aug Nov Apr Nov May Collecting real data improves performance: Enables increased complexity and improved robustness for acoustic and language models Better match than laboratory recording conditions Acoustic Modelling 22

Error Analysis (2506 Utterance Test Set) Entire Set In Domain (ID) Male (ID) Female (ID) Child (ID) Non-native (ID) Out of Domain Expert (ID) 2% 22% 11% 70% of test set is in domain 8% 70% of speakers are male 13% 50% worse than males 23% 3 s worse than males 26% Different test set 60% Experienced users adapt to system! 0 10 20 30 40 50 60 70 Word Error Rate (%) Acoustic Modelling 23

A* Search Latency 0 0.5 1 1.5 2 2.5 3 3.5 4 Latency (s) Average latency.62 seconds 85% < 1 second ; 99% < 2 seconds Latency not dependent on utterance length Acoustic Modelling 24

Gaussian Selection ~50% of total computation is evaluation of Gaussian densities Can use binary VQ to select mixture components to evaluate Component selection criteria for each VQ codeword: Those within distance threshold Those within codeword (i.e., every component used at least once) At least one component/model per codeword (i.e., only if necessary) Can significantly reduce computation with small error loss within codeword outside distance threshold Acoustic Modelling 25

Model Aggregation K-means and EM algorithms converge to different local minima from different initialization points Performance on development data not necessarily a strong indicator of performance on test data TIMIT phonetic recognition error for 24 training trials Test Set Error Rate (%) 30 29.5 29 28.5 Best on Dev, Worst on Test! 27.5 28 28.5 Dev Set Error Rate (%) Correlation Coeff = 0.16 Acoustic Modelling 26

Aggregation Experiments Combining different training runs can improve performance Three experimental systems: phonetic classification, phonetic recognition (TIMIT), and word recognition (RM) Acoustic models: - Mixture Gaussian densities, randomly initialized K-means - 24 different training trials Measure average performance of M unique N-fold aggregated models (starting from 24 separate models) % Error Phone Classification Phone Recognition Word Rec. M=24 N=1 22.1 29.3 4.5 M=6 N=4 20.7 28.4 4.2 M=1 N=24 20.2 28.1 4.0 % Reduction 8.3 4.0 12.0 Acoustic Modelling 27

Model Aggregation Aggregation combines N classifiers, with equal weighting, to form one aggregate classifier r 1 N r ϕ ( x) = ϕ ( x) A N n = 1 The expected error of an aggregate classifier is less than the expected error of any randomly chosen constituent N-fold aggregate classifier has N times more computation Gaussian kernels of aggregate model can be hierarchically clustered and selectively pruned Experiment: Prune 24-fold model back to size of smaller N-fold models n 2 Gaussians 4 Gaussians 6 Gaussians Acoustic Modelling 28

Error Rate (%) on TIMIT Dev Set 28.5 28 27.5 27 Aggregation Experiments Average N-fold Trial Best and Worst Trial Pruned 24-fold Mode 1 2 4 6 8 12 24 # of Aggregated Training Trials (N) Acoustic Modelling 29

Phonetic Classification Confusions Most confusions occur within manner class Acoustic Modelling 30

Committee-based Classification Change of temporal basis affects within-class error Smoothly varying cosine basis better for vowels and nasals Piecewise-constant basis better for fricatives and stops 30 28 % Error 26 24 22 S1: 5 averages S3: 5 cosines 20 Overall Vowel Nasal Weak Fricative Stop Combining information sources can reduce error Acoustic Modelling 31

Committee-based Classifiers (Halberstadt, 1998) Uses multiple acoustic feature vectors and classifiers to incorporate different sources of information Explored 3 combination methods (e.g., voting, linear, indep.) Obtains state-of-the-art phonetic classification and recognition results (TIMIT) Combining 3 boundary models in Jupiter weather domain Word error rate 10-16% relative reduction over baseline Substitution error rate 14-20% relative reduction over baseline Acoustic Measurements % Error % Sub B1 (30 ms, 12 MFCC, telescoping avg) 11.3 6.4 B2 (30 ms, 12 MFCC+ZC+E+LFE, 4 cos±50ms) 12.0 6.7 B3 (10ms, 12 MFCC, 5 cos±75ms) 12.1 6.9 B1 + B2 + B3 10.1 5.5 Acoustic Modelling 32

Related Work ROVER system developed at NIST [Fiscus, 1997] 1997 LVCSR Hub-5E Benchmark test Recognizer output voting error reduction Combines confidence-tagged word recognition output from multiple recognizers Produced 12.5% relative reduction in WER Notion of combining multiple information sources Syllable-based and word-based [Wu, Morgan et al, 1998] Different phonetic inventories [AT&T] 80, 100, or 125 frames per second [BBN] Triphone and quinphone [HTK] Subband-based speech recognition [Bourland, Dupont, 1997] Acoustic Modelling 33

References E. Bocchieri. Vector quantization for the efficient computation of continuous density likelihoods. Proc. ICASSP, 1993. T. Hazen and A. Halberstadt. Using aggregation to improve the performance of mixture Gaussian acoustic models. Proc. ICASSP, 1998. J. Glass, T. Hazen, and L. Hetherington. Real-time telephone-based speech recognition in the Jupiter domain. Proc. ICASSP, 1999. A. Halberstadt. Heterogeneous acoustic measurements and multiple classifiers for speech recognition. Ph.D. Thesis, MIT, 1998. T. Watanabe et al. Speech recognition using tree-structured probability density function. Proc. ICSLP, 1994. Acoustic Modelling 34