Statistical Pronunciation Modeling for Non-native Speech

Similar documents
STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

On the Formation of Phoneme Categories in DNN Acoustic Models

Learning Methods in Multilingual Speech Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Speech Recognition at ICSI: Broadcast News and beyond

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Letter-based speech synthesis

Effect of Word Complexity on L2 Vocabulary Learning

Lecture 9: Speech Recognition

Voice conversion through vector quantization

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Characterizing and Processing Robot-Directed Speech

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A study of speaker adaptation for DNN-based speech synthesis

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

English Language and Applied Linguistics. Module Descriptions 2017/18

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Automatic Pronunciation Checker

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Florida Reading Endorsement Alignment Matrix Competency 1

Investigation on Mandarin Broadcast News Speech Recognition

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Rhythm-typology revisited.

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

SIE: Speech Enabled Interface for E-Learning

Cross Language Information Retrieval

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Mandarin Lexical Tone Recognition: The Gating Paradigm

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Body-Conducted Speech Recognition and its Application to Speech Support System

Universal contrastive analysis as a learning principle in CAPT

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Taking into Account the Oral-Written Dichotomy of the Chinese language :

Disambiguation of Thai Personal Name from Online News Articles

Detecting English-French Cognates Using Orthographic Edit Distance

Edinburgh Research Explorer

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Language. Name: Period: Date: Unit 3. Cultural Geography

Journal of Phonetics

Large vocabulary off-line handwriting recognition: A survey

TEKS Comments Louisiana GLE

Speech Emotion Recognition Using Support Vector Machine

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

REVIEW OF CONNECTED SPEECH

Calibration of Confidence Measures in Speech Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

Public Speaking Rubric

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Aviation English Solutions

Speech Recognition by Indexing and Sequencing

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Assessing speaking skills:. a workshop for teacher development. Ben Knight

Using dialogue context to improve parsing performance in dialogue systems

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

ELP in whole-school use. Case study Norway. Anita Nyberg

Deep Neural Network Language Models

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Linguistics. The School of Humanities

Advances in Aviation Management Education

Modern Languages. Introduction. Degrees Offered

Phonological Processing for Urdu Text to Speech System

Preferences...3 Basic Calculator...5 Math/Graphing Tools...5 Help...6 Run System Check...6 Sign Out...8

Overview of the 3rd Workshop on Asian Translation

University of New Orleans

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Academic Choice and Information Search on the Web 2016

Participate in expanded conversations and respond appropriately to a variety of conversational prompts

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

READ 180 Next Generation Software Manual

Transcription:

Statistical Pronunciation Modeling for Non-native Speech Dissertation Rainer Gruhn Nov. 14 th, 2008 Institute of Information Technology University of Ulm, Germany In cooperation with Advanced Telecommunication Research Labs, Kyoto

Page 2 Outline Introduction Motivation and background Thesis objectives Hidden Markov Models as statistical lexicon Initialization and training Application Experiments ATR non-native speech database Evaluation Closing Thesis contributions Publications

Page 3 Non-native English speech Relevant in many applications of speech recognition: Automatic tourist information system Car navigation with user going abroad Speech recognition in the media domain Mispronunciations include phoneme insertions, deletions and substitutions (e.g. in German English: /th/) Different patterns for each language ( Accent) Example: Certainly. What time do you anticipate checking in? Chinese Indonesian Japanese French

Page 4 Schematic Outline of a Speech Recognition System additional knowledge speech feature extraction features n-best recognition n-best word hypotheses rescoring result acoustic model language model dictionary

Page 5 Schematic Outline of a Speech Recognition System additional knowledge speech feature extraction features n-best recognition n-best word hypotheses rescoring result acoustic model language model dictionary Improve performance for individual speakers: acoustic model adaptation (e.g. Maximum A Posteriori)

Page 6 Schematic Outline of a Speech Recognition System additional knowledge speech feature extraction features n-best recognition n-best word hypotheses rescoring result acoustic model language model dictionary Common approach for non-native speakers: rule-based dictionary enhancement (Goronzy 2002, Mayfield-Tomokiyo 2001)

Page 7 Schematic Outline of a Speech Recognition System Proposed stat. HMM lexicon speech feature extraction features n-best recognition n-best word hypotheses rescoring result acoustic model language model dictionary Proposed method: rescoring with HMMs as statistical lexicon

Page 8 Common Approach: Rules Common approach: Phoneme confusion rules (data driven / knowledge based) recognition result s ae ng k - i uw comparison $S ae ng k - $S uw transcription th ae ng k - y uw generated rules ths - yi Apply rules on pronunciation dictionary Rule set: ths, yi thank : /th ae ng k/, /s ae ng k/; you : /y uw/, /i uw/;

Page 9 Problems about Rules Pronunciation variations also depend on context Variations unseen in training data cannot be modeled Knowledge-based: Manual rule generation When rules are applied to pronunciation dictionary: tradeoff between: Large dictionary (including all possible variations as entry) Losing information (choosing to apply only some rules)

Page 10 Thesis Objective Non-native speech: many pronunciation variations, automatic speech recognition difficult Improve automatic speech recognition of non-natives Target: Model those variations automatically and statistically Cover all pronunciation variations Approach: Train discrete Hidden Markov Models (HMM) for each word as pronunciation model

Page 11 Outline Introduction Motivation and background Thesis objectives Hidden Markov Models as statistical lexicon Initialization and training Application Experiments ATR non-native speech database Evaluation Closing Thesis contributions Publications

Page 12 Statistical Lexicon HMMs to represent pronunciations (not explicitly representing the confusions) One discrete HMM model for each word Initialization on baseline lexicon Training on phoneme sequences generated by phoneme recognition

Page 13 Initialization, Training and Application Phoneme recognition to generate phoneme sequences Speech data ax n d w ih th ah sh ow Phonemes ax n d eh n d ah n t Enter ae 0.495 ax 0.495 n 0.99 d 0.99 Exit AND: Train word pronunciation model on all instances of that word Initialization Training Application of Models Phoneme sequence ae n l eh n w ih ch ih l eh k t ix s t ey anywhere you d like to stay -82.5 and when would you and what I will to N-best hypotheses like to stay like to stay -69.0-75.0 Pronunciation score

Page 14 Introduction HMMs as statistical lexicon : Initialization Experiments Closing Word Model Example: AND AND: /ae n d/ /ax n d/ Transitions States Enter ae 0.495 ax 0.495 ah 0.0002 b 0.0002 d 0.0002 n 0.0002 n 0.99 ae 0.0002 ax 0.0002 ah 0.0002 b 0.0002 d 0.0002 d 0.99 ae 0.0002 ax 0.0002 ah 0.0002 b 0.0002 n 0.0002 Exit Probability Distributions

Page 15 Introduction HMMs as statistical lexicon: Initialization Experiments Closing Model Initialization Given: standard pronunciation dictionary One discrete HMM for each word Number of states equals number of baseline phonemes (+ enter, exit states) Several pronunciation variants in dictionary are integrated into word model

Page 16 Introduction HMMs as statistical lexicon: Training Experiments Closing Model Training Segmentation of training data into words Phoneme recognition Train discrete HMM for each word on phoneme sequence Default unseen words to baseline lexicon phoneme sequence(s)

Page 17 Introduction HMMs as statistical lexicon: Training Experiments Closing Training of Discrete HMMs Speech data Phoneme recognition to generate phoneme sequences ax n d w ih th ah sh ow Phonemes ax n d eh n d ah n t AND: Train word pronunciation model on all instances of that word

Page 18 Introduction HMMs as statistical lexicon : Initialization Experiments Closing Word Model After Training AND: /ae n d/ /ax n d/ Transitions States Enter ae 0.5 ax 0.3 ah 0.15 ih 0.05 d 0.0001 n 1.0e -6 n 0.7 m 0.2 ng 0.005 hh 0.002 b 0.0001 d 1.0e -6 d 0.7 t 0.2 b 0.05 ae 0.0001 ax 0.0001 ah 1.0e -6 Exit Probability Distributions

Page 19 Introduction HMMs as statistical lexicon: Application Experiments Closing Model Application test utterance n-best recognition n-best string Standard n-best decoding of test set

Page 20 Introduction HMMs as statistical lexicon: Application Experiments Closing Model Application test utterance phoneme recognition n-best recognition phoneme sequence n-best string Standard n-best decoding of test set 1-best phoneme recognition of whole utterance

Page 21 Introduction HMMs as statistical lexicon: Application Experiments Closing Model Application Proposed stat. HMM lexicon phoneme recognition phoneme sequence Viterbi alignment test utterance n-best recognition n-best string pron. score Standard n-best decoding of test set 1-best phoneme recognition of whole utterance Calculate pronunciation score of each n-best hypothesis

Page 22 Introduction HMMs as statistical lexicon: Application Experiments Closing Model Application Proposed stat. HMM lexicon phoneme recognition phoneme sequence Viterbi alignment LM test utterance n-best recognition n-best string pron. score max. score selector Language model score best from n-best Standard n-best decoding of test set 1-best phoneme recognition of whole utterance Calculate pronunciation score of each n-best hypothesis Select best hypothesis based on pronunciation score with weighted language model score

Page 23 Introduction HMMs as statistical lexicon: Application Experiments Closing Rescoring of N-best Phoneme sequence ae n l eh n w ih ch ih l eh k t ix s t ey anywhere you d like to stay -82.5 and when would you and what I will to N-best hypotheses like to stay like to stay -69.0-75.0 Pronunciation score

Page 24 Outline Introduction Motivation and background Thesis objectives Hidden Markov Models as statistical lexicon Initialization and training Application Experiments ATR non-native speech database Evaluation Closing Thesis contributions Publications

Page 25 Introduction HMMs as statistical lexicon Experiments: Database Closing ATR Non-native Speech Database Existing comparable databases (large, multi-accent): M-ATC, Hiwire: noisy, special military vocabulary Crosstowns: unavailable to public Collected in this work One of the largest non-native English speech databases Data available at ATR Total 22h of speech country China France Germany Indonesia Japan all #speakers 17 15 15 15 28 96

Page 26 Introduction HMMs as statistical lexicon Experiments: Database Closing ATR Non-native Speech Database Per speaker: 12 minutes training, 2 minutes test data (2 hotel reservation dialogs) Read speech Content: Uniform set of hotel reservation dialogs phonetically balanced sentences digit sequences Speaker skill: various, rated

Page 27 Introduction HMMs as statistical lexicon Experiments: Database Closing Database Collection Non-nativeness vs. anxiousness: Instructor in same room, nodding Non-intimidating environment Words where speaker was not sure how to pronounce: speaker had to try Speakers could repeat sentence until satisfied

Page 28 Introduction HMMs as statistical lexicon Experiments: Evaluation Closing Experimental Setup Baseline dictionary: 7311 words, 8875 entries 7311 pronunciation HMMs 10-best word recognition Generate pronunciation HMMs separately for each accent group Acoustic model: trained on Wall Street Journal database Word bigram LM, trained on travel arrangement task text data Phoneme/Word error rate INS + DEL + N total SUB Relative error rate improvement ERR ERR before ERR before after

Page 29 Introduction HMMs as statistical lexicon Experiments: Evaluation Closing Phoneme Recognition Both pronunciation model training and application steps require phoneme recognition Error rate calculated relative to canonical transcription Recognition of whole utterance Phoneme bigram as phonotactical constraint 70 Phoneme error rate 65 60 55 50 Monophone Triphone 45 40 CH FR GER IN JAP Average

Page 30 Introduction HMMs as statistical lexicon Experiments: Evaluation Closing Pronunciation Scoring: Results Word error rates for non-native speech recognition, with and without pronunciation rescoring 60 55 Word error rate 50 45 40 35 30 Baseline Rescoring 25 CH FR GER IN JAP Average Accent type CH FR GER IN JP Avg rel. WER impr. 11.9 8.3 5.9 5.4 8.0 8.2

Page 31 Introduction HMMs as statistical lexicon Experiments: Evaluation Closing Comparing to Standard Technology Standard approach to adjust for non-native speech: Rule-based Dictionary modification Comparison of relative improvements % 8 7 Improvement vs. pronunciation alternatives added to dictionary % Relative word error rate improvement 6 5 4 3 2 1 0 Rules Statist. Lexicon 4,5 4 3,5 3 2,5 2 1,5 1 0,5 0 8875 9994 12142 14218 23506 41151 Pronunciations in dictionary rel. Impr. Evaluated for the Japanese speaker set

Page 32 Outline Introduction Motivation and background Thesis objectives Hidden Markov Models as statistical lexicon Initialization and training Application Experiments ATR non-native speech database Evaluation Closing Thesis contributions Publications

Page 33 Thesis Contributions Theoretical Integrated framework for statistical pronunciation modeling Both learned and unseen variations are considered Data-driven: No expert knowledge about accent is required Practical Collected a large non-native English speech database 22h of speech uttered by 96 speakers among the largest such databases existing Experimental Consistently improved performance for any type of accent Largest improvement achieved: 11.9% relative WER reduction

Page 34 Publications (Excerpt) 1. A Statistical Lexicon for Non-Native Speech Recognition Rainer Gruhn, Konstantin Markov, Satoshi Nakamura, ICSLP 2004 2. Discrete HMMs for statistical pronunciation modeling Rainer Gruhn, Konstantin Markov, Satoshi Nakamura, SLP 2004 3. A multi-accent non-native English database Rainer Gruhn, Tobias Cincarek, Satoshi Nakamura, ASJ 2004 4. A Statistical Lexicon Based on HMMs Rainer Gruhn, Satoshi Nakamura, IPSJ 2004 5. Probability Sustaining Phoneme Substitution for Non-Native Speech Recognition Rainer Gruhn, Konstantin Markov, Satoshi Nakamura, ASJ 2002 6. CORBA-based Speech-to-Speech Translation System Rainer Gruhn, Koji Takashima, Atsushi Nishino, Satoshi Nakamura, ASRU 2001 7. A CORBA based Speech-to-Speech Translation System Rainer Gruhn, Koji Takashima, Atsushi Nishino, Satoshi Nakamura, ASJ 2001 8. Multilingual Speech Recognition with the CALLHOME Corpus Rainer Gruhn, Satoshi Nakamura, ASJ 2001 9. Cellular Phone Based Speech-To-Speech Translation System ATR-MATRIX Rainer Gruhn, Harald Singer, Hajime Tsukada, Atsushi Nakamura, Masaki Naito, Atsushi Nishino, Yoshinori Sagisaka, Satoshi Nakamura, ICSLP 2000 10. Towards a Cellular Phone Based Speech-To-Speech Translation Service Rainer Gruhn, Satoshi Nakamura, Yoshinori Sagisaka, MSC 2000 11. Scalar Quantization of Cepstral Parameters for Low Bandwidth Client-Server Speech Recognition Systems Rainer Gruhn,Harald Singer,Yoshinori Sagisaka, ASJ 1999 Total: 46 Publications

Page 35 Patents 2001-222292 A computer with a speech processing system and program in memory 2001-222531 A computer with a program in memory that provides speech translation and feedback 2002-135642 A speech to speech translation system 2002-304392 A speech to speech translation system 2002-311983 A speech to speech translation system 2002-320037 A speech to speech translation system 2005-234504 A method for training HMM pronunciation models for speech recognition 2005-292770 A method for acoustic model generation and speech recognition 2006-84965 A system and program for speech data collection 2006-84966 A method and program for automatic rating of spoken speech Total: 10 Patents, all granted by Japanese Patent Office

Page 36 Future Directions Applicability on native speech Baseline dictionary with no pronunciation variants Speech controlled services on mobile devices Experiments on word level smaller units? Syllables N-phones Special states to model insertion errors Accent recognition

Page 37! THANK YOU!