Learning from Mistakes: Expanding Pronunciation Lexicons using Word Recognition Errors

Similar documents
Characterizing and Processing Robot-Directed Speech

Learning Methods in Multilingual Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Letter-based speech synthesis

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Investigation on Mandarin Broadcast News Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Investigation of Indian English Speech Recognition using CMU Sphinx

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Lecture 9: Speech Recognition

Speech Recognition by Indexing and Sequencing

The NICT Translation System for IWSLT 2012

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Calibration of Confidence Measures in Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Effect of Word Complexity on L2 Vocabulary Learning

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Using dialogue context to improve parsing performance in dialogue systems

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

Mandarin Lexical Tone Recognition: The Gating Paradigm

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Miscommunication and error handling

Speech Recognition at ICSI: Broadcast News and beyond

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Lecture 1: Machine Learning Basics

Deep Neural Network Language Models

Phonological Processing for Urdu Text to Speech System

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Probabilistic Latent Semantic Analysis

Proceedings of Meetings on Acoustics

Segregation of Unvoiced Speech from Nonspeech Interference

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Noisy SMS Machine Translation in Low-Density Languages

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Switchboard Language Model Improvement with Conversational Data from Gigaword

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

IMPROVING PRONUNCIATION DICTIONARY COVERAGE OF NAMES BY MODELLING SPELLING VARIATION. Justin Fackrell and Wojciech Skut

CROSS-LANGUAGE MAPPING FOR SMALL-VOCABULARY ASR IN UNDER-RESOURCED LANGUAGES: INVESTIGATING THE IMPACT OF SOURCE LANGUAGE CHOICE

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

A study of speaker adaptation for DNN-based speech synthesis

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Individual Differences & Item Effects: How to test them, & how to test them well

SARDNET: A Self-Organizing Feature Map for Sequences

ASR for Tajweed Rules: Integrated with Self- Learning Environments

Description: Pricing Information: $0.99

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Disambiguation of Thai Personal Name from Online News Articles

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Language Model and Grammar Extraction Variation in Machine Translation

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Corrective Feedback and Persistent Learning for Information Extraction

Automatic Pronunciation Checker

Truth Inference in Crowdsourcing: Is the Problem Solved?

Arabic Orthography vs. Arabic OCR

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Greedy Decoding for Statistical Machine Translation in Almost Linear Time

SIE: Speech Enabled Interface for E-Learning

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Language Acquisition Chart

First Grade Curriculum Highlights: In alignment with the Common Core Standards

English Language and Applied Linguistics. Module Descriptions 2017/18

Multi-View Features in a DNN-CRF Model for Improved Sentence Unit Detection on English Broadcast News

Transcription:

Learning from Mistakes: Expanding Pronunciation Lexicons using Word Recognition Errors Sravana Reddy The University of Chicago Joint Work with Evandro Gouvêa

Sang Bissenette SPEECH RECOGNITION Sane visitor

Mariano DiFabio SPEECH RECOGNITION Mary and the fable

This Work Mariano DiFabio Out of Vocabulary (OOV) Words SPEECH Black Box RECOGNITION Latent Phonetic Similarity Channel Pronunciations of OOV words (Mariano and DiFabio) Mary and the fable Known Words

Previous Work Mariano DiFabio M EH R IY AA N AE L IH AE SPEECH RECOGNITION <s> L AH EY AH N D EY AH AH AH Mary and the fable Pronunciations of OOV words (Mariano and DiFabio)

Previous Work Wooters and Stolcke (ICASSP 1994) Sloboda and Waibel (ICSLP 1996) Fossler-Lussier (Ph.D. Thesis 1999) Maison (Eurospeech 2003) Tan and Bessacier (Interspeech 2008) Bansal et al (ICASSP 2009) Badr et al (Interspeech 2010) etc.

Why assume black-box access? Practical: What if ASR engine is a black box? (proprietary speech recognition tools, etc.) Example possible use of our approach: Third-party app analyzes results of black-box recognition engine, returns OOV pronunciations Scientific: How much pronunciation information can we get from only word recognition errors?

Our Generative Model for input word w and output recognition hypothesis e 1. Generate word w with Pr(w) 2. Generate pronunciation baseform b with Pr(b w) 3. Generate phoneme sequence p with Pr(p b, w) by passing through phonetic confusion channel 4. Generate hypothesis word or phrase e with Pr(e p, b, w) Pr(w,e) = b,p Pr(w)Pr(b w)pr( p b,w)pr(e p,b,w) DiFabio D IY F AA B IH OW Black Box ASR DH AH F EY B AH L the fable

Our Generative Model for input word w and output recognition hypothesis e 1. Generate word w with Pr(w) 2. Generate pronunciation baseform b with Pr(b w) 3. Generate phoneme sequence p with Pr(p b, w) by passing through phonetic confusion channel 4. Generate hypothesis word or phrase e with Pr(e p, b, w) 5. Repeat steps 2-4 to generate more e D IY F AA B IH OW DiFabio DH AH F EY B AH L the fable D IY F EY B IH OW D IH F ER B AH T differ but

Learning Algorithm GOAL : find best pronunciation for input word w Given argmax b Pr(b w) Current guess about Pr(baseform b w) Pr(transformed phonemes p b, w) Phonetic Confusions -- will explain later Pr(word recognition output e p, b, w) = Pr(e p) Current Lexicon

Learning Algorithm Compute posterior probability of baseform b given w and e Pr(b e,w) = Pr(b w)pr(p b,w)pr(e p,b,w) c Guess Phonetic Confusions Current Lexicon Pr(c w)pr( p c,w)pr(e p,c,w) Sum over all e in n-best word recognition lists over all utterances of w Expectation Maximization Iterate Pr(b w) = e E w From Above Uniform Pr(b e,w)pr(e)

Initial Guess for Pr(b w) Limit to reasonable candidates Existing Lexicon Joint-sequence g2p algorithm (Sequitur*) Broad coverage: order 2 multigrams (low accuracy, high recall) Initialize B w = {all sequences b with > 0.00001 probability} Pr(b w) = * Bisani and Ney (2008) 1 B w if b B w 0 otherwise

Modeling Phonetic Confusions TIMIT (train) Phone Recognition Phoneme Hypotheses Phoneme References p conditionally independent of w Phoneme Confusion Finite-State Transducer Pr(p b,w) = Pr(p b) = sum of paths with input b & output p

Data CSLU Names Corpus Only use single-word names (isolated-word experiments) 20423 utterances, 7771 unique names Train (learn OOV pronunciations): Random 50% of utterances for each name Test (evaluate new lexicon): Remaining utterances

Setup Sphinx 3 MFCCs extracted using Sphinx s default parameters Acoustic Models trained on TIMIT Original Lexicon: CMU Dictionary, CSLU names removed Language Model: unigrams over names, add-one smoothing to include all CMU Dictionary words

Evaluation Word Error Rate of ASR recognition with learned lexicon Baseform Error Rate: proportion of learned baseforms different from corpus transcriptions Phoneme Error Rate: proportion of insertions, deletions, and substitutions of learned baseforms against corpus transcriptions Baselines: 1. State of the art g2p: Sequitur, multigrams of order 6 (SEQUITUR) 2. CMU Dictionary pronunciations for names in dictionary (CMUGOLD)

Results E w (set of hypotheses) = results from 10-best recognition E w = results from 5-best recognition SEQUITUR Can we get better pronunciations than a grapheme-to-phoneme system?

Results E w (set of hypotheses) = results from 10-best recognition (Only those utterances where the names are in the CMU Dictionary) E w = results from 5-best recognition CMUGOLD How does ASR recognition with gold standard pronunciations compare?

Results E w (set of hypotheses) = results from 10-best recognition E w = results from 5-best recognition SEQUITUR

Results E w (set of hypotheses) = results from 10-best recognition E w = results from 5-best recognition SEQUITUR

What Works? Dense phonetic neighborhood Merry in Mary Marilyn Marian Mary and Perelman Maryland Maritime Successful pronunciation recovery Sparse phonetic neighborhood Luther of Rumor for Rutherford Ruder for Not so successful

Conclusion Can we learn pronunciations from word recognition errors? Yes! Learned pronunciations are better than grapheme-to-phoneme results Preliminary work lots more to be done Extend EM to also learn (or augment) phonetic confusions Learn pronunciation variants of words in lexicon Adapt to continuous speech (not just isolated words) Seed Pr(b w) independent of Sequitur or other g2p Combine phone lattice information and word recognition output as cues for pronunciation

Dank Yu!