Pronunciation Modeling. Te Rutherford

Similar documents
On the Formation of Phoneme Categories in DNN Acoustic Models

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Lecture 9: Speech Recognition

Learning Methods in Multilingual Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Human Emotion Recognition From Speech

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A Neural Network GUI Tested on Text-To-Phoneme Mapping

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Natural Language Processing. George Konidaris

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Investigation on Mandarin Broadcast News Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Automatic Pronunciation Checker

Speech Emotion Recognition Using Support Vector Machine

CS 446: Machine Learning

Letter-based speech synthesis

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Calibration of Confidence Measures in Speech Recognition

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

A study of speaker adaptation for DNN-based speech synthesis

21st Century Community Learning Center

Characterizing and Processing Robot-Directed Speech

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Phonological Processing for Urdu Text to Speech System

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Improvements to the Pruning Behavior of DNN Acoustic Models

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Noisy SMS Machine Translation in Low-Density Languages

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Edinburgh Research Explorer

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Effect of Word Complexity on L2 Vocabulary Learning

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Python Machine Learning

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Speaker Identification by Comparison of Smart Methods. Abstract

Cross Language Information Retrieval

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Speech Recognition by Indexing and Sequencing

Unit 9. Teacher Guide. k l m n o p q r s t u v w x y z. Kindergarten Core Knowledge Language Arts New York Edition Skills Strand

Speaker recognition using universal background model on YOHO database

English-German Medical Dictionary And Phrasebook By A.H. Zemback

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Longman English Interactive

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Probabilistic Latent Semantic Analysis

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Detecting English-French Cognates Using Orthographic Edit Distance

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

CS 598 Natural Language Processing

arxiv: v1 [cs.cv] 10 May 2017

Rule Learning With Negation: Issues Regarding Effectiveness

CEFR Overall Illustrative English Proficiency Scales

SIE: Speech Enabled Interface for E-Learning

Common Core Standards Alignment Chart Grade 5

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Large vocabulary off-line handwriting recognition: A survey

(Sub)Gradient Descent

A Case Study: News Classification Based on Term Frequency

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Information for Candidates

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

TEKS Comments Louisiana GLE

Read&Write Gold is a software application and can be downloaded in Macintosh or PC version directly from

Experience: Virtual Travel Digital Path

ROSETTA STONE PRODUCT OVERVIEW

Mini Lesson Ideas for Expository Writing

Reading Horizons. A Look At Linguistic Readers. Nicholas P. Criscuolo APRIL Volume 10, Issue Article 5

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Language Acquisition Chart

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Introduction to Causal Inference. Problem Set 1. Required Problems

Application of Virtual Instruments (VIs) for an enhanced learning environment

A Vector Space Approach for Aspect-Based Sentiment Analysis

Problems of the Arabic OCR: New Attitudes

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Transcription:

Pronunciation Modeling Te Rutherford

Bottom Line Fixing pronunciation is much easier and cheaper than LM and AM. The improvement from the pronunciation model alone can be sizeable.

Overview of Speech Recognition

Audio to Feature Vector Chop up the audio Take Fourier Transform power of each frequency Featurize MFCC, Cepstrum, PLP, etc. Clip out the silence Stacking (windowing) use t-1 t-2 t-3 and t+1 t+2 t+3 as features

Frame to State Acoustic model a classifier (e.g. Neural Network) input : a feature vector output : a triphone state What is a triphone state? type 1 /t/ : {vowel} + /t/ + {vowel} type 2 /t/ : {nasal} + /t/ + {vowel} Use decision trees to cluster /t/ sounds from different environment.

Phone Lattice time 1 time 2 time 3 type 1 t - beginning 0.2 0.11 0.001 type 1 t - middle 0.1 0.23 0.001 type 1 t - end 0.1 0.06 0.2 type 2 t - beginning 0.05 0.02 0.2 type 2 t - middle 0.05 0.05 0.005 type 2 t - end 0.1 0.001 0.05 type 1 k - beginning 0.03 0.06 0.12 type 2 k - middle 0.02 0.02 0.42 type 3 k end 0.03 0.001 0.001 time 4

Weighted Finite State Transducer Essentially a search graph In speech, FST composition is used to prune the search graph and score the results.

HMM conversion Limiting what sequence of states is valid. k1 k1 k3 k3 ao1 ao2 ao2 ao3 l1 l2 l3 l3 not okay k1 k1 k2 k2 l1 l1 l2 l3 l3 not okay k1 k2 k2 k2 k3 k3 ao1 ao2 ao3 ao3 l1 l1 l2 l3 okay Use FST composition to prune the phone lattice

State to Phoneme Converting context-dependent triphone state back into phoneme type 1 /t/ : {vowel} + /t/ + {vowel} à /t/ type 2 /t/ : {nasal} + /t/ + {vowel} à /t/ Use FST Composition operation to convert (transduce)

Phoneme to Word List of valid phoneme sequences Example pronunciation dictionary: call : k ao l dad : d ae d mom : m aa m Backbone of the recognizer the best phoneme sequence " might not make a word

d ey t ax" d ey dx ax" d ae t ax" d ae dx ax Pronunciation FSTs

Recognizing sequence of words Language model and grammar are a finite state transducers. P(is intuition) P(is data)

Search meets machine learning Compose a well-pruned search graph (HCLG) HMM FSTsà pruned phone graph Context-independent FSTs à phoneme graph Lexicon FSTsà word graph " (heavily pruned phoneme graph) Grammar and LM FSTs à pruned and scored word graph

Search meets machine learning Compose a well-pruned search graph (HCLG) HMM FSTsà pruned phone graph Context-independent FSTs à phoneme graph Lexicon FSTsà word graph " (heavily pruned phoneme graph) Grammar and LM FSTs à pruned and scored word graph Classify a sequence of feature vector (by acoustic model) à phone lattice Compose the phone lattice with HCLG and search

Overview of Speech Recognition

Lexicon is the key component The lexicon makes training and decoding possible. limiting the size of the search graph The lexicon determines the words that can possibly be recognized.

Motivations for Pronunciation Modeling Suppose you are making a speech recognizer for a new language. Base dictionary and specialized dictionary How is a dictionary made? Can we automate it? semi-automate it?

Motivations for Pronunciation Modeling Suppose you are making a speech recognizer for a new language. Base dictionary and specialized dictionary How is a dictionary made? Can we automate it? semi-automate it? The errors from pronunciation modeling can be isolated and fixed relatively easily.

Pronunciation can be hard For a dictation task, we face new words all the time. Play a song by Ke$ha Install Spotify Who sings Gangnam Style? Tell me about Zoe Deschanel

Ask linguists for help These new words are usually hard and might require several pronunciations Gotye Rihanna Gangnam Style Ellie Goulding

Ask linguists for help These new words are usually hard and might require several pronunciations Gotye Rihanna Gangnam Style Ellie Goulding Disadvantages Expensive Too accurate Slow(er) turn-around time

Refresh the pronunciation dictionary Apply Grapheme-to-Phoneme conversion (G2P) to a list of new words Rule-based approach

Refresh the pronunciation dictionary Apply Grapheme-to-Phoneme conversion (G2P) to a list of new words Rule-based approach Statistical approach: Hidden Markov Model trained with Baum-Welch on the base dictionary state : set of phonemes observation : letters or groups of letters

Performance of statistical approach HMM-based models perform quite well. English, Dutch, German, and French which is the hardest?

Performance of statistical approach HMM-based models perform quite well. English, Dutch, German, and French which is the hardest? Jiampojamarn et al, 2007

Semi-automatic approach Learn the pronunciation of a word from a native speaker but not a linguist. Why is this possible?

Pronunciation Learning from Crowd-sourcing Rutherford et al., 2014

Learn automatically from human For each word or phrase (transcription), get people to pronounce it use the speech recognizer to extract the pronunciation we already have the acoustic model + transcription! Fast : <1 day turn-around Cheap : ~5 cents/transcription Sounds great. But would it work?

Overview of the algorithm

Data Acquisition Picked top 1,000 downloaded entries from Google Play Store for each of the four categories Artist names Song titles TV show names Movie titles Made 4 data sets Send them to Amazon Mechanical Turk 10 different Turkers pronounce the same transcription 7 utterances for pronunciation learning 3 utterances for testing

Pronunciation candidate generation Use Grapheme-to-Phoneme conversion to generate 20 pronunciations per word If a word is already in the dictionary, use that pronunciation too.

Extract pronunciation Force-aligned G2P: find the triphone state sequences that best align with the utterance

Pronunciation selection Evaluated 3 possibilities Use all of the learned pronunciations Use the pronunciations that occur more than once Majority vote

Experiment results: " Use all learned pronunciations

Conclusion so far Improved SACC across all datasets Highest impact on artist names

Question How should we select pronunciations? How many should we keep?

How many pronunciations should we use?

Question So far we tested on matching data only Why can t we test it on the standard test set? Would it help to use the new pronunciations in the real setting?

Beyond WER Use majority vote to select pronunciations to add the base pronunciation dictionary Test on the voice search task MTurkers rate the quality of the voice search result.

Results Voice Search

Testing out the pipeline Extract 14,000 Google Map typed search queries that occur rarely in the log. Pick 5,000 words that are not in the lexicon. Can we do better than fully automatic G2P?

Results Voice Google Map Search

Question Where do the learned pronunciations come from with respect to G2P rankings? Our baseline system uses the top G2P candidate.

Distribution of G2P Ranks

Learned pronunciations from artist data Artist name! Learned pronunciations! G2P Rank! Dan Omelio! AX M EH L IY OW! 20! Tyga! T AY G ER! 20! Mat Kearney! K EH EH R N IY! 20! Jadakiss! JH EY D AX K IH S! 20! Nasri Atweh! N AA Z R IY! 19! Sonny Uwaezuoke! Y UW W EY Z UW OW K! 15! Flo Rida! R AY D AX! 2! Amadeus Mozart! AX M AX D EY AX S! 4! David Guetta! G W EH T AX! 6!

Limitations Word boundary Call Me Fitz F IH T Gwen Stefani G W EH N Z + S T EY F AX N IY Need to enforce boundary constraints

Conclusion Crowdsourcing pronunciations proved a viable quick path to refresh the pronunciation dictionary. The forcealigned-g2p algorithm can be used to learn pronunciation from audio data. Pronunciation in English is hard because it s such an international language.