Automatic Speech Recognition (CS753)

Similar documents
Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Letter-based speech synthesis

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speech Recognition at ICSI: Broadcast News and beyond

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

A study of speaker adaptation for DNN-based speech synthesis

Phonological Processing for Urdu Text to Speech System

The IRISA Text-To-Speech System for the Blizzard Challenge 2017

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Modeling function word errors in DNN-HMM based LVCSR systems

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Calibration of Confidence Measures in Speech Recognition

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Improvements to the Pruning Behavior of DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Proceedings of Meetings on Acoustics

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Noisy SMS Machine Translation in Low-Density Languages

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Python Machine Learning

arxiv: v1 [cs.cl] 27 Apr 2016

Investigation of Indian English Speech Recognition using CMU Sphinx

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Lecture 1: Machine Learning Basics

Learning Methods in Multilingual Speech Recognition

Deep Neural Network Language Models

arxiv: v1 [cs.lg] 7 Apr 2015

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

CS Machine Learning

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Investigation on Mandarin Broadcast News Speech Recognition

THE world surrounding us involves multiple modalities

Linking Task: Identifying authors and book titles in verbose queries

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Language properties and Grammar of Parallel and Series Parallel Languages

Learning Distributed Linguistic Classes

Artificial Neural Networks written examination

(Sub)Gradient Descent

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Speaker Identification by Comparison of Smart Methods. Abstract

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

A Review: Speech Recognition with Deep Learning Methods

Learning Methods for Fuzzy Systems

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Knowledge Transfer in Deep Convolutional Neural Nets

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text

Detecting English-French Cognates Using Orthographic Edit Distance

Stages of Literacy Ros Lugg

SARDNET: A Self-Organizing Feature Map for Sequences

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

IMPROVING PRONUNCIATION DICTIONARY COVERAGE OF NAMES BY MODELLING SPELLING VARIATION. Justin Fackrell and Wojciech Skut

The NICT Translation System for IWSLT 2012

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Florida Reading Endorsement Alignment Matrix Competency 1

Softprop: Softmax Neural Network Backpropagation Learning

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Residual Stacking of RNNs for Neural Machine Translation

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Model Ensemble for Click Prediction in Bing Search Ads

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Consonants: articulation and transcription

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

raıs Factors affecting word learning in adults: A comparison of L2 versus L1 acquisition /r/ /aı/ /s/ /r/ /aı/ /s/ = individual sound

arxiv: v1 [cs.cl] 2 Apr 2017

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Evolution of Symbolisation in Chimpanzees and Neural Nets

arxiv: v1 [cs.lg] 15 Jun 2015

Transcription:

Automatic Speech Recognition (CS753) Lecture 20: Pronunciation Modeling Instructor: Preethi Jyothi Oct 16, 2017

Pronunciation Dictionary/Lexicon Pronunciation model/dictionary/lexicon: Lists one or more pronunciations for a word Typically derived from language experts: Sequence of phones written down for each word Dictionary construction involves: 1. Selecting what words to include in the dictionary 2. Pronunciation of each word (also, check for multiple pronunciations)

Grapheme-based models

Graphemes vs. Phonemes Instead of a pronunciation dictionary, one could represent a pronunciation as a sequence of graphemes (or letters). That is, model at the grapheme level. Useful technique for low-resourced/under-resourced languages Main advantages: 1. Avoid the need for phone-based pronunciations 2. Avoid the need for a phone alphabet 3. Works pretty well for languages with a direct link between graphemes (letters) and phonemes (sounds)

Grapheme-based ASR Language ID System WER (%) Vit CN CNC Kurmanji Phonetic 67.6 65.8 205 Kurdish Graphemic 67.0 65.3 Tok Pisin 207 Cebuano 301 Kazakh 302 Telugu 303 Lithuanian 304 Phonetic 41.8 40.6 Graphemic 42.1 41.1 Phonetic 55.5 54.0 Graphemic 55.5 54.2 Phonetic 54.9 53.5 Graphemic 54.0 52.7 Phonetic 70.6 69.1 Graphemic 70.9 69.5 Phonetic 51.5 50.2 Graphemic 50.9 49.5 64.1 39.4 52.6 51.5 67.5 48.3 Image from: Gales et al., Unicode-based graphemic systems for limited resource languages, ICASSP 15

Graphemes vs. Phonemes Instead of a pronunciation dictionary, one could represent a pronunciation as a sequence of graphemes (or letters) Useful technique for low-resourced/under-resourced languages Main advantages: 1. Avoid the need for phone-based pronunciations 2. Avoid the need for a phone alphabet 3. Works pretty well for languages with a direct link between graphemes (letters) and phonemes (sounds)

Grapheme to phoneme (G2P) conversion

Grapheme to phoneme (G2P) conversion Produce a pronunciation (phoneme sequence) given a written word (grapheme sequence) Learn G2P mappings from a pronunciation dictionary Useful for: ASR systems in languages with no pre-built lexicons Speech synthesis systems Deriving pronunciations for out-of-vocabulary (OOV) words

G2P conversion (I) One popular paradigm: Joint sequence models [BN12] Grapheme and phoneme sequences are first aligned using EM-based algorithm Results in a sequence of graphones (joint G-P tokens) Ngram models trained on these graphone sequences WFST-based implementation of such a joint graphone model [Phonetisaurus] [BN12]:Bisani & Ney, Joint sequence models for grapheme-to-phoneme conversion,specom 2012 [Phonetisaurus] J. Novak, Phonetisaurus Toolkit

G2P conversion (II) Neural network based methods are the new state-of-the-art for G2P Bidirectional LSTM-based networks using a CTC output layer [Rao15]. Comparable to Ngram models. Incorporate alignment information [Yao15]. Beats Ngram models. No alignment. Encoder-decoder with attention. Beats the above systems [Toshniwal16].

LSTM + CTC for G2P conversion [Rao15] Model Word Error Rate (%) Galescu and Allen [4] 28.5 Chen [7] 24.7 Bisani and Ney [2] 24.5 Novak et al. [6] 24.4 Wu et al. [12] 23.4 5-gram FST 27.2 8-gram FST 26.5 Unidirectional LSTM with Full-delay 30.1 DBLSTM-CTC 128 Units 27.9 DBLSTM-CTC 512 Units 25.8 DBLSTM-CTC 512 + 5-gram FST 21.3 [Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015

G2P conversion (II) Neural network based methods are the new state-of-the-art for G2P Bidirectional LSTM-based networks using a CTC output layer [Rao15]. Comparable to Ngram models. Incorporate alignment information [Yao15]. Beats Ngram models. No alignment. Encoder-decoder with attention. Beats the above systems [Toshniwal16].

Seq2seq models (with alignment information [Yao15]) Method PER (%) WER (%) encoder-decoder LSTM 7.53 29.21 encoder-decoder LSTM (2 layers) 7.63 28.61 uni-directional LSTM 8.22 32.64 uni-directional LSTM (window size 6) 6.58 28.56 bi-directional LSTM 5.98 25.72 bi-directional LSTM (2 layers) 5.84 25.02 bi-directional LSTM (3 layers) 5.45 23.55 Data Method PER (%) WER (%) CMUDict past results [20] 5.88 24.53 bi-directional LSTM 5.45 23.55 NetTalk past results [20] 8.26 33.67 bi-directional LSTM 7.38 30.77 Pronlex past results [20, 21] 6.78 27.33 bi-directional LSTM 6.51 26.69 [Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015

G2P conversion (II) Neural network based methods are the new state-of-the-art for G2P Bidirectional LSTM-based networks using a CTC output layer [Rao15]. Comparable to Ngram models. Incorporate alignment information [Yao15]. Beats Ngram models. No alignment. Encoder-decoder with attention. Beats the above systems [Toshniwal16]. [Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015 [Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015 [Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

Encoder-decoder + attention for G2P [Toshniwal16] Attention Layer y t c t t Encoder Decoder h Tg h 3 h 2 h 1 d t x Tg x 3 x 2 x 1 [Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

Encoder-decoder + attention for G2P [Toshniwal16] Attention Layer c t t y t Data Method PER (%) CMUDict BiDir LSTM + Alignment [6] 5.45 DBLSTM-CTC [5] - DBLSTM-CTC + 5-gram model [5] - Encoder-decoder + global attn 5.04 ± 0.03 Encoder-decoder + local-m attn 5.11 ± 0.03 Encoder-decoder + local-p attn 5.39 ± 0.04 Ensemble of 5 [Encoder-decoder + global attn] models 4.69 Pronlex BiDir LSTM + Alignment [6] 6.51 Encoder-decoder + global attn 6.24 ± 0.1 Encoder-decoder + local-m attn 5.99 ± 0.11 Encoder-decoder + local-p attn 6.49 ± 0.06 NetTalk BiDir LSTM + Alignment [6] 7.38 Encoder-decoder + global attn 7.14 ± 0.72 Encoder-decoder + local-m attn 7.13 ± 0.11 Encoder-decoder + local-p attn 8.41 ± 0.19 Encoder Decoder h Tg h 3 h 2 h 1 d t x Tg x 3 x 2 x 1 [Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

Sub-phonetic feature-based models

Pronunciation Model Phone-Based Articulatory Features Each word is a sequence of phones Parallel streams of articulator movements Tends to be highly language dependent Based on theory of articulatory phonology 1 PHONE s eh n s 1 [C. P. Browman and L. Goldstein, Phonology 86]

Pronunciation Model Articulatory Features LIP- OPEN LIP- LOC TT-LOC TB-LOC TB- TT- OPEN OPEN VELUM Parallel streams of articulator movements Based on theory of GLOTTIS articulatory phonology 1 PHONE s eh n s LIP open/labial TON.TIP critical/alveolar mid/alveolar closed/alveolar critical/alveolar TON.BODY mid/uvular mid/palatal mid/uvular GLOTTIS open critical open VELUM closed open closed

Example: Pronunciations for word sense CANONICAL LIP TB TT GLOT VEL PHONE open/labial mid/uvular mid/palatal mid/uvular critical/alveolar mid/alveolar closed/alveolar critical/alveolar open critical open closed closed open s eh n s E.g. OBSERVED LIP TB TT GLOT VEL PHONE open/labial mid/uvular mid/palatal mid/uvular critical/alveolar mid/alveolar closed/alveolar critical/alveolar open critical open closed open closed s eh_n n t s Simple asynchrony across feature streams can appear as many phone alterations [Adapted from Livescu 05]

Dynamic Bayesian Networks (DBNs) Provides a natural framework to efficiently encode multiple streams of articulatory features Simple DBN with three random variables in each time frame A t-1 A t A t+1 B t-1 B t B t+1 C t-1 C t C t+1 frame t-1 frame t frame t+1

P. Jyothi, E. Fosler-Lussier & K. Livescu, Interspeech 12 DBN model of pronunciation Word Posn 0 1 2 3 Phn L-Lag s ao l v 0 s ao l v 1 - s ao l Trans Posn Phn Word solve s ao l v L-Lag T-Lag G-Lag L-Phn Prev- Phone T-Phn G-Phn Observed feature values Lip-Op TT-Op Glot sur Lip-Op sur TT-Op sur Glot

Factorized DBN model 1 Word Trans Posn Phn Set1 L-Lag T-Lag G-Lag Set2 L-Phn Prev- Phone T-Phn G-Phn Set3 Lip-Op TT-Op Glot Set4 sur Lip-Op sur TT-Op sur Glot Set5

Cascade of Finite State Machines Word Trans Posn L-Lag L-Phn sur Lipop sur TTop Prev- Phn Phn T-Lag T-Phn Lipop TTop G- Lag G- Phn Glot sur Glot Word Phn, Trans, L-Lag,T-Lag, G-Lag Lip-op, TT-op, Glot Posn L-Lag T-Lag G-Lag Prev-Phn Phn Phn, Trans L-Phn, T-Phn, G-Phn F1 F2 F3 F4 surlip-op, surtt-op, surglot F5 1 [P. Jyothi, E. Fosler-Lussier, K. Livescu, Interspeech 12]

Weighted Finite State Machine x1:y1/1.5 x2:y2/1.3 x4:y4/0.6 x3:y3/2.0

Weighted Finite State Machine x1:y1/1.5 x2:y2/1.3 x4:y4/0.6 x3:y3/2.0 w α (X, a) : weight of path a on input X. where α are learned parameters Linear model: w α (X, a) = α φ(x, a). where φ is a feature function. Decoding: For input X, find the path with minimum cost. a * = argmin w α (X, a) path a

Discriminative Training x1:y1/1.5 x2:y2/1.3 x4:y4/0.6 x3:y3/2.0 Online discriminative training algorithm to learn α Similar to structured perceptron [Collins 02]: Each training sample gives a decoded path and a correct path. Update α to bias towards correct path. Use a large-margin training algorithm adapted to work with a cascade of finite state machines 1 1 [P. Jyothi, E. Fosler-Lussier & K. Livescu, Interspeech-13]