Automatic Speech Recognition (CS753)

Size: px

Start display at page:

Download "Automatic Speech Recognition (CS753)"

Regina Walker
5 years ago
Views:

1 Automatic Speech Recognition (CS753) Lecture 20: Pronunciation Modeling Instructor: Preethi Jyothi Oct 16, 2017

2 Pronunciation Dictionary/Lexicon Pronunciation model/dictionary/lexicon: Lists one or more pronunciations for a word Typically derived from language experts: Sequence of phones written down for each word Dictionary construction involves: 1. Selecting what words to include in the dictionary 2. Pronunciation of each word (also, check for multiple pronunciations)

3 Grapheme-based models

4 Graphemes vs. Phonemes Instead of a pronunciation dictionary, one could represent a pronunciation as a sequence of graphemes (or letters). That is, model at the grapheme level. Useful technique for low-resourced/under-resourced languages Main advantages: 1. Avoid the need for phone-based pronunciations 2. Avoid the need for a phone alphabet 3. Works pretty well for languages with a direct link between graphemes (letters) and phonemes (sounds)

5 Grapheme-based ASR Language ID System WER (%) Vit CN CNC Kurmanji Phonetic Kurdish Graphemic Tok Pisin 207 Cebuano 301 Kazakh 302 Telugu 303 Lithuanian 304 Phonetic Graphemic Phonetic Graphemic Phonetic Graphemic Phonetic Graphemic Phonetic Graphemic Image from: Gales et al., Unicode-based graphemic systems for limited resource languages, ICASSP 15

6 Graphemes vs. Phonemes Instead of a pronunciation dictionary, one could represent a pronunciation as a sequence of graphemes (or letters) Useful technique for low-resourced/under-resourced languages Main advantages: 1. Avoid the need for phone-based pronunciations 2. Avoid the need for a phone alphabet 3. Works pretty well for languages with a direct link between graphemes (letters) and phonemes (sounds)

7 Grapheme to phoneme (G2P) conversion

8 Grapheme to phoneme (G2P) conversion Produce a pronunciation (phoneme sequence) given a written word (grapheme sequence) Learn G2P mappings from a pronunciation dictionary Useful for: ASR systems in languages with no pre-built lexicons Speech synthesis systems Deriving pronunciations for out-of-vocabulary (OOV) words

9 G2P conversion (I) One popular paradigm: Joint sequence models [BN12] Grapheme and phoneme sequences are first aligned using EM-based algorithm Results in a sequence of graphones (joint G-P tokens) Ngram models trained on these graphone sequences WFST-based implementation of such a joint graphone model [Phonetisaurus] [BN12]:Bisani & Ney, Joint sequence models for grapheme-to-phoneme conversion,specom 2012 [Phonetisaurus] J. Novak, Phonetisaurus Toolkit

10 G2P conversion (II) Neural network based methods are the new state-of-the-art for G2P Bidirectional LSTM-based networks using a CTC output layer [Rao15]. Comparable to Ngram models. Incorporate alignment information [Yao15]. Beats Ngram models. No alignment. Encoder-decoder with attention. Beats the above systems [Toshniwal16].

LSTM + CTC for G2P conversion [Rao15] Model Word Error Rate (%) Galescu and Allen [4] 28.5 Chen [7] 24.7 Bisani and Ney [2] 24.5 Novak et al. [6] 24.4 Wu et al. [12] 23.4 5-gram FST 27.

11 LSTM + CTC for G2P conversion [Rao15] Model Word Error Rate (%) Galescu and Allen [4] 28.5 Chen [7] 24.7 Bisani and Ney [2] 24.5 Novak et al. [6] 24.4 Wu et al. [12] gram FST gram FST 26.5 Unidirectional LSTM with Full-delay 30.1 DBLSTM-CTC 128 Units 27.9 DBLSTM-CTC 512 Units 25.8 DBLSTM-CTC gram FST 21.3 [Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015

12 G2P conversion (II) Neural network based methods are the new state-of-the-art for G2P Bidirectional LSTM-based networks using a CTC output layer [Rao15]. Comparable to Ngram models. Incorporate alignment information [Yao15]. Beats Ngram models. No alignment. Encoder-decoder with attention. Beats the above systems [Toshniwal16].

13 Seq2seq models (with alignment information [Yao15]) Method PER (%) WER (%) encoder-decoder LSTM encoder-decoder LSTM (2 layers) uni-directional LSTM uni-directional LSTM (window size 6) bi-directional LSTM bi-directional LSTM (2 layers) bi-directional LSTM (3 layers) Data Method PER (%) WER (%) CMUDict past results [20] bi-directional LSTM NetTalk past results [20] bi-directional LSTM Pronlex past results [20, 21] bi-directional LSTM [Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015

14 G2P conversion (II) Neural network based methods are the new state-of-the-art for G2P Bidirectional LSTM-based networks using a CTC output layer [Rao15]. Comparable to Ngram models. Incorporate alignment information [Yao15]. Beats Ngram models. No alignment. Encoder-decoder with attention. Beats the above systems [Toshniwal16]. [Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015 [Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015 [Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

15 Encoder-decoder + attention for G2P [Toshniwal16] Attention Layer y t c t t Encoder Decoder h Tg h 3 h 2 h 1 d t x Tg x 3 x 2 x 1 [Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

16 Encoder-decoder + attention for G2P [Toshniwal16] Attention Layer c t t y t Data Method PER (%) CMUDict BiDir LSTM + Alignment [6] 5.45 DBLSTM-CTC [5] - DBLSTM-CTC + 5-gram model [5] - Encoder-decoder + global attn 5.04 ± 0.03 Encoder-decoder + local-m attn 5.11 ± 0.03 Encoder-decoder + local-p attn 5.39 ± 0.04 Ensemble of 5 [Encoder-decoder + global attn] models 4.69 Pronlex BiDir LSTM + Alignment [6] 6.51 Encoder-decoder + global attn 6.24 ± 0.1 Encoder-decoder + local-m attn 5.99 ± 0.11 Encoder-decoder + local-p attn 6.49 ± 0.06 NetTalk BiDir LSTM + Alignment [6] 7.38 Encoder-decoder + global attn 7.14 ± 0.72 Encoder-decoder + local-m attn 7.13 ± 0.11 Encoder-decoder + local-p attn 8.41 ± 0.19 Encoder Decoder h Tg h 3 h 2 h 1 d t x Tg x 3 x 2 x 1 [Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

17 Sub-phonetic feature-based models

18 Pronunciation Model Phone-Based Articulatory Features Each word is a sequence of phones Parallel streams of articulator movements Tends to be highly language dependent Based on theory of articulatory phonology 1 PHONE s eh n s 1 [C. P. Browman and L. Goldstein, Phonology 86]

Pronunciation Model Articulatory Features LIP- OPEN LIP- LOC TT-LOC TB-LOC TB- TT- OPEN OPEN VELUM Parallel streams of articulator movements Based on theory of GLOTTIS articulatory phonology 1

19 Pronunciation Model Articulatory Features LIP- OPEN LIP- LOC TT-LOC TB-LOC TB- TT- OPEN OPEN VELUM Parallel streams of articulator movements Based on theory of GLOTTIS articulatory phonology 1 PHONE s eh n s LIP open/labial TON.TIP critical/alveolar mid/alveolar closed/alveolar critical/alveolar TON.BODY mid/uvular mid/palatal mid/uvular GLOTTIS open critical open VELUM closed open closed

20 Example: Pronunciations for word sense CANONICAL LIP TB TT GLOT VEL PHONE open/labial mid/uvular mid/palatal mid/uvular critical/alveolar mid/alveolar closed/alveolar critical/alveolar open critical open closed closed open s eh n s E.g. OBSERVED LIP TB TT GLOT VEL PHONE open/labial mid/uvular mid/palatal mid/uvular critical/alveolar mid/alveolar closed/alveolar critical/alveolar open critical open closed open closed s eh_n n t s Simple asynchrony across feature streams can appear as many phone alterations [Adapted from Livescu 05]

21 Dynamic Bayesian Networks (DBNs) Provides a natural framework to efficiently encode multiple streams of articulatory features Simple DBN with three random variables in each time frame A t-1 A t A t+1 B t-1 B t B t+1 C t-1 C t C t+1 frame t-1 frame t frame t+1

22 P. Jyothi, E. Fosler-Lussier & K. Livescu, Interspeech 12 DBN model of pronunciation Word Posn Phn L-Lag s ao l v 0 s ao l v 1 - s ao l Trans Posn Phn Word solve s ao l v L-Lag T-Lag G-Lag L-Phn Prev- Phone T-Phn G-Phn Observed feature values Lip-Op TT-Op Glot sur Lip-Op sur TT-Op sur Glot

23 Factorized DBN model 1 Word Trans Posn Phn Set1 L-Lag T-Lag G-Lag Set2 L-Phn Prev- Phone T-Phn G-Phn Set3 Lip-Op TT-Op Glot Set4 sur Lip-Op sur TT-Op sur Glot Set5

24 Cascade of Finite State Machines Word Trans Posn L-Lag L-Phn sur Lipop sur TTop Prev- Phn Phn T-Lag T-Phn Lipop TTop G- Lag G- Phn Glot sur Glot Word Phn, Trans, L-Lag,T-Lag, G-Lag Lip-op, TT-op, Glot Posn L-Lag T-Lag G-Lag Prev-Phn Phn Phn, Trans L-Phn, T-Phn, G-Phn F1 F2 F3 F4 surlip-op, surtt-op, surglot F5 1 [P. Jyothi, E. Fosler-Lussier, K. Livescu, Interspeech 12]

25 Weighted Finite State Machine x1:y1/1.5 x2:y2/1.3 x4:y4/0.6 x3:y3/2.0

26 Weighted Finite State Machine x1:y1/1.5 x2:y2/1.3 x4:y4/0.6 x3:y3/2.0 w α (X, a) : weight of path a on input X. where α are learned parameters Linear model: w α (X, a) = α φ(x, a). where φ is a feature function. Decoding: For input X, find the path with minimum cost. a * = argmin w α (X, a) path a

27 Discriminative Training x1:y1/1.5 x2:y2/1.3 x4:y4/0.6 x3:y3/2.0 Online discriminative training algorithm to learn α Similar to structured perceptron [Collins 02]: Each training sample gives a decoded path and a correct path. Update α to bias towards correct path. Use a large-margin training algorithm adapted to work with a cascade of finite state machines 1 1 [P. Jyothi, E. Fosler-Lussier & K. Livescu, Interspeech-13]

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio