Resource Optimized Speech Recognition using Kullback-Leibler Divergence based HMM

Size: px

Start display at page:

Download "Resource Optimized Speech Recognition using Kullback-Leibler Divergence based HMM"

Toby Oswin Owen
5 years ago
Views:

1 Resource Optimized Speech Recognition using Kullback-Leibler Divergence based HMM Ramya Rasipuram David Imseng, Marzieh Razavi, Mathew Magimai Doss, Herve Bourlard 24 October /23

2 Automatic Speech Recognition (ASR) Pronunciation lexicon good /g/ /uh/ /d/ god /g/ /aa/ /d/ cat /k/ /aa/ /t/ Speech signal Feature extraction acoustic features Acoustic likelihood/ probabilities of sounds Decoder Good Morning /g/ 0.9 /uh/ 0.7 /d/ 0.8 /k/ 0.01 /aa/ 0.1 /t/ 0.15 Language morning good 0.7 god 0.01 cat /23

3 Hidden Markov Models (HMMs) for ASR language IS Is it? IT hypothesis word sequence pronunciation lexicon /ih/ /z/ /ih/ /t/ subwords deterministic /sil ih+z/ /ih z+ih/ /z ih+t/ /ih t+sil/ HMM states s_ih_2 s_ih_2 acoustic Acoustic features 3/23

4 Standard HMM-based ASR W t word sequence pronunciation lexicon subwords lexical acoustic HMM states Acoustic features Acoustic Model: 1 GMM HMM/GMM 2 ANN Hybrid HMM/ANN Lexical Model: 1 Deterministic decision trees 4/23

5 Resources for ASR Linguist Transcribed speech data Pronunciation lexicon good /g/ /uh/ /d/ god /g/ /aa/ /d/ cat /k/ /aa/ /t/ Speech signal Feature extraction acoustic features Acoustic likelihood/ probabilities of sounds Decoder Good Morning /g/ 0.9 /uh/ 0.7 /d/ 0.8 /k/ 0.01 /aa/ 0.1 /t/ 0.15 Text data Language morning good 0.7 god 0.01 cat /23

6 ASR for Under-Resourced Languages Limited or no transcribed speech Linguistic expertise may not be available Limited or no text resources 6/23

7 Limited Transcribed Speech Data Borrow resources Sounds of languages or phonemes can be shared across languages Pronunciation lexicon is important word pronun word pronun English a ei b b i: Italian a German a, a: b e: French A, a, E b & Greek α a β v 7/23

8 Conventional Approaches MAP adaptation Tandem Trees: LI Trees: LD Train: LI Adapt: LD GMM: LD v t ANN: LI Log PCA v t LI: language-independent data from resource-rich language(s) LD: language-dependent data from under-resourced language 8/23

9 No Pronunciation Lexicon 1 Pay a linguist expensive, time consuming 2 Graphemes as subword units easy, not optimal context dependent graphemes Word Phone Grapheme Read r eh d R E A D r iy d decision trees clustered context dependent graphemes GMMs 9/23

10 Limited Transcribed Speech and No Pronunciations Multilingual graphemes? Worse than monolingual grapheme-based ASR Language word pronun word pronun English a a b b Spanish a b Italian a b German a b French a Greek α? β? 10/23

11 Probabilistic Lexical Modeling W t 0 < P( ) < 1, D d=1 P(ad ) = 1 lexical Lexical : θ l = {y i } I i=1 y i = [y 1 i,..., yd i ]T, y d i = P(ad ) a 1 a D θ l estimated by training a HMM acoustic Kullback-Leibler divergence based HMM (KL-HMM) 1 1 Aradilla G., Acoustic Models for Posterior Features in Speech Recognition, EPFL PhD Thesis, /23

12 KL-HMM System Lexical parameters HMM state sequence Acoustic unit probability vector sequence y y D 1.. Acoustic Acoustic observation sequence (PLP) y y D y y D 3 a 12 a 23 a l 1 l 2 l 3 34 a 01 a 11 a 22 a 33 z 1 1 z D 1 p(a 1 ) z 1 t z D t... a 1 a 2 ANN a D p(a D ).. D Number of acoustic units ( 4,,,, +4 ) x 1,,,, x T z 1 T z D T 12/23

13 KL-HMM Features: posterior probability estimates of acoustic units z t = [z 1 t,..., z d t,..., z D t ] T, z d t = p( ) State distribution: categorical distribution y i = [y 1 i,..., yd i,..., yd i ]T, y d i = P(ad ) Local score: Kullback-Leibler (KL) divergence D ( z S(z t, y i ) = z d d ) t log t d=1 Parameter estimation: Viterbi Expectation Maximization algorithm cost function based on KL-divergence y d i 13/23

14 Decoding W t language pronunciation lexicon lexical a 1 a 2 a D match between acoustic and lexical evidence KL-divergence a 1 a 2 a D acoustic 14/23

15 Advantage 1: Resource Optimization Speech signal Feature extraction acoustic features Language dependent data Language independent data Acoustic likelihood/ probabilities of sounds Pronunciation lexicon read R E A D thing T H I N G that T H A T Lexical Decoder subword units /r/ 0.8 /eh/ 0.5 /d/ 0.9 /er/ 0.1 /iy/ 0.4 /t/ 0.09 "I Read that book" /r/ 0.9 /eh/ 0.8 /d/ 0.8 /er/ 0.09 /iy/ 0.1 /t/ 0.15 Language 15/23

16 Advantage 2: Grapheme Subword Units Linguist Pronunciation lexicon read R E A D thing T H I N G that T H A T subword units Speech signal Feature extraction acoustic features Acoustic likelihood/ probabilities of sounds Lexical /r/ 0.9 /eh/ 0.8 /d/ 0.8 /er/ 0.09 /iy/ 0.1 /t/ 0.15 Decoder /r/ 0.8 /eh/ 0.5 /d/ 0.9 /er/ 0.1 /iy/ 0.4 /t/ 0.09 Language "I Read that book" 16/23

17 Task Build speech recognition system for Greek but with Limited transcribed speech data No pronunciation lexicon Borrow resources from French, German, Italian, Spanish and English language independent (LI) data 17/23

18 Systems KL HMM KL-HMM: Greek Tandem Trees: Greek a 1 a D a 1 a D GMM: Greek ANN: LI ANN: LI Log v t PCA v t MAP adaptation Trees: LI GMM Train: LI Adapt: Greek HMM/GMM Trees: Greek GMM: Greek 18/23

19 Results KL-HMM Tandem Word accuracy in % 85 Phone Grapheme Amount of training data in minutes MAP adaptation Word accuracy in % Phone Grapheme Amount of training data in minutes HMM/GMM Word accuracy in % Phone Grapheme Word accuracy in % Phone Grapheme Amount of training data in minutes Amount of training data in minutes 19/23

20 Advantage 3: Pronunciation Variability Modeling Train data: Native speech Test data: Native and non-native speech HMM/GMM Hybrid HMM/ANN KL-HMM Trees Trees KL HMM a 1 a D GMMs ANN a 1 a D ANN 20/23

21 Results Word accuracy HMM/GMM-Overall Hybrid HMM/ANN-Overall KL-HMM-Overall 60 57(mono) Number of clustered CD units 21/23

22 Conclusions KL-HMM approach for speech recognition: 1 Efficient resource sharing 2 Suitable for both grapheme and phone based pronunciation lexicon 3 Suitable when task is challenged by both transcribed speech and pronunciation resource constraints 4 Performs better or comparable in well-resourced conditions 22/23

23 Thank you for your attention Questions? 23/23

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial