Resource Optimized Speech Recognition using Kullback-Leibler Divergence based HMM

Resource Optimized Speech Recognition using Kullback-Leibler Divergence based HMM Ramya Rasipuram David Imseng, Marzieh Razavi, Mathew Magimai Doss, Herve Bourlard 24 October 2014 1/23

Automatic Speech Recognition (ASR) Pronunciation lexicon good /g/ /uh/ /d/ god /g/ /aa/ /d/ cat /k/ /aa/ /t/ Speech signal Feature extraction acoustic features Acoustic likelihood/ probabilities of sounds Decoder Good Morning /g/ 0.9 /uh/ 0.7 /d/ 0.8 /k/ 0.01 /aa/ 0.1 /t/ 0.15 Language morning good 0.7 god 0.01 cat 0.001 2/23

Hidden Markov Models (HMMs) for ASR language IS Is it? IT hypothesis word sequence pronunciation lexicon /ih/ /z/ /ih/ /t/ subwords deterministic /sil ih+z/ /ih z+ih/ /z ih+t/ /ih t+sil/ HMM states s_ih_2 s_ih_2 acoustic Acoustic features 3/23

Standard HMM-based ASR W t word sequence pronunciation lexicon subwords lexical acoustic HMM states Acoustic features Acoustic Model: 1 GMM HMM/GMM 2 ANN Hybrid HMM/ANN Lexical Model: 1 Deterministic decision trees 4/23

Resources for ASR Linguist Transcribed speech data Pronunciation lexicon good /g/ /uh/ /d/ god /g/ /aa/ /d/ cat /k/ /aa/ /t/ Speech signal Feature extraction acoustic features Acoustic likelihood/ probabilities of sounds Decoder Good Morning /g/ 0.9 /uh/ 0.7 /d/ 0.8 /k/ 0.01 /aa/ 0.1 /t/ 0.15 Text data Language morning good 0.7 god 0.01 cat 0.001 5/23

ASR for Under-Resourced Languages Limited or no transcribed speech Linguistic expertise may not be available Limited or no text resources 6/23

Limited Transcribed Speech Data Borrow resources Sounds of languages or phonemes can be shared across languages Pronunciation lexicon is important word pronun word pronun English a ei b b i: Italian a b @ German a, a: b e: French A, a, E b & Greek α a β v 7/23

Conventional Approaches MAP adaptation Tandem Trees: LI Trees: LD Train: LI Adapt: LD GMM: LD v t ANN: LI Log PCA v t LI: language-independent data from resource-rich language(s) LD: language-dependent data from under-resourced language 8/23

No Pronunciation Lexicon 1 Pay a linguist expensive, time consuming 2 Graphemes as subword units easy, not optimal context dependent graphemes Word Phone Grapheme Read r eh d R E A D r iy d decision trees clustered context dependent graphemes GMMs 9/23

Limited Transcribed Speech and No Pronunciations Multilingual graphemes? Worse than monolingual grapheme-based ASR Language word pronun word pronun English a a b b Spanish a b Italian a b German a b French a Greek α? β? 10/23

Probabilistic Lexical Modeling W t 0 < P( ) < 1, D d=1 P(ad ) = 1 lexical Lexical : θ l = {y i } I i=1 y i = [y 1 i,..., yd i ]T, y d i = P(ad ) a 1 a D θ l estimated by training a HMM acoustic Kullback-Leibler divergence based HMM (KL-HMM) 1 1 Aradilla G., Acoustic Models for Posterior Features in Speech Recognition, EPFL PhD Thesis, 2008 11/23

KL-HMM System Lexical parameters HMM state sequence Acoustic unit probability vector sequence y 1 1.. y D 1.. Acoustic Acoustic observation sequence (PLP) y 1 2.. y D 2........ y 1 3.. y D 3 a 12 a 23 a l 1 l 2 l 3 34 a 01 a 11 a 22 a 33 z 1 1 z D 1 p(a 1 ) z 1 t z D t... a 1 a 2 ANN a D p(a D ).. D Number of acoustic units ( 4,,,, +4 ) x 1,,,, x T z 1 T z D T 12/23

KL-HMM Features: posterior probability estimates of acoustic units z t = [z 1 t,..., z d t,..., z D t ] T, z d t = p( ) State distribution: categorical distribution y i = [y 1 i,..., yd i,..., yd i ]T, y d i = P(ad ) Local score: Kullback-Leibler (KL) divergence D ( z S(z t, y i ) = z d d ) t log t d=1 Parameter estimation: Viterbi Expectation Maximization algorithm cost function based on KL-divergence y d i 13/23

Decoding W t language pronunciation lexicon lexical a 1 a 2 a D match between acoustic and lexical evidence KL-divergence a 1 a 2 a D acoustic 14/23

Advantage 1: Resource Optimization Speech signal Feature extraction acoustic features Language dependent data Language independent data Acoustic likelihood/ probabilities of sounds Pronunciation lexicon read R E A D thing T H I N G that T H A T Lexical Decoder subword units /r/ 0.8 /eh/ 0.5 /d/ 0.9 /er/ 0.1 /iy/ 0.4 /t/ 0.09 "I Read that book" /r/ 0.9 /eh/ 0.8 /d/ 0.8 /er/ 0.09 /iy/ 0.1 /t/ 0.15 Language 15/23

Advantage 2: Grapheme Subword Units Linguist Pronunciation lexicon read R E A D thing T H I N G that T H A T subword units Speech signal Feature extraction acoustic features Acoustic likelihood/ probabilities of sounds Lexical /r/ 0.9 /eh/ 0.8 /d/ 0.8 /er/ 0.09 /iy/ 0.1 /t/ 0.15 Decoder /r/ 0.8 /eh/ 0.5 /d/ 0.9 /er/ 0.1 /iy/ 0.4 /t/ 0.09 Language "I Read that book" 16/23

Task Build speech recognition system for Greek but with Limited transcribed speech data No pronunciation lexicon Borrow resources from French, German, Italian, Spanish and English language independent (LI) data 17/23

Systems KL HMM KL-HMM: Greek Tandem Trees: Greek a 1 a D a 1 a D GMM: Greek ANN: LI ANN: LI Log v t PCA v t MAP adaptation Trees: LI GMM Train: LI Adapt: Greek HMM/GMM Trees: Greek GMM: Greek 18/23

Results KL-HMM Tandem Word accuracy in % 85 Phone 80 75 70 65 60 55 Grapheme 50 5 9 18 37 75 150 300 800 Amount of training data in minutes MAP adaptation Word accuracy in % 85 80 75 70 65 60 55 Phone Grapheme 50 5 9 18 37 75 150 300 800 Amount of training data in minutes HMM/GMM Word accuracy in % 85 80 75 70 65 60 55 Phone Grapheme Word accuracy in % 85 80 75 70 65 60 55 Phone Grapheme 50 5 9 18 37 75 150 300 800 Amount of training data in minutes 50 5 9 18 37 75 150 300 800 Amount of training data in minutes 19/23

Advantage 3: Pronunciation Variability Modeling Train data: Native speech Test data: Native and non-native speech HMM/GMM Hybrid HMM/ANN KL-HMM Trees Trees KL HMM a 1 a D GMMs ANN a 1 a D ANN 20/23

Results 78 76 74 Word accuracy 72 70 68 66 64 62 HMM/GMM-Overall Hybrid HMM/ANN-Overall KL-HMM-Overall 60 57(mono) 195 385 549 759 1101 3000 Number of clustered CD units 21/23

Conclusions KL-HMM approach for speech recognition: 1 Efficient resource sharing 2 Suitable for both grapheme and phone based pronunciation lexicon 3 Suitable when task is challenged by both transcribed speech and pronunciation resource constraints 4 Performs better or comparable in well-resourced conditions 22/23

Thank you for your attention Questions? 23/23