A Transfer Learning Approach for Under-Resourced Arabic Dialects Speech Recognition

Size: px

Start display at page:

Download "A Transfer Learning Approach for Under-Resourced Arabic Dialects Speech Recognition"

Arron Warren
6 years ago
Views:

1 A Transfer Learning Approach for Under-Resourced Arabic Dialecs Speech Recogniion Mohamed Elmahdy *, Mar Hasegawa-Johnson, Eiman Musafawi * * Qaar Universiy, Doha, Qaar Universiy of Illinois a Urbana-Champaign, USA melmahdy@ieee.org, jhasegaw@illinois.edu, eimanmus@qu.edu.qa Absrac A major problem wih dialecal Arabic speech recogniion is due o he sparsiy of speech resources. In his paper, we propose a ransfer learning framewor o joinly use large amoun of Modern Sandard Arabic (MSA) daa and lile amoun of dialecal Arabic daa o improve acousic and language modeling. We have chosen he Qaari Arabic (QA) dialec as a ypical example for an under-resourced Arabic dialec. A wide-band speech corpus has been colleced and ranscribed from several Qaari TV series and al-show programs. A large vocabulary speech recogniion baseline sysem was buil using he QA corpus. The proposed MSA-based ransfer learning echnique was performed by applying orhographic normalizaion, phone mapping, daa pooling, acousic model adapaion, and sysem combinaion. The proposed approach can achieve more han 28% relaive reducion in WER. Keywords: dialecal Arabic, acousic modeling, language modeling, adapaion, cross-lingual 1. Inroducion Arabic language is he larges sill living Semiic language in erms of he number of speaers. More han 300 million people use Arabic as heir firs naive language and i is he 6 h mos widely used language based on he number of firs language speaers. Modern Sandard Arabic (MSA) is currenly considered he formal Arabic variey across all Arabic speaers. MSA is used in news broadcas, newspapers, formal speech, boos, movies subiling, and whenever he arge audience or readers come from differen naionaliies. Pracically, MSA is no he naural spoen language for naive Arabic speaers. MSA is always a second language for all Arabic speaers. In fac, dialecal (or colloquial) Arabic is he naural spoen variey of Arabic in everyday life communicaions. A significan problem in Arabic auomaic speech recogniion (ASR) is he exisence of many differen Arabic dialecs (Egypian, Levanine, Iraqi, Gulf, ec). Every counry has is own dialec and usually here exis differen dialecs wihin he same counry. Moreover, he differen Arabic dialecs are only spoen and no formally wrien and significan phonological, morphological, synacic, and lexical differences exis beween he dialecs and he sandard form. This siuaion is called Diglossia (Ferguson, 1959). Because of he diglossic naure of dialecal Arabic, lile research has been done in dialecal Arabic ASR, or in he use of dialec in any naural language processing ass. For MSA, on he oher hand, a lo of research has been conduced. The limied research done for dialecal Arabic ASR is also due o he sparsiy of dialecal speech resources for raining differen ASR models. To acle he problem of daa sparsiy, Kirchhoff and Vergyri (2005) proposed a cross-lingual approach where hey used pooled MSA and dialecal speech daa in raining he acousic model and achieved around 3% relaive reducion in WER. Similarly, in (Huang and Hasegawa-Johnson, 2012), a join cross-lingual raining mehod based on he similariy beween phonemes in MSA and dialecal speech daa also showed improvemens in phone classificaion ass. Elmahdy e al., (2010) proposed anoher cross-lingual approach based on acousic model adapaion, which resuled in abou 12% relaive reducion in WER. Acousic model adapaion can perform beer han daa pooling when dialecal speech daa are very limied compared o exising MSA daa, and adapaion may avoid dialecal acousic feaures masing by large MSA daa as in he daa pooling approach. In he DARPA GALE projec (Mangu e al., 2011), hey have rained he acousic model using large amoun of speech daa colleced from various news channels. Evaluaion was performed on news speech and conversaional speech. Conversaional speech is mosly sponaneous and includes significan percenage of dialecal Arabic as well as MSA. However he sysem was no evaluaed or adaped wih a specific under-resourced Arabic dialec. Moreover, mos of he conversaional daa in he GALE projec are coming from new broadcass, and we have noiced ha he majoriy of speaers end o spea in MSA raher han in heir own Arabic dialec. In his paper, we have chosen Qaari Arabic (QA) 1 as a ypical example for an under-resourced Arabic dialec. Despie he huge differences beween QA and MSA, we show how o benefi from large exising MSA speech and ex resources. In he proposed framewor, MSA daa and QA daa are joinly used in raining improved acousic and language models for QA. Since ranscripion convenions may be differen beween MSA and dialecal Arabic, we show how o apply phone mapping across MSA and dialecal Arabic. In addiion, we propose o apply daa pooling followed by 1 QA is he Arabic dialec spoen in Qaar and i is a subvariey of he Gulf dialec.

2 acousic model adapaion for cross-lingual acousic modeling and inerpolaion for cross-lingual language modeling. Our assumpion is ha he conribuion of limied dialecal speech daa in a pooled acousic model depends on he raio beween MSA daa and dialecal daa. Usually, here are far more daa available in MSA han in he dialec; so we expec lile conribuion of dialecal daa o he final pooled acousic model. In order o boos he weigh of dialecal feaures, acousic model adapaion echniques are applied on he pooled acousic model using dialecal speech daa. All our experimens have been conduced wih QA in he domain of TV broadcass. The remainder of his paper is organized as follows: Secion 2 inroduces he MSA and QA speech corpora. Secion 3 and 4 presen our speech recogniion sysem and he baseline approach, respecively. Our proposed cross-lingual language modeling and acousic modeling are discussed in Secion 5 and 6, respecively. Secion 7 discusses he experimenal resuls. Secion 8 concludes his sudy. 2. Speech Corpora 2.1. Modern Sandard Arabic The MSA corpus has been colleced from he domain of news broadcas. The corpus consiss of wo speech resources from he European Language Resources Associaion (ELRA). All resources are recorded in linear PCM forma, 16 Hz, and 16 bi. The ELRA speech resources are: The NEMLAR Broadcas News Speech Corpus, which consiss of abou 40 hours from differen radio saions: Medi1, Radio Orien, Radio Mone Carlo, and Radio Television Maroc. The NeDC Arabic Broadcas News Speech Corpus, which conains abou 22.5 hours recorded from Radio Orien. Deailed composiion of he resources is shown in Table 1. Source Duraion (hrs) Radio Orien 34.6 Medi1 9.5 Radio Mone Carlo 9.0 Radio Tele. Maroc 9.3 Toal 62.4 Table 1. Composiion of he MSA speech corpus Qaari Arabic Corpus We have colleced he QA corpus from differen TV series and al show programs. Daa are seleced from programs in which he majoriy of speech segmens is in QA; segmens from each program are seleced afer audiion confirms he qualiy of he speech signal. The programs are: Tesaneef (popular Qaari series wih almos 100% in QA), Sabah El-Doha (al show wih almos 80% in QA), and some episodes from Al-Jazeerah are seleced if gues speaers are speaing Qaari dialec. The corpus is recorded in linear PCM, 16 Hz, and 16 bis. The overall lengh is 15 hours. Deailed composiion is shown in Table 2. Transcripion is performed manually in radiional Arabic orhography. Five more Persian leers are used o indicae non-sandard Arabic consonans. The leer چ denoes he /ʧ/ consonan, گ denoes /ɡ/, ڤ denoes /v/, ژ denoes /ʒ/, and پ denoes /p/. Some diacriic mars are added for ambiguous words. The following non-speech filler ags are ranscribed: pause, breah, laugh, ah, noise, and music. Speech segmenaion is done wih a 10 second maximum for each segmen delimied by filler ags. The QA corpus is divided ino a raining se of 13 hours, a developmen se of 1 hour, and an evaluaion se of 2 hours. The raining se is used eiher o rain he QA baseline acousic model or o adap exiing MSA acousic model. Source Duraion (hrs) Tesaneef series 9.3 Sabah El-Doha al show 2.0 Al-Jazeerah programs 3.7 Toal 15.0 Table 2. Composiion of he QA corpus. 3. Sysem Descripion Our sysem is a GMM-HMM archiecure based on Kaldi speech recogniion engine (Povey e al., 2011). Acousic models are all fully coninuous densiy conex-dependen ri-phones wih 3 saes per HMM rained wih Maximum Muual Informaion Esimaion (MMIE). The feaure vecor consiss of he sandard 39-dimensional MFCC coefficiens. During acousic model raining, linear discriminan analysis (LDA) and maximum lielihood linear ransform (MLLT) are applied o reduce dimensionaliy, which improves accuracy as well as recogniion speed. Feaure-space MLLR (fmllr) was used for Speaer Adapive Training (SAT) of he acousic models. The firs decoding pass uses a relaively smaller language model of around 800K n-grams. Then in he second pass, he generaed rigram laices are rescored agains a larger rigram model of around 10M n-grams. 4. Baseline Sysem 4.1. Acousic Modeling We have adoped Grapheme-based acousic modeling (also nown as graphemic modeling). Graphemic modeling is an acousic modeling approach where he phoneic ranscripion is approximaed o be he word graphemes raher han he exac phoneme sequence. Shor vowels

3 and geminaions are assumed o be implicily modeled in he acousic model (Vergyri e al., 2005; Billa e al., 2002). The baseline acousic model is rained wih he QA raining se. The opimized number of ied-saes and Gaussians mixure per sae are found o be 1000 and 8, respecively. Each grapheme leer is mapped o a unique model resuling in a oal number of 41 base unis (36 leers in he sandard Arabic alphabe and 5 Persian leers) Language Modeling The language model is a bacoff ri-gram model wih Modified Kneser-Ney smoohing. The baseline language model has been rained wih he ranscripions of he QA raining se (65K words). The vocabulary size is abou 15.5K unique words. LM raining parameers have been opimized o minimize he perplexiy of he QA developmen se. The evaluaion of he language model agains he ranscripions of he evaluaion se resuls in an OOV rae of 22.2% and a perplexiy of whils on he developmen se, i resuls in an OOV rae of 18.4% and a perplexiy of he as shown in Table 4. We could no observe any improvemen in speech recogniion accuracy by increasing he order o 4-grams, apparenly because of he limied amoun of QA raining ex ha can resul in more sparsiy in higher order n-grams Evaluaion Seings For he QA baseline sysem, bach decoding resuled in WER of 61.7% on he QA developmen se and 80.8% on he evaluaion se as shown in Table 3. By examining resuls, we find ha abou 1.0% of he errors are caused,(ا insead of أ (e.g. by eiher: he differen forms of Alef final Teh Marbua ة) insead of ه or vice versa), or final Alef Masura ى) insead of ي or vice versa). Since here is no sandard orhographic form for dialecal Arabic and hese inds of errors are already common orhographic varians in dialecal Arabic, we decide o ignore hese ypes of errors by normalizing boh hypohesis and reference, before alignmen, as follows:. ا o (أ إ آ) Normalizing all forms of Hamzaed Alef. ى o Alef Masura ي Normalizing final Yeh. ه o Heh ة Normalizing Teh Marbua Afer applying orhographic normalizaion, absolue WER decreases o 60.9% on he dev. se wih 1.3% relaive reducion and 79.9% on he eval. se wih 1.1% relaive reducion as shown in Table 3. QA Baseline + Orhographic norm. dev. 61.7% 60.9% eval. 80.8% 79.9% Table 3. Word Error Rae (WER) (%) evaluaion of he QA baseline sysem wih and wihou orhographic normalizaion on he developmen se and he evaluaion se. 5. Cross-Lingual Language Modeling In he baseline sysem, a significan percenage of errors is mainly due o he high OOV rae ha exceeds 18%. In an aemp o improve he LM, we rained a MSA rigram LM using he LDC Gigaword corpus (Parer e a., 2009) ha consiss of more han 800M words. The MSA vocabulary consiss of he op 256K words in he corpus. The evaluaion of he MSA LM resuled in a perplexiy of and on he dev. and eval. ses respecively as shown in Table 4. The OOV rae was found o be 22.3% and 22.1% on he dev. and eval. ses respecively as shown in Table 4. In order o decrease OOV, we have linearly inerpolaed boh he QA LM and he MSA LM. Inerpolaion weighs were opimized on he dev. se. The cross-lingual inerpolaion resuled in a vocabulary size of 265.7K words. OOV rae is significanly decreased o 8.9% and 9.2% on he dev. and eval. ses respecively as shown in Table 4. Perplexiy es resuled in and on he dev. and eval. ses respecively. Using he cross-lingual MSA/QA LM, bach decoding resuled in absolue WER of 56.0% and 64.4% on he dev. and eval. ses respecively wih significan relaive reducion of 3.6% and 16.3% compared o he baseline as shown in Table 5. LM Vocab. Perp. OOV (%) dev. eval. dev. eval. QA 15.5K MSA 256K QA/MSA 265.7K Table 4. Language models evaluaion wih developmen se and evaluaion se. 6. Cross-Lingual Acousic Modeling 6.1. MSA Acousic Model In his secion, we describe how o use an MSA acousic model o decode QA speech. Iniially, ha is no possible because of he mismach beween he phone ses of MSA and QA. This mismach is solved by applying phone mapping. Consonans ha do no exis in MSA have been mapped o he closes ones in MSA as follows: /ɡ/ and /ʒ/ are mapped o /ʤ/. /ʧ/ is mapped o // followed by /ʃ/. /v/ is mapped o /f/. /p/ is mapped o /b/. Afer applying QA phone mapping, a MSA graphemic acousic model is rained using he MSA 62.4 hours corpus. Decoding resuls are an absolue WER of 61.9% and 81.3% on he dev. and eval. ses respecively wih 1.6% and 1.8 relaive increase compared o he QA baseline as shown in Table 5. This relaive increase is expeced as he MSA acousic model does no ye cover all QA dialec specific feaures.

4 6.2. Daa Pooling In daa pooling acousic modeling, we have joinly rained he acousic model using boh QA and MSA daa. Decoding resuls are an absolue WER of 56.6% and 64.4% on he dev. and eval. ses respecively ouperforming he baseline by a relaive decrease of 7.1% and 19.4% as shown in Table Acousic Model Adapaion In his secion, we apply acousic model adapaion echniques on he MSA model using QA speech Daa. Maximum Lielihood Linear Regression (MLLR) (Leggeer and Woodland, 1995) followed by Maximum A- Poseriori (MAP) re-esimaion (Lee and Gauvain, 1993) is applied. Decoding resuls are an absolue WER of 57.3% and 65.9% on he dev. and eval. ses respecively ouperforming he baseline by a relaive decrease of 5.9% and 17.5% as shown in Table Combined Daa Pooling and Acousic Model Adapaion Daa pooling and acousic model adapaion have been combined in his secion. Acousic model adapaion is applied on he MSA/QA pooled model raher han he MSA model. Decoding resuls are an absolue WER of 55.6% and 62.5% on he dev. and eval. ses respecively ouperforming he baseline by a significan relaive decrease of 8.7% and 21.8% as shown in Table Sysem Combinaion In his secion, we combine differen sysems o furher improve accuracy using Minimum Bayes-Ris (MBR) decoding (Goel and Byrne, 2000). MBR is applied on he generaed laices from he wo sysems: 1. QA AM (sys. 1 in Table 5). 2. QA/MSA pool/adap AM. (sys. 5 in Table 5). In boh sysems, he QA/MSA inerpolaed LM is used. Sysem combinaion using laice MBR resuled in an absolue WER of 47.9% and 56.8% on he dev. and eval. ses respecively ouperforming he baseline sysem by a relaive decrease of 21.3% and 28.9% as shown in Table 5. sys. AM dev. eval QA MSA QA/MSA pool QA/MSA adap QA/MSA pool/adap 1+5 MBR Table 5. WER on QA dev. and eval. ses using QA/MSA LM and various acousic models configuraions. The sraegy of daa pooling, followed by MLLR+MAP adapaion, is equivalen o a ype of ieraive ransformaion and adapive re-weighing of he QA relaive o h he MSA daa. For example, he mean vecor of he Gaussian, compued by he final sage of MAP adapaion, is given by T ( ) x A 1 T, (1) ( ) 1 where x, 1 T, is a dialecal feaure vecor, h () is he poserior probabiliy of he Gaussian h given x, is he weigh of he prior, is he mean prior o adapaion, and A is he corresponding MLLR ransformaion. Bu noice ha, in urn, is given by T S T S 1 ( ) x, N ( ) x, N 1 1 (2) where x, for T 1 T S, is an MSA feaure vecor, and () is he weighing coefficien compued during he las round of maximum-lielihood EM raining applied o he pooled MSA and QA daases. By combining Eq. (1) and (2), we discover ha MAP adapaion is similar o an adapive re-weighing scheme, such ha QA feaure vecors are weighed comparably o MSA feaure vecors during he iniial EM raining, hen ransformed by A, and hen re-weighed o an increased final weigh of N ( ) ( ). The effecive weigh of each MSA daum is similarly decreased, during MAP adapaion, o only (). The effec of his ieraive sraegy is o give greaer weigh o MSA daa during he iniial raining of he model, when he MSA daa may be useful o help he learning algorihm avoid spurious local opima in he lielihood funcion; afer he model parameers have converged o a soluion ha is opimal for he pooled MSA+QA daa, hen MLLR improves he represenaion of QA daa, and, finally, MAP is used o increase he relaive imporance of QA daa in he final raining crierion. 7. Discussion Even hough he differences beween MSA and Arabic dialecs are large, o he exen ha we can consider Arabic dialecs as oally differen languages (Ferguson, 1959), we can sill benefi from MSA speech resources o improve dialecal Arabic speech recogniion. The performance of he daa pooling approach may be affeced by he raio of dialecal daa amoun o MSA daa amoun. In our case, he daa pooling approach resuls in an absolue WER of 56.0% on dev. se and 64.4% on eval. se. MSA daa amoun is abou five imes he amoun of dialecal daa. In order o boos he conribuion of dialecal daa, MLLR and MAP adapaions are hen applied on he pooled acousic model, effecively increasing he weigh of dialecal acousic feaures in he final cross-lingual model. The combinaion of daa pooling followed by acousic model adapaion resuls in a lower absolue

5 WER of 55.6% on dev. se and 62.5% on eval. se. Laice MBR decoding conribues in furher reducion in WER achieving 47.9% on dev. se and 56.8% on eval. se. 8. Conclusions and Fuure Wor In his paper, we propose a speech recogniion sysem for Qaari Colloquial Arabic (QA). Due o he limiaion of dialecal resources, by uilizing MSA daa, our proposed mehod, cross-dialecal phone mapping, daa pooling, acousic model adapaion, and sysem combinaion mehods, has achieved 21.3% and 28.9% relaive WER reducion on QA developmen se and evaluaion se respecively. For fuure wor, i is possible o exend curren framewor o oher dialec speech recogniion sysems. Moreover, some fuure direcions are o incorporae recen achievemens in ransfer learning and domain adapaion o furher improve he sysem performance (Pan and Yang, 2010). In addiion, he cross-lingual raining and adapaion can be bidirecional; a muli-as framewor of Arabic speech recogniion can be formulaed so ha boh MSA and dialecal recogniion performance can be enhanced simulaneously (Caruana, 1997). 9. Acnowledgmen This publicaion was made possible by a gran from he Qaar Naional Research Fund under is Naional Prioriies Research Program (NPRP) award number NPRP Is conens are solely he responsibiliy of he auhors and do no necessarily represen he official views of he Qaar Naional Research Fund. We would lie also o acnowledge he European Language Resources Associaion (ELRA) and he Linguisic Daa Consorium (LDC) for providing us wih daa resources. References Billa, J., Noamany, M., Srivasava, A., Liu, D., Sone, R., Xu, J., Mahoul, J. and Kubala, F. (2002). Audio indexing of Arabic broadcas news. Proceedings of ICASSP, vol. 1, pp Caruana, R. (1997). Mulias learning. Machine Learning, vol. 28, no. 1, pp Elmahdy, M., Gruhn, R., Miner, W. and Abdennadher, S. (2010). Cross-lingual acousic modeling for dialecal Arabic speech recogniion. Proceedings of INTER- SPEECH, pp Ferguson, C.A. (1959). Diglossia. Word, vol. 15, pp Goel, V. and Byrne, W. (2000). Minimum Bayes-Ris Auomaic Speech Recogniion. Compuer Speech and Language, 14(2), pp Huang, P.-S. and Hasegawa-Johnson, M. (2012). Crossdialecal daa ransferring for Gaussian mixure model raining in Arabic speech recogniion. Inernaional Conference on Arabic Language Processing. Kirchhoff, K. and Vergyri, D. (2005). Cross-dialecal daa sharing for acousic modeling in Arabic speech recogniion. Speech Communicaion, vol. 46(1), pp Lee, C.-H. and Gauvain, J.-L. (1993). Speaer adapaion based on MAP esimaion of HMM parameers. Proceedings of ICASSP, vol. II, pp Leggeer, C.J. and Woodland, P.C. (1995). Maximum lielihood linear regression for speaer adapaion of he parameers of coninuous densiy hidden Marov models. Compuer Speech and Language, vol. 9, pp Mangu, L., Kuo, H.-K, Chu, S., Kingsbury, B., Saon, G., Solau, H. and Biadsy, F. (2011). The IBM 2011 GALE Arabic Speech Transcripion Sysem. Proceedings of ASRU, pp NEMLAR Broadcas News Speech Corpus, ELRA caalog reference: ELRA-S0219, hp://caalog.elra.info/ produc_info.php?producs_id=874 NeDC Arabic BNSC (Broadcas News Speech Corpus), ELRA caalog reference: ELRA-S0157, hp://caalog. elra.info/produc_info.php?producs_id=13 Pan, S. J. and Yang, Q. (2010). A survey on ransfer learning. IEEE Transacions on Knowledge and Daa Engineering, vol. 22, no. 10, pp Parer, R., Graff, D., Chen, K., Kong, J., Maeda, K. (2009) Arabic Gigaword Fourh Ediion. Linguisic Daa Consorium, Pennsylvania, LDC Caalog No.: LDC2009T30, ISBN: Povey, D., Ghoshal, A., Boulianne, G., Burge, L., Glembe, O., Goel, N., Hannemann, M., Molice, P., Qian, Y., Schwarz, P., Silovsy, J., Semmer, G. and Vesely, K. (2011). The Kaldi Speech Recogniion Tooli. Proceedings of IEEE ASRU. Vergyri, D., Kirchhoff, K., Gadde, R., Solce, A. and Zheng, J. (2005). Developmen of a conversaional elephone speech recognizer for Levanine Arabic. Proceedings of INTERSPEECH, pp

Neural Network Model of the Backpropagation Algorithm

Neural Network Model of the Backpropagation Algorithm Neural Nework Model of he Backpropagaion Algorihm Rudolf Jakša Deparmen of Cyberneics and Arificial Inelligence Technical Universiy of Košice Lená 9, 4 Košice Slovakia jaksa@neuron.uke.sk Miroslav Karák