A Transfer Learning Approach for Under-Resourced Arabic Dialects Speech Recognition

Similar documents
Neural Network Model of the Backpropagation Algorithm

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Fast Multi-task Learning for Query Spelling Correction

More Accurate Question Answering on Freebase

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

MyLab & Mastering Business

1 Language universals

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Modeling function word errors in DNN-HMM based LVCSR systems

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Learning Methods in Multilingual Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Deep Neural Network Language Models

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

arxiv: v1 [cs.cl] 27 Apr 2016

Improvements to the Pruning Behavior of DNN Acoustic Models

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Recognition at ICSI: Broadcast News and beyond

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

arxiv: v1 [cs.lg] 7 Apr 2015

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Emotion Recognition Using Support Vector Machine

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

WHEN THERE IS A mismatch between the acoustic

A study of speaker adaptation for DNN-based speech synthesis

Multi-Lingual Text Leveling

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Language Model and Grammar Extraction Variation in Machine Translation

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Arabic Orthography vs. Arabic OCR

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Division of Arts, Humanities & Wellness Department of World Languages and Cultures. Course Syllabus اللغة والثقافة العربية ١ LAN 115

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Letter-based speech synthesis

Noisy SMS Machine Translation in Low-Density Languages

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

What do Medical Students Need to Learn in Their English Classes?

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

Reading Horizons. A Look At Linguistic Readers. Nicholas P. Criscuolo APRIL Volume 10, Issue Article 5

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Human Emotion Recognition From Speech

Calibration of Confidence Measures in Speech Recognition

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Automatic Assessment of Spoken Modern Standard Arabic

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Problems of the Arabic OCR: New Attitudes

A hybrid approach to translate Moroccan Arabic dialect

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Switchboard Language Model Improvement with Conversational Data from Gigaword

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Student s Assistant for Open e-learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Using EEG to Improve Massive Open Online Courses Feedback Interaction

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Proceedings of Meetings on Acoustics

Word Segmentation of Off-line Handwritten Documents

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Dropout improves Recurrent Neural Networks for Handwriting Recognition

TEKS Correlations Proclamation 2017

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Online Updating of Word Representations for Part-of-Speech Tagging

Grade 4. Common Core Adoption Process. (Unpacked Standards)

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speaker recognition using universal background model on YOHO database

Reducing Features to Improve Bug Prediction

SCHEMA ACTIVATION IN MEMORY FOR PROSE 1. Michael A. R. Townsend State University of New York at Albany

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Transcription:

A Transfer Learning Approach for Under-Resourced Arabic Dialecs Speech Recogniion Mohamed Elmahdy *, Mar Hasegawa-Johnson, Eiman Musafawi * * Qaar Universiy, Doha, Qaar Universiy of Illinois a Urbana-Champaign, USA melmahdy@ieee.org, jhasegaw@illinois.edu, eimanmus@qu.edu.qa Absrac A major problem wih dialecal Arabic speech recogniion is due o he sparsiy of speech resources. In his paper, we propose a ransfer learning framewor o joinly use large amoun of Modern Sandard Arabic (MSA) daa and lile amoun of dialecal Arabic daa o improve acousic and language modeling. We have chosen he Qaari Arabic (QA) dialec as a ypical example for an under-resourced Arabic dialec. A wide-band speech corpus has been colleced and ranscribed from several Qaari TV series and al-show programs. A large vocabulary speech recogniion baseline sysem was buil using he QA corpus. The proposed MSA-based ransfer learning echnique was performed by applying orhographic normalizaion, phone mapping, daa pooling, acousic model adapaion, and sysem combinaion. The proposed approach can achieve more han 28% relaive reducion in WER. Keywords: dialecal Arabic, acousic modeling, language modeling, adapaion, cross-lingual 1. Inroducion Arabic language is he larges sill living Semiic language in erms of he number of speaers. More han 300 million people use Arabic as heir firs naive language and i is he 6 h mos widely used language based on he number of firs language speaers. Modern Sandard Arabic (MSA) is currenly considered he formal Arabic variey across all Arabic speaers. MSA is used in news broadcas, newspapers, formal speech, boos, movies subiling, and whenever he arge audience or readers come from differen naionaliies. Pracically, MSA is no he naural spoen language for naive Arabic speaers. MSA is always a second language for all Arabic speaers. In fac, dialecal (or colloquial) Arabic is he naural spoen variey of Arabic in everyday life communicaions. A significan problem in Arabic auomaic speech recogniion (ASR) is he exisence of many differen Arabic dialecs (Egypian, Levanine, Iraqi, Gulf, ec). Every counry has is own dialec and usually here exis differen dialecs wihin he same counry. Moreover, he differen Arabic dialecs are only spoen and no formally wrien and significan phonological, morphological, synacic, and lexical differences exis beween he dialecs and he sandard form. This siuaion is called Diglossia (Ferguson, 1959). Because of he diglossic naure of dialecal Arabic, lile research has been done in dialecal Arabic ASR, or in he use of dialec in any naural language processing ass. For MSA, on he oher hand, a lo of research has been conduced. The limied research done for dialecal Arabic ASR is also due o he sparsiy of dialecal speech resources for raining differen ASR models. To acle he problem of daa sparsiy, Kirchhoff and Vergyri (2005) proposed a cross-lingual approach where hey used pooled MSA and dialecal speech daa in raining he acousic model and achieved around 3% relaive reducion in WER. Similarly, in (Huang and Hasegawa-Johnson, 2012), a join cross-lingual raining mehod based on he similariy beween phonemes in MSA and dialecal speech daa also showed improvemens in phone classificaion ass. Elmahdy e al., (2010) proposed anoher cross-lingual approach based on acousic model adapaion, which resuled in abou 12% relaive reducion in WER. Acousic model adapaion can perform beer han daa pooling when dialecal speech daa are very limied compared o exising MSA daa, and adapaion may avoid dialecal acousic feaures masing by large MSA daa as in he daa pooling approach. In he DARPA GALE projec (Mangu e al., 2011), hey have rained he acousic model using large amoun of speech daa colleced from various news channels. Evaluaion was performed on news speech and conversaional speech. Conversaional speech is mosly sponaneous and includes significan percenage of dialecal Arabic as well as MSA. However he sysem was no evaluaed or adaped wih a specific under-resourced Arabic dialec. Moreover, mos of he conversaional daa in he GALE projec are coming from new broadcass, and we have noiced ha he majoriy of speaers end o spea in MSA raher han in heir own Arabic dialec. In his paper, we have chosen Qaari Arabic (QA) 1 as a ypical example for an under-resourced Arabic dialec. Despie he huge differences beween QA and MSA, we show how o benefi from large exising MSA speech and ex resources. In he proposed framewor, MSA daa and QA daa are joinly used in raining improved acousic and language models for QA. Since ranscripion convenions may be differen beween MSA and dialecal Arabic, we show how o apply phone mapping across MSA and dialecal Arabic. In addiion, we propose o apply daa pooling followed by 1 QA is he Arabic dialec spoen in Qaar and i is a subvariey of he Gulf dialec.

acousic model adapaion for cross-lingual acousic modeling and inerpolaion for cross-lingual language modeling. Our assumpion is ha he conribuion of limied dialecal speech daa in a pooled acousic model depends on he raio beween MSA daa and dialecal daa. Usually, here are far more daa available in MSA han in he dialec; so we expec lile conribuion of dialecal daa o he final pooled acousic model. In order o boos he weigh of dialecal feaures, acousic model adapaion echniques are applied on he pooled acousic model using dialecal speech daa. All our experimens have been conduced wih QA in he domain of TV broadcass. The remainder of his paper is organized as follows: Secion 2 inroduces he MSA and QA speech corpora. Secion 3 and 4 presen our speech recogniion sysem and he baseline approach, respecively. Our proposed cross-lingual language modeling and acousic modeling are discussed in Secion 5 and 6, respecively. Secion 7 discusses he experimenal resuls. Secion 8 concludes his sudy. 2. Speech Corpora 2.1. Modern Sandard Arabic The MSA corpus has been colleced from he domain of news broadcas. The corpus consiss of wo speech resources from he European Language Resources Associaion (ELRA). All resources are recorded in linear PCM forma, 16 Hz, and 16 bi. The ELRA speech resources are: The NEMLAR Broadcas News Speech Corpus, which consiss of abou 40 hours from differen radio saions: Medi1, Radio Orien, Radio Mone Carlo, and Radio Television Maroc. The NeDC Arabic Broadcas News Speech Corpus, which conains abou 22.5 hours recorded from Radio Orien. Deailed composiion of he resources is shown in Table 1. Source Duraion (hrs) Radio Orien 34.6 Medi1 9.5 Radio Mone Carlo 9.0 Radio Tele. Maroc 9.3 Toal 62.4 Table 1. Composiion of he MSA speech corpus. 2.2. Qaari Arabic Corpus We have colleced he QA corpus from differen TV series and al show programs. Daa are seleced from programs in which he majoriy of speech segmens is in QA; segmens from each program are seleced afer audiion confirms he qualiy of he speech signal. The programs are: Tesaneef (popular Qaari series wih almos 100% in QA), Sabah El-Doha (al show wih almos 80% in QA), and some episodes from Al-Jazeerah are seleced if gues speaers are speaing Qaari dialec. The corpus is recorded in linear PCM, 16 Hz, and 16 bis. The overall lengh is 15 hours. Deailed composiion is shown in Table 2. Transcripion is performed manually in radiional Arabic orhography. Five more Persian leers are used o indicae non-sandard Arabic consonans. The leer چ denoes he /ʧ/ consonan, گ denoes /ɡ/, ڤ denoes /v/, ژ denoes /ʒ/, and پ denoes /p/. Some diacriic mars are added for ambiguous words. The following non-speech filler ags are ranscribed: pause, breah, laugh, ah, noise, and music. Speech segmenaion is done wih a 10 second maximum for each segmen delimied by filler ags. The QA corpus is divided ino a raining se of 13 hours, a developmen se of 1 hour, and an evaluaion se of 2 hours. The raining se is used eiher o rain he QA baseline acousic model or o adap exiing MSA acousic model. Source Duraion (hrs) Tesaneef series 9.3 Sabah El-Doha al show 2.0 Al-Jazeerah programs 3.7 Toal 15.0 Table 2. Composiion of he QA corpus. 3. Sysem Descripion Our sysem is a GMM-HMM archiecure based on Kaldi speech recogniion engine (Povey e al., 2011). Acousic models are all fully coninuous densiy conex-dependen ri-phones wih 3 saes per HMM rained wih Maximum Muual Informaion Esimaion (MMIE). The feaure vecor consiss of he sandard 39-dimensional MFCC coefficiens. During acousic model raining, linear discriminan analysis (LDA) and maximum lielihood linear ransform (MLLT) are applied o reduce dimensionaliy, which improves accuracy as well as recogniion speed. Feaure-space MLLR (fmllr) was used for Speaer Adapive Training (SAT) of he acousic models. The firs decoding pass uses a relaively smaller language model of around 800K n-grams. Then in he second pass, he generaed rigram laices are rescored agains a larger rigram model of around 10M n-grams. 4. Baseline Sysem 4.1. Acousic Modeling We have adoped Grapheme-based acousic modeling (also nown as graphemic modeling). Graphemic modeling is an acousic modeling approach where he phoneic ranscripion is approximaed o be he word graphemes raher han he exac phoneme sequence. Shor vowels

and geminaions are assumed o be implicily modeled in he acousic model (Vergyri e al., 2005; Billa e al., 2002). The baseline acousic model is rained wih he QA raining se. The opimized number of ied-saes and Gaussians mixure per sae are found o be 1000 and 8, respecively. Each grapheme leer is mapped o a unique model resuling in a oal number of 41 base unis (36 leers in he sandard Arabic alphabe and 5 Persian leers). 4.2. Language Modeling The language model is a bacoff ri-gram model wih Modified Kneser-Ney smoohing. The baseline language model has been rained wih he ranscripions of he QA raining se (65K words). The vocabulary size is abou 15.5K unique words. LM raining parameers have been opimized o minimize he perplexiy of he QA developmen se. The evaluaion of he language model agains he ranscripions of he evaluaion se resuls in an OOV rae of 22.2% and a perplexiy of 315.5 whils on he developmen se, i resuls in an OOV rae of 18.4% and a perplexiy of 399.4 he as shown in Table 4. We could no observe any improvemen in speech recogniion accuracy by increasing he order o 4-grams, apparenly because of he limied amoun of QA raining ex ha can resul in more sparsiy in higher order n-grams. 4.3. Evaluaion Seings For he QA baseline sysem, bach decoding resuled in WER of 61.7% on he QA developmen se and 80.8% on he evaluaion se as shown in Table 3. By examining resuls, we find ha abou 1.0% of he errors are caused,(ا insead of أ (e.g. by eiher: he differen forms of Alef final Teh Marbua ة) insead of ه or vice versa), or final Alef Masura ى) insead of ي or vice versa). Since here is no sandard orhographic form for dialecal Arabic and hese inds of errors are already common orhographic varians in dialecal Arabic, we decide o ignore hese ypes of errors by normalizing boh hypohesis and reference, before alignmen, as follows:. ا o (أ إ آ) Normalizing all forms of Hamzaed Alef. ى o Alef Masura ي Normalizing final Yeh. ه o Heh ة Normalizing Teh Marbua Afer applying orhographic normalizaion, absolue WER decreases o 60.9% on he dev. se wih 1.3% relaive reducion and 79.9% on he eval. se wih 1.1% relaive reducion as shown in Table 3. QA Baseline + Orhographic norm. dev. 61.7% 60.9% eval. 80.8% 79.9% Table 3. Word Error Rae (WER) (%) evaluaion of he QA baseline sysem wih and wihou orhographic normalizaion on he developmen se and he evaluaion se. 5. Cross-Lingual Language Modeling In he baseline sysem, a significan percenage of errors is mainly due o he high OOV rae ha exceeds 18%. In an aemp o improve he LM, we rained a MSA rigram LM using he LDC Gigaword corpus (Parer e a., 2009) ha consiss of more han 800M words. The MSA vocabulary consiss of he op 256K words in he corpus. The evaluaion of he MSA LM resuled in a perplexiy of 1366.7 and 1199.2 on he dev. and eval. ses respecively as shown in Table 4. The OOV rae was found o be 22.3% and 22.1% on he dev. and eval. ses respecively as shown in Table 4. In order o decrease OOV, we have linearly inerpolaed boh he QA LM and he MSA LM. Inerpolaion weighs were opimized on he dev. se. The cross-lingual inerpolaion resuled in a vocabulary size of 265.7K words. OOV rae is significanly decreased o 8.9% and 9.2% on he dev. and eval. ses respecively as shown in Table 4. Perplexiy es resuled in 1147.0 and 1262.7 on he dev. and eval. ses respecively. Using he cross-lingual MSA/QA LM, bach decoding resuled in absolue WER of 56.0% and 64.4% on he dev. and eval. ses respecively wih significan relaive reducion of 3.6% and 16.3% compared o he baseline as shown in Table 5. LM Vocab. Perp. OOV (%) dev. eval. dev. eval. QA 15.5K 399.4 315.5 18.4 22.2 MSA 256K 1366.7 1199.2 22.3 22.1 QA/MSA 265.7K 1147.0 1262.7 8.9 9.2 Table 4. Language models evaluaion wih developmen se and evaluaion se. 6. Cross-Lingual Acousic Modeling 6.1. MSA Acousic Model In his secion, we describe how o use an MSA acousic model o decode QA speech. Iniially, ha is no possible because of he mismach beween he phone ses of MSA and QA. This mismach is solved by applying phone mapping. Consonans ha do no exis in MSA have been mapped o he closes ones in MSA as follows: /ɡ/ and /ʒ/ are mapped o /ʤ/. /ʧ/ is mapped o // followed by /ʃ/. /v/ is mapped o /f/. /p/ is mapped o /b/. Afer applying QA phone mapping, a MSA graphemic acousic model is rained using he MSA 62.4 hours corpus. Decoding resuls are an absolue WER of 61.9% and 81.3% on he dev. and eval. ses respecively wih 1.6% and 1.8 relaive increase compared o he QA baseline as shown in Table 5. This relaive increase is expeced as he MSA acousic model does no ye cover all QA dialec specific feaures.

6.2. Daa Pooling In daa pooling acousic modeling, we have joinly rained he acousic model using boh QA and MSA daa. Decoding resuls are an absolue WER of 56.6% and 64.4% on he dev. and eval. ses respecively ouperforming he baseline by a relaive decrease of 7.1% and 19.4% as shown in Table 5. 6.3. Acousic Model Adapaion In his secion, we apply acousic model adapaion echniques on he MSA model using QA speech Daa. Maximum Lielihood Linear Regression (MLLR) (Leggeer and Woodland, 1995) followed by Maximum A- Poseriori (MAP) re-esimaion (Lee and Gauvain, 1993) is applied. Decoding resuls are an absolue WER of 57.3% and 65.9% on he dev. and eval. ses respecively ouperforming he baseline by a relaive decrease of 5.9% and 17.5% as shown in Table 5. 6.4. Combined Daa Pooling and Acousic Model Adapaion Daa pooling and acousic model adapaion have been combined in his secion. Acousic model adapaion is applied on he MSA/QA pooled model raher han he MSA model. Decoding resuls are an absolue WER of 55.6% and 62.5% on he dev. and eval. ses respecively ouperforming he baseline by a significan relaive decrease of 8.7% and 21.8% as shown in Table 5. 6.5. Sysem Combinaion In his secion, we combine differen sysems o furher improve accuracy using Minimum Bayes-Ris (MBR) decoding (Goel and Byrne, 2000). MBR is applied on he generaed laices from he wo sysems: 1. QA AM (sys. 1 in Table 5). 2. QA/MSA pool/adap AM. (sys. 5 in Table 5). In boh sysems, he QA/MSA inerpolaed LM is used. Sysem combinaion using laice MBR resuled in an absolue WER of 47.9% and 56.8% on he dev. and eval. ses respecively ouperforming he baseline sysem by a relaive decrease of 21.3% and 28.9% as shown in Table 5. sys. AM dev. eval. 1 2 3 4 5 6 QA MSA QA/MSA pool QA/MSA adap QA/MSA pool/adap 1+5 MBR 58.7 61.9 56.0 57.3 55.6 47.9 66.9 81.3 64.4 65.9 62.5 56.8 Table 5. WER on QA dev. and eval. ses using QA/MSA LM and various acousic models configuraions. The sraegy of daa pooling, followed by MLLR+MAP adapaion, is equivalen o a ype of ieraive ransformaion and adapive re-weighing of he QA relaive o h he MSA daa. For example, he mean vecor of he Gaussian, compued by he final sage of MAP adapaion, is given by T ( ) x A 1 T, (1) ( ) 1 where x, 1 T, is a dialecal feaure vecor, h () is he poserior probabiliy of he Gaussian h given x, is he weigh of he prior, is he mean prior o adapaion, and A is he corresponding MLLR ransformaion. Bu noice ha, in urn, is given by T S T S 1 ( ) x, N ( ) x, N 1 1 (2) where x, for T 1 T S, is an MSA feaure vecor, and () is he weighing coefficien compued during he las round of maximum-lielihood EM raining applied o he pooled MSA and QA daases. By combining Eq. (1) and (2), we discover ha MAP adapaion is similar o an adapive re-weighing scheme, such ha QA feaure vecors are weighed comparably o MSA feaure vecors during he iniial EM raining, hen ransformed by A, and hen re-weighed o an increased final weigh of N ( ) ( ). The effecive weigh of each MSA daum is similarly decreased, during MAP adapaion, o only (). The effec of his ieraive sraegy is o give greaer weigh o MSA daa during he iniial raining of he model, when he MSA daa may be useful o help he learning algorihm avoid spurious local opima in he lielihood funcion; afer he model parameers have converged o a soluion ha is opimal for he pooled MSA+QA daa, hen MLLR improves he represenaion of QA daa, and, finally, MAP is used o increase he relaive imporance of QA daa in he final raining crierion. 7. Discussion Even hough he differences beween MSA and Arabic dialecs are large, o he exen ha we can consider Arabic dialecs as oally differen languages (Ferguson, 1959), we can sill benefi from MSA speech resources o improve dialecal Arabic speech recogniion. The performance of he daa pooling approach may be affeced by he raio of dialecal daa amoun o MSA daa amoun. In our case, he daa pooling approach resuls in an absolue WER of 56.0% on dev. se and 64.4% on eval. se. MSA daa amoun is abou five imes he amoun of dialecal daa. In order o boos he conribuion of dialecal daa, MLLR and MAP adapaions are hen applied on he pooled acousic model, effecively increasing he weigh of dialecal acousic feaures in he final cross-lingual model. The combinaion of daa pooling followed by acousic model adapaion resuls in a lower absolue

WER of 55.6% on dev. se and 62.5% on eval. se. Laice MBR decoding conribues in furher reducion in WER achieving 47.9% on dev. se and 56.8% on eval. se. 8. Conclusions and Fuure Wor In his paper, we propose a speech recogniion sysem for Qaari Colloquial Arabic (QA). Due o he limiaion of dialecal resources, by uilizing MSA daa, our proposed mehod, cross-dialecal phone mapping, daa pooling, acousic model adapaion, and sysem combinaion mehods, has achieved 21.3% and 28.9% relaive WER reducion on QA developmen se and evaluaion se respecively. For fuure wor, i is possible o exend curren framewor o oher dialec speech recogniion sysems. Moreover, some fuure direcions are o incorporae recen achievemens in ransfer learning and domain adapaion o furher improve he sysem performance (Pan and Yang, 2010). In addiion, he cross-lingual raining and adapaion can be bidirecional; a muli-as framewor of Arabic speech recogniion can be formulaed so ha boh MSA and dialecal recogniion performance can be enhanced simulaneously (Caruana, 1997). 9. Acnowledgmen This publicaion was made possible by a gran from he Qaar Naional Research Fund under is Naional Prioriies Research Program (NPRP) award number NPRP 09-410-1-069. Is conens are solely he responsibiliy of he auhors and do no necessarily represen he official views of he Qaar Naional Research Fund. We would lie also o acnowledge he European Language Resources Associaion (ELRA) and he Linguisic Daa Consorium (LDC) for providing us wih daa resources. References Billa, J., Noamany, M., Srivasava, A., Liu, D., Sone, R., Xu, J., Mahoul, J. and Kubala, F. (2002). Audio indexing of Arabic broadcas news. Proceedings of ICASSP, vol. 1, pp. 5 8. Caruana, R. (1997). Mulias learning. Machine Learning, vol. 28, no. 1, pp. 41 75. Elmahdy, M., Gruhn, R., Miner, W. and Abdennadher, S. (2010). Cross-lingual acousic modeling for dialecal Arabic speech recogniion. Proceedings of INTER- SPEECH, pp. 873 876. Ferguson, C.A. (1959). Diglossia. Word, vol. 15, pp. 325 340. Goel, V. and Byrne, W. (2000). Minimum Bayes-Ris Auomaic Speech Recogniion. Compuer Speech and Language, 14(2), pp. 115 135. Huang, P.-S. and Hasegawa-Johnson, M. (2012). Crossdialecal daa ransferring for Gaussian mixure model raining in Arabic speech recogniion. Inernaional Conference on Arabic Language Processing. Kirchhoff, K. and Vergyri, D. (2005). Cross-dialecal daa sharing for acousic modeling in Arabic speech recogniion. Speech Communicaion, vol. 46(1), pp. 37 51. Lee, C.-H. and Gauvain, J.-L. (1993). Speaer adapaion based on MAP esimaion of HMM parameers. Proceedings of ICASSP, vol. II, pp. 558 561. Leggeer, C.J. and Woodland, P.C. (1995). Maximum lielihood linear regression for speaer adapaion of he parameers of coninuous densiy hidden Marov models. Compuer Speech and Language, vol. 9, pp. 171 185. Mangu, L., Kuo, H.-K, Chu, S., Kingsbury, B., Saon, G., Solau, H. and Biadsy, F. (2011). The IBM 2011 GALE Arabic Speech Transcripion Sysem. Proceedings of ASRU, pp. 272 277. NEMLAR Broadcas News Speech Corpus, ELRA caalog reference: ELRA-S0219, hp://caalog.elra.info/ produc_info.php?producs_id=874 NeDC Arabic BNSC (Broadcas News Speech Corpus), ELRA caalog reference: ELRA-S0157, hp://caalog. elra.info/produc_info.php?producs_id=13 Pan, S. J. and Yang, Q. (2010). A survey on ransfer learning. IEEE Transacions on Knowledge and Daa Engineering, vol. 22, no. 10, pp. 1345 1359. Parer, R., Graff, D., Chen, K., Kong, J., Maeda, K. (2009) Arabic Gigaword Fourh Ediion. Linguisic Daa Consorium, Pennsylvania, LDC Caalog No.: LDC2009T30, ISBN: 1-58563-532-4. Povey, D., Ghoshal, A., Boulianne, G., Burge, L., Glembe, O., Goel, N., Hannemann, M., Molice, P., Qian, Y., Schwarz, P., Silovsy, J., Semmer, G. and Vesely, K. (2011). The Kaldi Speech Recogniion Tooli. Proceedings of IEEE ASRU. Vergyri, D., Kirchhoff, K., Gadde, R., Solce, A. and Zheng, J. (2005). Developmen of a conversaional elephone speech recognizer for Levanine Arabic. Proceedings of INTERSPEECH, pp. 1613 1616.