A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

Similar documents
Division of Arts, Humanities & Wellness Department of World Languages and Cultures. Course Syllabus اللغة والثقافة العربية ١ LAN 115

Sentiment Analysis of Tunisian Dialect: Linguistic Resources and Experiments

ASR for Tajweed Rules: Integrated with Self- Learning Environments

On the Formation of Phoneme Categories in DNN Acoustic Models

Arabic Orthography vs. Arabic OCR

Phonological Processing for Urdu Text to Speech System

SIX DISCOURSE MARKERS IN TUNISIAN ARABIC: A SYNTACTIC AND PRAGMATIC ANALYSIS. Chris Adams Bachelor of Arts, Asbury College, May 2006

A hybrid approach to translate Moroccan Arabic dialect

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Learning Methods in Multilingual Speech Recognition

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

HybridTechniqueforArabicTextCompression

Florida Reading Endorsement Alignment Matrix Competency 1

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Study Center in Amman, Jordan

Modeling function word errors in DNN-HMM based LVCSR systems

Problems of the Arabic OCR: New Attitudes

Modeling function word errors in DNN-HMM based LVCSR systems

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Accepted Manuscript. Title: Region Growing Based Segmentation Algorithm for Typewritten, Handwritten Text Recognition

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Letter-based speech synthesis

Modeling full form lexica for Arabic

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

CEFR Overall Illustrative English Proficiency Scales

Baku Regional Seminar in a nutshell

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Phonological and Phonetic Representations: The Case of Neutralization

English Language and Applied Linguistics. Module Descriptions 2017/18

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Consonants: articulation and transcription

Speech Recognition at ICSI: Broadcast News and beyond

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

1. Introduction. 2. The OMBI database editor

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Automatic English-Chinese name transliteration for development of multilingual resources

Rebecca McLain Hodges

VISUAL MEDIA USED IN INTRODUCING VOCABULARY AT TK IT AL-MA UN SENGKALING THESIS. By: FAJRIN AL FERA

Conventional Orthography for Dialectal Arabic

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Effect of Word Complexity on L2 Vocabulary Learning

The Use of Inflectional Morphemes by Kuwaiti EFL Learners

Language. Name: Period: Date: Unit 3. Cultural Geography

DIBELS Next BENCHMARK ASSESSMENTS

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Cross Language Information Retrieval

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

WHY SOLVE PROBLEMS? INTERVIEWING COLLEGE FACULTY ABOUT THE LEARNING AND TEACHING OF PROBLEM SOLVING

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Coast Academies Writing Framework Step 4. 1 of 7

Lexical phonology. Marc van Oostendorp. December 6, Until now, we have presented phonological theory as if it is a monolithic

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

English-German Medical Dictionary And Phrasebook By A.H. Zemback

Characterizing and Processing Robot-Directed Speech

IMPROVING SPEAKING SKILL OF THE TENTH GRADE STUDENTS OF SMK 17 AGUSTUS 1945 MUNCAR THROUGH DIRECT PRACTICE WITH THE NATIVE SPEAKER

ARNE - A tool for Namend Entity Recognition from Arabic Text

A Comparative Survey on Arabic Stemming: Approaches and Challenges

UNITED STATES SOCIAL HISTORY: CULTURAL PLURALISM IN AMERICA El Camino College - History 32 Spring 2009 Dr. Christina Gold

REVIEW OF CONNECTED SPEECH

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

To appear in The TESOL encyclopedia of ELT (Wiley-Blackwell) 1 RECASTING. Kazuya Saito. Birkbeck, University of London

Code-switching among Tunisian women and its impact on identity

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Decade of Higher Education in the Arab States: Achievements & Challenges

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Word-based dialect identification with georeferenced rules

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

Getting into top colleges. Farrukh Azmi, MD, PhD

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

A Novel Approach for the Recognition of a wide Arabic Handwritten Word Lexicon

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Universal contrastive analysis as a learning principle in CAPT

SIE: Speech Enabled Interface for E-Learning

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

INTERNATIONAL JOURNAL OFTHE SOCIOLOGY OF LANGUAGE

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Reading Horizons. A Look At Linguistic Readers. Nicholas P. Criscuolo APRIL Volume 10, Issue Article 5

Mandarin Lexical Tone Recognition: The Gating Paradigm

EDUCATION. Graduate studies include Ph.D. in from University of Newcastle upon Tyne, UK & Master courses from the same university in 1987.

Present: Ehab Galal, Dietrich Jung, Jon Nordenson, Susanne Olsson, Christina Rothman, Leif Stenberg, Liv Tønnessen, Pekka Tuominen,

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

CHAPTER TWO REVIEW OF RELATED LITERATURE. Many languages of the world have gone through a common process of lexical

CODE Multimedia Manual network version

Linking Task: Identifying authors and book titles in verbose queries

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Formulaic Language and Fluency: ESL Teaching Applications

Developing a TT-MCTAG for German with an RCG-based Parser

On the nature of voicing assimilation(s)

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Using SAM Central With iread

Transcription:

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition Abir Masmoudi 1,2, Mariem Ellouze Khemakhem 1,Yannick Estève 2, Lamia Hadrich Belguith 1 and Nizar Habash 3 (1) ANLP Research group, MIRACL Lab., University of Sfax, Tunisia (2) LIUM, University of Maine, France (3) Center for Computational Learning Systems, Columbia University, USA masmoudiabir@gmail.com,mariem.ellouze@planet.tn, yannick.esteve@lium.univ-lemans.fr, l.belguith@fsegs.rnu.tn, habash@ccls.columbia.edu Abstract In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%. Keywords: Tunisian Arabic, speech recognition, phonetic dictionary, grapheme-to-phoneme 1. Introduction Automatic Speech Recognition (ASR) is playing an increasingly important role in a variety of applications such as automatic query answering, telephone communication with information systems, speech-to-text transcription, etc. In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic ASR. The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. It contains a subset of the words available in the language and the pronunciation variants of each word in terms of sequences of the phonemes available in the acoustic models. In the next section, we give a historical overview of Tunisian Arabic. Then, in Section 3, we present the steps of creating the corpus for our study and provide an analysis of this corpus in Section 4. Section 5 details the phonological variations of Tunisian Arabic. Sections 6 and 7 present the method we propose to build the Tunisian Arabic phonetic dictionary and its evaluation, respectively. 2. Historical Overview of Tunisian Arabic Modern Standard Arabic (MSA) has a special status as an official standard language in the Arab world. It is in particular the language of the written press and official venues. Furthermore, there is a large variety of dialects that constitute the mother tongues of Arabic speakers. Arabic Dialects are divided into two major groups namely the Western group or North African group and the Eastern group. The North African Arabic is the variety of Arabic spoken in the Maghreb countries (Tunisia, Algeria, Morocco, Libya and Mauritania) while the Eastern group includes the varieties spoken in Egypt, the Levant, Iraq, the Gulf states, Yemen, Oman, etc. Tunisian Arabic is the main variety used in the daily life of Tunisian people for spoken communication. It is becoming more widely used in interviews, news, debate programs, and public service announcements; and it has a strong online presence today in blogs, forums, and user/reader commentaries. Historically, Berber was the original mother tongue of the inhabitants of North Africa. The spread of Islamin North Africa brought Arabic, the language of the Islam s Holy Book. Other historical facts occurred which influenced the language spoken in Tunisia such as the Ottoman empire, European colonialism and peaceful trade-based interactions between civilizations. So, Tunisian Arabic is an outcome of the interactions between Berber, Classical Arabic and many other languages. The trace of this interaction in the language is manifested in the introduction of borrowed words from French, Italian, Turkish and Spanish in Tunisian Arabic. These borrowings are used in the daily life of Tunisians with some phonological changes. However, many borrowed words are used in the discourse of the Tunisians without being adapted to the Tunisian phonology. Table 1 below shows some examples of foreign words commonly used in Tunisian Arabic with or without phonological modification. 306

Words Transliteration Origin Sense شكب ة škub~aħ Italian card game كاغث kaaγiθ Turkish paper Table 1: Some examples of foreign words used in Tunisian Arabic. 1 3. The Tunisian Arabic Railway Interaction Corpus The building of an ASR system requires at least two types of corpora: audio recordings and the corresponding written text. Since we aim to build an ASR system, and due to the lack of such resources especially concerning Tunisian Arabic, we decided to create our own corpus, which we named TARIC: Tunisian Arabic Railway Interaction Corpus. The creation of the corpus was done in three steps. First is the production of audio recordings; second is the transcription of these recordings; and third is the normalization of these transcriptions. In the following three sub-sections we will detail the process of creation of TARIC. 3.1 The Recordings The first step consisted in making audio recordings. We did that in the ticket offices of the Tunis railway station. We recorded conversations in which there was a request of information about such things as the train schedules, fares, bookings, etc. The equipment we used includes two portable PCs using the Audacity software and two microphones, one for the ticket office clerk and another one for the client. We chose to record in different periods, particularly holidays, weekends, festival days, and sometimes during the week. We obtained 20 hours of audio recordings. 3.2 The Transcription Once our recordings were ready, we manually transcribed them because we did not have the tools for automatic transcription for Tunisian Arabic. This transcription was done by three university students. Our corpus consists of several dialogues; each dialogue is a complete interaction between a clerk and a client. All the words are written using the Arabic alphabet with diacritics. The diacritics indicate how the word is pronounced. The same word can have more than one pronunciation. Table 2 presents some statistics of the TARIC corpus. Number of hours Number of dialogues Number of statements Number of words 20h 4,662 18,657 71,684 Table 2: Statistics of the TARIC corpus 1 Transliteration of Arabic will be presented in the Habash-Soudi-Buckwalter scheme (Habash et al, 2007). 3.3 Normalization To obtain coherent data and consistent corpora, we had to use standard orthographies. But until now, Tunisian Arabic has no standard orthographies since there are no Arabic dialect academies. In our laboratory, we developed our own orthographic guidelines to transcribe the spoken Tunisian Arabic following previous work by Habash et al. (2012) on developing a conventional orthography for dialectal Arabic or CODA. Our guidelines are described in (Zribi et al.,2014). 4. Analysis of TARIC In this section, we present an analysis of the collected corpus. The analysis consists of determining dialogue acts, foreign words, lexical variations and speech disfluencies. 4.1 Dialogue Acts Dialogue acts are the actions caused by the speaker. The corpus had a variety of dialogue acts that pertain to requests and answers about scheduling and reservations. Table 3 shows an example of segmentation in dialogue act of a set of conversations between a client and an agent. Dialogue Act Dialect Lexicon Translation Departure time requests وقتاش التران للتونس When is the train to Tunis? Answer there is at 10 hours ثمة في العشرة و and at 13 hours في الماضي ساعة Reservation requests ريززڥيلي في التران متاع العشرة Reserve me for the train at10. Confirmation أوكاي OK Table 3: Analysis in dialogue act of a conversation between an agent and a Client 4.2 Lexical Variation As indicated in Section 2, the use of foreign words is a common feature in Tunisian Arabic due to historical reasons. In TARIC, foreign words represent 20% of the corpus. Table 4 gives some examples of these words. Dialect words Translation Origin Sense تران trian French Train كالس klaas French Class blaasaħ French Space بالصة Table 4: Examples of foreign words Also, we noticed the presence of several different words from different backgrounds but with the same meaning. For example, the word "ticket" can be expressed in three different ways: تكاي تسكرة, tikaay tiskraħor تذكرة tiðkraħ. Table 5 illustrates other frequently used examples. 307

Lexicon Translation أوتوراي ترينو تران trian triynuw ÂuwtuwraAy Train پالس باليص بقايع bqaayie bliayis plaas Places Table 5: Example of lexical variation in Tunisian Arabic 4.3 Speech Disfluencies Disfluency is a frequently occurring phenomenon in spontaneous oral production resulting in new lexical classes that need to be properly handled. The principal phenomena of disfluency are: repetitions, self-corrections, hesitations and incomplete words. Next, we present an analysis of our corpus TARIC in terms of disfluencies to extract these new lexical classes. Repetitions: these consist of repeating a word or series of words. The majority of repetitions in TARIC are used by a speaker to affirm or to reformulate his request. Below are two examples of repetitions. (a) زوز للتونس أالي رتور زوز أالي رتور two to Tunis go back two go back Example (a) represents a repetition in the speaker utterance to affirm the request. (b) تكاي بليصة للصفاقس Ticket place to Sfax In the second example, the repetition is used by the speaker to press his claim. He used two different words that have the same meaning. Self-corrections: the speaker can make one or more mistakes and correct them in the same utterance. This phenomenon is similar to a repetition but the repeated portion is a reconstruction of a bad portion in the utterance. Below are two examples of self-corrections. (a) تونس ال سوسة Tunis no Sousse (b) تكاي أالي ال سامحني أالي رتور Ticket go no sorry go back Hesitations: these are phenomena which appear in spontaneous oral production. They can be manifested in various ways: either by using a specific morpheme (e.g., uh, um, etc.) or in the form of an elongation of syllable. These are lexical classes belonging only to spontaneous oral production. There are lexical classes that are similar to foreign languages such as French and others are specific to Tunisian Arabic. The following example shows hesitation markers present in our corpus. (a) تران للتونس آه دراكت Train to Tunis ah direct Incomplete words: these are the cases of the stopping the production of a word before the normal end of it. In his terminology, an incomplete word is always a word fragment that can be identified through knowledge of the phraseology. ( a )بالالهي ترا تران دوزيام كالس Please tra train second class In this example, the speaker begins to pronounce the word "train" but he stops before the normal end of the word and then says the full word again. 5. Phonological Variations in Tunisian Arabic Before creating a phonetic dictionary for Tunisian Arabic, it is necessary to study the phonological variations of this language. There are several specific phonological variations in Tunisian Arabic. We can find a variation in the pronunciation of some consonants. We cite below a few of these phonetic features: The presence of foreign words in Tunisian Arabic ڥ phonemes: resulted in the introduction of three new /V/, ڨ /G/ and پ /P/. In Tunisian Arabic, the consonant ق "q" has a double ڨ pronunciation. In the rural dialects, it is pronounced /G/. In the urban dialects, the consonant ق is pronounced /Q/, but there are some exceptions. The consonant ض /DD/ can have several possible pronunciations such ضas /DD/ or ذ "ð" /DH/ or د "d" /D/. For example, the word م اض ي /M AE: DD IY/ in the expression م اض ي س اع ة /M AE: DD IY S AE: AI AE/ 13 hours is pronounced م اض ي /M AE: DD IY/ or IY/. /M AE: D م ادي /M AE: DH IY/ or م اذي The consonant س "s"/s/ can be pronounced as /S/ or /R AE S رسول "S" /SS/. For example, the word ص UW L / Prophet is pronounced رسول /R AE S UW L/ or رصول /R AE SS UW L/. ض /DH2/ is realized as /DH2/ or ظ The consonant /DD/. In a few words such as ث م ة /TH AE M M AE/ exist, the consonant ث "v"/th/ can be pronounced in two ways: ث /TH/ or ف "f"/f/ The consonant ط "T" /TT/ is sometimes pronounced أعطيني example, "t"/t/. For ت /TT/ and at other times أعطيني /E AE AI T IY N IY/ give-me is pronounced /E AE AI TT IY N IY/ or أعتيني /E AE AI T IY N IY/ Tunisian Arabic Hamza (or glottal stop) at the beginning of the word, is sometimes pronounced with different ways: If the word is at the beginning of the statement, the glottal stop is pronounced. If the word is in the middle of the statement, the glottal stop is omitted. The consonant ع "E" /AI/ is sometimes 308

pronounced /AI/ and at other times ح "H" /HH/. For example, مت اعه ا /M T AE: AI H AE:/ hers is pronounced مت اعه ا /M T AE: AI H AE: / or مت احه ا /M T AE: HH H AE:/. We noticed the elimination of a consonant in some word. For example, قلتلك /Q UH L T L IH K / I told you can be pronounced قتلك /Q UH T L IH K/, we noticed that the consonant ل "l" /L/ is eliminated. In Tunisian Arabic, starting from eleven, the phoneme (n) is added to numbers followed by a noun, for example, حد اشن ألف /HH D AE: SH N E AE L F/. 6. The Tunisian Arabic Phonetic Dictionary Pronunciation dictionaries map words to one or more pronunciation variants and take into account pronunciation variability. Our approach consists in using a set of phonetic rules and a lexicon of exceptions to automatically generate a pronunciation dictionary. 6.1 The Lexicon of Exceptions There are some words that cannot follow our set of phonetic rules. So, it is necessary to define a lexicon of exceptions. This lexicon is consulted before the rules are used. If the word is among the exceptions, it is encoded directly in phonetic form. Otherwise, we must apply the rules to the word to generate its phonetic form. In our lexicon, we have more than 30 exceptions. Our lexicon of exceptions is evaluated by three judges (native speaker).table 6 shows some examples of lexical exceptions. Exceptions Transliteration Phonetization haðaaهذا this[masc. sg.] AE: H AE: DH haðiyهذي this[fem. sg.] H AE: DH IY AilaAhاله god E IH L AE: H Table 6: Lexicon of exceptions This operation is called transcription by phonetic lexicon for each word as it directly generates a lexical entity that represents the pronunciation that matches it. 6.2 Phonetic Rules We developed a set of phonetic rules to map written Tunisian Arabic. Rules are provided for each letter in Tunisian Arabic. Each rule tries to match certain conditions relative to the context of the letter and to provide a replacement. Our rules are evaluated by three judges (native speaker).these rules are stored in a rule base. The total number of rules is 80. Each rule is read from right to left and follows this format: Replacement<={Left-Cond}+{Graph}+{Right-Cond} Graph: is the current letter in the word. Right-Condition has one of the following formats: <? <= Pattern>: context before the current position "Graph" is to be considered. <? <! Pattern>: context before the current position "Graph" is not to be considered. Left-Condition: can take one of these two formats: <? = Pattern>: context after the current position "Graph" is to be considered. <! Pattern>: context after the current position "Graph" is not to be considered. Replacement: is either a phoneme or more of a phoneme or a vacuum (*) if the graph is omitted in pronunciation. The application of phonetic rules is done in the direction of reading of the word, that is to say it starts with the first letter of the word and respects the order of letters. The following are three examples of rules of Tunisian Arabic: 1. Shadda rule: shadda diacritic is written on a consonant and never on a vowel. Its effect is to double the consonant on which it is placed. 2. The rules of the ا (Alef): at the end of a word and preceded by w, the combination signifies a plural word. In this situation, the final "Alef" does not have any خ لص وا pronunciation. For example, in the plural word (they have paid) the final ا is deleted. 3. Sun letter rule: When a word starts with the definite article ال Al+ followed by a so-called Sun consonant letter, the /l/ of the definite article is assimilated to the consonant (Habash, 2010; Biadsy et al., 2009). For example, the word السما Al+samiA the sky is pronounced /E IH S S M AE:/. 7. Evaluation We evaluate the performance of our phonetic rules on two corpora: TARIC and another corpus downloaded from the website of Tunisian bloggers. This corpus is selected on several themes: political, sporting, cultural, social, etc. Since the web corpus does not follow our writing standard, we standardized the corpus according to Tunisian Arabic CODA (Zribi et al., 2014)and manually diacritized it. The evaluation set contained around 3K unique words from TARIC and 3K unique words from the web. Our pronunciation dictionary is evaluated by three experts (native speaker). Table 8 shows the evaluation size of each type of corpus. TARIC corpus Web corpus 8% 10% Table 8: Results of the evaluation (word error rate) As presented in Table 8, the system of phonetic of a Tunisian Arabic has 8% word error rate for vowelized words of our corpus TARIC and 10% word error rate for diacritized words from the web corpus. These errors are due to the order of rules, for example it is necessary to 309

make the rules of long vowels before rules of short vowels. Also, you can find errors due to the contradiction of two rules. 8. Conclusion and Future Work To deal with the lack of linguistic resources in Tunisian Arabic for ASR, we create our own corpus TARIC. We described TARIC creation and highlighted some of its features. We also presented a tool for rule-based grapheme to phoneme mapping that converts graphemes of Tunisian Arabic into their corresponding phonemes. The process of implementation is based on the list of graphemes, phonemes, the lexicon of exceptions and phonetic rules. Each rule attempts to match certain conditions relating to the context of the letter and provides a replacement. The total number of rules is about 80.The resulting software is tested on a word list in Tunisian Arabic using two independent test sets and reached an error rate of ~9%. The data that has been prepared: TARIC and phonetic dictionary and tool will be used to build ASR systems in the Tunisian Railway Transport Network. In future work, we plan to extend our research to improving of the phonetization of diacritized and undiacritized words in Tunisian Arabic. We will consider methods for data driven grapheme-to-phoneme mapping. 9. References Algamdi, M. (2003). KACST Arabic Phonetics Database.Fifteenth International Congress of Phonetics Science, Barcelona. pages. 3109-3112. Algamdi, M., Elshafei, M., & Almuhtasib, H. (2002).,Speech Units for Arabic Text-to-speech. Fourth Workshop on Computer and Inforamtion Sciences. pages. 199-212. Biadsy, F., Habash, N., and Hirschberg, J., (2009), Improving the Arabic Pronunciation Dictionary for Phone and Word Recognition with Linguistically-Based Pronunciation Rules, The 2009 Annual Conference of the North American Chapter of the ACL, pages 397 405, Boulder, Colorado. Bisani, M., Ney, H., (2008).Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication 50, 434 451. Diehl, F., Gales, M. J. F., Tomalin, M., & Woodland, P. C. (2008).Phonetic pronunciations for Arabic speech-to-text systems. IEEE International Conference on Acoustics, Speech and Signal Processing.pages. 1573-1576. El-Imam. Y., (2004). Phonetization of Arabic: rules and algorithms. In Computer Speech and Language 18, pages 339 373. Gales, M. J. F., Diehl, F., Raut, C. K., Tomalin, M., Woodland, P. C., & Yu, K. (2007).Development of a phonetic system for large vocabulary Arabic speech recognition. IEEE Workshop on Automatic Speech Recognition & Understanding. pages. 24-29. Habash, Nizar. (2010) Introduction to Arabic Natural Language Processing, Synthesis Lectures on Human Language Technologies, Graeme Hirst, editor. Morgan & Claypool Publishers. Habash, N., Soudi, A., and Buckwalter T. (2007). On Arabic Transliteration. Book Chapter. In Arabic Computational Morphology: Knowledge-based and Empirical Methods. Editors Antal van den Bosch and Abdelhadi Soudi. Habash, N., Diab, M., Rambow, O. (2012).Conventional Orthography for Dialectal Arabic. In: Proceedings of the Language Resources and Evaluation Conference (LREC), Istanbul. Hiyassat, H. A. R. (2007). Automatic Pronunciation Dictionary Toolkit for Arabic Speech Recognition Using SPHINX Engine. Ph.D. thesis, Arab Academy for Banking and Financial Sciences, Amman, Jordan. Maamouri, M., Buckwalter, T., Cieri, C. (2004). Dialectal Arabic Telephone Speech Corpus: Principles, Tool Design, and Transcription Conventions. In: NEMLAR International Conference on Arabic Language Resources and Tools, Cairo, September, pages. 22-23. Paris-sud, Centre d'orsay. Masmoudi, A., Estève, Y., Ellouze Khmekhem, M., Hadrich Belguith, L., (2014), Phonetic tools for the Tunisian Dialect, The 4 th International Workshop on spoken Language Technologies for Under-resourced Languages, Russia. Zribi, I., Boujelban, R,. Masmoudi, A., Ellouze Khmekhem, M., Hadrich Belguith, L., and Habash, N., (2014), A Conventional Orthography for Tunisian Arabic, In 19th edition of the Language Resources and Evaluation Conference, Reykjavik, Iceland. 310