The Burgeoning Challenge of Deciphering Arabic Chat

Similar documents
In Workflow. Viewing: Last edit: 10/27/15 1:51 pm. Approval Path. Date Submi ed: 10/09/15 2:47 pm. 6. Coordinator Curriculum Management

E mail: Phone: LIBRARY MBA MAIN OFFICE

TEAM NEWSLETTER. Welton Primar y School SENIOR LEADERSHIP TEAM. School Improvement

Nordic Centre Newsletter

Starting the Conversation about Feedback. Jennifer Marten. Plain Talk About Reading February 9-11, 2015 New Orleans

Master of Science in Management Institut Teknologi Bandung

Program Review

Using the CU*BASE Member Survey

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Adver sing 2012 Graduate Survey

Arabic Orthography vs. Arabic OCR

Outreach Connect User Manual

The open source development model has unique characteristics that make it in some

Integration of ICT in Teaching and Learning

Learning Methods in Multilingual Speech Recognition

Education the telstra BLuEPRint

Cross Language Information Retrieval

IBCP Language Portfolio Core Requirement for the International Baccalaureate Career-Related Programme

A Case Study: News Classification Based on Term Frequency

Corporate learning: Blurring boundaries and breaking barriers

Anti-Money Laundering with Text Analytics

Creating Travel Advice

Online Marking of Essay-type Assignments

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Three Strategies for Open Source Deployment: Substitution, Innovation, and Knowledge Reuse

Eduroam Support Clinics What are they?

English for Specific Purposes World ISSN Issue 34, Volume 12, 2012 TITLE:

REVIEW OF CONNECTED SPEECH

Reading Horizons. A Look At Linguistic Readers. Nicholas P. Criscuolo APRIL Volume 10, Issue Article 5

Teaching ideas. AS and A-level English Language Spark their imaginations this year

Proficiency Illusion

21st Century Community Learning Center

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Notetaking Directions

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

English Language and Applied Linguistics. Module Descriptions 2017/18

Speech Recognition at ICSI: Broadcast News and beyond

Guidelines for Writing an Internship Report

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

APA Basics. APA Formatting. Title Page. APA Sections. Title Page. Title Page

NORTH CAROLINA VIRTUAL PUBLIC SCHOOL IN WCPSS UPDATE FOR FALL 2007, SPRING 2008, AND SUMMER 2008

2015/2016 STUDENT HANDBOOK

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Conducting an interview

Computerized Adaptive Psychological Testing A Personalisation Perspective

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

Using a Native Language Reference Grammar as a Language Learning Tool

Including the Microsoft Solution Framework as an agile method into the V-Modell XT

Unit 7 Data analysis and design

Florida Reading Endorsement Alignment Matrix Competency 1

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

TEKS Correlations Proclamation 2017

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

10.2. Behavior models

Derivational and Inflectional Morphemes in Pak-Pak Language

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

ESSENTIAL SKILLS PROFILE BINGO CALLER/CHECKER

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

ATENEA UPC AND THE NEW "Activity Stream" or "WALL" FEATURE Jesus Alcober 1, Oriol Sánchez 2, Javier Otero 3, Ramon Martí 4

Triple P Ontario Network Peaks and Valleys of Implementation HFCC Feb. 4, 2016

Fearless Change -- Patterns for Introducing New Ideas

Tap vs. Bottled Water

Enhancing Customer Service through Learning Technology

Chapter 5: Language. Over 6,900 different languages worldwide

Nearing Completion of Prototype 1: Discovery

Controlled vocabulary

PART C: ENERGIZERS & TEAM-BUILDING ACTIVITIES TO SUPPORT YOUTH-ADULT PARTNERSHIPS

Changes to GCSE and KS3 Grading Information Booklet for Parents

AQUA: An Ontology-Driven Question Answering System

21st CENTURY SKILLS IN 21-MINUTE LESSONS. Using Technology, Information, and Media

10 Tips For Using Your Ipad as An AAC Device. A practical guide for parents and professionals

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

Leveraging MOOCs to bring entrepreneurship and innovation to everyone on campus

Information Session 13 & 19 August 2015

West s Paralegal Today The Legal Team at Work Third Edition

Modeling user preferences and norms in context-aware systems

IBM Software Group. Mastering Requirements Management with Use Cases Module 6: Define the System

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

An Introduction to the Minimalist Program

Pragmatic Use Case Writing

MFL SPECIFICATION FOR JUNIOR CYCLE SHORT COURSE

Welcome Prep

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

COMMUNICATION & NETWORKING. How can I use the phone and to communicate effectively with adults?

S ecrets TO SUCCESS

PREP S SPEAKER LISTENER TECHNIQUE COACHING MANUAL

Evaluation of Hybrid Online Instruction in Sport Management

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Total Knowledge Management. May 2002

Chapter 4 - Fractions

MARK 12 Reading II (Adaptive Remediation)

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Problems of the Arabic OCR: New Attitudes

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

DISTANCE LEARNING OF ENGINEERING BASED SUBJECTS: A CASE STUDY. Felicia L.C. Ong (author and presenter) University of Bradford, United Kingdom

SULLIVAN & CROMWELL LLP

Transcription:

March 26, 2012 The Burgeoning Challenge of Deciphering Arabic Chat Arabizi, an informal dialect of Arabic typed on mobile phones and computer keyboards using the La n alphabet, has spread widely via text messages and social networks. Analyzing messages wri en in this dialect is a challenge for analysts in government and industry because of wide varia ons in spelling, grammar, and dic on. We put the World in the World Wide Web

ABOUT BASIS TECHNOLOGY Basis Technology provides so ware solu ons for text analy cs, informa on retrieval, digital forensics, and iden ty resolu on in over forty languages. Our Rose e linguis cs pla orm is a widely used suite of interoperable components that power search, business intelligence, e-discovery, social media monitoring, financial compliance, and other enterprise applica ons. Our linguis cs team is at the forefront of applied natural language processing using a combina on of sta s cal modeling, expert rules, and corpus-derived data. Our forensics team pioneers be er, faster, and cheaper techniques to extract forensic evidence, keeping government and law enforcement ahead of exponen al growth of data storage volumes. So ware vendors, content providers, financial ins tu ons, and government agencies worldwide rely on Basis Technology s solu ons for Unicode compliance, language iden fica on, mul lingual search, en ty extrac on, name indexing, and name transla on. Our products and services are used by over 250 major firms, including Cisco, EMC, Exalead/Dassault Systems, Hewle -Packard, Microso, Oracle, and Symantec. Our text analysis products are widely used in the U.S. defense and intelligence industry by such firms as CACI, Lockheed Mar n, Northrop Grumman, SAIC, and SRI. We are the top provider of mul lingual technology to web and e-commerce search engines, including Amazon.com, Bing, Google, and Yahoo!. Company headquarters are in Cambridge, Massachuse s, with branch offices in San Francisco, Washington, London, and Tokyo. For more informa on, visit www.basistech.com. 2012 Basis Technology Corpora on. Basis Technology, Geoscope, Odyssey Digital Forensics, Rose e, and We put the World in the World Wide Web are registered trademarks of Basis Technology Corpora on. All other trademarks, service marks, and logos used in this document are the property of their respec ve owners. (2012-08-15)

For the last several hundred years, technology has made it easier for dominant languages to drive out the smaller ones. Faster transporta on, ubiquitous telephony, and efficient prin ng are helping the major languages dominate our communica ons. Studies by the Linguis cs Society of America and the Na onal Geographic Society es mate that more than half of the world s approximately 7,000 languages will be ex nct by 2100. 1 But technology also nurtures crea vity, and new forms of wri ng are appearing in unforeseen places driven by unexpected confluences. One of the newest is Arabic chat alphabet, also called Arabizi a casual version of wri en Arabic that appeared when Arabic speakers began using Western keyboards on mobile phones and computers to spell out their na ve language with the Roman alphabet. Thus, instead of typing مرحبا they would type mar7aba (transla on: hello ). The prac ce is a growing challenge for government intelligence agencies because the wri ng system is prolifera ng as portable phones, social media, and other digital channels become more common. More and more conversa ons flow through the handsets, and an increasing percentage travel as text messages. Western social networks like Facebook or Twi er are also growing in popularity, and the users o en choose Arabizi for their messages. In many cases, protests of the Arab Spring were planned, nurtured, and executed through messages passed in these channels, o en as Arabic chat. The format poses a unique problem for government analysts working with open source intelligence, because it is s ll evolving and the writers do not follow any standard rules for spelling, grammar, or dic on. Speakers from different regions not only use different spellings but they also write in their local dialect and even code-switch (insert other languages such as English or French), or mix dialectal Arabic with Modern Standard Arabic (MSA). The phonology, morphology, syntax, and lexicon of the dialects that these na ve speakers of Arabic are using are different from those of MSA. (See sidebar on the growth of Arabizi, The Flourishing Garden of Arabizi. ) While the collec on phase of the intelligence cycle can gather text messages easily, the processing and exploita on phase is slowed or blocked en rely by messages that cannot be easily understood by automated tools trained only on MSA in Arabic script. Humans must read through the messages and choose the important ones. Automa on is essen al because there are simply not enough analysts to scan the large volume of text. Arabizi confuses conven onal natural language processing tools for Arabic by mixing in non- Arabic le ers, regional dialects, and foreign words, all of which are unexpected by tools trained on MSA. UNDERSTANDING THE ATTRACTION OF ARABIZI While the early Arabizi users were probably mo vated by the need to send Arabic words with a Western keyboard, Arabizi also a racts current users for aesthe c, poli cal, and personal reasons. Convenience is only part of the allure. 1 See What is an Endangered Language by AC Woodbury, Linguis cs Society of America, 2006. Also The Last Speakers: The Quest to Save the World s Most Endangered Languages K. David Harrison, Na onal Geographic Press, Sept 2010. Also. Languages Die, But Not Their Last Words, NY Times, Sept 19, 2007. The Burgeoning Challenge of Deciphering Arabic Chat 3

In one study conducted by the American University in Cairo 2, a collec on of 70 Arabic users on Facebook were asked why they chose to write with Roman le ers. More than 80% of the respondents said that they used Arabizi, and 40% used it most of the me. Those who did not use it largely said it was either out of respect for the Quran or as part of an effort to maintain a separate Arabic iden ty outside of Western influence. Indeed many agreed it should not be used in a religious context even if it is acceptable for casual communica on. About 20% said it made them feel closer to each other, a phrase that implies a role as a secret language that is not understood by outsiders of different ages, backgrounds, or educa on. It is cool. Just as many English speakers may choose a more elegant-sounding French word, and Japanese companies add English words to the packaging of products, Arabic speakers who use Arabizi display a depth of knowledge and sophis ca on. The early users were, by necessity, well educated in European languages, so they were able to know ins nc vely how Arabic sounds mapped to Roman le er combina ons. English and French classes are common in schools in the Arabic world and this nurtures the understanding of the Roman alphabet. This user profile makes Arabizi a mark of success that suggests that the writer is well educated and o en well traveled. Today the wri ng technique also popular among young and agile minds because they are frequently the first to adopt new technology. Just as Western youths have improvised many acronyms and textual shorthand to simplify typing on mobile devices, the younger Arab users are also crea ng Roman approxima ons to meet their needs. Some suggest that they use Arabizi because it is o en easier to type than Standard Arabic, especially when they have no training in using an Arabic keyboard. The combina on of poli cal ac vity and the prolifera on of technology fueled a greater focus on Arabic chat format. While many of the original adopters may have turned to Arabizi out of a prac cal need to express Arabic words using Western technology, some suggest that the language is now a popular and sophis cated choice even when the technology supports tradi onal Arabic script. People are choosing Arabizi when other tools to spell MSA are available. The once casual slang is growing in importance and becoming a significant format of its own. The form quickly came to the a en on of non-arabic speakers during the Arab Spring of 2011 when many poli cal ac vists discovered that mobile phones and Facebook were ideal vectors for organizing protests and structuring the evolu on of poli cal dissent. The wri ng born of casual chat turned into a tool for revolu on. THE CHALLENGE OF UNLOCKING ARABIC CHAT Analyzing text wri en in Arabizi is a difficult problem because it lacks much of the structure that current technologies rely upon. Many algorithms and data-mining tools depend on a stable, predictable spelling and structure for words rules that are standardized through dic onaries and schools. Arabizi s improvisa onal origins produce something more chao c. (See sidebar.) The wide range of words and influences can confound algorithms that assume stability and a fixed 2 From Summary of Arabizi or Romaniza on: The dilemma of wri ng Arabic texts by Randa Muhammed, Mona Farrag, Nariman Elshamly, and Nady Abdel-Ghaffar. Presented at Jīl Jadīd Conference, University of Texas at Aus n, February 18-19, 2011 4 The Burgeoning Challenge of Deciphering Arabic Chat

interpreta on. All of this mathema cal apparatus assumes that the structure of a language will not change, but Arabizi is transforming as each person chooses the closest approxima on of a word that comes to mind. Many of the earliest approaches to analyzing Arabizi with these tools depended upon people transla ng the words directly into MSA and then using tradi onal algorithms on MSA to work with the results. The mechanism was tuned to MSA in the result, not for Arabizi entering the system. Any success that these rote algorithms offered was o en limited and short-lived because a direct conversion to Arabic script will o en produce sentences that do not follow the rules of MSA. The linguis c shorthand structures may follow the rules of proper MSA in some sentences and ignore them in others. Foreign words that are o en mixed into Arabizi break these algorithms immediately. The bri leness is a compounded by the fact that there has been no formal process by any central body to provide guidance for how this language should be used, nor will there be it is a language of the digital streets. There is no government-run commi ee on the language like the Académie Française, and no one has wri en a style guide like the AP Style Manual. There are also no large publica ons wri en and edited in Arabizi, so there is no source of a large, curated collec on of text, like a newspaper, that people can imitate when structuring their sentences. This lack of common rules can be compounded when the people use words from a smaller, local dialect that is not widely understood. The lack of central steering means that the language evolves differently as each user chooses which orthography or gramma cal trope to adopt or ignore without any guidance or sugges on. Many adopt new construc ons as they write their texts, manufacturing spelling as they need them. If people are influenced, they are influenced by their friends. Regional Dialects Compound Difficul es There are many regional differences in pronuncia on that mul ply the complexi es because قلب qalb the words are o en transliterated using sound. In Iraqi Arabic, the q as in the word (i.e., heart ) is pronounced as a g ; Pales nian speakers in the West Bank would pronounce q as k ; in Tunisia, it is pronounced as either q or g ; and Egyp ans pronounce it as a glo al stop (as in the sound in uh-oh ). Hence, Egyp ans pronounce the first le er of the former Libyan ruler s name with a glo al stop, whereas some other dialects pronounce it with a g sound. Consequently possible spellings in Arabic chat would be: Gadaffi, 2addafi, Kathafi, Qathafi, Qadhafy, etc. The different spellings reflect the different pronuncia ons, the orthographic varia ons that exist in the English language itself, and the orthographic varia ons of Arabic chat (such as spelling the glo al stop or hamza with the digit 2). On top of that, residents of countries where French is spoken, for instance, o en learn different rules for mapping sounds to Roman le ers than people who live in countries where English is the dominant second language. Even people who may know li le English may s ll adopt English orthography when their friends choose to use it. The decentralized structure does offer new opportuni es for understanding and analysis. The spellings and linguis c construc ons travel like viruses or memes without any central control, and people from the same social networks o en share the same locu ons. It is possible to The Burgeoning Challenge of Deciphering Arabic Chat 5

iden fy the regional origin of speakers through the dialectal words they use and the spellings they select for the words. Thus a simple chat can contain more informa on than what the words are communica ng. AUTOMATING ANALYSIS The complexity of the language and the explosion of Arabizi guarantee that automa on is essen al, and any computerized assistance with understanding the chat messages will be an analy cal mul plier. It is impossible to find enough people to read all of the messages, so it is impossible to locate the salient texts without leveraging advanced algorithms. Computeriza on also ensures that human resources can be deployed more efficiently. Without effec ve algorithmic triage, human analysts become mired in rou ne transla ons that become harder with the odd or chao c translitera on. If most of the text snippets are of li le interest to anyone but the recipients, automated tools are essen al for both searching the corpus and ranking the importance of messages. An automated pre-processor lets the human resources focus on the most important message streams. An Enterprise-Ready Solu on for Deciphering Arabic Chat Alphabet Currently, there are few tools that can deal effec vely with Arabic chat. Basis Technology s Rose e linguis cs pla orm may be the only produc on-quality, enterprise tool currently available on the market for conver ng and analyzing it. Rose e Chat Translator, one module in the full Rose e pipeline, is capable of disassembling Arabizi because it begins with a sta s cal approach to translitera on that breaks the text message into phonemes and then ranks the possible conversions. The most probable mappings are used to convert the text into Arabic script. This sta s cal approach allows the so ware to adapt as the common usage changes. mar7aban Abu Mas3uud. Wallahi mudda taweela lmma shuftak ya shiekh. Esma3 insha' allah netgabal 3end abu musle7 ghadan wa la tensa el mawaad al matluba lzar3 al shajara fee shaari3 karbala'. 7awali 5:30 nitqabal ma3 ahmad abdallah salih Romanized Arabic mar7aban Abu Mas3uud. Wallahi mudda taweela lmma shuftak ya shiekh. Esma3 insha' allah netgabal 3end abu musle7 ghadan wa la tensa el mawaad al matluba lzar3 al shajara fee shaari3 karbala'. 7awali 5:30 nitqabal ma3 ahmad abdallah salih. Romanized Arabic Standard Arabic Standard Arabic Greetings, Abu Masood! By God, it has been a long time since I ve seen you, oh Sheikh. Listen, God willing, we meet at Abu Musleh tomorrow and do not forget the materials required to plant the tree in Karbala Street. Around 5:30 we meet with Ahmed Abdullah Saleh. Entity Extraction English Translation Basis Technology built the sta s cal model for the chat translator from more than 300 million Arabizi messages gathered from throughout the world. The database is updated regularly through an automa c algorithm that builds a new sta s cal model from the latest corpus. New releases include the latest version of the model trained with the most recent collec on of chat messages. 6 The Burgeoning Challenge of Deciphering Arabic Chat

The results also carry metadata about the regional dialect used in the text message and this can iden fy the country of origin of the writer. The transla ons of chat alphabet to Arabic script from Rose e Chat Translator amplify the knowledge of the analyst by sugges ng possible sources of the message that may lie outside the core of the analyst s exper se. The encyclopedic nature of Rose e offers a deep set of op ons for the analysts to grade, saving them me in iden fying the source. This informa on is kept alongside the transla on for analysts to study at all subsequent stages of processing. Integra ng with the Enterprise Rose e Chat Translator delivers its results to other so ware packages in an industry-standard format for further analysis by the user s custom algorithms or other modules from Basis Technology. The tool delivers the converted message along with any hints about the origins that were unlocked during the conversion. When Rose e Chat Translator is combined with the full Rose e linguis cs pla orm pipeline, it offers a full set of tools for collec ng and analyzing a corpus of messages. Rose e En ty Extractor takes the text converted to Arabic script by the chat translator and iden fies names and loca ons. These en es then can be fed into Rose e Name Indexer (which matches different spellings of the same name) and Rose e Name Translator (which transliterates Arabic names to English), making it simpler for the non-arabic-speaking analyst to understand them. The name index built by the Rose e pipeline helps analysts work with a large collec on of messages by providing a quick way to iden fy documents containing the same en ty. The complexi es caused by the different spellings and linguis c structures of Arabizi are drama cally reduced with this standardized index. The large index built by Rose e is a key part of iden fying relevant messages. Analysts can move faster through incoming documents and find cross-references that can unlock connec ons that are obscured by all of the different name spellings. The index can quickly iden fy all other messages with similar en es names of people, places, and more even when they use different spellings. NEW TECHNOLOGY IS DRIVING FUTURE ARABIZI GROWTH The future of Arabizi will follow changes in technology. Today, mobile handsets are just beginning to dominate the Arab countries. One study from the Egyp an government noted that handset subscrip ons jumped from 55 million to 71 million during 2010 3. In January 2011, the penetra on rate of the market was es mated to be 91%. Other Arab countries are experiencing similar explosions of interest as the technology becomes available to all income levels. Social networks are also growing more popular and the users of these systems o en choose Arabizi because it seems a natural choice. Even non-western websites o en have comments wri en in Arabizi instead of standard Arabic. As Facebook, Twi er, and other Western social media websites con nue to grow more popular, Arabizi will follow. Each new user adds complexity to the prolifera ng dialects, subdialects, and formats because each user has their own preferred set of words and orthography. These choices are imitated by friends, and the social networks 3 See ICT Indicators in Brief, Feb 2011, Arab Republic of Egypt Ministry of Communica ons and Informa on Technology. (www.mcit.gov.eg) Also The Adop on of Mobile Phones in Emerging Markets: Global Diffusion and the Rural Challenge by Kas Kalba. Interna onal Journal of Communica on 2 (2008), 631-661 The Burgeoning Challenge of Deciphering Arabic Chat 7

amplify the structures as they pass, like trends and fads, from one to another. Just as English speakers using Twi er are building a new dialect, Arabic speakers are also echoing similar structures. Rose e Chat Translator is the only enterprise-grade, produc on-quality so ware choice available for working with the increasing stream of Arabizi. Its hybrid collec on of algorithms breaks down the language into phone c components enabling it to effec vely match the Western spellings with Arabic words. This module then feeds the rest of the Rose e pipeline to provide a complete solu on for understanding and cross-correla ng the messages. Automated tools like Rose e are essen al for effec vely managing the interpreta on of the vast collec on of data flowing through the intelligence cycle, especially when most of the informa on is not of interest. Flagging the most salient and poten ally important messages ensures that the translators and analysts can focus on the messages with the highest poten al value, saving me and money. Automa on unlocks informa on that would be otherwise lost in a sea of data. THE FLOURISHING GARDEN OF ARABIZI We can understand why Arabizi is complex by examining an Arabic chat text message wri en in the Levan ne dialect used in Lebanon: akid elli 3emel hai theory wa7ad 6el3ello wala wala wala luck Most of the words like wala are Arabic spelled with English versions of the sounds 4. The numerical digits are used because they look like Arabic le ers, and some mes they do not have sounds that are easily approximated by English spellings. Two English words, theory and luck, are here because they probably were easier for the writer to include. Perhaps they were a more accurate reflec on of what the writer wanted to say, or perhaps they just came to mind before the Arabic versions. In some cases, the character just looked like the Arabic le er. Research shows that the writers have many reasons for why they choose par cular combina ons, and different people make different choices. Understanding Arabic chat is not simple because there are many different ways to translate phonemes into a Roman script. Here is an example of five different translitera ons of the same word along with the group affilia on of a person who dra ed the text message: Spelling talateh thalatheh talata tlete salasa Region Jordanian Bedouin Cairene Lebanese Egyp an The spellings can depend upon local pronuncia on and the writer s exposure to European languages. This example is far from complete because many words can be spelled in more than several dozen ways. Studies have iden fied at least 32 different ways that the Western 4 The smallest unit of sound in a language. 8 The Burgeoning Challenge of Deciphering Arabic Chat

publica ons spell the name of the former head of Libya, Mu ammar Qadhafi. His first name alone is commonly represented in at least five different ways in Western literature. Arabizi o en includes even more variety in spelling because there are more users who are not following any standards. The Influence of Non-Arabic Languages The job of translitera ng Arabizi is more difficult when English is involved because the language is, like Arabizi, a polyglot tongue with orthography drawn from mul ple linguis c tradi ons. These eight words all have the same vowel phoneme /i/ but are spelled differently: flea, free, niece, perceive, turkey, Phoebe, she, and ski. Arabizi users could choose any of them, and so a polyglot descendant of a polyglot language grows even more complex. To make ma ers more complicated, each Arabic-speaking region o en pronounces the same word differently. Two French-speaking writers could choose different spellings because the local versions of the word are voiced differently. The source of the spelling is not always strictly auditory. The digit 7 is o en used because it looks like a common Arabic figure.خ Some imitate the dot above the le er by pu ng a quote mark before it and some place it a er. Some use an asterisk for the dot instead and also place it either before or a er. So there are four common translitera ons of just this one sound and some users will change it in the same message. ('7, 7', *7, 7*) Understanding can grow more complex when the users add in words from other languages. The mixture of French and Arabic is o en called Franko-Arab ; including English words produces what some call Arablish. This evolu on guarantees that there will be many forms of Arabizi and the spellings will change from region to region and social group to social group. EXPLORE FURTHER For more informa on or to request an evalua on, please call us at 617-386-2090 or 800-697-2062, or write to info@basistech.com. We will be happy to assist you in evalua ng the performance of our products on your data. The Burgeoning Challenge of Deciphering Arabic Chat 9