Arabic Orthography vs. Arabic OCR

Similar documents
Problems of the Arabic OCR: New Attitudes

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Florida Reading Endorsement Alignment Matrix Competency 1

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Speech Recognition at ICSI: Broadcast News and beyond

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Word Segmentation of Off-line Handwritten Documents

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Spanish IV Textbook Correlation Matrices Level IV Standards of Learning Publisher: Pearson Prentice Hall

Stages of Literacy Ros Lugg

TEKS Comments Louisiana GLE

English Language and Applied Linguistics. Module Descriptions 2017/18

Fisk Street Primary School

Standards for Members of the American Handwriting Analysis Foundation

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Learning Methods in Multilingual Speech Recognition

Interpreting ACER Test Results

Modeling function word errors in DNN-HMM based LVCSR systems

Topic 3: Roman Religion

Diagnostic Test. Middle School Mathematics

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Cross Language Information Retrieval

Master Program: Strategic Management. Master s Thesis a roadmap to success. Innsbruck University School of Management

Using SAM Central With iread

Reading Horizons. A Look At Linguistic Readers. Nicholas P. Criscuolo APRIL Volume 10, Issue Article 5

Tap vs. Bottled Water

Unit purpose and aim. Level: 3 Sub-level: Unit 315 Credit value: 6 Guided learning hours: 50

How to make successful presentations in English Part 2

MENTORING. Tips, Techniques, and Best Practices

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Accepted Manuscript. Title: Region Growing Based Segmentation Algorithm for Typewritten, Handwritten Text Recognition

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Implementing a tool to Support KAOS-Beta Process Model Using EPF

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

CEFR Overall Illustrative English Proficiency Scales

Phonological Processing for Urdu Text to Speech System

1 Use complex features of a word processing application to a given brief. 2 Create a complex document. 3 Collaborate on a complex document.

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

The College Board Redesigned SAT Grade 12

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Holy Family Catholic Primary School SPELLING POLICY

Conducting an interview

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Modeling function word errors in DNN-HMM based LVCSR systems

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

CS 598 Natural Language Processing

Learning Disability Functional Capacity Evaluation. Dear Doctor,

The Talent Development High School Model Context, Components, and Initial Impacts on Ninth-Grade Students Engagement and Performance

ESSENTIAL SKILLS PROFILE BINGO CALLER/CHECKER

OFFICE OF COLLEGE AND CAREER READINESS

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

ECON 365 fall papers GEOS 330Z fall papers HUMN 300Z fall papers PHIL 370 fall papers

PHILOSOPHY & CULTURE Syllabus

Timeline. Recommendations

DIBELS Next BENCHMARK ASSESSMENTS

BENGKEL 21ST CENTURY LEARNING DESIGN PERINGKAT DAERAH KUNAK, 2016

Coast Academies Writing Framework Step 4. 1 of 7

Generating Test Cases From Use Cases

Politics and Society Curriculum Specification

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Carolina Course Evaluation Item Bank Last Revised Fall 2009

Test Blueprint. Grade 3 Reading English Standards of Learning

Large vocabulary off-line handwriting recognition: A survey

5. UPPER INTERMEDIATE

LISTENING STRATEGIES AWARENESS: A DIARY STUDY IN A LISTENING COMPREHENSION CLASSROOM

What the National Curriculum requires in reading at Y5 and Y6

Operational Knowledge Management: a way to manage competence

Abstractions and the Brain

Formative Assessment in Mathematics. Part 3: The Learner s Role

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Controlled vocabulary

RED 3313 Language and Literacy Development course syllabus Dr. Nancy Marshall Associate Professor Reading and Elementary Education

PREVIEW LEADER S GUIDE IT S ABOUT RESPECT CONTENTS. Recognizing Harassment in a Diverse Workplace

MULTIMEDIA Motion Graphics for Multimedia

What is PDE? Research Report. Paul Nichols

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

IMPROVING SPEAKING SKILL OF THE TENTH GRADE STUDENTS OF SMK 17 AGUSTUS 1945 MUNCAR THROUGH DIRECT PRACTICE WITH THE NATIVE SPEAKER

PUBLIC SPEAKING: Some Thoughts

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

Essay on importance of good friends. It can cause flooding of the countries or even continents..

Management and monitoring of SSHE in Tamil Nadu, India P. Amudha, UNICEF-India

Mandarin Lexical Tone Recognition: The Gating Paradigm

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Foundations of Knowledge Representation in Cyc

University of Groningen. Systemen, planning, netwerken Bosman, Aart

WebQuest - Student Web Page

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

ENGLISH. Progression Chart YEAR 8

Physics 270: Experimental Physics

Abdul Rahman Chik a*, Tg. Ainul Farha Tg. Abdul Rahman b

Transcription:

Arabic Orthography vs. Arabic OCR Rich Heritage Challenging A Much Needed Technology Mohamed Attia Having consistently been spoken since more than 2000 years and on, Arabic is doubtlessly the oldest among the major contemporary languages. Over that amazingly long history, this language has been able to respond to the civil needs of consecutive ages, and also to react with the geographical and ethnic expansions of its speakers from only the limited inhabitants of the Arabic peninsula to currently 300 million native speakers (called collectively Arabs) plus numerous tens of millions of non native ones among almost 1 billion non Arab Muslims. Although the basics of phonology, morphology, grammar,, and the other Arabic discriminant components remained essentially the same, some aspects of Arabic have naturally and continuously been evolving with the aforementioned time passing and expansion of the speakers base. One such aspect is the Arabic orthography which is also used nowadays to transcript other widely spoken languages such as Persian (official language of Iran) and Urdu (official language of Pakistan), and also used to be the transcription format of others like Turkish (until the thirties of the twentieth century). A. Historical Background In the very early stages; writing was not so common among the people of the world in general and among Arabs in special who used to mainly communicate via speaking. Being - by that time - mainly Bedouin troops isolated in the severe deserts of the Arabic peninsula with a superior talent of composing and memorizing poetry and with a little need for official documentation, the minority of Arabs who had the writing and reading ability were satisfied by a relatively simple orthographic scheme. This scheme detailed in the next section - is based on 28 alphabetical characters {Alif, Baa, Taa,, Ha, Waw, Yaa} each represented mainly by a basic shape (and variants) called grapheme. The (complicating) simplification was that sets of different characters are represented by the same grapheme! While such a scheme with only 15 (or 16) graphemes is obviously quite ambiguous, the educated minority of olden talented Arabs had no problem communicating with it among each other. If one of them were to write the sentence Translation is a basic means of the mutual exchange of civilizations among the peoples over the ages, the result would look like figure 1. Fig. 1 (BareArabicOrthography.bmp); A sample sentence written in bare Arabic orthography used in the early ages of the language. With the emergence of Islam by the early seventh century AD in the Arab peninsula, Qur aan; the holy book of Muslims and the basic source of their

jurisprudence has been revelated in Arabic. The early Muslims who were mainly Arabs carefully documented the holy Qur aan in the aforementioned bare style which was good enough for them. Few decades later, when Islam became the religion of many non Arab peoples, mistakes at reading the holy Qur aan were experienced. Serious as it was, the threat of misconstructing the holy Qur aan hardly alerted the Arab linguists of the time to the ambiguity of the bare orthographic style. It was then logical for the (less ambiguous) dotted orthographic scheme to replace the bare one resorting to the rule that Each character is represented by a basic grapheme which has a unique shape. In order to comply with its antecedent, the dotted scheme cleverly added discriminating dots over and under ambiguous graphemes. Using the latter scheme, the transcript of our sample sentence looks like figure 2. Fig. 2 (DottedArabicOrthography.bmp); The same sample sentence written with dotting to remove character identification ambiguity. To this point the Arabic orthography could perfectly describe the spelling of text, but the phonetic transcription was still to be inferred by a knowledgeable reader who had - among other tasks - to supply the short vowels and differentiate whether a character belonging to {Alif, Waw, Yaa} is a consonant or a long vowel. Again, the new comers to Islam from an ever expanding area outside the Arabic peninsula were troubled by all these jobs while reading the holy Qur aan written in the dotted orthographic scheme. In response to this problem, Arabic linguists devised later an elaborate Arabic orthographic scheme containing many diacritical marks (or simply diacritics) and punctuators as well as a wide set of reading rules that all completely and unambiguously determine the exact phonetic transcription of the holy Qur aan in special, and any written Arabic text in general. This new scheme was called the Ottoman orthogrpghy and became to date the exclusively approved style for transcripting the holy Qur aan from which we show a sample page on figure 3.

Fig. 3 (OttomanOrthography.bmp); One page of the holy Qur aan - describing the creation phases of the human being - written in the Ottoman orthography. With the maturation of the respective Muslim states (actually empires) from Abbasids to Ottomans, the extensive need for all kinds of offical, intellectual, technical,..., etc. documentation turned the Arabic orthography into a rigorous science which produced very early the concept of font in a wide range of variety from the practical (see figure 4) to the artistic (see figure 5). Arabic orthography was then doubtlessly so ready for the age of printing, and later for the age of digital computers. Fig. 4 (PracticalFont.bmp); An example of a practical Arabic font (now called Traditional Arabic) from the Naskh family. Fig. 5 (DiwaniFont.bmp); An example font of the Diwani fonts family with artistic effects.

B. Challenges Of Modern Arabic Orthography To OCR Technology Equipped with the background presented above, we are in a good place to spot the most challenging features of Arabic orthography to the OCR technology. 1- The connectivity challenge: Whether handwritten or typewritten, Arabic text can only be scripted in connected (or cursive) mode; i.e. graphemes are connected to one another within the same word with this connection interrupted at few certain characters or at the end of the word. This necessitates any Arabic OCR to do not only the traditional grapheme recognition task, but also another tough grapheme segmentation one (see figure 6). To make things even harder, both of these tasks are mutually dependent and must hence be done simultaneously. Fig. 6 (GraphemeSegmentation.bmp); Grapheme segmentation process illustrated by manually inserting vertical gray lines at appropriate grapheme connection points. 2- The dotting challenge: As stated before; dotting is extensively used to differentiate characters sharing similar graphemes. According to figure 7 where some example sets of dotting-differentiated graphemes, it is apparent that the digital differences between the members of the same set are small. Whether the dots are eliminated before the recognition process, or recognition features are extracted from the dotted script, dotting is a significant source of confusion hence recognition errors in Arabic typewritten OCR systems especially when run on noisy documents; e.g. those produced by photocopiers. On the contrary, dotting may be helpful for Arabic handwritten OCR systems as dots are usually sensed as separate short strokes. Fig. 7 (DottingOnSimilarCharacters.bmp); example sets of dotting-differentiated graphemes. 3- The multiple grapheme cases challenge: Due to the mandatory connectivity in Arabic orthography; the same grapheme representing the same character can has multiple variants according to its relative position within the Arabic word segment {Starting, Middle, Ending, Separate} as exemplified by the 4 variants of the Arabic character Ein highlighted in red in figure 8. Fig. 8 (MultipleCasesOfGrapheme.bmp); The 4 cases; Starting, Middle, Ending, and Separate cases of the grapheme represnting character Ein highlighted in red.

4- The ligatures challenge: To make things even more complex, certain compounds of characters at certain positions of the Arabic word segments are represented by single atomic graphemes called ligatures. Ligatures are found in almost all the Arabic fonts, but their number depends on the involvement of the specific font in use. Figure 9 illustrates some ligatures in the famous font Traditional Arabic highlighted in red. Fig. 9 (Ligatures.bmp); Some ligatures in the Traditional Arabic font highlighted in red. 5- Broad graphemes set: Multiple grapheme cases as well as the occurrence of ligatures directly lead to broad grapheme sets; e.g. a common highly involved font like Traditional Arabic contains around 190 graphemes, and another common less involved one (with less ligatures) like Simplified Arabic contains around 95 graphemes. Compare this to English where 40 or 50 graphemes are enough! Again, a broader grapheme set means higher ambiguity, and hence more confusion. 6- The diacritics challenge: Unless the reader is knowledgeable enough, each character in Arabic strictly needs one or more diacritical marks to be drawn over or under the corresponding grapheme in order to ensure the intended phonetic transcription and hence the correct pronunciation. Apart from the teaching purposes, Arabic diacritics are used in practice only when they help in resolving linguistic ambiguity of the text. The problem of diacritics with typewritten Arabic OCR is that their direction of flow is vertical while the main writing direction of the body Arabic text is horizontal from right to left. (See figure 10) Like dots; diacritics when existent - are a source of confusion of typewritten OCR systems especially when run on noisy documents, but due to their relatively larger size they are usually preprocessed. C. Current state of the art Fig. 10 (DiacritizedText.bmp); Diacritics added to Arabic text. Among numerous applications of typewritten OCR systems comes Document Management Systems (DMS) as the largest industrial consumer. In such systems scanned images of electronically unavailable documents are archived by the DMS, meanwhile OCR are run on each scanned image. While the images are used for viewing the document, the text resulting from OCRing the images is used for all kinds of Information Retrieval (IR) and Knowledge Management (KM) purposes which are insensitive to the inevitable error rate of the OCR process as long as this rate is kept small enough (< 4% of the word rate is a rational criterion).

As the Arabic market esp. in the Gulf countries is currently a hot one, there is a quite high need for reliable typewritten Arabic OCR engines to be integrated in such DMS systems. Perhaps the - by far - most ready and best equipped system is Automatic Reader 7.0 provided by Sakhr. Affording document retrieval, omni Arabic OCR, learning mode, 95% word-level accuracy rate, SDK for integration, and being able to deal with bilingual Arabic-Latin documents, this system is the best-to-choose-now for heavy duty serious applications. For more details on this system; the reader can visit: http://www.sakhr.com/sakhr_e/products/ocr_off.htm?index=2&main=products&sub=ocr Fig. 11 (AutomaticReader_Sakhr.bmp); A screen capture of Automatic Reader 7.0 (Platinium Edition) from Sakhr Automatic Reader 7.0 from Sakhr is essentially based on a huge set of ad hoc orthographic rules and tips put in a work frame of AI searching techniques to decide on each of the Arabic OCR phases; preprocessing of tiny blocks like dots and diacritics, segmentation, and grapheme recognition. The last phase of synthesizing the recognized text is cleverly guided by Sakhr s Arabic NLP tools for filtering out nonsensical results. On the other hand, online handwritten OCR has turned into a real business with the booming of the keyboardless hand held devices. Beyond the academic pilots, practically functional Arabic handwritten OCR sysems are rare, and the product of Arabic Writer form ImagiNet can be selected as a representative one. The underlying methodology of this system is to train and deploy artificial Neural Networks to decide on the most likely character sequences corresponding to the dynamically sensed features sequences of curvature, with a preprocessing of short strokes corresponding to dots and diacritics. For more details on this system; the reader can visit: http://www.imaginet-software.com/index.aspx

Fig. 12 (ArabicWriter_ImagiNet.bmp); A snapshot of Arabic Writerr from Imaginet. Note the caution box! For such a needy language like Arabic; there is a wide room for enhancement either to lower the error rate of typewritten systems and/or allow for completely free hand writing style of online ones. The research group of Prof. Mohsen A. A. Rashwan and his post graduate students in the faculty of engineering of Cairo University-Egypt may be regarded as a representative one. They are trying a fully mathematical approach based on an analogy to the ASR (Automatic Speech Recognition) where (phoneme-grapheme) segmentation and recognition are done simultaneously using HMM techniques applied on feature-vector sequences extracted via a sliding window in the writing direction. Besides being a cleaner architecture, this promising approach for Arabic typewritten and also online OCR has the virtue of realizing an enhancing accuracy and noise immunity with the increase of training data as is the case with ASR.

Mohamed Attia is the Arabic NLP team leader in The Engineering Company for The Development of Computer Systems; RDI, www.rdi-eg.com, and is also a PhD student in the Faculty of Engineering, Cairo University, Egypt. He can be contacted at m_atteya@rdieg.com or m_atteya2004@yahoo.com