An Ocr System For Printed Nasta liq Script: A Segmentation Based Approach Saeeda Naz, Arif Iqbal Umar, Saad Bin Ahmed,, Syed Hamad Shirazi, M. Imran Razzak,, Imran Siddiqi Department Of Information Technology, Hazara University, Mansehra, Pakistan Higher Education Department, KPK, Pakistan King Saud Bin Abdul Aziz University for Health Sciences, Riyadh, Saudi Arabia Department of Computer Science, Bahria University Islamabad, Pakistan { saeedanaz292, isaadahmed, mirpak, syedhamad, imransidiqqi}@gmail.com, arifiqbalumar@gmailcom Abstract Machine simulation of human reading has been a subject of intensive research for almost four decades. Automatic Urdu character recognition remains a challenging task due to its cursive nature despite the fact that the latest improvements in recognition methods and systems for Latin script are very promising. This work introduces a robust approach based on statistical models that provide solution for recognition of Urdu text Nasta liq style. Contrary to classical approaches which segment text into words, ligatures or characters, we intend to employ an implicit segmentation where text lines are recognized during segmentation. The developed system will be evaluated on standard Urdu text databases and compared with the state-ofthe-art recognition techniques proposed till date. I. INTRODUCTION Most people learn to read and write during their first few years of education. By the time they have grown out of childhood, they have already acquired very good reading and writing skills including the ability to read most of the texts either handwritten or printed, written in different fonts and styles. Even majority of people have no problems in reading light prints or heavy prints; upside down prints; advertisements in fancy font styles, calligraphic text; characters with flowery ornaments and missing parts. On the contrary, despite more than four decades of intensive research, the reading skill of the computer is still way behind that of human. In the recent years, there has been an unending demand for cursive/non-cursive Optical Character Recognition (OCR) systems, not only to facilitate the native speakers to readily use these OCRs for their mobile or tablet requirements, but also for the digitization of a large amount of legacy documents, such as holy books, magazines, newspapers, poetry books, and handwritten documents. Although a computationally intensive field, OCR has witnessed a significant improvement over the years. This is mainly due to the tremendous advances in the computational intelligence algorithms. The objective of character recognition is to imitate the human reading ability, with the human accuracy but with far higher speed. The target performance is at least five characters per second with a 99.9% recognition rate [1]. OCR is the most important component of various applications, such as document automation, verification of cheques, data entry applications, development of reading machines for visually handicapped and a large variety of many other banking and business applications. OCR is an active area of research and its importance is well established par rapport the disciplines of digital image processing, pattern recognition, artificial intelligence, database systems, natural language processing, human-machine interaction, and communications. These applications can perform well, if the characters from text images are classified and recognized accurately. Most of the commercial OCR applications are concerned with the machine printed Latin scripts having well-separated characters. Moreover, the OCR systems for printed Japanese and Chinese languages are also quite mature. The languages such as Arabic, Persian, Urdu, and Pashto are derived from the Arabic script and are read, written and spoken by a considerable proportion of population in the world. There are many font styles of cursive script, Nasta'liq, Kofi, Thuluth, Diwani, Riq'a, and Naskh to name a few. Among the aforementioned font styles, Naskh and Nasta'liq are the most important to mention wherein the former is preferred for Arabic, Persian, and Pashto languages and the latter is adopted for Urdu typesetting. Some commercial OCRs are available for printed Arabic characters but they have many technical problems, especially in the segmentation stage where the results are not enviable. For all practical purposes, the Urdu script is the superset of its Arabic and Persian counterparts. Recognition of printed Arabic text has received considerable research attention whereas surveys on recognition of Urdu text [2-5] reveal that very limited research efforts have been carried out towards the development of an Urdu OCR. This may be due to the complexities involved Nasta liq writing style [6]. Although Urdu and Arabic share many common attributes, the techniques developed for recognition of Arabic text cannot be directly applied to Urdu text due to complexity of writing style Nastaliq as compare to the Naskh writing style for Arabic. Challenges in recognition of Nasta liq Urdu text include diagonality, multiple baselines, high cursiveness and context sensitivity. Most of the studies on recognition of Urdu text use ligatures as the basic unit of recognition [31-32, 40-43]. The total number of unique Urdu ligatures is approximately 22,000 [39] and training classifiers to learn to discriminate such a large number of classes is a challenging task. Many studies use only a small subset of ligatures representing the frequently used ligatures. Among one of the very well-known ligature based approaches, the study presented by Javed and Hussain [10] on Offline Urdu OCR is evaluated on 1500 ligatures from a set of 5,000 frequently occurring ligatures comprising one to eight characters. HMM HTK toolkit is trained on these ligatures using DCT features ISBN: 978-1-4799-5754-5/14/$26.00 2014 IEEE 255
and a recognition rate of 93% is realized. The number of recognized ligatures (approximately 1500) is very less as compared to the total number of unique Urdu ligatures (approximately 22,000). Moreover, ligature based approaches are limited in the sense that new ligatures on which system is not trained cannot be recognized. To overcome the problem of a large number of classes in ligature based approaches, one of the solutions is to segment the ligatures in to characters and train classifiers to recognize the characters. This reduces the number of classes from total number of unique ligatures to total number of characters and their different shapes. The segmentation of ligatures into characters, however, itself is a challenging and error prone task [16]. To overcome these issues with ligature based and segmentation based recognition, a new trend is to employ implicit segmentation techniques where the text is recognized during segmentation phase itself. Moreover, the limited work on Urdu OCR reported in the literature has mostly been evaluated on non-standard datasets where the researchers would generate their own datasets for evaluation of the proposed techniques. This makes an objective comparison of different methods a challenging task. The main goal of the proposed research is to develop an implicit segmentation based Optical Character Recognition system for printed Urdu text written in Nasta liq font. The paper is organized as: section II is presented the related work in the field of OCR, its motivation and research problem. In section III, we have discussed the general steps involved in the development of an OCR along with discussion of the notable contributions to recognition of Urdu text followed by a discussion on our intended methodology. This section also present the dataset and the evaluation metric we plan to work with. Finally, we conclude the paper with some remarks and our future plan of study. II. RELATED WORK Character recognition techniques associate a Unicode with the image of a character. Based on the mode of input, OCR is classified as offline and online as illustrated in Fig. 1. [7, 8]. The offline OCR deals with the digitized images of text such as handwritten or machine printed. The digital image of text could be obtained from an optical scanner or a camera. In contrast, in the online OCR, the input text is written directly using a tablet, a PDA, or a stylus. The online character recognition is probably easier than its offline counterpart as more information is available, such as time information, stroke coordinates, and handwriting style of the user. A typical OCR system mainly comprises a combination of the following modules. Image acquisition Preprocessing Segmentation o o Feature extraction Segmentation free/holistic approach Segmentation based/ analytical approach Explicit Segmentation Implicit Segmentation o Structural Features o Statistical Features Recognition/classification Post-processing Fig. 1. Types of OCR The images of printed or handwritten documents are acquired using a scanner, camera or a digitizing tablet and are pre-processed before they could be fed to the subsequent modules. Pre-processing typically involves binarization, skew and slant detection and correction, noise removal and segmentation of text and non-text objects [9-14]. Depending upon the type of approach the segmentation step involves splitting the text into lines, words, ligatures, characters or strokes. This step is more crucial in cursive scripts like Arabic, Urdu, Persian, Pashto, Sindhi, Malay (Jawi), Uigher etc. As discussed earlier, the recognition techniques rely on one of segmentation-based or segmentation-free approaches [15-16]. In segmentation-free or holistic approaches, the system seeks to recognize the ligature or word as a whole without segmenting it further into characters or sub images. Generally, paragraphs in text are split into lines using horizontal projection or heuristics based methods. Text lines are then split into words or sub-words (ligatures) using vertical projections and connected component labeling etc. [17]. In segmentationbased or analytical approaches, ligatures are segmented into characters or strokes explicitly or implicitly. Segmentation-based approaches are further categorized into explicit and implicit segmentation. In explicit segmentation, the words or ligature are divided into characters or strokes [16]. Incorrect segmentation leads to misclassifications and results in reduced recognition rates. Correct segmentation of ligatures is in fact the major challenge in explicit segmentation based approaches [18-20]. In the implicit segmentation, words or ligatures are segmented into smaller units while being recognized without any accurate splitting path. Implicit segmentation is also termed as straight or recognition based segmentation. These methods scan the text images line by line from right to left and segments words into characters during/after recognition using 256
codebook entries or predefined classes or a set of features [20-24]. These approaches have been effective on highly cursive scripts and can also be employed in the development of a multilingual OCR. Segmentation is followed by the feature extraction step and a wide variety of statistical as well as structural features have been investigated in the literature. Structural features are typically computed by finding the extreme points and joining points [25] or considering the number of dots, position of the dots, presence of branches, loops or secondary strokes and the slope between the initial point and the final point [26, 32]. Statistical features, for which rich classifiers are available, are mostly preferred over structural features and a large number of techniques rely on statistical features including shape descriptors, contour based statistics, edge based features and other statistical measurements computed at word, ligature or character levels [26 30]. For recognition, a number of classifiers including hidden Markov models (HMM) [16, 24, 33, 34], artificial neural networks (ANN) [20, 25, 31, 36], support vector machine (SVM), nearest neighbor classifier (NN) or template matching [32 33] and decision tree classifier [37] have been extensively used. In some cases, the classification step is followed by postprocessing [9, 10] to improve the overall recognition accuracy of the system. After having discussed the general steps involved in an OCR system, we present the proposed solution in the next section. III. PROPOSED SOLUTION Our study is aimed at developing a robust optical character recognition system based on implicit segmentation. The main steps involved in our work are likely to include the following. Acquisition of printed Urdu text from UPTI database employed in our study. Extraction and selection of features which provide the best recognition rates for implicit segmentation based Urdu OCR for printed text. Recognition using state-of-the-art classifiers like recurrent neural network, hidden markov model, classifiers based on fuzzy logic or conditional random fields (CRF). A. Overview of Proposed System We intend to work on scanned images of text from UPTI dataset. The pre-processing in our case will comprise the traditional steps of de-noising, skew detection and correction and binarization. The text page will be segmented into lines using horizontal projection profiles complemented with some heuristics. We intend to employ implicit segmentation and use a set of statistical features. Features like projection and profiles, chain codes and zone based statistical measures etc. can be explored. Classification can be carried out using neural networks or hidden Markov models while a language model can also be integrated with the system to improve the overall recognition rates through dictionary validation. An overview of the intended methodology is presented in Fig. 2. The system is planned to be developed in MATLAB/Python on Windows platform. Our some efforts are reported in [44, 45]. Fig. 2. Proposed System B. Dataset Most of the existing Urdu OCR systems have been evaluated on custom developed databases. This makes a quantitative comparison of different methods a difficult task. The Image Understanding and Pattern Recognition Group (IUPR) at Technical University of Kaiserslautern, Germany, generated synthetic data of Urdu Nasta liq text from leading Urdu newspapers of Pakistan and termed it as UPTI dataset. We plan to evaluate our system on his standard dataset. C. Measurement Matric The developed recognition system is planned to be evaluated using graph edit distance. The character level accuracy will computed using: insertions + substituti on + deletions accuracy = 100 1 totallengt hoftestset transcript ion IV. CONCLUSION This paper proposed an OCR system for printed Urdu text in Nasta liq script based on implicit segmentation. A set of statistical features will be extracted and fed to the classifier for recognition. The developed technique will also be evaluated on the standard UPTI database and will be compared with existing state-of-the-art Urdu OCRs. REFERENCES [1] V. Govindan and A. Shivaprasad, Character Recognition-A Review, Pattern Recognition, vol. 23, no. 7, pp. 671-683, 1990. [2] S. Naz, K. Hayat, M.I. Razzak, M.W. Anwar, S.A. Madani and S.U. Khan, The optical character recognition of Urdu-like cursive scripts, Pattern Recognition, vol. 47, no. 3, pp. 1229 1248, 2014. 257
[3] S. Naz, K. Hayat, M.I. Razzak, M.W. Anwar, and H. Akbar, Arabic script based character segmentation: A review, In Computer and Information Technology (WCCIT), World Congress on, pp. 1-6. 2013. [4] S. Naz, K. Hayat, M.I. Razzak, M.W. Anwar, and S.Z. Khan, "Challenges in Baseline Detection of Arabic Script Based Languages." Springer International Publishing in Intelligent Systems for Science and Information, pp. 181-196. 2014. [5] S. Naz, K. Hayat, M.I. Razzak, M.W. Anwar, and H. Akbar, Challenges in baseline detection of cursive script languages, In Science and Information Conference (SAI), pp. 551-556, 2013. [6] S. Naz, K. Hayat, M.I. Razzak, M.W. Anwar, and H. Akbar, Arabic script based language character recognition: Nasta'liq vs Naskh analysis, In Computer and Information Technology (WCCIT), World Congress on (pp. 1-7). 2013. [7] B. Al-Badr and S. A. Mahmoud, Survey and Bibliography of Arabic Optical Text Recognition, Signal Processing, vol. 41, no. 1, pp. 49-77, 1995. [8] L.M. Lorigo and V. Govindaraju, Online Arabic Handwriting Recognition: A Survey, IEEE Trans. Pattern Analysis and Machine Intelligence, pp.8, no. 5, pp. 712-724, 2006. [9] M. Naz Q. U. A. Akram and S. Hussain, Binarization and its Evaluation for Urdu Nastalique Document Images, Center for Language Engineering, Al-Khawarizmi Institute of Computer Science, Pakistan 2014. [10] F. Shafait, D. Keysers, and T. M. Breuel, Layout analysis of Urdu document images, In Multitopic Conference, pp. 293-298, 2006. [11] R.J. Ramteke and I. K. Pathan. Noise Reduction in Urdu Document Image Spatial and Frequency Domain Approaches. In Proc. 4th International Conference on Signal and Image Processing 2012 (ICSIP'12), volume 222 of Lecture Notes in Electrical Engineering, pp. 443-452. Springer India, 2013. [12] D.S. Le, G. R. Thoma, and H. Wechsler. Automated Page Orientation and Skew Angle Detection for Binary Document Images. Pattern Recognition, vol. 27, no. 10:1325-1344, 1994. [13] R.J. Ramteke, K. P. Imran, and S. C. Mehrotra. Skew Angle Estimation of Urdu Document Images: A Moments based Approach. International Journal of Machine Learning and Computing, vol.1, no. 1, pp. 7-12, 2011. [14] S. F. Rashid, S. S. Bukhari, F. Shafait, and T. M. Breuel. A Discriminative Learning Approach for Orientation Detection of Urdu Document Images. In Proc. 13th International Multitopic IEEE Conference (INMIC'09), pp. 1-5, 2009. [15] S.T. Javed and S. Hussain, Improving Nastalique Specific Pre- Recognition Process for Urdu OCR, In Proc. 13th International Multitopic IEEE Conference (INMIC'09), pp.1-6, 2009. [16] S. T. Javed, S. Hussain, A. Maqbool, S. Asloob, S. Jamil, and H. Moin, Segmentation Free Nastalique Urdu OCR, World Academy of Science, Engineering and Technology, vol 46, pp. 456-461, 2010. [17] B. Al-Badr and S. A. Mahmoud, Survey and Bibliography of Arabic Optical Text Recognition, Signal Processing, vol. 41, no. 1, pp. 49-77, 1995. [18] A.M. Zeki, The Segmentation Problem in Arabic Character Recognition The State of the Art, In Proc. 1st International Conference on Information and Communication Technologies (ICICT'05), pp. 11-26, 2005. [19] Y.M. Alginahi, A survey on Arabic character segmentation, International Journal on Document Analysis and Recognition, pp. 1-22, 2012. [20] Z. Ahmad, J. K. Orakzai, and I. Shamsher, Urdu Compound Character Recognition using Feed Forward Neural Networks, In Proc. 2nd International Conference on Computer Science and Information Technology (ICCSIT'09), pp. 457-462, 2009. [21] A. Ul-Hasan and S. B. Ahmed and F. Rashid and F. Shafait and T. M. Breuel, Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks, 12th International Conference on Document Analysis and Recognition (ICDAR'13), pp. 1061-1065, 2013. [22] A. Graves and M. Liwicki and S. Fern and R. Bertolami and H. Bunke, and J. Schmidhuber, A Novel Connectionist System for Unconstrained Handwriting Recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, 2009. [23] M.S. Khorsheed, Recognising handwritten Arabic manuscripts using a single hidden Markov model, Pattern Recognition Letter, vol 24, pp. 2235 2242, 2003. [24] M.S. Khorsheed, Offline recognition of omnifont Arabic text using the HMM ToolKit (HTK), Pattern Recognition Letter, vol 28, pp. 2235 2242, 2007. [25] I. Shamsher, Z. Ahmad, J. K. Orakzai, and A. Adnan, OCR for Printed Urdu Script using Feed Forward Neural Network, Proc. World Academy of Science, Engineering and Technology, vol. 23, pp. 172-175, 2007. [26] R.G. Casey and G. Nagy, Recursive Segmentation and Classification of Composite Character Patterns, In Proc. 6th International Conference on Pattern Recognition, vol. 2, pp. 1023-1026, 1982. [27] F. Hussain and J. Cowell, Extracting Features from Arabic Characters, In Proc. 2nd International Conference on Computer Graphics and Imaging (CGIM'01), pp. 201-206, 2001. [28] J. Cowell and F. Hussain, A Fast Recognition System for Isolated Arabic Characters, In Proc. 6th International Conference on Information Visualisation, pp. 650-654, London, UK, 2002. [29] A. Muaz, Urdu Optical Character Recognition System, Master's thesis, National University of Computer & Emerging Sciences Lahore, Pakistan, 2010. [30] S.A. Hussain, S. Zaman, and M. Ayub, A Self Organizing Map based Urdu Nasakh Character Recognition, In Proc. International Conference on Emerging Technologies (ICET'09), pp. 267-273, 2009. [31] D.B. Megherbi, S. M. Lodhi, and A. J. Boulenouar, Fuzzy-Logic- Model-based Technique with Application to Urdu Character Recognition, Proc. SPIE Applications of Artificial Neural Networks in Image Processing, vol. 3962, pp. 13-24, 2000. [32] Z.A. Shah, Ligature based Optical Character Recognition of Urdu- Nastaleeq Font, In Proc. 6th International Multitopic IEEE Conference (INMIC'02), 2002. [33] S.A. Husain, A Multi-Tier Holistic Approach for Urdu Nastaliq Recognition, In Proc. 6th International Multitopic IEEE Conference (INMIC'02), pp. 528-532, 2002. [34] M. Decerbo, E. MacRostie, and P. Natarajan, The BBN Byblos Pashto OCR system, In Proc. 1st ACM Workshop on Hardcopy Document Processing (HDP '04), pp. 29-32, 2004. [35] R. Safabakhsh and P. Adibi, Nastaaligh Handwritten Word Recognition Using a Continuous-Density Variable-Duration HMM, The Arabian Journal for Science and Engineering, vol. 30, no. 1B, 95-118, 2005. [36] S.N. Nawaz, M. Sarfraz, A. Zidouri, and W. G. Al-Khatib, An Approach to Online Arabic Character Recognition using Neural Networks, In Proc. 10th International Conference on Electronics, Circuits and Systems (ICECS'03), vol 3, pp. 1328-1331, 2003. [37] U. Pal and A. Sarkar, Recognition of Printed Urdu Script, In Proc. Seventh International Conference on Document Analysis and Recognition (ICDAR 2003), pp. 1183-1187, 2003. [38] D.S. Guru, S. K. Ahmed, and K. Irfan, An Attempt towards Recognition of Handwritten Urdu Characters: A Decision Tree Approach, In Proc. National Conference on Computers and Information Technology (NCCIT'01), pp. 75-83, 2001. [39] A.M. Jamil, Noori Nastaliq Revolution in Urdu composing, book, ELITE PUBLISHERS LTD for NOORI NASTALIQ FOUNDATION, 2008. [40] S.A. Sattar, A Technique for the Design and Implementation of an OCR for Printed Nastalique Text, PhD thesis, NED University of Engineering & Technology, Karachi, Pakistan, 2009. [41] U. Iftikhar, Recognition of Urdu Ligatures, Master's thesis, VIBOT Consortium and German Research Center for Arti_cial Intelligence (DFKI), 2011. [42] N. Sabbour, N. and F. Shafait, A segmentation-free approach to arabic and Urdu OCR. InIS&T/SPIE Electronic Imaging, pp. 86580N-86580, International Society for Optics and Photonics, 2013. 258
[43] G.S. Lehal and A. Rana. Recognition of Nastalique Urdu Ligatures. In Proceedings of the 4th International Workshop on Multilingual OCR. ACM, 2013 [44] S.B. Ahmed, S. Naz, Salahuddin, M.I. Razzak, A.A. Khan, A.I. Umar, UCOM Offline Dataset a Urdu Handwritten Dataset Generation, accepted in IAJIT, unpublished. [45] S.B. Ahmed, S. Naz, Salahuddin, M.I. Razzak, A.I. Umar, Handwritten Urdu Character Recognition using Recurrent Neural Networks:, Accepted in Neural Computing and Application, unpublished. 259