Arabic Script Web Document Language Identifications Using Neural Network

Arabic Script Web Document Language Identifications Using Neural Network Ali Selamat, Ng Choon Ching, Siti Nurkhadijah Aishah Ibrahim 1 ) Abstract This paper presents experiments in identifying language of Arabic script web documents using neural network. There are some difficulties when identifying those languages in Arabic script such as Persian, Turkish, Urdu, Jawi etc. Since there is a vast amount of information presented to the internet users, it is crucial to find an appropriate method in language identification for a variety of textual information. Several current approaches rely on collective dictionary and conventional statistical approach for language identification. We analyzed the feasibility of using windows algorithm in finding the best method of selecting best features for neural network. From the experiments, we have found that the non-sliding windows neural network has performed better in accuracy than the sliding windows neural network. 1. Introduction Language identification is the process of recognizing the natural language of human communication for the given content. For instance, languages such as English, Malay, Chinese, Japanese, Arabic etc. Nowadays, there is a lot of information across the internet either in text or image form which are encoded in different languages. In addition, many natural processes are also coded by a string of characters or letters such as DNA representation (CGTA), web page syntaxes (<html>, <title>, <body>) etc. Consequently there are some difficulties when identifying the languages used in web documents. Furthermore, some problems exist in textual document collected from the internet, such as irrelevant information in a web documents, unstructured useful textual information, spelling and syntax errors. Another issue is character encoding used in the web documents. Although unicode encoding is a standard style used for internet publication, many web documents are still facing an encoding problem. For example, it is common for internet users to come across some unknown symbols presented within a web document because of incompatible encoding used in the computer system. In order to recognize multilingual documents for Arabic scripts used in various countries such as Iran, Iraq, Libya, Pakistan etc, whereby these scripts have been used in their o±cial languages, a machine learning method is needed especially in the classification process during explosive growth of the web documents. 1 Faculty of Computer Science and Information System (FSKSM), Universiti Technologi Malaysia (UTM), aselamat@utm.my, simon5u@yahoo.com, echoas1306@yahoo.com 329

Many methods have been developed for language identification of textual document such as neural network [1], [2], [3], n-grams [9], [6], statistical approaches [11], and support vector machines [12]. Neural network has good ability to learn and adapt with abstract information which is derived from the real world. The ability of learning and adapting makes neural network a useful implementation tool in many application areas whereby it has been successfully used in language identification, either in text-based or speech-based. The methods of using neural network may differ in terms of input vectors and topology network regarding the applications. For example, a hybrid approach employing a multi-layer neural network and a decision rule block as a priori information for bilingual language identification has been developed [1]. This approach is a further improvement from previous work by using a hybrid neural network and rule-based system for text-to-phoneme mapping [2]. Scalable neural network has been successfully applied in a multilingual automatic speech recognition (ASR) system whereby memory of this model can be scaled to meet the memory requirement according to the target platform [3]. The implementation for multilingual recognition system is growing fast in various applications such as speech synthesis or text-to-speech (TTS) in the field of speed processing [13], and optical character recognition [7], [8]. Automatic language identification (LID) is an integral part of multilingual speech recognition system that uses dynamic vocabularies. Many language identification tools for web documents have been developed whereas most of the research were focusing on European and Asian languages. In this paper, we are focusing on the identification of Arabic script languages such as Persian and Arabic using neural network whereby, there is a need to identify and classify the similar words belonging to each of the two natural languages respectively. We analyzed the feasibility of using a windowing algorithm as an alternative method for selecting the best features for neural network classifications. 2. Preprocessing Web Documents Before the preprocessing phase, we developed a crawler agent to crawl around the World Wide Web (WWW) to gather the related web documents for this research. It consists of several seeds (Uniform Resource Locator, URL) to start the crawling process. Thus, it will crawls many web servers which contain the web pages with Arabic scripts. In this paper, we focused on the Arabic and Persian languages. We used 120 documents of Arabic language and 120 documents of Persian language obtained in the crawling processes. Figure 1 shows the example of the Arabic document before and after stopping and stemming processes. Finally, these documents will be encoded into decimal form using online tools [16]. 2.1. Stopping and Stemming In the preprocessing part, we have gone through the stopping and stemming processes in order to achieve higher accuracy result in the later process. Stopping process or stop words is the name given to words which are filtered out prior to, or after, processing of natural language data in a text format. Stemming is a process for reducing infected (or sometimes derived) words to their stem, base or root form that generally a written word form [4], [15]. 330

Figure 1: (a) Arabic script before preprocessing (b)arabic script after preprocessing 2.2. Input Pattern Selections In order to train the synaptic weights of the neural network and evaluate the performance of the network, a training dataset and testing dataset with corresponding language tag must be available. We have selected 80% of each language documents to be used as training dataset and the remaining documents were used as testing dataset. For example, 120 Arabic documents were divided into training dataset which was derived first 100 words of 96 documents and the remaining documents were used as testing dataset which was also derived first 100 words from each document. We have applied similar procedure on 120 Persian documents and consequently only one training dataset and one testing dataset were created to contain both the Arabic and Persian texts. Figure 2 shows the overview for preparing input patterns to the neural network. Each of the preprocessed documents were encoded to decimal documents, then, each selected preprocessed document were combined into one training and testing document. Next, the input patterns were captured by windows network according to the size used. Finally, selected input patterns will be used to train the neural network and the testing dataset will be used for evaluating the effectiveness of the trained networks. The window size implemented for capturing the characters is changed in a range of 2-50 characters for further investigation of the performance of neural network. The actual output of language identified by neural network model was then compared with desired output for calculating the accuracy of the network. 3. Neural Network Topology In this research, we have a multi-layer perception (MLP) back-propagation neural network as shown in Figure 3. This architecture consists of input layer, two hidden layers and output layer. The amount of neurons in input layer is based on the window size (i). We have eight nodes for first hidden layer and nine nodes for the following layer. The output layer consists of two nodes which are corresponding to the related language used as shown in Table 1. 331

Figure 2: Feature selection of data preprocessed Figure 3: Multi-layer perceptron (MLP) back-propagation neural network We have identified that a static architecture of artificial neural network has brought some problems which should be fixed in modeling before applying neural nets. In our case, when the architecture of the neural network is set, there is no flexibility to feed data with different dimensions. As a result, we consider the training set as constructed by words which are captured from Persian and Arabic 332

language documents. The training dataset is also used to test various window sizes in order to solve the dynamic length of words. 3.1. Windowing Algorithm Figure 4: Sliding windows neural network From literature review, the idea of windowing algorithm has been successfully implemented in NETtalk [10] whereby it is a parallel network that learns to read aloud of English words. Two different approaches have been applied in this research, firstly, sliding windows neural network and secondly, non-sliding windows neural network. A windowing concept is referring to the size of input layer of neural network. If the size of windows is i units then the number of nodes for the input layer of neural network will also i units. Various size of windows are used for evaluating the influences on the network. Sliding windows network will captures each input patterns according to the windows size used by moving the windows ahead one letter each time until the end of the document as shown in Figure 4. The non-sliding windows network will captures the input patterns by moving the whole windows to the next input (Figure 5). The total characters inside each document and the total number of tokens captured by windows are represented by j and a, respectively. For non-sliding windows neural network, the a is given by 333

where fix is to round the remaining answer to zero. For example, if j divided by i is equal to 2005.65, then fix is applied to the answer, a, which represents the number of input patterns, and the output becomes 2005. For sliding windows neural network, a is given by Figure 5: Non-sliding windows neural network 3.2 Neural Network Training During the training process of neural network, the connection weights between each node are initialized with some random values. The trained dataset was presented to the network and the connection weights were adjusted according to the error back-propagation learning rule. This process was repeated until the mean squares error (MSE) or maximum number of iterations is achieved. The parameters of the error back-propagation neural network were set as shown in Table 2. Learning rate and Momentum rate were set to 0.5 for all experiments. Consequently, process learning of neural network was faster and momentum rate was used to avoid local minima; 334

otherwise, training processes would be time consuming when dimension input layer changed from two to fifty neurons. 4. Experimental Setting We have conducted experiments for sliding windows and non-sliding windows neural network. About 120 Arabic documents and 120 Persian documents that have been retrieved from the internet and used for validating the design architecture of the neural network. The k-fold cross validation is one of the accuracy estimation method for classification model [14] that has been used to validate the experiments. Dataset (D) has been used in that model by randomly split it into k mutually exclusive subsets (namely the folds), D 1, D 2,... D k of approximately equal size. The crossvalidation for estimating the accuracy, E 1,E 2,... E k is the overall number of correct identifications (co), divided by the number of instances (a) in the dataset. Therefore, the cross-validation for estimating the overall accuracy (Acc overall )isgivenby where In this paper, we used 5-fold cross validation as a baseline for validating the neural network model. 5. Results and Discussion Figure 6 shows the results obtained from the experiments. Four experiments have been performed whereby S-500 is a sliding windows network trained with 500 epochs, NS-500 is a non-sliding windows network trained with 500 epochs and etc. In overall, the results produced by sliding windows network is almost comparable with non-sliding windows network. NS-1000 experiment shows that most of the results obtained are higher than others. Figure 7 shows the average for each experiment where NS-1000 achieved the highest accuracy compared to other experiments. We have observed that if the dataset used are smaller than sliding windows experiments and epochs used is 1000, it will enhance the weights of network but more time is taken for training. Furthermore, if we increase the time taken for learning, the accuracy of the identification process would be improved. The highest accuracy that we have achieved is 78.2% with a window size of 25 and trained on 500 epochs by a non-sliding window. It is assumed that if significant letter exists in the language used for training it will increase the accuracy of the network. Therefore some of the unique character will be excluded from preprocessing steps in the future investigation for increasing the accuracy of classification. Besides, the concept related to majority of language identified from input patterns in a document will be investigated for increasing the prediction of language used in the given document. 335

Figure 6: The classification accuracy of di erent topology network Figure 7: Results analysis of sliding window neural network In contrast, the windows network size between 2-6 unit will cause a low performance of the networks. This may be due to the fact that, duplication of input patterns occurred mostly in this range. Training and testing processes are interrupted because of the noises in dataset whereby exist irrelevant decimal numbers in the input patterns. From the figure, we observe that plots reach its peak in the middle whereby it may refers to the optimum size of windows neural network. Neural network with different topologies will be investigated for further validating windows network concept in language identification. In addition, extended dataset such as Urdu Turkish, Jawi, etc. will be used in order to clarify the feasibility of windows concept in neural network. Furthermore, other classification networks such as hybrid neural network or unsupervised network will be implemented also for a baseline comparison between different techniques. 6. Conclusions We have shown the experiments performed on language identification of Arabic script using sliding windows neural network and non-sliding windows neural network. The analysis of neural network 336

accuracy for both approaches have also been presented in this paper. The experiments showed that both approaches can be further improved in language identification. Preprocessing stage with a more sophisticated stopwords determination crtiterion will be implemented for filtering out noisy words. Besides, dataset will be extended to other Arabic scripts such as Urdu and Pashto for further investigation. In addition, different classification techniques such as adaptive resonance theory (ART) or other neural networks will be experimented for improving the performance of language identification. Acknowledgment This work is supported by the Ministry of Science &Technology and Innovation (MOSTI), Malaysia and Research Management Center, Universiti Teknologi Malaysia (UTM) under the Vot 78099. The authors wish to thank the reviewers for their valuable suggestions. References [1] Bilcu, E.B. and Astola J.: "A Hybrid Neural Network for Language Identi cation from Text", Machine Learning for Signal Processing, 2006. Proceedings of the 2006 16th IEEE Signal Processing Society Workshop, pp. 253-258, 2006. [2] Bilcu, E.B., Astola, J., Saarinen, J.: "A Hybrid Neural Network/ Rule Based System for Bilingual TextTo- Phoneme Mapping", Machine Learning for Signal Processing, 2004. Proceedings of the 2004 14 th IEEE Signal Processing Society Workshop, pp. 345-354, 2004. [3] Tian, J., Suontausta, J.: "Scalable Neural Network Based Language Identi cation from Written Text", ICASSP '03, Vol. 1, pp. 48-51, April 2003. [4] Lee, Y., Papineni, K., Roukos, S., Emama, O. and Hassan, H.: "Language Model Based Arabic Word Segmentation", in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, Vol. 1, pp. 399-406, 2003. [5] Selamat, A. and Omatu, S.: "Web page feature selection and classi cation using neural network", Information Sciences Informatics and Computer Science: An International Journal, Vol. 158, No. 1, pp. 69-88, January 2004. [6] Li, H., Ma, B. and Lee, C.H.: "A Vector Space Modeling Approach to Spoken Language Identi cation", in IEEE Transactions on Audio, Speech and Language Processing, Vol. 15, No. 1, 2007. [7] Tan, T.N.: "Written Language Recognition Based on Texture Analysis," Proc. IEEE ICIP96, Vol. 3, pp. 185-188, 1996. [8] Tan, C.L., Leong, T.Y. and He, S.: "Language identi cation in multilingual documents", in International Symposium on Intelligent Multimedia and Distance Education (ISIMADE'99), Baden-Baden, Germany, pp. 59-64, 2-7 August 1999. [9] Prager, J.M.: "Linguini: Language Identi cation for Multilingual Documents", Proceedings of the 32 nd Hawaii International Conference on System Sciences, 1999. [10] Sejnowski, T.J. and Rosenberg, C.R.: "Parallel Networks That Learn to Pronounce English Text", Complex Systems 1, pp. 145-168, 1987. [11] Yang, Y.M.: "An Evaluation of Statistical Approaches to Text Categorization", Information Retrieval, Vol. 1, No. 1/2, pp. 69-90, 1999. 337

[12] Zhai, L.F., Siu, M.H., Yang, X. and Gish, H.: "Discriminatively trained Language Models Using Support Vector Machines for Language Identi cation", Speaker and Language Recognition Workshop, IEEE Odyssey 2006, pp. 1-6, June 2006. [13] Embrechts, M.J. and Arciniegas, F.: "Neural Networks for Text-to-Speech Phoneme Recognition", In Proceedings of the IEEE Systems, Man and Cybernetics Conference, pp. 3582-3587, IEEE Society, 2000. [14] Kohavi, R.: "A Study of Cross-Validation and Bootstrap for accuracy Estimation and Model Selection", International Joint Conference on Arti cial Intelligent (IJCAI), 1995. [15] Wikimedia Foundation, http://en.wikipedia.org/wiki/wiki, accessed on May 2007. [16] Schou, P., www.paulschou.com/xlate/tools, accessed on June 2007. 338