Arabic Script Web Document Language Identifications Using Neural Network

Similar documents
A Neural Network GUI Tested on Text-To-Phoneme Mapping

Learning Methods for Fuzzy Systems

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Word Segmentation of Off-line Handwritten Documents

INPE São José dos Campos

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Python Machine Learning

Artificial Neural Networks written examination

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Modeling function word errors in DNN-HMM based LVCSR systems

CS Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Learning Methods in Multilingual Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

SARDNET: A Self-Organizing Feature Map for Sequences

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Evolutive Neural Net Fuzzy Filtering: Basic Description

Knowledge Transfer in Deep Convolutional Neural Nets

A study of speaker adaptation for DNN-based speech synthesis

Lecture 1: Machine Learning Basics

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Rule Learning With Negation: Issues Regarding Effectiveness

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Arabic Orthography vs. Arabic OCR

Speech Emotion Recognition Using Support Vector Machine

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Test Effort Estimation Using Neural Network

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Rule Learning with Negation: Issues Regarding Effectiveness

On the Formation of Phoneme Categories in DNN Acoustic Models

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Linking Task: Identifying authors and book titles in verbose queries

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Human Emotion Recognition From Speech

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

AQUA: An Ontology-Driven Question Answering System

Artificial Neural Networks

Speech Recognition at ICSI: Broadcast News and beyond

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Reducing Features to Improve Bug Prediction

Switchboard Language Model Improvement with Conversational Data from Gigaword

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Speaker Identification by Comparison of Smart Methods. Abstract

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Assignment 1: Predicting Amazon Review Ratings

Probabilistic Latent Semantic Analysis

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

*** * * * COUNCIL * * CONSEIL OFEUROPE * * * DE L'EUROPE. Proceedings of the 9th Symposium on Legal Data Processing in Europe

A Case Study: News Classification Based on Term Frequency

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

A Reinforcement Learning Variant for Control Scheduling

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Using dialogue context to improve parsing performance in dialogue systems

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Classification Using ANN: A Review

arxiv: v1 [cs.lg] 15 Jun 2015

Visual CP Representation of Knowledge

Laboratorio di Intelligenza Artificiale e Robotica

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Calibration of Confidence Measures in Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Using focal point learning to improve human machine tacit coordination

arxiv: v1 [cs.cl] 2 Apr 2017

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Multimedia Courseware of Road Safety Education for Secondary School Students

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Reinforcement Learning by Comparing Immediate Reward

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Problems of the Arabic OCR: New Attitudes

Soft Computing based Learning for Cognitive Radio

WHEN THERE IS A mismatch between the acoustic

Multi-Lingual Text Leveling

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Transcription:

Arabic Script Web Document Language Identifications Using Neural Network Ali Selamat, Ng Choon Ching, Siti Nurkhadijah Aishah Ibrahim 1 ) Abstract This paper presents experiments in identifying language of Arabic script web documents using neural network. There are some difficulties when identifying those languages in Arabic script such as Persian, Turkish, Urdu, Jawi etc. Since there is a vast amount of information presented to the internet users, it is crucial to find an appropriate method in language identification for a variety of textual information. Several current approaches rely on collective dictionary and conventional statistical approach for language identification. We analyzed the feasibility of using windows algorithm in finding the best method of selecting best features for neural network. From the experiments, we have found that the non-sliding windows neural network has performed better in accuracy than the sliding windows neural network. 1. Introduction Language identification is the process of recognizing the natural language of human communication for the given content. For instance, languages such as English, Malay, Chinese, Japanese, Arabic etc. Nowadays, there is a lot of information across the internet either in text or image form which are encoded in different languages. In addition, many natural processes are also coded by a string of characters or letters such as DNA representation (CGTA), web page syntaxes (<html>, <title>, <body>) etc. Consequently there are some difficulties when identifying the languages used in web documents. Furthermore, some problems exist in textual document collected from the internet, such as irrelevant information in a web documents, unstructured useful textual information, spelling and syntax errors. Another issue is character encoding used in the web documents. Although unicode encoding is a standard style used for internet publication, many web documents are still facing an encoding problem. For example, it is common for internet users to come across some unknown symbols presented within a web document because of incompatible encoding used in the computer system. In order to recognize multilingual documents for Arabic scripts used in various countries such as Iran, Iraq, Libya, Pakistan etc, whereby these scripts have been used in their o±cial languages, a machine learning method is needed especially in the classification process during explosive growth of the web documents. 1 Faculty of Computer Science and Information System (FSKSM), Universiti Technologi Malaysia (UTM), aselamat@utm.my, simon5u@yahoo.com, echoas1306@yahoo.com 329

Many methods have been developed for language identification of textual document such as neural network [1], [2], [3], n-grams [9], [6], statistical approaches [11], and support vector machines [12]. Neural network has good ability to learn and adapt with abstract information which is derived from the real world. The ability of learning and adapting makes neural network a useful implementation tool in many application areas whereby it has been successfully used in language identification, either in text-based or speech-based. The methods of using neural network may differ in terms of input vectors and topology network regarding the applications. For example, a hybrid approach employing a multi-layer neural network and a decision rule block as a priori information for bilingual language identification has been developed [1]. This approach is a further improvement from previous work by using a hybrid neural network and rule-based system for text-to-phoneme mapping [2]. Scalable neural network has been successfully applied in a multilingual automatic speech recognition (ASR) system whereby memory of this model can be scaled to meet the memory requirement according to the target platform [3]. The implementation for multilingual recognition system is growing fast in various applications such as speech synthesis or text-to-speech (TTS) in the field of speed processing [13], and optical character recognition [7], [8]. Automatic language identification (LID) is an integral part of multilingual speech recognition system that uses dynamic vocabularies. Many language identification tools for web documents have been developed whereas most of the research were focusing on European and Asian languages. In this paper, we are focusing on the identification of Arabic script languages such as Persian and Arabic using neural network whereby, there is a need to identify and classify the similar words belonging to each of the two natural languages respectively. We analyzed the feasibility of using a windowing algorithm as an alternative method for selecting the best features for neural network classifications. 2. Preprocessing Web Documents Before the preprocessing phase, we developed a crawler agent to crawl around the World Wide Web (WWW) to gather the related web documents for this research. It consists of several seeds (Uniform Resource Locator, URL) to start the crawling process. Thus, it will crawls many web servers which contain the web pages with Arabic scripts. In this paper, we focused on the Arabic and Persian languages. We used 120 documents of Arabic language and 120 documents of Persian language obtained in the crawling processes. Figure 1 shows the example of the Arabic document before and after stopping and stemming processes. Finally, these documents will be encoded into decimal form using online tools [16]. 2.1. Stopping and Stemming In the preprocessing part, we have gone through the stopping and stemming processes in order to achieve higher accuracy result in the later process. Stopping process or stop words is the name given to words which are filtered out prior to, or after, processing of natural language data in a text format. Stemming is a process for reducing infected (or sometimes derived) words to their stem, base or root form that generally a written word form [4], [15]. 330

Figure 1: (a) Arabic script before preprocessing (b)arabic script after preprocessing 2.2. Input Pattern Selections In order to train the synaptic weights of the neural network and evaluate the performance of the network, a training dataset and testing dataset with corresponding language tag must be available. We have selected 80% of each language documents to be used as training dataset and the remaining documents were used as testing dataset. For example, 120 Arabic documents were divided into training dataset which was derived first 100 words of 96 documents and the remaining documents were used as testing dataset which was also derived first 100 words from each document. We have applied similar procedure on 120 Persian documents and consequently only one training dataset and one testing dataset were created to contain both the Arabic and Persian texts. Figure 2 shows the overview for preparing input patterns to the neural network. Each of the preprocessed documents were encoded to decimal documents, then, each selected preprocessed document were combined into one training and testing document. Next, the input patterns were captured by windows network according to the size used. Finally, selected input patterns will be used to train the neural network and the testing dataset will be used for evaluating the effectiveness of the trained networks. The window size implemented for capturing the characters is changed in a range of 2-50 characters for further investigation of the performance of neural network. The actual output of language identified by neural network model was then compared with desired output for calculating the accuracy of the network. 3. Neural Network Topology In this research, we have a multi-layer perception (MLP) back-propagation neural network as shown in Figure 3. This architecture consists of input layer, two hidden layers and output layer. The amount of neurons in input layer is based on the window size (i). We have eight nodes for first hidden layer and nine nodes for the following layer. The output layer consists of two nodes which are corresponding to the related language used as shown in Table 1. 331

Figure 2: Feature selection of data preprocessed Figure 3: Multi-layer perceptron (MLP) back-propagation neural network We have identified that a static architecture of artificial neural network has brought some problems which should be fixed in modeling before applying neural nets. In our case, when the architecture of the neural network is set, there is no flexibility to feed data with different dimensions. As a result, we consider the training set as constructed by words which are captured from Persian and Arabic 332

language documents. The training dataset is also used to test various window sizes in order to solve the dynamic length of words. 3.1. Windowing Algorithm Figure 4: Sliding windows neural network From literature review, the idea of windowing algorithm has been successfully implemented in NETtalk [10] whereby it is a parallel network that learns to read aloud of English words. Two different approaches have been applied in this research, firstly, sliding windows neural network and secondly, non-sliding windows neural network. A windowing concept is referring to the size of input layer of neural network. If the size of windows is i units then the number of nodes for the input layer of neural network will also i units. Various size of windows are used for evaluating the influences on the network. Sliding windows network will captures each input patterns according to the windows size used by moving the windows ahead one letter each time until the end of the document as shown in Figure 4. The non-sliding windows network will captures the input patterns by moving the whole windows to the next input (Figure 5). The total characters inside each document and the total number of tokens captured by windows are represented by j and a, respectively. For non-sliding windows neural network, the a is given by 333

where fix is to round the remaining answer to zero. For example, if j divided by i is equal to 2005.65, then fix is applied to the answer, a, which represents the number of input patterns, and the output becomes 2005. For sliding windows neural network, a is given by Figure 5: Non-sliding windows neural network 3.2 Neural Network Training During the training process of neural network, the connection weights between each node are initialized with some random values. The trained dataset was presented to the network and the connection weights were adjusted according to the error back-propagation learning rule. This process was repeated until the mean squares error (MSE) or maximum number of iterations is achieved. The parameters of the error back-propagation neural network were set as shown in Table 2. Learning rate and Momentum rate were set to 0.5 for all experiments. Consequently, process learning of neural network was faster and momentum rate was used to avoid local minima; 334

otherwise, training processes would be time consuming when dimension input layer changed from two to fifty neurons. 4. Experimental Setting We have conducted experiments for sliding windows and non-sliding windows neural network. About 120 Arabic documents and 120 Persian documents that have been retrieved from the internet and used for validating the design architecture of the neural network. The k-fold cross validation is one of the accuracy estimation method for classification model [14] that has been used to validate the experiments. Dataset (D) has been used in that model by randomly split it into k mutually exclusive subsets (namely the folds), D 1, D 2,... D k of approximately equal size. The crossvalidation for estimating the accuracy, E 1,E 2,... E k is the overall number of correct identifications (co), divided by the number of instances (a) in the dataset. Therefore, the cross-validation for estimating the overall accuracy (Acc overall )isgivenby where In this paper, we used 5-fold cross validation as a baseline for validating the neural network model. 5. Results and Discussion Figure 6 shows the results obtained from the experiments. Four experiments have been performed whereby S-500 is a sliding windows network trained with 500 epochs, NS-500 is a non-sliding windows network trained with 500 epochs and etc. In overall, the results produced by sliding windows network is almost comparable with non-sliding windows network. NS-1000 experiment shows that most of the results obtained are higher than others. Figure 7 shows the average for each experiment where NS-1000 achieved the highest accuracy compared to other experiments. We have observed that if the dataset used are smaller than sliding windows experiments and epochs used is 1000, it will enhance the weights of network but more time is taken for training. Furthermore, if we increase the time taken for learning, the accuracy of the identification process would be improved. The highest accuracy that we have achieved is 78.2% with a window size of 25 and trained on 500 epochs by a non-sliding window. It is assumed that if significant letter exists in the language used for training it will increase the accuracy of the network. Therefore some of the unique character will be excluded from preprocessing steps in the future investigation for increasing the accuracy of classification. Besides, the concept related to majority of language identified from input patterns in a document will be investigated for increasing the prediction of language used in the given document. 335

Figure 6: The classification accuracy of di erent topology network Figure 7: Results analysis of sliding window neural network In contrast, the windows network size between 2-6 unit will cause a low performance of the networks. This may be due to the fact that, duplication of input patterns occurred mostly in this range. Training and testing processes are interrupted because of the noises in dataset whereby exist irrelevant decimal numbers in the input patterns. From the figure, we observe that plots reach its peak in the middle whereby it may refers to the optimum size of windows neural network. Neural network with different topologies will be investigated for further validating windows network concept in language identification. In addition, extended dataset such as Urdu Turkish, Jawi, etc. will be used in order to clarify the feasibility of windows concept in neural network. Furthermore, other classification networks such as hybrid neural network or unsupervised network will be implemented also for a baseline comparison between different techniques. 6. Conclusions We have shown the experiments performed on language identification of Arabic script using sliding windows neural network and non-sliding windows neural network. The analysis of neural network 336

accuracy for both approaches have also been presented in this paper. The experiments showed that both approaches can be further improved in language identification. Preprocessing stage with a more sophisticated stopwords determination crtiterion will be implemented for filtering out noisy words. Besides, dataset will be extended to other Arabic scripts such as Urdu and Pashto for further investigation. In addition, different classification techniques such as adaptive resonance theory (ART) or other neural networks will be experimented for improving the performance of language identification. Acknowledgment This work is supported by the Ministry of Science &Technology and Innovation (MOSTI), Malaysia and Research Management Center, Universiti Teknologi Malaysia (UTM) under the Vot 78099. The authors wish to thank the reviewers for their valuable suggestions. References [1] Bilcu, E.B. and Astola J.: "A Hybrid Neural Network for Language Identi cation from Text", Machine Learning for Signal Processing, 2006. Proceedings of the 2006 16th IEEE Signal Processing Society Workshop, pp. 253-258, 2006. [2] Bilcu, E.B., Astola, J., Saarinen, J.: "A Hybrid Neural Network/ Rule Based System for Bilingual TextTo- Phoneme Mapping", Machine Learning for Signal Processing, 2004. Proceedings of the 2004 14 th IEEE Signal Processing Society Workshop, pp. 345-354, 2004. [3] Tian, J., Suontausta, J.: "Scalable Neural Network Based Language Identi cation from Written Text", ICASSP '03, Vol. 1, pp. 48-51, April 2003. [4] Lee, Y., Papineni, K., Roukos, S., Emama, O. and Hassan, H.: "Language Model Based Arabic Word Segmentation", in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, Vol. 1, pp. 399-406, 2003. [5] Selamat, A. and Omatu, S.: "Web page feature selection and classi cation using neural network", Information Sciences Informatics and Computer Science: An International Journal, Vol. 158, No. 1, pp. 69-88, January 2004. [6] Li, H., Ma, B. and Lee, C.H.: "A Vector Space Modeling Approach to Spoken Language Identi cation", in IEEE Transactions on Audio, Speech and Language Processing, Vol. 15, No. 1, 2007. [7] Tan, T.N.: "Written Language Recognition Based on Texture Analysis," Proc. IEEE ICIP96, Vol. 3, pp. 185-188, 1996. [8] Tan, C.L., Leong, T.Y. and He, S.: "Language identi cation in multilingual documents", in International Symposium on Intelligent Multimedia and Distance Education (ISIMADE'99), Baden-Baden, Germany, pp. 59-64, 2-7 August 1999. [9] Prager, J.M.: "Linguini: Language Identi cation for Multilingual Documents", Proceedings of the 32 nd Hawaii International Conference on System Sciences, 1999. [10] Sejnowski, T.J. and Rosenberg, C.R.: "Parallel Networks That Learn to Pronounce English Text", Complex Systems 1, pp. 145-168, 1987. [11] Yang, Y.M.: "An Evaluation of Statistical Approaches to Text Categorization", Information Retrieval, Vol. 1, No. 1/2, pp. 69-90, 1999. 337

[12] Zhai, L.F., Siu, M.H., Yang, X. and Gish, H.: "Discriminatively trained Language Models Using Support Vector Machines for Language Identi cation", Speaker and Language Recognition Workshop, IEEE Odyssey 2006, pp. 1-6, June 2006. [13] Embrechts, M.J. and Arciniegas, F.: "Neural Networks for Text-to-Speech Phoneme Recognition", In Proceedings of the IEEE Systems, Man and Cybernetics Conference, pp. 3582-3587, IEEE Society, 2000. [14] Kohavi, R.: "A Study of Cross-Validation and Bootstrap for accuracy Estimation and Model Selection", International Joint Conference on Arti cial Intelligent (IJCAI), 1995. [15] Wikimedia Foundation, http://en.wikipedia.org/wiki/wiki, accessed on May 2007. [16] Schou, P., www.paulschou.com/xlate/tools, accessed on June 2007. 338