Statistical Patterns of Diacritized and Undiacritized Yorùbá Texts

Similar documents
Progressive Aspect in Nigerian English

Chapter 5: Language. Over 6,900 different languages worldwide

Mandarin Lexical Tone Recognition: The Gating Paradigm

Cross Language Information Retrieval

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Effect of Word Complexity on L2 Vocabulary Learning

Arabic Orthography vs. Arabic OCR

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Corpus Linguistics (L615)

Language Independent Passage Retrieval for Question Answering

Language. Name: Period: Date: Unit 3. Cultural Geography

ROSETTA STONE PRODUCT OVERVIEW

Fashion Design Program Articulation

Probability and Statistics Curriculum Pacing Guide

An Empirical and Computational Test of Linguistic Relativity

arxiv: v1 [cs.cl] 2 Apr 2017

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Learning Methods in Multilingual Speech Recognition

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Statewide Framework Document for:

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Problems of the Arabic OCR: New Attitudes

A Neural Network GUI Tested on Text-To-Phoneme Mapping

LOUISIANA HIGH SCHOOL RALLY ASSOCIATION

Conversions among Fractions, Decimals, and Percents

MEASURING GENDER EQUALITY IN EDUCATION: LESSONS FROM 43 COUNTRIES

Busuu The Mobile App. Review by Musa Nushi & Homa Jenabzadeh, Introduction. 30 TESL Reporter 49 (2), pp

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Constructing Parallel Corpus from Movie Subtitles

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

LANGUAGE DIVERSITY AND ECONOMIC DEVELOPMENT. Paul De Grauwe. University of Leuven

DEPARTMENT OF EXAMINATIONS, SRI LANKA GENERAL CERTIFICATE OF EDUCATION (ADVANCED LEVEL) EXAMINATION - AUGUST 2016

Probabilistic Latent Semantic Analysis

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Literature and the Language Arts Experiencing Literature

A Case Study: News Classification Based on Term Frequency

Detecting English-French Cognates Using Orthographic Edit Distance

CROSS-LANGUAGE MAPPING FOR SMALL-VOCABULARY ASR IN UNDER-RESOURCED LANGUAGES: INVESTIGATING THE IMPACT OF SOURCE LANGUAGE CHOICE

French Dictionary: 1000 French Words Illustrated By Evelyn Goldsmith

Visit us at:

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

A Reinforcement Learning Variant for Control Scheduling

The Ohio State University. Colleges of the Arts and Sciences. Bachelor of Science Degree Requirements. The Aim of the Arts and Sciences

Learning Disability Functional Capacity Evaluation. Dear Doctor,

The phonological grammar is probabilistic: New evidence pitting abstract representation against analogy

Turkish Vocabulary Developer I / Vokabeltrainer I (Turkish Edition) By Katja Zehrfeld;Ali Akpinar

Basic German: CD/Book Package (LL(R) Complete Basic Courses) By Living Language

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

CHAPTER III RESEARCH METHOD

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

An Online Handwriting Recognition System For Turkish

Finding Translations in Scanned Book Collections

Calibration of Confidence Measures in Speech Recognition

Technical Manual Supplement

Mathematics subject curriculum

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Physics 270: Experimental Physics

EUROPEAN DAY OF LANGUAGES

Universiteit Leiden ICT in Business


Effect of Cognitive Apprenticeship Instructional Method on Auto-Mechanics Students

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Disambiguation of Thai Personal Name from Online News Articles

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Integrating culture in teaching English as a second language

Switchboard Language Model Improvement with Conversational Data from Gigaword

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Modern Languages. Introduction. Degrees Offered

Speech Recognition at ICSI: Broadcast News and beyond

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

IMPROVING SPEAKING SKILL OF THE TENTH GRADE STUDENTS OF SMK 17 AGUSTUS 1945 MUNCAR THROUGH DIRECT PRACTICE WITH THE NATIVE SPEAKER

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Language Center. Course Catalog

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Age Effects on Syntactic Control in. Second Language Learning

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

Speech Emotion Recognition Using Support Vector Machine

Learning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries

Information Session 13 & 19 August 2015

CAMPUS PROFILE MEET OUR STUDENTS UNDERGRADUATE ADMISSIONS. The average age of undergraduates is 21; 78% are 22 years or younger.

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Running head: LISTENING COMPREHENSION OF UNIVERSITY REGISTERS 1

Linking Task: Identifying authors and book titles in verbose queries

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

Math 96: Intermediate Algebra in Context

Word Segmentation of Off-line Handwritten Documents

Transcription:

Statistical Patterns of Diacritized and Undiacritized Yorùbá Texts Asubiaro, Toluwase E. Latunde Odeku Medical Library College of Medicine University of Ibadan Nigeria toluwaase@gmail.com ABSTRACT: Yorùbá standard orthography involves heavy use of diacritics for tone marking and representation of characters that are beyond ANSI scope. The diacritics are not always applied in many Yorùbá documents because specialized and language-dependent input devices for the language are very rarely available. Hence, this study aims at explicating the statistical implication of the inconsistency in the use of diacritics in electronic Yoruba documents on the distribution of word in the two versions of its texts. This was achieved by modeling the texts of Yoruba language based on Zipf s and Heap s law on the n-grams (for n=1, 2 and 3) with corporal of 1,089,318 words that are diacritically marked and its version that are unmarked diacritically. It was observed that the Zipf s graphs of the two corporal exhibited no significant difference. On the other hand, the Heap s graphs of the diacritized and undiacritized texts deviated significantly from the base. This shows that the use of the diacritics significantly affect single word distribution of the language but the effect reduced in the distribution of co-occurrences of two or more words. Keywords: Zipf s Law, Heaps law, Yorùbá Language, Diacritics, Statistical Language Model, Word Distribution Received: 22 May 2015, Revised 19 June 2015, Accepted 28 June 2015 2015 DLINE. All Rights Reserved 1. Introduction Diacritics include sub-dots and tone marks which are appended to base or American Standard Institute (ANSI) characters. Diacritics are appended on base characters to represent some speech sounds that are beyond the scope of ANSI conventional codes for writing which is based on Latin encoding system. Hence, diacritics extend the functionality of these base characters, therefore new characters are formed by appending diacritic mark(s) on a base character. For instance, when a sub-dot is appended to s character, a new character c is formed. In some languages like Yorùbá, tonality is represented with the tone marks; high tone (\) and low tone (/) which are applied on its vowels and nasal consonant. Yorùbá also cater for speech sounds that are not represented in the 26 alphabets of Latin encoding from which it inherited its writing style. These characters are Like Yorùbá, some African and European languages such as Hausa, Igbo, French, German, Italian and Finnish use diacritics on some base characters. While diacritics carry morphological information in some of these languages, in others, diacritics do not. International Journal of Computational Linguistics Research Volume 6 Number 3 September 2015 77

In Yorùbá, German and Finnish for instance, the use of diacritics provide morphological and lexical information. for instance, are different Yorùbá words derived by appending diacritical marks on ojo, each has a distinct meaning which differs from others derived from the same base characters. Italian and French languages also use diacritics, but the use of diacritics bear insignificant morphological or lexical information. When texts of languages that heavily use diacritics are normalized; that is the diacritics are removed or they are not appended on necessary words, information is lost or distorted in such texts. The statistical properties of the texts may also be affected. The four variants of ojo which appear and are distributed accordingly as four words in properly diacritically marked Yoruba text only appear as a word ojo if the texts are otherwise unmarked with diacritics. It could therefore be hypothesized that the statistical properties of the two versions of the orthographies of a language could be different. Statistical properties of written texts have been observed to follow some universal regularities. These regularities are studied in Statistical language modeling (SLM) by attempting to understand human languages through the observations of the regularities. SLM is an attempt to capture and compute a probability distribution of word or character sequences in natural languages, such that sequences which are well-formed are given a higher likelihood than those which are not [1], [2]. SLM studies have informed research work in development of language technologies. Statistical properties of written texts such as the distribution of word frequencies and increase of the vocabularies or distinct words are some of the various universal regularities observed in SLM have been modeled by Heaps law. Heaps law is a power law which explains that the number of distinct words or vocabulary of a given language will increase slowly with the increase in its document size. Accordingly, for a language with a number of collections of written texts or spoken speech and V(n) estimated number of unique or distinct words in a collection n, while T is the number of tokens in the collection, the relation V(n) = KT α holds where 0 < α < 1. Heaps Law predicts the vocabulary size of the texts of a given language from the size of a text [3]. Another law that have modeled human languages which have co-existed with Heaps law in studies is the Zipf s law. It explains that in a sample of written texts or soken speech of a given language, the few very high frequency words account for bigger proportion of the text size or spoken speech in a language. There is an approximate mathematical relation between the frequency of occurrence of each distinct word denoted as f and its rank in.the mathematical relationship between the frequency of occurrence of a given distinct word or vocabulary denoted as f and its rank r, it was given as f = 1/r α Empirically, when the list of all the words used in the text are ordered by decreasing frequency, the relationship between each distinct word and its rank as given in equation 1 is an inverse power law with an exponent that is close to 1[4], [5] Many human languages, mostly of the indo-european origins have been found to conform to the Zipf s and Heaps law [6], [7], [8], [9], [10], [11]. These laws are co-efficients of these laws depend on language[10]. Apart from this, [12], [13] found that randomly generated texts and index terms obeyed these laws. On the other hand, it has been found that it does not hold for raw Asian languages like Chinese, Korean and Japanese [14], but holds for word segmented corpus of Chinese [15]. Zipf s and Heaps laws are both power laws which have been found to be theoretically and empirically related [16], [17], [18], [19], [20]. 2. Yoruba Language and Its Orthography Yorùbá language is spoken by over 30 million people in different parts of the world. Its native name is ede Yorùbá. The native speakers of Yorùbá language occupy the southwestern part of Nigeria, a part of southern Benin Republic and southern Togo. There are traces of the use of the language in Santeria religion as language of worship where is called Lucumi or Nago in Argentina, Cuba, Puerto Rico and the Dominican Republic. There are also reported traces of the use of the language by some natives in Sierra Leone where it is called oku. [21], [22]. Standard Yorùbá orthography demands a heavy use of diacritically marked characters (sub-dots and tone marks). Diacritics are used for marking tonality and to cater for the need to represent speech sounds that are beyond the range of the basic America National Standard Institute (ANSI) characters or standard Latin encoding system. It should be noted that the conventional computer keyboards is based on ANSI convention. These characters; which are used in Yorùbá orthography and are beyond ANSI scope therefore do not appear on the conventional keyboards. 78 International Journal of Computational Linguistics Research Volume 6 Number 3 September 2015

However, due to dearth of specialized and Yorùbá language-dependent device input device that could adequately and speciallycater for these diacritically marked characters, these diacritically marked characters are mostly represented electronically with the available equivalent ANSI character which are the equivalent ANSI diacritically unmarked characters. The base characters of the diacritically marked characters are also their unmarked equivalents. For instance, characters are all represented by their unmarked equivalent; e. These practices are either partial where the diacritics are correctly applied on choice words or total. In a previous study[10], it was proved that SLM like Heaps laws are language dependent. In essence, this study proposes a hypothesis that Heaps behavior of a language is orthographically dependent. There are two versions of the orthography of Yoruba language: the standard and the sub-standard. The standard orthography of Yoruba requires heavy use of diacritics for tone marking and representation of characters that are beyond the ANSI characters. While the sub-standard version of the orthography does not append the diacritics (in other words, characters with diacritics are normalized). Most computer encoded Yoruba texts fall to the sub-standard orthography category. 3. Methodology The word list of n-grams (unigram, bigram and trigram) was obtained for the two corpora and ranked in decreasing order of frequency of occurrence. For Zipf s graph, logarithmic values of frequency (Fr) were plotted against logarithmic value of rank (r). For Heaps graph, V(n) was estimated as the number of distinct or unique words in each collection, while T is the number of tokens in the collection. For Heaps graph, values of V(n) was plotted against the values of T. Text corpus that is representative, orthographically accurate and large enough is very essential in linguistics and language processing studies. Yorùbá language lacks corpus for linguistic experiments. The first step taken was gathering data set that could be acceptable in quantity and quality for the study. Texts were collected online and offline. The sources of data collected is displayed on Table 1.12% of the texts used for this study were news articles collected online. This is consistent with TREC s methodology of using news articles for corpus development. A corpus of 1,089,318 was used for the study. To obtain diacritically marked version of texts that were originally not appended with diacritics, they were automatically diacritized. The diacritics were also removed from the originally diacritically marked texts to obtain its diacritically unmarked version. In this paper, the diacritically marked and unmarked texts are referred to as the diacritized and undiacritized texts respectively. Source No of Articles Corpus Size Originally undiacritized Alaroye (Yorùbá weekly newspaper published 782 676,634 online) Originally diacritized Yorùbá Published novels (collected offline) 4 165,553 Originally undiacritized Academic Projects written in Yorùbá language 10 203,416 (collected online) Originally undiacritized Yorùbá Online (Yorùbá online news collected 49 43,715 Total 1,089,318 4. Results and Discussion Table 2 shows rank- distribution of the ten most frequent words in the diacritized and undiacritized Yoruba texts. The table explains the word-frequency of Yoruba texts as they are affected by the use or non-use of diacritics. International Journal of Computational Linguistics Research Volume 6 Number 3 September 2015 79

Table 2. Word frequency of the diacritized and undiacritized Yorùbá most frequent unigrams Diacritized Texts Undiacritized Texts Rank Index Term Frequency Index Term Frequency 1 tí 35818 ti 45063 2 ni 34794 ni 42659 3 27353 won 33258 4 24913 o 25043 5 D 21904 awon 24953 6 ó 21028 si 23903 7 pé 20349 n 22439 8 tó 19748 pe 21310 9 kò 19167 ko 21222 10 náà 16736 to 19947 Zipf s Law The Zipf s graphs of unigram, bigrams and trigrams of diacritized and undiacritized Yoruba texts are presented on Figures 1a, 1b and 1c respectively. The three graphs show that the diacrititized and undiacritized texts converged on most regions of the Zipf s graphs. This shows that the diacritized and undiacrized Yoruba texts on Zipf s graph are not significantly different. This is further proved with the R 2 value of the straight line graph drawn on the Zipf s curve. The R 2 value for unigram of the diacritized and undiacritized are 0.98 and 0.97 respectively. For the bigrams, R 2 for diacritized and undiacritized are 0.95 and 0.94 respectively, while for the trigram, R 2 for diacritized and undiacritized are 0.84 and 0.85 respectively. Figure 1a. Zipf s Graph for Unigram Figure 1b. Zipf s Graph for Bigram 80 International Journal of Computational Linguistics Research Volume 6 Number 3 September 2015

Heaps Law Figure 1b. Zipf s Graph for Trigram The Heaps graphs for unigrams, bigrams and trigrams of the diacritized and undiacritized texts are presented in Figures 2a, 2b and 2c. The Heaps curves of the diacritized and undiacritized texts presented on the three graphs drifted apart from the origin. However, the differences exhibited by the Heaps curves of diacrtized and undiacritized texts reduce as the n-grams increases while the graph becomes more linear. The Heaps exponent also increased as the n-grams increased with the undiacritized texts having higher exponential values. The Heaps exponents are expected to be close to 1, the trigrams have the highest exponents with 0.88 for the diacritized and undiacritized texts while the unigrams had the lowest exponents with 0.72 and 0.77 for the diacritized and undiacritized texts respectively. This shows that the trigrams exhibited the Heaps properties more than the bigrams and unigrams. This study shows that diacritics signifcantly affect word distribution in the Yoruba texts. This difference reduces as the co-occurred words (n-grams) under consideration increases. 5. Conclusion and Recommendation Figure 2a. Heaps Graph for Unigrams Zipf s and Heaps law are popular laws which are used in Natural Language Processing for modeling languages. It explains the characteristics of a language in relations to the increase in its vocabulary as the size of its texts increases. They present hidden natural regularities in statistical models. Heaps exponent for a language is a unique value which is language dependent and a distinguishing factor between languages. Hence, the behavior or the model of a language based the heaps law should portray the uniqueness of the language. In this case, the behavior of the language using the versions of Yoruba texts is a statistical account which suggest that the diacritics is a special feature which can affect the model or behavior of the language for language modeling. Though Zipf s model presented in this study present dissenting view as it does not explicate differences in the International Journal of Computational Linguistics Research Volume 6 Number 3 September 2015 81

Figure 2b. Heaps Graph for Bigrams Figure 2c. Heaps Graph for Trigrams diacritized and undiacritized texts. As a suggested future research work, the explanations for the difference behaviours exhibited by the diacritized and undiacritized texts could be explored. It further suggests that Yoruba language corporal for NLP studies are necessarily consistent in diacritics usage for accurate model of the language. A body of Yoruba texts that is partly diacritized will provide invalid (statistical) model and miscued behavior of the language and ultimately wrought wrong results for any NLP study. Furthermore, results of NLP studies on undiacritized version of the language texts cannot be extended to its diacritized version. For instance, [23] created stopword list 82 International Journal of Computational Linguistics Research Volume 6 Number 3 September 2015

for both diacritized and undiacritized versions of the same corpus. This research work proves that the Heap s law is dependent on the consistency of the use of orthography of a language. However, the dependency reduces as the number of n-grams increases while this effect was not exhibited on Zipfian s graph. References [1] Rosenfeld, R. (2000). Two decades of statistical language modeling: Where do we go from here? School of Computer Science Carnegie Mellon University, Pittsburgh, PA 15213, USA. [2] Xu, P., Karakos, D., Khudanpur, S. (2009). Self-Supervised Discriminative Training of Statistical Language Models. [3] Heaps, H. S. (1978). Information Retrieval: Computational and Theoretical Aspects. Orlando, FL, USA: Academic Press, Inc. [4]Zipf, G. (1936). The Psychobiology of Language. London: Routledge. [5] Zipf, G. (1949). Human behavior and the principle of least effort. Oxford, England: Addison-Wesley Press. [6] Shamilov, Yolacan. (2006). Statistical Structure of Printed Turkish, English, German, French, Russian and Spanish, in Proceedings of the 9 th WSEAS International Conference on Applied Mathematics, Istanbul, Turkey, 638 644. [7] Géza, N., Csaba, Z. (2007). Multilingual Statistical Text Analysis, Zipf s Law and Hungarian Speech Generation. Department of Telecommunications & Telematics, Budapest University of Technology and Economics, Hungary. [8]Manaris, B., Pellicoro, L., Pothering, G., Hodges, H. (2006). Investigating Esperanto s Statistical Proportions Relative to other Languages using Neural Networks and Zipf s Law, in Proceedings of the 2006 IASTED International Conference on ARTIFICIAL INTELLIGENCE AND APPLICATIONS (AIA 2006), February 13 16, 2006, Innsbruck, Austria. [9] Damian, H., Marcelo, A. (2008). Dynamics of text generation with realistic Zipf distribution. Consejo Nacional de Investigaciones Cient 1ficas y T ecnicas, Centro At omico Bariloche and Instituto Balseiro, 8400 San Carlos de Bariloche, R 1o Negro, Argentina. [10] Alexander, G., Grigori, S. (2001). Zipf Heaps and Laws Coefficients Depend on Language, in Conference on Intelligent Text Processing and Computational Linguistics, February 18 24, 2001, Mexico City. Lecture Notes in Computer Science, Mexico City, 2001, 332 335. [11] Bochkarev, V. V., Lerner, E. Y., Shevlyakova, A. V. (2014). Deviations in the Zipf and Heaps laws in natural languages, in Journal of Physics: Conference Series,, 490, 01. [12] Wentian, L. (1992). Random Texts Exhibit Zipf s-law-like Word Frequency Distribution, IEEE Trans. Inf. Theory. 38 [6], p. 1842 1845. [13]Asubiaro, T. (2011). An Analysis of the Structure of Index Terms for Yorùbá Texts, A Master s degree project, University of Ibadan, Africa Regional Centre for Information Science. [14] Lu, L., Zhang, Z.K., Zhou, T. (2013). Deviation of Zipf s and Heaps Laws in Human Languages with Limited Dictionary Sizes, Sci. Rep., 3, 1 9. [15] Xiao, H. (2008). On the Applicability of Zipf s Law in Chinese Word Frequency Distribution, J. Chin. Lang. Comput.18 [1], 33 46. [16] Font-Clos, F., Boleda, G., lvaro Corral, A. (2013). A scaling law beyond Zipf s law and its relation to Heaps law, J. Phys., 15. [17] Van Leijenhorst, D. C., Van der Weide, T. P. (2005). A formal derivation of Heaps Law, Inf. Sci., 170, 263 272. [18] Petersen, A. M., Tenenbaum, J. N., Havlin, S., Stanley, E., Perc, M. (2012). Languages cool as they expand: Allometric scaling and the decreasing need for new Words, Sci. Rep., 2. [19] Eliazar, I. I., Cohen, M. H. (2012). Power-law connections: From Zipf to Heaps and beyond, Ann. Phys., 332, p. 56 74. [20] Eliazar, I. (2011). The growth statistics of Zipfian ensembles: Beyond Heaps law, Phys. Stat. Mech. Its Appl., 390 [20], p. 3189 3203. [21] Adesola,O. (2005). Yorùbá: A Grammar Sketch: Version 1.0. Rutgers University, U.S.A, 2005. International Journal of Computational Linguistics Research Volume 6 Number 3 September 2015 83

[22] Akilimali, F. (2008). Keyboard to help save Yorùbá and other endangered African languages. [23] Asubiaro, T. (2013). Entropy-Based Generic Stopwords List for Yoruba Texts, Int. J. Comput. Inf. Technol. 2 [5], p. 1065 106 8, Biography ASUBIARO, Toluwase works in the Systems Unit of E. Latunde Odeku Medical Library, College of Medicine, University of Ibadan, Nigeria as an Academic Librarian. His research interest is Information Retrieval, Statistical Language Modelling, Informetrics, Information systems and technology use. He had a B. Sc in Mathematics and a Masters degree in Information Science. 84 International Journal of Computational Linguistics Research Volume 6 Number 3 September 2015