Analysis of Primary School Arabic Language Textbooks

Similar documents
1. Introduction. 2. The OMBI database editor

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Modeling full form lexica for Arabic

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Using SAM Central With iread

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Linking Task: Identifying authors and book titles in verbose queries

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

South Carolina English Language Arts

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Constructing Parallel Corpus from Movie Subtitles

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Coast Academies Writing Framework Step 4. 1 of 7

A Case Study: News Classification Based on Term Frequency

AQUA: An Ontology-Driven Question Answering System

TotalLMS. Getting Started with SumTotal: Learner Mode

PowerTeacher Gradebook User Guide PowerSchool Student Information System

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

MOODLE 2.0 GLOSSARY TUTORIALS

Word Sense Disambiguation

The following information has been adapted from A guide to using AntConc.

The taming of the data:

On-Line Data Analytics

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

BULATS A2 WORDLIST 2

Ontologies vs. classification systems

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Learning and Retaining New Vocabularies: The Case of Monolingual and Bilingual Dictionaries

CHANCERY SMS 5.0 STUDENT SCHEDULING

DYNAMIC ADAPTIVE HYPERMEDIA SYSTEMS FOR E-LEARNING

ROSETTA STONE PRODUCT OVERVIEW

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Using Moodle in ESOL Writing Classes

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Memory-based grammatical error correction

WHY SOLVE PROBLEMS? INTERVIEWING COLLEGE FACULTY ABOUT THE LEARNING AND TEACHING OF PROBLEM SOLVING

THE VERB ARGUMENT BROWSER

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Advanced Grammar in Use

COMPETENCY-BASED STATISTICS COURSES WITH FLEXIBLE LEARNING MATERIALS

Initial English Language Training for Controllers and Pilots. Mr. John Kennedy École Nationale de L Aviation Civile (ENAC) Toulouse, France.

Disambiguation of Thai Personal Name from Online News Articles

The Language of Football England vs. Germany (working title) by Elmar Thalhammer. Abstract

National Literacy and Numeracy Framework for years 3/4

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Development of the First LRs for Macedonian: Current Projects

Software Maintenance

Loughton School s curriculum evening. 28 th February 2017

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Millersville University Degree Works Training User Guide

What the National Curriculum requires in reading at Y5 and Y6

CEFR Overall Illustrative English Proficiency Scales

Guidelines for Writing an Internship Report

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Oakland Unified School District English/ Language Arts Course Syllabus

Longman English Interactive

Rule-based Expert Systems

Universiteit Leiden ICT in Business

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Corpus Linguistics (L615)

LING 329 : MORPHOLOGY

Words come in categories

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Specification of the Verity Learning Companion and Self-Assessment Tool

Field Experience Management 2011 Training Guides

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Big Fish. Big Fish The Book. Big Fish. The Shooting Script. The Movie

Test Blueprint. Grade 3 Reading English Standards of Learning

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Graduate Program in Education

ARNE - A tool for Namend Entity Recognition from Arabic Text

Vocabulary Usage and Intelligibility in Learner Language

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Transcription:

Analysis of Primary School Arabic Language Textbooks B. Belkhouche 1, H. Harmain 1, H. Al Taha 2, L. Al Najjar 2, S. Tibi 3 (1) Faculty of Information Technology (2) Faculty of Humanities (3) Faculty of Education U.A.E. University, Al-Ain, U.A.E. { b.belkhouche,harmain, Huda.salem,lalnajja, tibi}@uaeu.ac.ae Abstract This paper reports on a preliminary analysis of a corpus consisting of Arabic language textbooks used in primary schools. The input to our process is a raw text extracted from Arabic textbooks used by the Emirates curriculum of grades 1 through 6. Various aspects, including word and root frequencies, parts-of-speech distribution, phonology, and themes, are investigated. A comparison among parts of speech, UAE grade 1 and Libya grade 1 is performed. Our analysis raises several issues on the criteria for selecting a word list that is appropriate for enriching the vocabulary of school children. 1. Introduction Language learning has been greatly facilitated by the availability of electronic resources. In particular, electronic dictionaries and concordances have been instrumental in promoting language learning and vocabulary building [5]. However, electronic resources present some methodological challenges, not only for Arabic but even for languages that have been investigated for decades [10]. For example, the Japanese Electronic Dictionary [9] took over eight years to build an initial version and involved several organizations. Trends revolving around the building of electronic resources for the Arabic language fall more into the electronic duplication of hard copies. Even though a great deal of research and implementations has been done in many Indo-European languages, most of the results of such endeavors cannot be easily applied to Arabic. This can be attributed to the differences in language structures and the complexity of the Arabic morphology. However, some of the proposed work and knowledge representation schemes (e.g. FrameNet, WordNet, Treebank) provide an abstract conceptual basis that is language-independent [8], and thus, can be used effectively in the Arabic language context. Our primary concern in this paper is to shed some light on the issues of building an Arabic elearning dictionary with the main goal to sketch a methodology for building Arabic Learning resources. Such a task presents numerous and unique challenges, mainly because of the complex morphology of the Arabic language and the appropriateness of the vocabulary. Current Arabic printed dictionaries do not lend themselves to a straightforward categorization. Users of such dictionaries face great difficulties searching for words, because they are expected to master a substantial set of morphological derivation rules. For the young learner, this may be not just a challenging requirement, but a deterrent. We can argue that the need for an Arabic electronic dictionary has become more urgent than ever before. The available Arabic electronic dictionaries are no more than electronic version of the printed dictionaries with the only advantage of providing an interface to search for words quickly. These dictionaries do not help in searching for inflectional words that do not need to be in the dictionary as separate entries (e.g. if you search for Taktub () and the dictionary entry is Yaktub (), your search will yield no results). This is just one example of shortcoming in the Arabic electronic readable dictionaries. The Arabic electronic learners dictionary we propose should at least satisfy the following requirements: 1. Target primary grade school students. 2. Immerse the children into an interactive and exploratory mode. 3. Accommodate the development of word meanings to satisfy the current language usage (take for example the meaning for ). 4. Provide more grammatical information that will enrich the dictionary and help the learners develop their linguistic skills, by including information such as cases, dual, plural, gender, and broken plurals. 5. Display the word meaning in an age-appropriate way and include relevant examples with illustrations. 6. Capture the relations of the word with the root and its derivatives. 7. Contain a morphological module to help search for words in their plain format and return all possible forms of a word. 8. Support cross-referencing and navigation. The rest of the paper is organized as follows: Section 2 discusses our concerns about methodological issues. Section 3 describes our text processing tools. Section 4 gives details of our corpus. Results are discussed in Section 5 and conclusions are given in Section 6.

2. Methodological Issues The use of grade 1-6 language arts textbooks as a corpus to identify the vocabulary for an Arabic e-dictionary raises several methodological issues. The first difficulty of pragmatic import is the availability of these resources in electronic form. A fundamental question is whether these textbooks can be considered as canonical, that is, do they offer a rich and representative enough corpus from which to identify our vocabulary word list? This has implications on the methodology used to develop textbooks, if indeed a methodology exists within and among Arab countries. Moreover, as Tables 1 and 2 reveal, these textbooks cannot be analyzed as a homogeneous group as the data they contain do not show similar patterns in terms of sizes, frequencies, and word similarities. The traditional approach for building word lists relies on large corpora. The selection of words is then based on their frequencies. A major assumption of this approach is that these corpora are highly representative of the word usage. For many reasons, and among them diglossia, this assumption does not hold in the Arab world context. Moreover, recent research has shown that raw word frequencies may not be good measures for building word lists [1, 3]. Thus, rather than analyzing the corpora in a quantitative fashion, our approach is to concentrate on a comparative analysis of the distributions of some general linguistic principles, such as parts-of-speech, themes, and sounds. 3. Text Processing Tool Figure 1 shows a snapshot from AlKhaleel, a corpus processing tool specifically designed to analyze Arabic texts [6]. AlKhaleel accepts texts in different formats and can be used to perform simple text analysis such as calculating word frequencies and concordances, or to perform more sophisticated analysis using Natural Language Processing techniques for morphological analysis. AlKhaleel is a modular system designed to integrate several language analysis tools to support the study and analysis of Arabic Language. AlKhaleel is a menu-driven Windows application implemented in Visual C#. The main modules of the system are discussed in this section. 3.1 Corpus Loader The Corpus Loader is used to load the text document(s) into the system. AlKhaleel accepts texts in different formats including HTML, XML, plain text, and MS Word documents. The corpus can be stored in one or multiple files. Two text encodings currently supported for Arabic texts are UTF-8 and MS Windows CP-1256. These are the two main and widely used Arabic text encodings. The Corpus Loader extracts Arabic words from the loaded texts and builds an internal data structure. The data structure consists of a unique word index, the word in its surface form, the word without short vowels, and the file name and line number in which the word appeared. Punctuation marks, HTML, and XML tags are ignored. 3.2 Word Frequency Analyzer The Word Frequency Analyzer reads the data structure built by the Corpus Loader and builds a word frequency list. This list consists of the word, its frequency number, and a set of pointers to all its occurrences in the text. The word frequency list is stored in a data store and displayed to the user in a grid format as shown in Fig. 1. The frequency list can be sorted alphabetically or based on the word frequencies. 3.3 Concordancer We use a simple Keyword in Context (KWIC) technique to build concordances. The concordances are displayed right-to-left and can be exported to MS Excel file. The context of the keywords is given in its original format with all diacritics and punctuation marks included. The user can display a larger context of the keyword with several lines before and after it. 3.4 Dictionary Lookup AlKhaleel includes an Arabic electronic dictionary. This is our own implementation of Al-Waseet Dictionary [9]. We have implemented a simple look-up algorithm to search to the word in the dictionary. First the given word is searched in the keyword entries of the dictionary. If it is not found we search for it in the sub-entry forms. If it is not found then we perform some morphological analysis by deleting the prefixes and/or suffixes if any and search again. Figure 1 AlKhaleel 3.5 Morphological Analyzer The Morphological Analyzer is a re-implementation of

Buckwalter s morphological analyzer [10]. We used the same data dictionaries given in Buckwalter Analyzer v.1.0 and re-implemented the algorithms in C#. The results of analysis are presented in a table which includes the English translation of the word (see Fig. 4). This can be of great help to the Arabic second language learners. Table 1 Corpus Data (Libya) Grade Size Word Count G1 1814 748 G2 4528 1769 G3 6959 2776 G4 7472 3137 G5 11609 4483 G6 18958 7450 G1-G6 51340 21363 3.6 Output Generator To better support its users (teachers, learners, and researchers) AlKhaleel allows for all of its results to be exported to MS Excel files. These Excel files can be used for printing handouts or for performing extra statistical analysis. For the purpose of this paper we used the Word Frequency Analyzer, Morphological Analyzer, and the Output Generator. 4. The Corpus When selecting a corpus for use in any linguistic task, two main issues must be taken into consideration: the corpus size and its representativeness. Unfortunately, for the Arabic language there are no available corpora like the British National Corpus or the Bank of English [2]; therefore we must rely on other sources to build a new corpus. Table 2 Corpus Data (UAE) Grade Size Word Count G1 11281 3393 G2 9944 3564 G3 23943 6701 G4 25204 7341 G5 38113 9100 G6 39042 11384 G1-G6 147527 25505 textbooks for grades 1 to grade 6 of the UAE and Libya. Tables 1 and 2 show the total number of tokens and words. Tables 3 and 4 show the words with the highest frequencies. 5. Analysis of Results The first major issue to confront is finding an appropriate methodology for building an Arabic dictionary for grammar school children. Studies in other languages may give some hints, but their assumptions are not Table 3 Top Word Frequencies (Libya) Word Freq Word Freq 1274 246 1199 235 596 236 473 235 461 226 443 202! 440 201 263 188 " 255 #$ 177 transferrable to the Arabic context. Fundamental studies related to Arabic dictionaries for children are unknown to us. The second major issue is the notion of fundamental vocabulary and its extent. Determining reliable sources and filters to extract such a vocabulary is not only complex, but it may not have a positive answer. Thus, defining a suitable word list becomes a challenging task. The intent of the analysis is to quantify and categorize the various aspects of the use of the vocabulary in grammar school in order to build a dictionary that responds to the needs of the pupils. Based on the available material from public schools in the UAE, we performed various statistical analyses of which we present an overview related to data in grade 1. Our analysis includes: (1) parts-of-speech distribution; (2) verb frequency; (3) theme frequency; and (4) sound frequency. 5.1 Parts-of-Speech Distribution Major grammatical categories (parts of speech, POS) were identified. Figure 2 shows the distribution of the different parts of speech categorized as noun, verb, particle, and number. It shows a substantial use of nouns, whereas verbs and particles constitute half of the usage of words. The use of adjectives seems to be very limited. The number category may not be relevant, since the textbook deal with language arts. In the absence of sound studies on vocabulary requirements, or at least a suitable vocabulary, it is difficult to provide an interpretation of this distribution. For the purpose of our project we started with a set of grade school Arabic textbooks. This set consists of

Reducing the verbs to their roots (Figure 4) results in an increase of instance repetition. Still, the highest frequency (8 repetitions) constitutes only 22% of the roots. In any case, one would question the impact of such low frequencies on vocabulary development. A raw frequency of 100 is suggested in [11]. Figure 2 POS Distribution 5.2 Verb and Verb Root Frequencies Since roots play a major role in Arabic, we present here frequencies of verbs in word and root forms. Figure 3 summarizes the frequency of verbs within the corpus. It indicates that instances of verbs do not appear repeatedly. Indeed, it shows that 77% of the verbs are used only once and 14% used only twice. Figure 5 Sound Distribution Figure 6 Theme Distribution Figure 3 Verb Frequency Figure 5 shows the distribution of sounds within verbs. An issue that we were trying to elucidate is whether difficult sounds (e.g., letters such as %) would imply their infrequent use. The analysis according to the major categories does not support this assumption. A further analysis of the letter frequencies confirms the distribution in the Figure 5. That is, there is no pattern differentiating easy and difficult sounds. Figure 6 shows the distribution of verbs according to themes as defined in Roget s thesaurus. Over 90% of the verbs are action verbs. Even though excessively imbalanced, the bias towards action verbs seems appropriate for the age group under consideration. Figure 4 Root Frequency To further our analysis, we began an investigation of grades 1 to 6 textbooks from Libya. For comparison

purposes, we followed the same pattern used for the UAE textbooks. Figure 7 shows the POS distribution. References [1] Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17, 814-823. [2] British National Corpus. http://www.natcorp.ox.ac.uk/. [3] Brysbaert, Marc and New, Boris, Moving beyond Ku&era and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods 2009, 41 (4), 977-990. Figure 7 POS Distribution (Libya) We also compared the POS distribution in an English dictionary for children (1000 words), the UAE grade 1 language textbooks, and the Libyan grade 1 textbook. Figure 8 summarizes our comparison. [4] Buckwalter, T. (2004). Issues in Arabic orthography and morphological analysis. Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages. [5] Cobb, T. Do corpus-based electronic dictionaries replace concordancers? In B. Morrison, G. Green, & G. Motteram (Eds.) (2003). Directions in CALL: Experience, experiments, evaluation, pp. 179-206. [6] Harmain, H. ALKHALEEL: A Corpus-Based Learning Tool for Arabic, EduLearn 10. Barcelona, Spain, 5-7 July 2010. [7] Lété, B., Sprenger-Charolles, L., and Colé, P. (2004). MANULEX: A grade-level lexical database from French elementary school readers, Behavior Research Methods, Instruments, & Computers 36 (1), pp 156-166. Figure 8 POS Comparison 6. Conclusion Our results shed some light on the data and the tasks needed to develop a grammar school dictionary. The use of grade 1-6 language arts textbooks as a corpus to identify the vocabulary for an Arabic e-dictionary raised several methodological issues. A fundamental question is whether these textbooks can be considered as canonical, that is, do they offer a rich and representative enough corpus from which to identify our vocabulary word list? This has implications on the methodology used to develop textbooks, if indeed, a methodology exists. Noting that a grammar school dictionary is not just another general dictionary, an underlying question is to elaborate the distributions of parts-of-speech, themes, and sounds. The methodology used in [7] provides some solid guidelines on how to extend our work. [8] Niles, I. and Pease, A. Towards a Standard Upper Ontology. Proceedings of the Second International Conference on Formal Ontology in Information Systems (FOIS-2001). [9] Takebayashi, Y. (1993). EDR Electronic Dictionary. MT Summit IV, Kobe, Japan, pp 117-126. [10] Teubert, W. (2004). The Corpus Approach to Lexicography. Lexicographica 20, pp 1-19. [11] Verlinde, S. and Selva, T. (200?). Corpus-based versus intuition-based lexicography: defining a word list for a French learners dictionary. Modern Language Institute, K.U. Leuven (Belgium), pp. 594-598 [12] Tim Buckwalter. 2002. Buckwalter Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium, catalog number LDC2002L49 and ISBN 1-58563-257-0. <http://www.ldc.upenn.edu/catalog/catalogentry.jsp? catalogid=ldc2002l49 >