Intuitive Coding of the Arabic Lexicon

Similar documents
Modeling full form lexica for Arabic

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

CS 598 Natural Language Processing

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Division of Arts, Humanities & Wellness Department of World Languages and Cultures. Course Syllabus اللغة والثقافة العربية ١ LAN 115

Parsing of part-of-speech tagged Assamese Texts

1. Introduction. 2. The OMBI database editor

Development of the First LRs for Macedonian: Current Projects

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Linking Task: Identifying authors and book titles in verbose queries

Cross Language Information Retrieval

Developing a TT-MCTAG for German with an RCG-based Parser

Memory-based grammatical error correction

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Words come in categories

Writing a composition

A First-Pass Approach for Evaluating Machine Translation Systems

An Interactive Intelligent Language Tutor Over The Internet

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

BULATS A2 WORDLIST 2

Character Stream Parsing of Mixed-lingual Text

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

BASIC ENGLISH. Book GRAMMAR

Phonological Processing for Urdu Text to Speech System

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Ch VI- SENTENCE PATTERNS.

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

On the Notion Determiner

THE VERB ARGUMENT BROWSER

The College Board Redesigned SAT Grade 12

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Natural Language Processing. George Konidaris

LING 329 : MORPHOLOGY

Emmaus Lutheran School English Language Arts Curriculum

Derivational and Inflectional Morphemes in Pak-Pak Language

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Specifying a shallow grammatical for parsing purposes

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Spanish III Class Description

ScienceDirect. Malayalam question answering system

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Software Maintenance

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Constructing Parallel Corpus from Movie Subtitles

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Phenomena of gender attraction in Polish *

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

Coast Academies Writing Framework Step 4. 1 of 7

Pontificia Universidad Católica del Ecuador Facultad de Comunicación, Lingüística y Literatura Escuela de Lenguas Sección de Inglés

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

Constraining X-Bar: Theta Theory

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

2.1 The Theory of Semantic Fields

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

A Simple Surface Realization Engine for Telugu

California Department of Education English Language Development Standards for Grade 8

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Underlying and Surface Grammatical Relations in Greek consider

The Use of Inflectional Morphemes by Kuwaiti EFL Learners

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Guidelines for Writing an Internship Report

Study Center in Amman, Jordan

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Semantic Modeling in Morpheme-based Lexica for Greek

Applications of memory-based natural language processing

(3) Vocabulary insertion targets subtrees (4) The Superset Principle A vocabulary item A associated with the feature set F can replace a subtree X

Unit 3. Design Activity. Overview. Purpose. Profile

Phonological and Phonetic Representations: The Case of Neutralization

Problems of the Arabic OCR: New Attitudes

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Advanced Grammar in Use

A Case Study: News Classification Based on Term Frequency

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

CORRECT YOUR ENGLISH ERRORS BY TIM COLLINS DOWNLOAD EBOOK : CORRECT YOUR ENGLISH ERRORS BY TIM COLLINS PDF

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

cambridge occasional papers in linguistics Volume 8, Article 3: 41 55, 2015 ISSN

AN EXPERIMENTAL APPROACH TO NEW AND OLD INFORMATION IN TURKISH LOCATIVES AND EXISTENTIALS

Name of Course: French 1 Middle School. Grade Level(s): 7 and 8 (half each) Unit 1

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Transcription:

Intuitive Coding of the Arabic Lexicon Ali Farghaly SYSTRAN Software, Inc 9333 Genesee Avenue San Diego, CA 92121, USA. farghaly@systransoft.com Jean Senellart SYSTRAN S A. 1 Rue du Cimetiere 95230 Soisy-sous-Montmorency, France senellart@systran.fr Abstract SYSTRAN started the design and the development of Arabic, Farsi and Urdu to English machine translation systems in July 2002. This paper describes the methodology and implementation adopted for dictionary building and morphological analysis. SYSTRAN s IntuitiveCoding technology (ICT) facilitates the creation, update, and maintenance of Arabic, Farsi and Urdu lexical entries, is more modular and less costly. ICT for Arabic, Farsi, and Urdu requires the implementation of stem-based lexical entries, the authentic scripts for each language, a statistical Arabic stem-guesser, and separate declarative modules for internal and external morphology. Keywords (machine translation, Arabic morphology, lexical entries, stem-based morphology, intuitive coding) 1 Intuitive Coding Technology An effective way to reduce ambiguity and improve the efficiency of NLP systems is to incorporate domainspecific dictionaries (Farghaly & Hedin, 2003). In a general machine translation system, this involves the customization of the MT system to the particular corpora of a corporation. This customization process is typically performed by the system developers (Senellart et al, 2003). Because domain specific information is propriety, the customization process is challenging. Corporate customers are reluctant to share such information with MT developers. Although it is important to note that most customers do not have the linguistic expertise needed to perform customization in-house. SYSTRAN developed the innovative IntuitiveCoding technology (Senellart et al, 2003) to resolve this paradox. Although SYSTRAN s current development of the Arabic system was done in-house, Arabic lexicographers with linguistic expertise were not readily available. We needed to make coding Arabic entries as intuitive as possible, in particular by starting with a stem-based Arabic lexicon, for increased productivity. 2 Stem-based Arabic Lexicon The decision to start with Arabic stems rather than roots, eliminates the process of generating stems from roots. In other

Arabic morphological analyzers (Beesley, 2001), the roots are entered manually as well as the morphological patterns. Such information is essential to generate Arabic stems and is complex to formalize. This decision is quite unique in the morphological descriptions of languages developed by SYSTRAN. The main counterpart of this approach is an increased risk of typographical errors in the dictionary due to redundancy. SYSTRAN dealt with this matter by providing a strict coding frame for lexicographers with a guesser and validation features. At the same time, derivation is not directly described as such verbal and derived nominals are distinct entries in the lexicon though a link binds both. In a complete root-based system, a complex formalization is set up and encounters a large number of lexical exceptions (for example, inheritance of semantic features between stems). In SYSTRAN s system, the validation process looks for consistency of stems coded (for example, validating that the root is preserved in the different stems of a given verb). For a full discussion of the advantages of a stem-based dictionary over a root-based dictionary, see (Dichy and Farghaly, 2003). Lexicographers were trained to enter stems, which are the words in the specific language. They do not need to consult grammar books to reach the underlying root and the patterns. Arabic native speakers struggle at school with الميزان الصرفي these patterns known as in Arabic grammar books. Another decision to eliminate the use of transliteration in the dictionary was made. As a result, lexicographers do not need to be trained in transliteration tables. SYSTRAN uses an SQL database to maintain the dictionary, which automatically saves various versions for future translation quality comparisons and reinforces the consistency procedure on the database. In our database an entry usually has six fields. The first three are for the lemma, part-of-speech and the type of the-part of-speech. Types represent subclassification of major parts-of-speech. For example nouns have five types: common nouns, proper nouns, verbal nouns, present participle and past participle. There are two types of adjectives: base and comparative. There are several types of verbs, such as plain, aux_modal, aux_neg.etc. There is only one type of prepositions. The last three fields are for the morphological, syntactic and semantic information. There is also a field for notes in which the lexicographer may insert comments regarding the entry. Coding the linguistic information in the monolingual dictionary is a two- step process. The first step is completed when the first three fields (see Figure 1) are entered. Then, a morphological guesser is run to fill in the morphological field. The second step begins when lexicographers review the suggestions made by the guesser. If they agree with the suggested forms generated by the guesser, no additional work needs to be done. In the event they disagree, they make corrections and fill in the syntactic and semantic fields. In section 3, we show how the morphological guesser works.

قال verb plain ليس verb Aux_neg م ه ن دس noun common ب غ داد noun proper م ن prep plain ق ليل adj base Figure (1) The First three fields of the dictionary 3 Statistical Arabic stem-guesser Entering the morphological information proved to be very time consuming. For each entry, several stems have to be entered. This was done to avoid the use of morphological tables. The alternative is to enrich the lexicon. Figure (2) shows the morphological information of the verb زرع to plant. ي ز ر ع= imperfect ],[ز ر ع= perfect ] ز ر ع= passperf ],[إز ر ع= imperative ],[ [ي ز ر ع= passimperf ],[ Figure (2) زرع The Morpho field of The morphological field of nouns and adjectives contains forms for the singular masculine, singular feminine, plural masculine and plural feminine. In order to save time and reduce costs, a guesser was designed to automatically generate the different stems of each category. Only the rules that apply to the largest number of forms in a given category are used. Even though the lexicographers are aware that the guesser over-generalizes, they are 60% more productive. Figure (3) below shows how the guesser over-generalizes and produces wrong forms that need to be corrected by the lexicographers. ي قال= imperfect ],[ قال= perfect ] قال= passperf ],[إقال= imperative ],[ [يقال= passimperf ],[ Figure (3) The output of the guesser showing overgeneralization Generating forms with inflections for gender, number and person is performed by the internal morphology module presented in the following section. 4 Internal Morphology SYSTRAN has two different modules for Arabic morphology: internal and external modules. The internal module generates all different inflectional patterns of a given stem. The input to the internal morphology module includes the stems in the morphological field. The rules are very simple. They go through the mono (do you mean monolingual?) dictionary and retrieve the lemma, partof-speech and the type. Next, the rules obtain the stem of the morphological field, identify the type of stem, and generate the correct inflected forms with tags that represent the morphological properties of the that form. Figure (4) shows the output of the inflected form wrote they آتبن آتب verb plain آتبن +past+fem+3p+plural Figure (4) The output of the Internal Morphology Module The internal morphology module in fact generates an inflected dictionary. As displayed in Figure (4), it generates the inflected forms exactly as they may appear in authentic Arabic texts. This module also provides the lemma, part-

of-speech, type, gender, person and number tags since this information must be made available to the other modules in the MT system. This inflected dictionary is compiled using finite-state automata technology into a runtime dictionary. The internal morphology module is thus simplified because many of the complex processes used to generate the verb stems were treated in the dictionary. For example, there is no need to design rules to generate the imperative form which vary from one verb to another because such forms are already accounted for in the dictionary. However, rules for hollow and weak verbs, dual nouns forms, regular plurals, deletion, epenthesis etc. have been implemented in the internal morphology module. 5 External Morphology It is very common in Arabic that words, or more accurately tokens, may exhibit the structure of a whole sentence. For شاهدناهم example, the Arabic token is translated into We saw them. Therefore, this token must be decomposed into a verb, a subject and a direct object. Moreover, it is also possible that this token take a conjunction that will be attached as a prefix. The external morphology module is the component that decomposes a token into different parts-of-speech. The crucial difference between the internal morphology module and the external morphology is that the internal morphology works at the paradigmatic level; whereas the external morphology works at the syntagmatic (?) level. Figure (5) illustrates the function of each. I Internal هم External شاهدنا و ني يشاهد ها أشاهد ف هن سيشاهد و Figure (5) The function of the Internal and External Morphology Modules The different inflections that we see in the internal column represent variations of the verb with respect to tense, number and person. The relation of the members of the set under internal (is membership?? do you mean belong ; use of is membership is incorrect) of the class of verbs; whereas the set across which the external morphology decomposes represent members belonging to different word classes. The first is a conjunction, the second is a verb and the third is a pronominal suffix. In the implementation, the external morphology module precedes the internal morphology because it feeds the lookup procedure. If decomposition is not done correctly, the lookup procedure will not match words that actually exist in the dictionary. There are cases where there will be multiple parses of a complex word. A complex word may decompose in more than one way. This is where disambiguation rules play an important role. External morphology rules written intuitively as combination patterns are described in Figure (6).

WAFA:= KABILI:= LI:= < CONJ.ف CONJ.و > < PREP.ل PREP.ب PREP.ك > < PREP.ل > # al+noun/det/adj/numeric {WAFA}?_{AL}_<NOUN:-PROPERNOUN ADJ DET:QUANTIFIER NUMERIC:CARDINAL> # noun/adj-suffix {WAFA}?_{NOUNADJ}_<PRON:PERSPOSS> {WAFA}?_{KABILI}_{NOUNADJ}_<PRON:PERSPOSS> Figure (6) Sample of external morphology rules Conclusion SYSTRAN s Arabic-English machine translation system contains a dictionary of over 30,000 single-word entries. Terminology coverage of Arabic newspapers and Internet materials reaches over 90%. It currently provides adequate gisting-level translation quality. We are developing a compound dictionary with a third level of morphology, a compound morphology module. Analysis and synthesis rules are being added to improve the quality of the translation beyond the gistinglevel. Our approach has greatly - accelerated the development of this system. Continued development on the lexical database, the syntactic module and quality assurance process are ongoing. References Beesley, Kenneth, 2001. Finite-State Morphological Analysis and Generation of Arabic at Xerox Research:Status and Plans in 2001, ACL, Arabic NLP Workshop, Toulouse. Dichy Joseph and, Farghaly, Ali, 2003. Root & Pattern Vs. Stem: On what grounds should a multilingual database centered on Arabic should be built? to be presented at IX Machine Translation Summit, New Orleans. Farghaly, Ali and Bruce Hedin, 2003. Domain Analysis and Representation, in. Handbook for Language Engineers, Ed. Ali Farghaly, CSLI Publications, Stanford, California. Senellart, et al, 2003. SYSTRAN Intuitive Coding Technology, to be presented at the IX MT Summit, New Orleans.