Modeling full form lexica for Arabic

Similar documents
1. Introduction. 2. The OMBI database editor

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

LING 329 : MORPHOLOGY

Analysis of Lexical Structures from Field Linguistics and Language Engineering

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Development of the First LRs for Macedonian: Current Projects

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Developing a TT-MCTAG for German with an RCG-based Parser

Controlled vocabulary

On the Notion Determiner

Linking Task: Identifying authors and book titles in verbose queries

Underlying and Surface Grammatical Relations in Greek consider

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Deep encoding of etymological information in TEI

English Language and Applied Linguistics. Module Descriptions 2017/18

Ontologies vs. classification systems

Derivational and Inflectional Morphemes in Pak-Pak Language

Phenomena of gender attraction in Polish *

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Laying the Foundations for a Diachronic Dictionary of Tunis Arabic: a First Glance at an Evolving New Language Resource

BULATS A2 WORDLIST 2

ARNE - A tool for Namend Entity Recognition from Arabic Text

Problems of the Arabic OCR: New Attitudes

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

UC Berkeley Berkeley Undergraduate Journal of Classics

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

CS 598 Natural Language Processing

A Simple Surface Realization Engine for Telugu

Taking into Account the Oral-Written Dichotomy of the Chinese language :

THE VERB ARGUMENT BROWSER

Tutorial on Paradigms

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

HinMA: Distributed Morphology based Hindi Morphological Analyzer

Constraining X-Bar: Theta Theory

Year 4 National Curriculum requirements

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

The CESAR Project: Enabling LRT for 70M+ Speakers

Words come in categories

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Intermediate Academic Writing

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Applications of memory-based natural language processing

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

Corpus Linguistics (L615)

Designing e-learning materials with learning objects

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Basic concepts: words and morphemes. LING 481 Winter 2011

(3) Vocabulary insertion targets subtrees (4) The Superset Principle A vocabulary item A associated with the feature set F can replace a subtree X

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

(Musselwhite, 2008) classrooms.

A hybrid approach to translate Moroccan Arabic dialect

Memory-based grammatical error correction

Natural Language Processing. George Konidaris

Phonological and Phonetic Representations: The Case of Neutralization

CEN/ISSS ecat Workshop

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

Inflection Classes and Economy

Cross Language Information Retrieval

Training and evaluation of POS taggers on the French MULTITAG corpus

Author: Fatima Lemtouni, Wayzata High School, Wayzata, MN

Specifying a shallow grammatical for parsing purposes

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Parsing of part-of-speech tagged Assamese Texts

Accurate Unlexicalized Parsing for Modern Hebrew

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Florida Reading Endorsement Alignment Matrix Competency 1

Dictionary-based techniques for cross-language information retrieval q

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

AN EXPERIMENTAL APPROACH TO NEW AND OLD INFORMATION IN TURKISH LOCATIVES AND EXISTENTIALS

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

Abbey Academies Trust. Every Child Matters

What the National Curriculum requires in reading at Y5 and Y6

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Language description and hypertext: Nunggubuyu as a case study

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

The taming of the data:

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

Procedia - Social and Behavioral Sciences 154 ( 2014 )

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Włodzimierz Sobkowiak. Phonetics of EFL Dictionary Definitions. 2006, 249 pp. ISBN Anglistyka. Poznań: Wydawnictwo Poznańskie.

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Greeley-Evans School District 6 French 1, French 1A Curriculum Guide

Semantic Modeling in Morpheme-based Lexica for Greek

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Chapter 4: Valence & Agreement CSLI Publications

Frequency and pragmatically unmarked word order *

Learning Methods in Multilingual Speech Recognition

Transcription:

Modeling full form lexica for Arabic Susanne Alt Amine Akrout Atilf-CNRS Laurent Romary Loria-CNRS

Objectives Presentation of the current standardization activity in the domain of lexical data modeling Validation of the proposed standard on Arabic Contribution to the establishment of a reference resource for Arabic

Overview Background Why do we need full form lexica? Standards Lexical resources & dictionaries Instanciation Specificities of an Arabic full form lexicon Overall goal making current work interoperable

Two views on lexical data Extensional representation exhaustive list of observables set of inflected word forms set of syntactic constructions Intensional representation factorization of regular behaviour ("grammar") lemma + inflection rules deep syntactic representation + transformation rules

Full form lexica : advantages Local linguistic information local inflectional variants (courbattu vs. courbaturé) defective paradigms (*nous pleuvons) phonological variants (les [le]/[lez]) Testimony of inflected forms token frequency wrt a reference corpus Exchange of lexical resources no consensus on encoding format for grammar rules pivot format for merging and comparing lexicons Extensional data for data recognition purposes

Standards for NLP lexica Forefathers a wide range of international projects MULTEXT, EAGLES, ISLE/MILE, Parole... XML encoding of print dictionaries "Print dictionary" chapter of the TEI http://www.tei-c.org Terminology Sense-to-word oriented Terminology Markup Framework (ISO 16642)

Lexical Markup Framework Future ISO standard 24613 ISO technical committee TC 37/SC 4 Language Resource Management http://www.tc37sc4.org http://lirics.loria.fr Project leaders Monte George (USA) & Gil Francopoulo (FR) First applications Morphalou (Salmon-Alt et alii, 2004)

LMF: Basic principles An open platform for specifying lexical data implemented prototypes : Lexus, Syntax Main modeling principles metamodel basic building blocks and basic structural constraints e.g. "A lexical database is made of lexical entries." data categories basic linguistic descriptors e.g. "grammatical gender", "synonymof",... stored in a shared data category registry

LMF core metamodel Lexical Database Global Information Lexical Entry Form Sense

Data categories Independent from the hierarchical structure of the data model /partofspeech/, /grammaticalnumber/, /grammaticalcase/ Characteristics complex vs. simple /grammaticalnumber/ => /singular/, /plural/ relational data categories /synonymof/, /toinflectionalparadigm/ generic vs. language specific /grammaticalnumber/ => {/singular/, /plural/, /dual/}

Documention and localization Entry Identifier : /grammaticalgender/ Profile : Morpho-syntax Definition : Grammatical genders are classes of nouns reflected in the behavior of associated words Explanation: Grammatical gender is distinguished from natural gender by the fact that grammatical gender requires agreement between nouns and the forms of modifiers... Source : Charles F. Hockett, A Course in Modern Linguistics, Macmillan, 1958. Range : {/masculine/, /feminine/, /neuter/, /common/} Object Language : fr Name : genre Range : {/masculine/, /feminine/} Object Language : en Name : gender, grammatical gender Range : {} Object Language : de Name : Genus, Geschlecht Range : {/masculine/, /feminine/, /neuter/}

Lexicon specification Lexical Database /grammaticalcategory/ Global Information Lexical Entry Form Sense

GMT (Generic Mapping Tool) <struct type="lexicaldatabase"> <struct type="globalinformation">...</struct> <struct type="lexicalentry"> <feat type="grammaticalcategory">...</feat> <struct type="form">...</struct> <struct type="sense">...</struct> <struct type="sense">...</struct>... </struct> <struct type="lexicalentry">...</struct>... </struct>

User specific XML format <lexicaldatabase> <globalinformation>...</globalinformation > <lexicalentry POS=... > <form>...</form> <sense>...</sense> </lexicalentry> <lexicalentry POS=... > <form>...</form> <sense>...</sense> </lexicalentry>... </lexicaldatabase >

Applying LMF to Arabic Little representation of Arabic speaking countries in ISO/TC 37/SC 4 NLP of Arabic morphology Beesley K., 2001; Buckwalter, 2002; Cavalli-Sforza et alii, 2000; Maamouri & Bies, 2004; Tahir et alii, 2004 Yet, no widely, freely accessible and cumulative lexicon can be used to boost research on Arabic language strategy : combining efforts through standardization

FR vs. Arabic full form lexica French lexicography semasiological + alphabetical perspective (Traditional) Arabic perspective mixed + root based grouping of all derivates from consonantic pattern ktb (notion of writing) => kâtaba (to write), kattaba (cause to write), maktabun (desk), maktabatun (library), kitâbun (book) therefore distinction between human readability and machine processing essential to keep reference to the root

Adapting LMF to Arabic (I) Specifying the notion of "lexical entry" alphabetically ordered characterized by POS keyform reference to the root Lexical Database /grammaticalcategory/ /keyform/ /root/ Global Information Lexical Entry

Adapting LMF to Arabic (II) Specifying the notion of "inflected form" a word form and inflectional features form related & inflection related data categories Inflected Form /orthography/ /pronunciation/ /grammaticalgender/ /grammaticalnumber/ /grammaticalcase/ /grammaticaldefiniteness/ /grammaticalaspect/ /grammaticalvoice/ /grammaticalmood/ /grammaticalperson/

Adapting LMF to Arabic (III) Form related data categories orthography and pronunciation both are subject to refinements ("local metadata") transliteration : fully reversible one-to-one mapping to original orthography Buckwalter transliteration transcription : devised to render (morpho)phonology IPA Inflected Form /orthography/ => /transliteration/ /pronunciation/ => /transcription/

Adapting LMF to Arabic (IV) Some questions on inflection related data categories Nouns /grammaticalgender/ => /masculine/, /feminine/ no lexicalized (because of gender change in plural forms) choice of no "underspecified" gender /grammaticalnumber/ => /singular/, /plural/, /dual/ (enter) and/or pick up /dual/ from the DCR /grammaticalcase/ => /nominative/, /accusative/, /prepositional/ terminology (prepositional, indirect, possessive or genitive)? /definiteness/ => /definite/, /indefinite/ one or two categories of definiteness (def. alkitâbu, pos. kitâbî)? inflection vs composition (e.g. prepositional affixes)?

Lexical Database /grammaticalcategory/ /keyform/ /gloss/ The fully specified model /root/ Global Information Lexical Entry Word Form Set Inflected Form Form Inflection /orthography/, /pronunciation/ /grammaticalgender/ /grammaticalnumber/ /grammaticalcase/ /grammaticaldefiniteness/ /grammaticalaspect/ /grammaticalvoice/ /grammaticalmood/ /grammaticalperson/

<lexicalentry keyform="kataba" grammaticalcategory="verb" root="ktb" gloss="écrire"> <wordformset> <inflectedform> XML example <form> <orthography code=""akrout_2005">katabtu</realization > </form> <inflection> <grammaticalaspect>perfect</grammaticalaspect> <grammaticalgender>masculine</grammaticalgender> <grammaticalperson>firstperson</grammaticalperson> <grammaticalnumber>singular</grammaticalnumber> <grammaticalvoice>active</grammaticalvoice> </inflection> </inflectedform>... <inflectedform> <form> <orthography code="akrout_2005">taktubâ</realization > </form> <inflection> <grammaticalaspect>imperfect</grammaticalaspect> <grammaticalgender>masculine</grammaticalgender> <grammaticalperson>secondperson</grammaticalperson> <grammaticalnumber>dual</grammaticalnumber> <grammaticalmood>subjunctive</grammaticalmood> </inflection> </inflectedform>... </wordformset> </lexicalentry>

Towards a reference lexicon for Arabic: issues Interoperability Comparison of proprietary specifications Coverage Completion of specific advances (dialectal, terminological, phonology) Accessibility Common query interface, wide (free?) distribution Maintenance Common rules to ensure editorial evenness Documentation & user manuals A step towards an intensional representation