Implementing Large-Scale LFG Grammar for Wolof

Similar documents
Project in the framework of the AIM-WEST project Annotation of MWEs for translation

THE VERB ARGUMENT BROWSER

Adapting Stochastic Output for Rule-Based Semantics

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Developing a TT-MCTAG for German with an RCG-based Parser

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Double Double, Morphology and Trouble: Looking into Reduplication in Indonesian

Interfacing Phonology with LFG

Improving coverage and parsing quality of a large-scale LFG for German

Parsing of part-of-speech tagged Assamese Texts

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

LING 329 : MORPHOLOGY

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Modeling full form lexica for Arabic

Building an HPSG-based Indonesian Resource Grammar (INDRA)

Accurate Unlexicalized Parsing for Modern Hebrew

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

CS 598 Natural Language Processing

Underlying and Surface Grammatical Relations in Greek consider

Linking Task: Identifying authors and book titles in verbose queries

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Advanced Grammar in Use

Specifying a shallow grammatical for parsing purposes

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

THE INTERNATIONAL JOURNAL OF HUMANITIES & SOCIAL STUDIES

L1 and L2 acquisition. Holger Diessel

Character Stream Parsing of Mixed-lingual Text

Phonological Processing for Urdu Text to Speech System

Grammars & Parsing, Part 1:

An Interactive Intelligent Language Tutor Over The Internet

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Control and Boundedness

Learning Computational Grammars

Beyond the Pipeline: Discrete Optimization in NLP

Phonological and Phonetic Representations: The Case of Neutralization

Context Free Grammars. Many slides from Michael Collins

Update on Soar-based language processing

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Development of the First LRs for Macedonian: Current Projects

cmp-lg/ Jul 1995

Natural Language Processing. George Konidaris

Analysis of Probabilistic Parsing in NLP

Prediction of Maximal Projection for Semantic Role Labeling

Noisy SMS Machine Translation in Low-Density Languages

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Constraining X-Bar: Theta Theory

A Computational Evaluation of Case-Assignment Algorithms

Some Principles of Automated Natural Language Information Extraction

Applications of memory-based natural language processing

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The Interface between Phrasal and Functional Constraints

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

The Choice of Features for Classification of Verbs in Biomedical Texts

Formulaic Language and Fluency: ESL Teaching Applications

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Heads and history NIGEL VINCENT & KERSTI BÖRJARS The University of Manchester

Memory-based grammatical error correction

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

Loughton School s curriculum evening. 28 th February 2017

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Corpus Linguistics (L615)

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

5 Star Writing Persuasive Essay

Construction Grammar. University of Jena.

Frequency and pragmatically unmarked word order *

Name of Course: French 1 Middle School. Grade Level(s): 7 and 8 (half each) Unit 1

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

California Department of Education English Language Development Standards for Grade 8

BULATS A2 WORDLIST 2

Hindi Aspectual Verb Complexes

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Pseudo-Passives as Adjectival Passives

Beginners French FREN 101 University Studies Program. Course Outline

Feature-Based Grammar

Using dialogue context to improve parsing performance in dialogue systems

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

Refining the Design of a Contracting Finite-State Dependency Parser

Developing Grammar in Context

The Smart/Empire TIPSTER IR System

Compositional Semantics

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Som and Optimality Theory

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Transcription:

Implementing Large-Scale LFG Grammar for Wolof Cheikh Bamba Dione Department of Linguistic November 27, 2012 Cheikh Bamba Dione November 27, 2012 Wolof Morphology using Finite-State Techniques 1 / 9

Project work Generalities on Wolof Morphology 1 Build a morphological analyzer for Wolof (spoken in Senegal with 10 million speakers) 2 Implement a large-scale grammar using the (Lexical Functional Grammar) LFG formalism Motivation: No NLP resources available for Wolof Parallel Grammar (ParGram) project Aim: produce wide coverage grammars for a variety of languages (English, German, French, Norwegian, Arabic, Urdu, Tigrinya etc.). Collaboratively written grammars within the LFG framework Use of a commonly-agreed-upon set of grammatical features NLP development plateforms: 1 Morphological analysis: Xerox finite state tool (FST) 2 Parsing: Xerox Linguistic Environment (XLE) Cheikh Bamba Dione November 27, 2012 Wolof Morphology using Finite-State Techniques 2 / 9

Wolof FST System Morphological analysis using the Xerox tool (fst) 1 two-level morphology: 1) a lower surface and 2) an upper or lexical level 2 Input: surface form is transformed into a lexical form (stem + morphosyntactic features) 3 Use of intermediate level 4 The tool handles the input in both directions: analysis and generation Example Task: Apply up fecceekuwaatoon "untied again" from fas: "to tie" Lexical: fas+v+base+inv+e+mpsv+iter+pst Lexicon + morphotactics Intermediate: fas :i :e :u :aat :oon Orthographic rules Surface: fecceekuwaatoon Cheikh Bamba Dione November 27, 2012 Wolof Morphology using Finite-State Techniques 3 / 9

Morphological components The components of the Wolof FST: 1 Lexicon: contains verbal and nominal stems, ideophone and closed classes Statistics: common nouns (3800), proper nouns (1000), verbs (3500) 2 Morphotactics as finite-state network encoding the legal morphem. combination 3 Phonotactics as finite-state transducers describing the rules alternation 4 Composition of lexicon + phonotact. into a single network lex. transducer Cheikh Bamba Dione November 27, 2012 Wolof Morphology using Finite-State Techniques 4 / 9

The Wolof Grammar has 95.78 LFG style rules Tokenization using FST (handle MWE, clitics, etc.) Guessing mechanisms for unknown lexical entries 1 First guessing strategy: used for words that are recognized by the morphological analyzer but are not in the lexicons. 2 Second guessing strategy: used for those entries that are not recognized at all. For modularity, transparency and performance reasons, the lexicons are divided into three lexicons A main lexicon containing open classes and which records subcategorization information. The second lexicon includes mainly closed class items (stems for determiners, pronouns, prepositions, etc.). There is additionally a lexicon for complex predicates entries (morphological applicative, causative, medio-passive etc.). Cheikh Bamba Dione November 27, 2012 Wolof Morphology using Finite-State Techniques 5 / 9

Robustness Techniques Special techniques for disambiguation, increasing robustness and coverage FRAGMENT: the standard grammar collects enough information in cases where an input sentence does not get a full parse. Return-value: well-formed chunks specified as rules in the standard grammar (e.g. NPs, PPs, Ss, etc.) or The individuals input tokens parsed as TOKEN chunks if no chunks are available. SKIMMING: allows to overcome timeouts and memory problems (has been used to tackle performance problems for the English and German grammar). Disambiguation: Optimality marks for preferences Using discriminant-based methods Constraint Grammar (CG) Rules Cheikh Bamba Dione November 27, 2012 Wolof Morphology using Finite-State Techniques 6 / 9

Data description Generalities on Wolof Morphology Problem for automatic evaluation: no gold-standard available for Wolof. Possibility: manual evaluation The corpus is collected from stories. The data are randomly split into a development and a test set. Table: Development Corpus Total number of sentences 380 Total number of words 3875 Average number of words per sentence 10.0 Sentences less than 10 words 205 Sentences between 10 and 15 words 109 Sentences between 16 and 20 words 44 Sentences more than 20 words 22 Cheikh Bamba Dione November 27, 2012 Wolof Morphology using Finite-State Techniques 7 / 9

Table: Test Corpus Total number of sentences 150 Total number of words 1439 Average number of words per sentence 9.0 Sentences less than 10 words 87 Sentences between 10 and 15 words 41 Sentences between 16 and 20 words 16 Sentences more than 20 words 6 Cheikh Bamba Dione November 27, 2012 Wolof Morphology using Finite-State Techniques 8 / 9

Possible evaluation scheme: classification of errors into minor errors and serious errors. Minor errors would include for instance (PP attachment, Scope of coordination, Best solution is not first solution, but among the first 10, pronominal reference, etc.) Serious error: Wrong phrase structure in the main clause. This happens when the system builds the wrong tree because it assigns a POS or a subcategorization frame that is wrong in the context. Three or more minor errors Cheikh Bamba Dione November 27, 2012 Wolof Morphology using Finite-State Techniques 9 / 9