Computational Morphology: Introduction

Similar documents
ROSETTA STONE PRODUCT OVERVIEW

LING 329 : MORPHOLOGY

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Derivational and Inflectional Morphemes in Pak-Pak Language

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Developing a TT-MCTAG for German with an RCG-based Parser

Modeling full form lexica for Arabic

CS 598 Natural Language Processing

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Type Theory and Universal Grammar

Approved Foreign Language Courses

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Semi-supervised learning of morphological paradigms and lexicons

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

Chapter 4: Valence & Agreement CSLI Publications

Ch VI- SENTENCE PATTERNS.

The Ohio State University. Colleges of the Arts and Sciences. Bachelor of Science Degree Requirements. The Aim of the Arts and Sciences

Parsing of part-of-speech tagged Assamese Texts

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Specifying a shallow grammatical for parsing purposes

1. Introduction. 2. The OMBI database editor

More Morphology. Problem Set #1 is up: it s due next Thursday (1/19) fieldwork component: Figure out how negation is expressed in your language.

Refining the Design of a Contracting Finite-State Dependency Parser

Basic concepts: words and morphemes. LING 481 Winter 2011

BULATS A2 WORDLIST 2

Coast Academies Writing Framework Step 4. 1 of 7

Linking Task: Identifying authors and book titles in verbose queries

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

What the National Curriculum requires in reading at Y5 and Y6

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Character Stream Parsing of Mixed-lingual Text

Development of the First LRs for Macedonian: Current Projects

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Proof Theory for Syntacticians

Using a Native Language Reference Grammar as a Language Learning Tool

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

ARNE - A tool for Namend Entity Recognition from Arabic Text

DETECTING RANDOM STRINGS; A LANGUAGE BASED APPROACH

Constructing Parallel Corpus from Movie Subtitles

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Cross Language Information Retrieval

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Developing Grammar in Context

Oakland Unified School District English/ Language Arts Course Syllabus

Double Double, Morphology and Trouble: Looking into Reduplication in Indonesian

Syntactic types of Russian expressive suffixes

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Today we examine the distribution of infinitival clauses, which can be

Words come in categories

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Context Free Grammars. Many slides from Michael Collins

Underlying Representations

The Pennsylvania State University. The Graduate School. College of the Liberal Arts THE TEACHABILITY HYPOTHESIS AND CONCEPT-BASED INSTRUCTION

HinMA: Distributed Morphology based Hindi Morphological Analyzer

THE VERB ARGUMENT BROWSER

INTRODUCTION TO MORPHOLOGY Mark C. Baker and Jonathan David Bobaljik. Rutgers and McGill. Draft 6 INFLECTION

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Adjectives tell you more about a noun (for example: the red dress ).

To appear in the Papers from the 2002 Chicago Linguistics Society Meeting. Comments welcome:

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

Hindi Aspectual Verb Complexes

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

Tutorial on Paradigms

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Berlitz Swedish-English Dictionary (Berlitz Bilingual Dictionaries) By Berlitz Guides

BASIC ENGLISH. Book GRAMMAR

Memory-based grammatical error correction

Basic German: CD/Book Package (LL(R) Complete Basic Courses) By Living Language

Phenomena of gender attraction in Polish *

On the nature of voicing assimilation(s)

AF~-SUttA~ :tc.a~ v~ t~* Salah Alnajem. Abstract. Department of Arabic, College of Arts Kuwait University

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Lecture 2: Quantifiers and Approximation

5/29/2017. Doran, M.K. (Monifa) RADBOUD UNIVERSITEIT NIJMEGEN

Turkish Vocabulary Developer I / Vokabeltrainer I (Turkish Edition) By Katja Zehrfeld;Ali Akpinar

Aspectual Classes of Verb Phrases

Dear Teacher: Welcome to Reading Rods! Reading Rods offer many outstanding features! Read on to discover how to put Reading Rods to work today!

The College Board Redesigned SAT Grade 12

Foundations of Knowledge Representation in Cyc

Spanish III Class Description

Underlying and Surface Grammatical Relations in Greek consider

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

On the Notion Determiner

Morphotactics as Tier-Based Strictly Local Dependencies

Rendezvous with Comet Halley Next Generation of Science Standards

Transcription:

Computational Morphology: Introduction Aarne Ranta European Masters Course, Malta, March 2011

Objective Implement a morphology module for some language, comprising an inflection engine a morphological lexicon Make this into a reusable resource, i.e. usable for various linguistic processing tasks available free and open-source

What is needed Theoretical knowledge of morphology speaker s intuition grammar book Programming skills mastery of appropriate tools design and problem solving

What languages will be addressed Any languages of your choice; you can work in groups, too. Addressed in the lectures (in more detail): English, Italian, Finnish, Arabic.

What tools will be used Principal tool: GF, Grammatical Framework. Also introduced: XFST, Xerox Finite State tool. These tools can co-operate!

The GF Resource Grammar Project Morphology and syntax for natural languages. Currently covering Afrikaans Amharic Arabic Bulgarian+ Catalan Danish Dutch English+ Finnish+ French German Hindi Italian Latin Norwegian Polish Punjabi Romanian Russian Spanish Swedish+ Turkish+ Urdu where + = with large lexicon. We mainly expect lexica for the other languages, and inflection engines for languages outside the list.

How much work it is Basic inflection engine: 1 week Complete inflection engine: up to 8 weeks Lexicon: 1 to 8 weeks. All this depends on language and on available resources

Contents of these lectures Overview of concepts and tools Getting started with GF Designing a simple inflection engine: English Morphology-syntax interface Richer inflection engine with traditional paradigms: Latin Complex morphology with phonological processes: Finnish

Nonconcatenative morphology: Arabic Building a morphological lexicon Algorithms and tools: analysis vs. synthesis, GF vs. XFST

Overview of concepts and tools

Plan What morphology is Morphological processing tasks Finite state transducers and other formats Hockett s three models Not morphology: POS tagging, tokenization, stemming

Morphology Theory of forms (Gr. morphe) of plants and animals (biology) of words (linguistics) In linguistics, between phonology and syntax. Examples of morphological questions:

What is the past tense of English drink? What word form in Latin is amavissent? How are past tenses of verbs formed in Swedish? Do Greek nouns have dual forms? In what ways can causative verbs be formed in Finnish?

Morphological processing Analysis: given a word (string), find its form description. Synthesis: given a form description, find the resulting string. Example of words and form descriptions in English play - play +N +Sg +Nom play +V +Inf plays - play +N +Pl +Nom play +V +IndPres3sg Description = lemma followed by tags Both analysis and synthesis can give many results.

Morphology, mathematically Between words W and their form descriptions D in a language, the morphology is defined by a relation M, M : P(W x D) A morphological analyser is a function f : W -> P(D) such that d : f(w) iff (w,d) : M A morphological synthesizer is a function g : D -> P(W) such that w : g(d) iff (w,d) : M

Finite-state morphology M is a reg- A common assumption in computational morphology: ular relation. This implies: M can be defined using a regular expression word-description pairs in M can be be recognized by a finite-state automaton, a transducer In most system of computational morphology, M is moreover finite:

the language has a finite number of words each word has a finite number of forms A finite morphology M is trivially a regular relation. We ll return to finite-state descriptions later.

Other formats for a finite morphology Full-form lexicon: list of all words with their descriptions play - play +N +Sg +Nom play +V +Inf plays - play +N +Pl +Nom play +V +IndPres3sg player - player +N +Sg +Nom Morpological lexicon: list of all lemmas and all their forms play N: play, plays, play s, plays play V: play, plays, played, played, playing

player N: player, players, player s, players The forms come in a canonical order, so that it is easy to restore the full description attached to each form. It is easy to transform a morphological lexicon to a full-form lexicon.

Analysing with a full-form lexicon It is easy to compile a full-form lexicon into a trie - a prefix tree. A trie has transitions for each symbol, and it can return a value (or several values) at any point: - s(3) - s(12) / / p - l - a - y(1,5) ---- e - r(10) - s(11) - (13) \ s(2,6) - (4) N.B. a trie is also a special case of a finite automaton - an acyclic deterministic finite automaton.

Three models of morphological description From Hockett, Two models of grammatical description (Word, 1954): item and arrangement: inflection is concatenation of morphemes (stem + affixes). dog +Pl --> dog s --> dogs item and process: inflection is application of rules to the stem (one rule per feature) baby +Pl --> baby(y -> ie / s) s --> babie s --> babies

word and paradigm: inflection is association of a model inflection table to a stem {Sg:fly, Pl:flies}(fly := baby) --> {Sg:baby, Pl:babies}

The word and paradigm model The traditional model (Greek and Latin grammar). The most general and powerful: anything goes. The other models can be used as auxiliaries when defining a paradigm. But: there is no precise definition of a paradigm and its application.

Paradigms, mathematically For each part of speech C ( word class ), associate a finite set F (C) of inflectional features. An inflection table for C is a function of type F (C) -> Str. Type Str: lists of strings (which list may be empty). A paradigm for C is a function of type String -> F (C) -> Str. Thus there are different paradigms for nouns, adjectives, verbs,...

Example: English nouns F (N) = Number x Case, where Number = {Sg,Pl}, Case = {Nom,Gen} The word dog has the inflection table (using GF notation) table { <Sg,Nom> => "dog" ; <Sg,Gen> => "dog s" ; <Pl,Nom> => "dogs" ; <Pl,Gen> => "dogs " } regn, the regular noun paradigm, is the function (of variable x)

\x -> table { <Sg,Nom> => x ; <Sg,Gen> => x + " s" ; <Pl,Nom> => x + "s" ; <Pl,Gen> => x + "s " }

Two more paradigms for English nouns esn, nouns with plural ending es \x -> table { <Sg,Nom> => x ; <Sg,Gen> => x + " s" ; <Pl,Nom> => x + "es" ; <Pl,Gen> => x + "es " } iesn, nouns with plural ending ies, dropping last character \x -> table {

} <Sg,Nom> => x ; <Sg,Gen> => x + " s" ; <Pl,Nom> => init x + "ies" ; <Pl,Gen> => init x + "ies " -- init drops the last char

Building a lexicon with paradigms For a new entry: just give a stem and a paradigm, dog regn baby iesn coach esn boy sn hero esn This can be compiled into a morphological lexicon by applying the paradigms. Analysis can be performed by compiling the lexicon into a trie.

But how do we select the right paradigm for each word? And how to do with irregular words (such as man - men)?

Multiargument paradigms To inflect highly irregular words, one can quite as well use several arguments: irregn = \x,y -> table { <Sg,Nom> => x ; <Sg,Gen> => x + " s" ; <Pl,Nom> => y ; <Pl,Gen> => y + " s" } Similarly: irregular verb paradigms taking three forms.

man men irregn mouse mice irregn house regn drink drank drunk irregv

Arabic verb inflection: the problem form perfect imperfect P3 Sg Masc kataba yaktubu P3 Sg Fem katabat taktubu P3 Dl Masc katabaa yaktubaani P3 Dl Fem katabataa taktubaani P3 Pl Masc katabuwa yaktubuwna P3 Pl Fem katabna yaktubna P2 Sg Masc katabta taktubu P2 Sg Fem katabti taktubiyna P2 Dl katabtumaa taktubaani P2 Pl Masc katabtum taktubuwna P2 Pl Fem katabtunv2a taktubna P1 Sg katabtu A?aktubu P1 Pl katabnaa naktubu

This is not morphology Tokenization: split up the input into words, punctuation marks, digit groups, etc. before morphological analysis. Part-of-speech tagging: resolve ambiguities after morphological analysis. Stemming, also known as lemmatization: find out the ground form of a word, but ignore the morphological tags. This is sometimes done instead of proper morphological analysis, usually in quick-and-dirty ways. All these techniques can be implemented using finite-state methods, e.g. XFST.

Part-of-speech tagging (= POS tagging) Task: among the many possible morphological analyses, find the one that is correct in the given context. She plays the guitar. play +V She likes your plays. play +N Statistical POS tagging: (+Pron, +V, +Det) is a more frequent trigram than (+Pron, +N, +Det) Rule-based POS tagging (constraint grammar): after +Pron, +N is not allowed. POS tagging is covered in another course - morphology just feeds it.

Other material The LREC-2010 tutorial to GF: http://www.grammaticalframework.org/doc/gf-lrec-2010.pdf GF reference manual: http://www.grammaticalframework.org/doc/gf-refman.html GF library synopsis: http://www.grammaticalframework.org/lib/doc/synopsis.html