The Alpino Grammar and Lexicon

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Developing a TT-MCTAG for German with an RCG-based Parser

Grammars & Parsing, Part 1:

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Chapter 4: Valence & Agreement CSLI Publications

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Accurate Unlexicalized Parsing for Modern Hebrew

Adapting Stochastic Output for Rule-Based Semantics

Some Principles of Automated Natural Language Information Extraction

Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

Words come in categories

CS 598 Natural Language Processing

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Context Free Grammars. Many slides from Michael Collins

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

The building blocks of HPSG grammars. Head-Driven Phrase Structure Grammar (HPSG) HPSG grammars from a linguistic perspective

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Specifying a shallow grammatical for parsing purposes

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Som and Optimality Theory

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Feature-Based Grammar

Parsing of part-of-speech tagged Assamese Texts

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

LTAG-spinal and the Treebank

The Interface between Phrasal and Functional Constraints

Construction Grammar. University of Jena.

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Improving coverage and parsing quality of a large-scale LFG for German

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

A Graph Based Authorship Identification Approach

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

LNGT0101 Introduction to Linguistics

Hyperedge Replacement and Nonprojective Dependency Structures

Domain Adaptation for Parsing

VERB MOVEMENT The Status of the Weak Pronouns in Dutch

cmp-lg/ Jul 1995

THE VERB ARGUMENT BROWSER

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

"f TOPIC =T COMP COMP... OBJ

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Alpino: accurate, robust, wide coverage computational analysis of Dutch. Gertjan van Noord University of Groningen

Parasitic participles and ellipsis in VP-focus pseudoclefts. Jan-Wouter Zwart

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Proof Theory for Syntacticians

Specifying Logic Programs in Controlled Natural Language

Refining the Design of a Contracting Finite-State Dependency Parser

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Prediction of Maximal Projection for Semantic Role Labeling

Update on Soar-based language processing

The Discourse Anaphoric Properties of Connectives

A relational approach to translation

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

THE INTERNATIONAL JOURNAL OF HUMANITIES & SOCIAL STUDIES

MODELING DEPENDENCY GRAMMAR WITH RESTRICTED CONSTRAINTS. Ingo Schröder Wolfgang Menzel Kilian Foth Michael Schulz * Résumé - Abstract

Analysis of Probabilistic Parsing in NLP

Writing a composition

University of Groningen. Topics in Corpus-Based Dutch Syntax Beek, Leonoor Johanneke van der

Character Stream Parsing of Mixed-lingual Text

Natural Language Processing. George Konidaris

Constraining X-Bar: Theta Theory

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Parsing natural language

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Annotation Projection for Discourse Connectives

Advanced Topics in HPSG

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Control and Boundedness

Pre-Processing MRSes

The Smart/Empire TIPSTER IR System

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Hans-Ulrich Block, Hans Haugeneder Siemens AG, MOnchen ZT ZTI INF W. Germany. (2) [S' [NP who][s does he try to find [NP e]]s IS' $=~

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

The Indiana Cooperative Remote Search Task (CReST) Corpus

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Adjectives tell you more about a noun (for example: the red dress ).

A Framework for Customizable Generation of Hypertext Presentations

On the Notion Determiner

Structure-Preserving Extraction without Traces

Underlying and Surface Grammatical Relations in Greek consider

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

EAGLE: an Error-Annotated Corpus of Beginning Learner German

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Ch VI- SENTENCE PATTERNS.

Hindi Aspectual Verb Complexes

Ensemble Technique Utilization for Indonesian Dependency Parser

Argument structure and theta roles

Learning Methods in Multilingual Speech Recognition

A Computational Evaluation of Case-Assignment Algorithms

Constructions with Lexical Integrity *

The Role of the Head in the Interpretation of English Deverbal Compounds

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Transcription:

The Alpino Grammar and Lexicon RSAVG, Section 243 & 244 Daniël de Kok

Overview Broad overview of Alpino The lexicon The grammar Problem section : Modifiers Verb movement

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar Tokenizer (finite state transducer)

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar Tokenizer (finite state transducer) Part-of-speech tagger (Hidden Markov Model)

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar Tokenizer (finite state transducer) Part-of-speech tagger (Hidden Markov Model) Parser

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar Tokenizer (finite state transducer) Part-of-speech tagger (Hidden Markov Model) Parser Generator

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar Tokenizer (finite state transducer) Part-of-speech tagger (Hidden Markov Model) Parser Generator Treebanking

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar Tokenizer (finite state transducer) Part-of-speech tagger (Hidden Markov Model) Parser Generator Treebanking Largely written in Prolog and C/C++

Parsing in Alpino Parsing in Alpino Lexical analysis Left-corner parser (with goal weakening and memoization) Disambiguation (with n-best unpacking)

Generation in Alpino Generation in Alpino: Lexical prediction Chart generator Fluency ranking (with n-best unpacking)

Lexicon

Introduction Alpino uses a strongly-lexicalized grammar: Word descriptions have detailed syntactic information A relatively small set of simple grammar rules Words are represented by attribute-value structures

Example v subj np agr sg case nom dt 1 sc np [ case acc dt 2 ] dt dt hd [ lex verft ] su 1 obj1 2 Figure 1:Simplified attribute-value structure for verft present tense second/third person inflection of verven to paint

Static lexicon M:M mapping: words < tag, stem > Each word is associated with a complex tag The attribute-value structure is constructed from the tag For example: # Inflection # Root # Tag advies advies noun(het,count,sg) adviezen advies noun(het,count,pl) Dictionary size (2012): ~180,000 mappings ~190,000 mappings for named entities Stored in a finite state automaton

Special entries Some specific combinations of words cannot be derived using generic grammar rules Consider: helemaal niemand (lit: at all nobody ) * helemaal iemand * helemaal hij helemaal is an intensifier for the pronoun niemand Since it cannot apply to other pronouns: no generalization

Special entries (2) The Alpino lexicon contains special entries for such word combinations Since the dependency structure cannot be derived productively, needs to be pre-packaged The tag of helemaal niemand: with_dt( pronoun(nwh,thi,sg,de,both,indef,strpro), dt(np,[ mod=l(helemaal,adverb,advp,0,1), hd=l(niemand, pronoun(nwh,thi,sg,de,both,indef,strpro),1,2)])) Require extra handling in parsing and generation

Productive lexicon The productive lexicon analyzes: Compounds Ordinals Unknown words

Grammar

Introduction The Alpino grammar is written as Prolog rules: %% Rule head template grammar_rule(identifier,lhs,rhs) %% Head for np -> det n grammar_rule(np_det_n,np,[det,n]) Approximately 850 construction-specific rules

Example rule grammar_rule(n_adj_n, NP, [ AP, N ] ) :- unmarked_n_adj_n_struct(n,ap,np) unmarked_n_adj_n_struct(n,ap,np) :- n_adj_n_struct(n,ap,np), AP:agr <=> N:agr n_adj_n_struct(n,ap,np) :- NP => n, AP => a, N => n, % reduce spur amb in 'ziek zijn' NP:subn => ~sub_indef_verb, ap_arg(ap), N:wh => nwh, % de hoeveelste overwinning was dat? NP:wh <=> AP:wh, %%

Principles Rules use general predicates/principles that are shared between different rules Example: percolate the dependency structure of the projected head on the left-hand side of the rule

Rule use Rules are purely declarative (besides a few exceptions) Calling the goal np_det_n grammar_rule(np_det_n,np,[det,n]) Will instantiate NP, Det, and N with attribute-value structures Consequence: we can store the grammar rules as Prolog facts Ideal: exploit first-argument indexing in parsing and generation

Handling modifiers

Introduction Unfortunately, sometimes a context-free backbone and dependency structure do not match as nicely as we would like Frequently occuring example: modifiers

Problem #1 Consider: (1) omdat hij met plezier een taart heeft gebakken because he with pleasure a cake has baked met plezier is a modifier of gebakken, however in the phrase structure it is attached to a phrase headed by the auxiliary heeft

Problem #1 34 CHAPTER 2 ATTRIBUTE-VALUE GRAMMAR IN ALPINO sbar comp vp omdat vproj np vproj hij pp vproj met plezier np een taart vproj vc v vc heeft gebakken Figure 222: Derivation tree of omdat hij met plezier een taart heeft gebakken because he with pleasure a cake has baked Category types are used as node

Problem #2 In cases where the syntactic head is also the head in the dependency structure, we want the head to have the full modifier list For example: (2) de mooie snelle groene auto the beautiful fast green car auto should have a modifier list containing mooie, snelle, and groene However, Prolog does not allow us to expand a well-formed list

Problem #2 24 AVG IN THE ALPINO SYSTEM 35 np det 4:n de a 3:n mooie a 2:n snelle a 1:n groene auto Figure 223: Derivation tree for the phrase de mooie snelle groene auto the beautiful fast green car Rule identifiers are replaced by category types

Solutions problem #2 1 Use a diference list for modifiers: apply a difference list append for each modifier that is found and unify the tail with the empty list at the maximal projection Mods = [Hole1] %% n:1 Hole1 = [groene Hole2] %% n:2 Hole2 = [snelle Hole3] %% n:3 Hole3 = [mooie Hole4] %% n:4 Hole4 = [] %% np

Solutions problem #2 1 Use a diference list for modifiers: apply a difference list append for each modifier that is found and unify the tail with the empty list at the maximal projection Mods = [Hole1] %% n:1 Hole1 = [groene Hole2] %% n:2 Hole2 = [snelle Hole3] %% n:3 Hole3 = [mooie Hole4] %% n:4 Hole4 = [] %% np 2 Use two separate attributes in the attribute-value structure for modifier collection (cmod) and the final list of modifiers (mod) The final list is reentrant among the categories and is unified at a maximal projection

Solution #2 used to collect modifiers and mod is the list of all modifiers that were collected at the maximal projection Figure 224 gives an impression of how these two attributes work for the derivation in Figure 223 2 [ ] cmod 1 mod 1 np de cmod 1 mooie, snelle, groene mod 1 n mooie cmod snelle, groene mod 1 n snelle cmod groene mod 1 n groene auto

Solving problem #1 The second solution also solves problem #1: where appropriate syntactic heads should hand over modifiers appropriately Example: add_modifier_to_dt([],sign) :- Sign => v, Sign:vtype => vaux, Sign:dt:mod => [], Sign:mods <=> GiveMods, Sign:deps <=> [VC _], VC => vc, VC:mods <=> VCMods, VC:cmods <=> VCCMods, alpino_wappend:wappend(givemods,vccmods,vcmods)

As an attribute-value structure v deps vc [ mods 1 cmods 2 ] _ mods 3 vtype vaux dt dt [ mod ] wappend( 3, 2, 1 )

Verb gaps

Finite verb movement (3) omdat ik hem het boek heb gegeven because I him the book have given

Finite verb movement (5) omdat ik hem het boek heb gegeven because I him the book have given (6) ik heb hem het boek gegeven I have him the book given

Finite verb movement (7) omdat ik hem het boek heb gegeven because I him the book have given (8) ik heb hem het boek gegeven I have him the book given Usual analysis: Dutch has a verb-final word order, in main clauses the finite verb moves to the second position

Verb movement in Alpino Many different approaches: Continuous constituents Discontinuous constituents Approach in Alpino: When a finite verb is found, assert a verb gap item with the necessary syntactic information Not very declarative, but efficient

Subordinate clause max xp(sbar) sbar(vp) omdat vp vpx vpx vproj vp arg v(np) np pron weak vp arg v(np) ik np pron weak vp arg v(np) hem np det n vproj vc het boek v v v heb vc vb vb v gegeven

Main clause max xp(root) non wh topicalization(np) np pron weak o(e) imp ik heb v2 vp vproj vpx vproj vp arg v(np) np pron weak vp arg v(np) hem np det n vproj vc het boek v v v vgap vc vb vb v gegeven

Main clause without auxiliary max xp(root) non wh topicalization(np) np pron weak o(e) imp ik geef v2 vp vproj vpx vproj vp arg v(np) np pron weak vp arg v(np) hem np det n vproj vc het boek vc vb vb v vgap

The end