The Alpino Grammar and Lexicon

The Alpino Grammar and Lexicon RSAVG, Section 243 & 244 Daniël de Kok

Overview Broad overview of Alpino The lexicon The grammar Problem section : Modifiers Verb movement

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar Tokenizer (finite state transducer)

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar Tokenizer (finite state transducer) Part-of-speech tagger (Hidden Markov Model)

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar Tokenizer (finite state transducer) Part-of-speech tagger (Hidden Markov Model) Parser

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar Tokenizer (finite state transducer) Part-of-speech tagger (Hidden Markov Model) Parser Generator

Broad overview of Alpino Hdrug: environment for developing grammars/parsers/generators Lexicon/Grammar Tokenizer (finite state transducer) Part-of-speech tagger (Hidden Markov Model) Parser Generator Treebanking

Parsing in Alpino Parsing in Alpino Lexical analysis Left-corner parser (with goal weakening and memoization) Disambiguation (with n-best unpacking)

Generation in Alpino Generation in Alpino: Lexical prediction Chart generator Fluency ranking (with n-best unpacking)

Lexicon

Introduction Alpino uses a strongly-lexicalized grammar: Word descriptions have detailed syntactic information A relatively small set of simple grammar rules Words are represented by attribute-value structures

Example v subj np agr sg case nom dt 1 sc np [ case acc dt 2 ] dt dt hd [ lex verft ] su 1 obj1 2 Figure 1:Simplified attribute-value structure for verft present tense second/third person inflection of verven to paint

Static lexicon M:M mapping: words < tag, stem > Each word is associated with a complex tag The attribute-value structure is constructed from the tag For example: # Inflection # Root # Tag advies advies noun(het,count,sg) adviezen advies noun(het,count,pl) Dictionary size (2012): ~180,000 mappings ~190,000 mappings for named entities Stored in a finite state automaton

Special entries Some specific combinations of words cannot be derived using generic grammar rules Consider: helemaal niemand (lit: at all nobody ) * helemaal iemand * helemaal hij helemaal is an intensifier for the pronoun niemand Since it cannot apply to other pronouns: no generalization

Special entries (2) The Alpino lexicon contains special entries for such word combinations Since the dependency structure cannot be derived productively, needs to be pre-packaged The tag of helemaal niemand: with_dt( pronoun(nwh,thi,sg,de,both,indef,strpro), dt(np,[ mod=l(helemaal,adverb,advp,0,1), hd=l(niemand, pronoun(nwh,thi,sg,de,both,indef,strpro),1,2)])) Require extra handling in parsing and generation

Productive lexicon The productive lexicon analyzes: Compounds Ordinals Unknown words

Grammar

Introduction The Alpino grammar is written as Prolog rules: %% Rule head template grammar_rule(identifier,lhs,rhs) %% Head for np -> det n grammar_rule(np_det_n,np,[det,n]) Approximately 850 construction-specific rules

Example rule grammar_rule(n_adj_n, NP, [ AP, N ] ) :- unmarked_n_adj_n_struct(n,ap,np) unmarked_n_adj_n_struct(n,ap,np) :- n_adj_n_struct(n,ap,np), AP:agr <=> N:agr n_adj_n_struct(n,ap,np) :- NP => n, AP => a, N => n, % reduce spur amb in 'ziek zijn' NP:subn => ~sub_indef_verb, ap_arg(ap), N:wh => nwh, % de hoeveelste overwinning was dat? NP:wh <=> AP:wh, %%

Principles Rules use general predicates/principles that are shared between different rules Example: percolate the dependency structure of the projected head on the left-hand side of the rule

Rule use Rules are purely declarative (besides a few exceptions) Calling the goal np_det_n grammar_rule(np_det_n,np,[det,n]) Will instantiate NP, Det, and N with attribute-value structures Consequence: we can store the grammar rules as Prolog facts Ideal: exploit first-argument indexing in parsing and generation

Handling modifiers

Introduction Unfortunately, sometimes a context-free backbone and dependency structure do not match as nicely as we would like Frequently occuring example: modifiers

Problem #1 Consider: (1) omdat hij met plezier een taart heeft gebakken because he with pleasure a cake has baked met plezier is a modifier of gebakken, however in the phrase structure it is attached to a phrase headed by the auxiliary heeft

Problem #1 34 CHAPTER 2 ATTRIBUTE-VALUE GRAMMAR IN ALPINO sbar comp vp omdat vproj np vproj hij pp vproj met plezier np een taart vproj vc v vc heeft gebakken Figure 222: Derivation tree of omdat hij met plezier een taart heeft gebakken because he with pleasure a cake has baked Category types are used as node

Problem #2 In cases where the syntactic head is also the head in the dependency structure, we want the head to have the full modifier list For example: (2) de mooie snelle groene auto the beautiful fast green car auto should have a modifier list containing mooie, snelle, and groene However, Prolog does not allow us to expand a well-formed list

Problem #2 24 AVG IN THE ALPINO SYSTEM 35 np det 4:n de a 3:n mooie a 2:n snelle a 1:n groene auto Figure 223: Derivation tree for the phrase de mooie snelle groene auto the beautiful fast green car Rule identifiers are replaced by category types

Solutions problem #2 1 Use a diference list for modifiers: apply a difference list append for each modifier that is found and unify the tail with the empty list at the maximal projection Mods = [Hole1] %% n:1 Hole1 = [groene Hole2] %% n:2 Hole2 = [snelle Hole3] %% n:3 Hole3 = [mooie Hole4] %% n:4 Hole4 = [] %% np 2 Use two separate attributes in the attribute-value structure for modifier collection (cmod) and the final list of modifiers (mod) The final list is reentrant among the categories and is unified at a maximal projection

Solution #2 used to collect modifiers and mod is the list of all modifiers that were collected at the maximal projection Figure 224 gives an impression of how these two attributes work for the derivation in Figure 223 2 [ ] cmod 1 mod 1 np de cmod 1 mooie, snelle, groene mod 1 n mooie cmod snelle, groene mod 1 n snelle cmod groene mod 1 n groene auto

Solving problem #1 The second solution also solves problem #1: where appropriate syntactic heads should hand over modifiers appropriately Example: add_modifier_to_dt([],sign) :- Sign => v, Sign:vtype => vaux, Sign:dt:mod => [], Sign:mods <=> GiveMods, Sign:deps <=> [VC _], VC => vc, VC:mods <=> VCMods, VC:cmods <=> VCCMods, alpino_wappend:wappend(givemods,vccmods,vcmods)

As an attribute-value structure v deps vc [ mods 1 cmods 2 ] _ mods 3 vtype vaux dt dt [ mod ] wappend( 3, 2, 1 )

Verb gaps

Finite verb movement (3) omdat ik hem het boek heb gegeven because I him the book have given

Finite verb movement (5) omdat ik hem het boek heb gegeven because I him the book have given (6) ik heb hem het boek gegeven I have him the book given

Finite verb movement (7) omdat ik hem het boek heb gegeven because I him the book have given (8) ik heb hem het boek gegeven I have him the book given Usual analysis: Dutch has a verb-final word order, in main clauses the finite verb moves to the second position

Verb movement in Alpino Many different approaches: Continuous constituents Discontinuous constituents Approach in Alpino: When a finite verb is found, assert a verb gap item with the necessary syntactic information Not very declarative, but efficient

Subordinate clause max xp(sbar) sbar(vp) omdat vp vpx vpx vproj vp arg v(np) np pron weak vp arg v(np) ik np pron weak vp arg v(np) hem np det n vproj vc het boek v v v heb vc vb vb v gegeven

Main clause max xp(root) non wh topicalization(np) np pron weak o(e) imp ik heb v2 vp vproj vpx vproj vp arg v(np) np pron weak vp arg v(np) hem np det n vproj vc het boek v v v vgap vc vb vb v gegeven

Main clause without auxiliary max xp(root) non wh topicalization(np) np pron weak o(e) imp ik geef v2 vp vproj vpx vproj vp arg v(np) np pron weak vp arg v(np) hem np det n vproj vc het boek vc vb vb v vgap

The end