The Prague Dependency Treebank (and WS02)

Similar documents
Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A Framework for Customizable Generation of Hypertext Presentations

CS 598 Natural Language Processing

Chapter 4: Valence & Agreement CSLI Publications

Adding syntactic structure to bilingual terminology for improved domain adaptation

Accurate Unlexicalized Parsing for Modern Hebrew

BULATS A2 WORDLIST 2

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Developing a TT-MCTAG for German with an RCG-based Parser

Underlying and Surface Grammatical Relations in Greek consider

Context Free Grammars. Many slides from Michael Collins

LTAG-spinal and the Treebank

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Compositional Semantics

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Words come in categories

The stages of event extraction

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Specifying a shallow grammatical for parsing purposes

THE VERB ARGUMENT BROWSER

Parsing of part-of-speech tagged Assamese Texts

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Prediction of Maximal Projection for Semantic Role Labeling

Participate in expanded conversations and respond appropriately to a variety of conversational prompts

Modeling full form lexica for Arabic

Some Principles of Automated Natural Language Information Extraction

Ensemble Technique Utilization for Indonesian Dependency Parser

Chapter 9 Banked gap-filling

Proof Theory for Syntacticians

Developing Grammar in Context

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Today we examine the distribution of infinitival clauses, which can be

Ch VI- SENTENCE PATTERNS.

Constraining X-Bar: Theta Theory

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Loughton School s curriculum evening. 28 th February 2017

Applications of memory-based natural language processing

National Literacy and Numeracy Framework for years 3/4

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

California Department of Education English Language Development Standards for Grade 8

A Computational Evaluation of Case-Assignment Algorithms

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Construction Grammar. University of Jena.

Linking Task: Identifying authors and book titles in verbose queries

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

cambridge occasional papers in linguistics Volume 8, Article 3: 41 55, 2015 ISSN

Type Theory and Universal Grammar

EAGLE: an Error-Annotated Corpus of Beginning Learner German

The College Board Redesigned SAT Grade 12

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Derivational and Inflectional Morphemes in Pak-Pak Language

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

BASIC ENGLISH. Book GRAMMAR

Natural Language Processing. George Konidaris

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Emmaus Lutheran School English Language Arts Curriculum

Argument structure and theta roles

Multiple case assignment and the English pseudo-passive *

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

Memory-based grammatical error correction

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

On the Notion Determiner

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Experiments with a Higher-Order Projective Dependency Parser

AN EXPERIMENTAL APPROACH TO NEW AND OLD INFORMATION IN TURKISH LOCATIVES AND EXISTENTIALS

Writing a composition

A First-Pass Approach for Evaluating Machine Translation Systems

Grammars & Parsing, Part 1:

Character Stream Parsing of Mixed-lingual Text

Adapting Stochastic Output for Rule-Based Semantics

cmp-lg/ Jul 1995

Adjectives tell you more about a noun (for example: the red dress ).

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Issues of Projectivity in the Prague Dependency Treebank

A Graph Based Authorship Identification Approach

Hindi-Urdu Phrase Structure Annotation

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

SAMPLE PAPER SYLLABUS

Beyond the Pipeline: Discrete Optimization in NLP

What the National Curriculum requires in reading at Y5 and Y6

Feature-Based Grammar

Transcription:

The Prague Dependency Treebank (and WS02) Jan Hajič Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague, Czech Republic

This Talk: an Overview The Prague Dependency Treebank The project The 3 annotation layers: morphology surface syntax (also: the lab) deep syntax Use of the deep representation: Machine translation Challenges for NL generation 5/7/2002 PreWS02 Summer School 2

The Prague Dependency Treebank Project (Czech Language Treebank) 1996-2004 1998 PDT v. 0.5 released (JHU workshop) 400k words annotated, unchecked 2001 PDT 1.0 released (LDC): 1.3MW annotated, morphology & surface syntax 2004 PDT 2.0 release planned 1.0MW annotated, underlying (deep) syntax: the tectogrammatical layer 5/7/2002 PreWS02 Summer School 3

Annotation Layers Morphology Tag (full morphology, 13 categories), lemma Analytical layer (surface syntax) Dependency, analytical function Tectogrammatical layer (underlying syntax) Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order) 5/7/2002 PreWS02 Summer School 4

Morphological Annotation 13 categories: Category # of values Example(s) POS 10 N (noun), Z (punctuation) SUBPOS 75 P (personal pron.), U (possessive adj.) GENDER 8 I (masc. inanimate), X (any), - (N.A) NUMBER 4 P (plural), D (dual) CASE 9 1 (nominative), 6 (locative) POSSGENDER 4 M (masc. animate), F (feminine) POSSNUMBER 3 S (singular), P (plural) PERSON 5 1 (first),... TENSE 4 P (present), M (past) GRADE 5 3 (superlative) NEGATION 3 A (affirmative), N (negative) VOICE 3 A (active), P (passive) VAR 11 1 (1 st variant), 6 (colloq. style), 8 (abbrev.) 5/7/2002 PreWS02 Summer School 5

Layer 1: Morphology Tag: 13 categories Example: AAFP3----3N---- Adjective no poss. Gender negated Regular no poss. Number no voice Feminine no person reserve1 Plural no tense reserve2 Dative superlative base var. Lemma: unique identifier Ex.: (to) the most uninteresting Books/verb -> book-1, went -> go, to/prep. -> To-1 5/7/2002 PreWS02 Summer School 6

Layer 2: Analytical syntax Surface, dependency-based representation Every word gets a node, plus one (root) Interested in: dependency structure analytical function: Pred, Sb, Obj, Adv, Atr, Atv, Pnom; AuxV, AuxP, AuxC,...; Coord, Apos, parenthesis ExD 5/7/2002 PreWS02 Summer School 7

Layer 2: Analytical syntax Dependency + Analytical Function dependent governor The influence of the Mexican crisis on Central and Eastern Europe has apparently been underestimated. 5/7/2002 PreWS02 Summer School 8

Comparison: parse trees vs. dependency Compare: Lexicalized parse tree S(walks) Dependency tree walks VBZ NP(John) VP(walks) NNP John walks VBZ John NNP 5/7/2002 PreWS02 Summer School 9

Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,... 5/7/2002 PreWS02 Summer School 10

Analytical vs. Tectogrammatical annotation (TR: sublayer 1 only shown) Underlying verb + tense Deep function Elided Actor in Another ellipsis... Prepositions out (TR: sublayer 1 only shown) 5/7/2002 PreWS02 Summer School 11

Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,... 5/7/2002 PreWS02 Summer School 12

Dependency structure Similar to the surface (Analytical) layer......but: certain nodes deleted auxiliaries, non-autosemantic words, punctuation some nodes added based on word (mostly verb, noun) valency some ellipsis resolution detailed dependency relation labels (functors) 5/7/2002 PreWS02 Summer School 13

Tectogrammatical Functors Actants : ACT, PAT, EFF, ADDR, ORIG cannot repeat in a clause, usually compulsory Free modifications (~ 50) can repeat; optional, sometimes compulsory Ex.: LOC, DIR1,...; TWHEN, TTILL,...; RESTR, DESC; BEN, ATT, ACMP, INTT, MANN; MAT, APP; ID, DPHR, Special Coordination, Rhematizers, Foreign phrases,... 5/7/2002 PreWS02 Summer School 14

Tectogrammatical Example Analytical verb form:» (he) allowed would-be to-be enrolled» směl by být zapsán Collapsed Additional attributes (grammatemes): conditional + allow 5/7/2002 PreWS02 Summer School 15

Tectogrammatical Example Predicate with copula (state)» (the) pool has-been already filled» bazén byl již napuštěný ý 5/7/2002 PreWS02 Summer School 16

Tectogrammatical Example Passive construction (action)» (The) book has-been translated [by Mr. X]» Kniha byla přeložena Disappeared Added 5/7/2002 PreWS02 Summer School 17

Tectogrammatical Example Object» (he) gave him a-book» dal mu knihu Obj goes into ACT, PAT, ADDR, EFF or ORIG based on governor s valency frame 5/7/2002 PreWS02 Summer School 18

Tectogrammatical Example Relative clause (embedded) (a) house, which is expensive, (we) (to-ourselves) will-notbuy dům, který je drahý, si nekoupíme 5/7/2002 PreWS02 Summer School 19

Tectogrammatical Example Incomplete phrases» Peter works well, but Paul badly» Petr pracuje dobře, ale Pavel špatně Added 5/7/2002 PreWS02 Summer School 20

Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,... 5/7/2002 PreWS02 Summer School 21

Deep word order, topic/focus Deep word order: from old information to the new one (left-to-right) at every level (head included) projectivity by definition i.e., partial level-based order -> total d.w.o. Topic/focus/contrastive topic attribute of every node restricted by d.w.o. and other constraints 5/7/2002 PreWS02 Summer School 22

Deep word order, topic/focus Example: Analytical dep. tree: Baker bakes rolls. vs. Baker IC bakes rolls. 5/7/2002 PreWS02 Summer School 23

Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,... 5/7/2002 PreWS02 Summer School 24

Coreference Grammatical (vs. textual) Ex.: Peter moved to Iowa after he finished his PhD. move PRED Peter Iowa ACT DIR1 finish TWHEN he ACT he APP PhD PAT NB: poster about Control, this morning 5/7/2002 PreWS02 Summer School 25

Layer 3: Tectogrammatical Underlying (deep) syntax 4 sublayers: dependency structure, (detailed) functors topic/focus and deep word order coreference (mostly grammatical only) all the rest (grammatemes): detailed functors underlying gender, number,... 5/7/2002 PreWS02 Summer School 26

Grammatemes Syntactic (= detailed functors) only for some functors: WHEN: before/after LOC: next-to, behind, in-front-of Lexical, underlying number (SG/PL), tense, modality, degree of comparison strictly only where necessary (agreement!) 5/7/2002 PreWS02 Summer School 27

The Valency Lexicon Valency frames each verb (+ some nouns, adjectives) has slots for functor/form pairs: give: ACT(Nom) PAT(Acc) ADDR(to+Dat) Basic set prepared in advance, annotators add entries on-the-go, checking and approval process follows (consistency) Compare: Levin s Classes, Proposition Bank 5/7/2002 PreWS02 Summer School 28

Tectogrammatical Annotation Manual annotation 4 groups of annotators ~ 4 sublayers Special graphical tool (TrEd) Customizable graphical tree editor Preprocessing Data from Analytical Layer, preprocessed Online dependency function preassignment 5/7/2002 PreWS02 Summer School 29

The [Manual] Annotation Tool Perl/PerlTk based, platform-independent Linux, Windows 95/98/2000, Solaris,... Perl as the macro language unlimited online processing capability Flexibility for interactive checking split screen, graphical diff function Customization, printing, plugins 5/7/2002 PreWS02 Summer School 30

The TrEd Tree Editor Graphical tool TrEd Main screen: Original sentence: [This year s flu season is still quiet in Europe.] Editing window customization Run a macro Multiwindow editing/compare 5/7/2002 PreWS02 Summer School 31

Valency Lexicon in TrEd to write sth (about sth) 5/7/2002 PreWS02 Summer School 32

What Is It Good For? Machine Translation TL representation is closer to an interlingua than surface (analytical) syntax => less work in the transfer phase more work in parsing and generation...but advantage in multilingual MT application Question answering same representation for questions and answers 5/7/2002 PreWS02 Summer School 33

Machine Translation Architecture Typical (structural) MT system: Transfer a parse Analysis (parsing) Generation (synthesis) source sentence target sentence 5/7/2002 PreWS02 Summer School 34

Machine Translation Architecture Tectogrammatical layer-based system: Transfer (tectogrammatical) parsing parsing morphology (tagging) tectogrammatical layer analytical layer morphological layer generation linearization morph. synthesis source sentence target sentence 5/7/2002 PreWS02 Summer School 35

Comparison: analytical layer 5/7/2002 PreWS02 Summer School 36

Comparison: tectogrammatical l. The [Homestead s] only remaining baker bakes the most famous roll s to the north of Long River. al-xabaaz al- axiir al-baaqii [fii Homestead] yaśmacu ashhar al-kruasaanaat ilaa shimaal min Long River. 5/7/2002 PreWS02 Summer School 37

The Three Crucial Steps Analytical (surface) Tectogrammatical additional parsing required Transfer minimal effort: only true transformations needed (like swimming ~ schwimmen gern) Generation back from Tectogrammatical representation to Analytical (surface syntax) 5/7/2002 PreWS02 Summer School 38

The Devil s In... The additional three steps: (tectogrammatical) parsing parsing morphology (tagging) Transfer tectogrammatical layer analytical layer morphological layer Generation linearization (trivial) morph. synthesis (easy) source sentence target sentence 5/7/2002 PreWS02 Summer School 39

The Devil s In... The additional three steps: Tectogrammatical parsing (Simple) transfer tectogrammatical layer source analytical layer target Generation: - Deletions - Insertions: prepositions, conjunctions,... - Word order - Morphology 5/7/2002 PreWS02 Summer School 40

Components:...the Generation Deletions of nodes [rare if going into English] Insertions of nodes prepositions, conjunctions, punctuation splitting phrases/idioms/named entities Tree reorganization (numeric expressions) Surface word order (analytical tree: defined w.o.) Morphology (agreement, cases based on subcat) 5/7/2002 PreWS02 Summer School 41

Generation Insertion of Prepositions střed center tectogrammatical layer přitažlivost APP.sg gravity APP.sg center středu přitažlivosti.nfs2 Atr analytical layer of AuxP gravity.nn Atr 5/7/2002 PreWS02 Summer School 42

Surface word order přijít.past Generation come.past tectogrammatical layer včera Petr yesterday Peter TWHEN ACT TWHEN ACT přijít.vb3sp come.vbd včera Petr Adv Sb analytical layer Peter yesterday Sb Adv 5/7/2002 PreWS02 Summer School 43

Generation: Complex Input English translation 5/7/2002 PreWS02 Summer School 44

Generation: How-To (1) Statistical, (perhaps) in two steps Analytical tree reconstruction everything except word order i.e., includes morphology (tag assignment) Word Order projective trees assumed here 1 thus, it is sufficient to determine level-by-level word order 1 Additional step required for non-projective constructions [can be avoided for English] 5/7/2002 PreWS02 Summer School 45

Generation: How-To (2) Reconstruction: two possible ways transformation-based learning ([fn]tbl) probabilistic, by a dependency tree model: based on triplets <word,tag,afun> and dependency relation (governor,dependent) ~ Collins bilexical model, Charniak parser model, Bangalore & Rambow afun instead of nonterminals 5/7/2002 PreWS02 Summer School 46

Generation: How-To (3) Word order language model for a single level in the tree: <word,tag,afun> triples; includes head (no afun) come.vbd Peter.NNP yesterday.adv Sb Adv non-projective constructions (and some more) by classic n-gram LM 5/7/2002 PreWS02 Summer School 47

Generation: How-To (4) Data trained on WSJ: converted to analytical dependency trees adapted Jason Eisner s head assignment rules added rules for heads of base NPs added rules for analytical functions rule-based parsing to tectogrammatical layer (for now; manual annotation will follow) i.e., TR AR data available (English) 5/7/2002 PreWS02 Summer School 48

Some pointers Current version of PDT: v1.0 morphology + analytical level 1.3M words (train/dev test/eval test) http://ufal.mff.cuni.cz Projects -> Treebank http://www.ldc.upenn.edu LDC2001T10 (PDT v1.0) http://www.clsp.jhu.edu: Workshop 2002 Using TL for MT Generation 5/7/2002 PreWS02 Summer School 49