From Morphology to Semantics: the Prague Dependency Treebank Family

Similar documents
Adding syntactic structure to bilingual terminology for improved domain adaptation

A High-Quality Web Corpus of Czech

CS 598 Natural Language Processing

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

English Language and Applied Linguistics. Module Descriptions 2017/18

AQUA: An Ontology-Driven Question Answering System

LING 329 : MORPHOLOGY

Natural Language Processing. George Konidaris

Ensemble Technique Utilization for Indonesian Dependency Parser

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

On-Line Data Analytics

Semi-supervised Training for the Averaged Perceptron POS Tagger

Modeling full form lexica for Arabic

Developing a TT-MCTAG for German with an RCG-based Parser

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Beyond constructions:

Grammar Extraction from Treebanks for Hindi and Telugu

Syntactic Dependencies for Multilingual and Multilevel Corpus Annotation

Speech Recognition at ICSI: Broadcast News and beyond

The Discourse Anaphoric Properties of Connectives

LTAG-spinal and the Treebank

Applications of memory-based natural language processing

POWLA: Modeling linguistic corpora in OWL/DL

Linking Task: Identifying authors and book titles in verbose queries

Highlighting and Annotation Tips Foundation Lesson

arxiv: v1 [cs.cl] 2 Apr 2017

Prediction of Maximal Projection for Semantic Role Labeling

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Annotation Projection for Discourse Connectives

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Secondary English-Language Arts

Parsing of part-of-speech tagged Assamese Texts

Xinyu Tang. Education. Research Interests. Honors and Awards. Professional Experience

Derivational and Inflectional Morphemes in Pak-Pak Language

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

A Framework for Customizable Generation of Hypertext Presentations

Adapting Stochastic Output for Rule-Based Semantics

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

An Interactive Intelligent Language Tutor Over The Internet

Accurate Unlexicalized Parsing for Modern Hebrew

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Constraining X-Bar: Theta Theory

Introduction to Moodle

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Underlying and Surface Grammatical Relations in Greek consider

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Compositional Semantics

Parallel Syntactic Annotation of Multiple Languages

Cross Language Information Retrieval

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

The taming of the data:

The CESAR Project: Enabling LRT for 70M+ Speakers

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Specifying a shallow grammatical for parsing purposes

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

State: Original. Status: Planned July 2015-June. State: Original. Status: Planned. July 2015-June. State: Original. Status: Planned.

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

THE INTERNATIONAL JOURNAL OF HUMANITIES & SOCIAL STUDIES

LNGT0101 Introduction to Linguistics

Visual CP Representation of Knowledge

Achim Stein: Diachronic Corpora Aston Corpus Summer School 2011

Program in Linguistics. Academic Year Assessment Report

An Introduction to the Minimalist Program

HILDE : A Generic Platform for Building Hypermedia Training Applications 1

1. Introduction. 2. The OMBI database editor

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Vocabulary Usage and Intelligibility in Learner Language

Developing a large semantically annotated corpus

TEKS Correlations Proclamation 2017

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Beyond the Pipeline: Discrete Optimization in NLP

Analysis of Lexical Structures from Field Linguistics and Language Engineering

Chapter 9 Banked gap-filling

Ministry of Education, Republic of Palau Executive Summary

SEMAFOR: Frame Argument Resolution with Log-Linear Models

An Open Framework for Integrated Qualification Management Portals

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Outreach Connect User Manual

Visit us at:

University of Edinburgh. University of Pennsylvania

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

Chapter 4: Valence & Agreement CSLI Publications

Language Independent Passage Retrieval for Question Answering

Some Principles of Automated Natural Language Information Extraction

More Morphology. Problem Set #1 is up: it s due next Thursday (1/19) fieldwork component: Figure out how negation is expressed in your language.

A First-Pass Approach for Evaluating Machine Translation Systems

International Branches

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

CollaboFramework. Framework and Methodologies for Collaborative Research in Digital Humanities. DHN Workshop. Organizers:

Transcription:

From Morphology to Semantics: the Prague Dependency Treebank Family Jan Hajič Charles University in Prague Institute of Formal and Applied Linguistics LINDAT-Clarin and META-NET (CZ) Czech Republic Sep. 7, 2012 PDT @ LDC 20 1

History LDC: Penn Treebank I (1993) We want it too! But: LDC s unlikely to do Czech (soon ) Prague (old time structuralist) tradition: dependency 1995: decision to build our own treebank Started 1996 with a specification grant Tool development, annotation since 1997 First PDT (1.0) published in 2001 (LDC2001T10) Morphology and syntax only, but > 1M words PDT 2.0 2006 (LDC2006T01) Full annotation & correction of 1.0 Other treebanks: 2004, 2012 (more to come, also by other groups) Sep. 7, 2012 PDT @ LDC 20 2

Prague Dependency Treebanks the Basics General Features Multilayered annotation, interlinked layers Dependency-based syntax (both surface and deep) Includes semantic functions, valency dictionary(-ies) Information structure of the sentence (topic/focus) Grammatical and textual co-reference, new: bridging New: discourse relations (not published yet) Languages: Czech, English (also parallel), Arabic: Indonesian, Urdu, Russian, (Student work on samples) (Auto) conversion from other treebanks (25 so far; experimental) Spoken: Czech and English (non-parallel, dialogs) Sep. 7, 2012 PDT @ LDC 20 3

The Layers Three basic layers Morphological layer Surface syntax ( a ) layer Tectogrammatical layer: underlying syntax, semantic roles (valency), inf. structure, co-reference (anaphora) Format Prague Markup Language (XML + Schema) (Speech: Additional layers: audio, transcript ) Sep. 7, 2012 PDT @ LDC 20 4

Tectogrammatical vs. Analytical (Surface) Syntax Predicate verb Location TR: No function words Re-inserted elided actor of making In practice, that procedure will require making of certified copies. Sep. 7, 2012 PDT @ LDC 20 5

PDT-style Treebanks (written language) Czech Prague Dependency Treebank Complex annotation, all levels, additional annotation Translation of Penn Treebank, aligned Tectogrammatical layer only, no information structure Analytical, morphology: automatic tools Will be manually revised later English Re-annotation of Penn Treebank, TR only so far Arabic New morphology, analytical syntax, sample TR only Sep. 7, 2012 PDT @ LDC 20 6

The Prague Czech-English Dependency Treebank (PCEDT) 2.0 Parallel treebank Aligned trees Aligned nodes Sep. 7, 2012 PDT @ LDC 20 7

The Prague Czech-English Dependency Treebank (PCEDT) 2.0 Parallel treebank Dependency style ( Prague ) (surface) syntax syntax & semantics ( tectogrammatics ) Penn Treebank translation into Czech Názory na její tříměsíční perspektivu se různí. Sep. 7, 2012 PDT @ LDC 20 8

The Prague Czech-English Dependency Treebank (PCEDT) 2.0 Parallel treebank Dependency style ( Prague ) (surface) syntax syntax & semantics ( tectogrammatics ) Penn Treebank translation into Czech 1 million words Published June 2012 (LDC2012T08) Also available through LINDAT-Clarin (with browsing and search tools) and META-SHARE Sep. 7, 2012 PDT @ LDC 20 9

PCEDT 2.0 The Alignment(s) Czech-English alignments Sentence-level (manual, natural due to translation) At both syntactic levels Word (node) level automatic, test section manually corrected (in part) Sep. 7, 2012 PDT @ LDC 20 10 PCEDT 2.0 @ LREC 2012

tectogrammatics PCEDT 2.0 The Alignment(s) Czech-English alignments Sentence-level (manual, natural due to translation) At both syntactic levels 1 1 Word (node) level automatic, test section manually corrected, m n Between annotation levels Tectogrammatics to surface syntax m n, incl. 1 0 Surface syntax to word level (1 1) PTB syntax surface syntax Sep. 7, 2012 PDT @ LDC 20 11

Tectogrammatical annotation Manual (both languages) Valency lexicons attached Eng: links to PropBank Co-reference integrated (Eng: BBN + more), Czech: manually Alignment Nodes: automatic / corrected manually (in part) This temblor-prone city dispatched inspectors, firefighters and other earthquake-trained personnel *-1 to aid San Francisco. Sep. 7, 2012 PDT @ LDC 20 13

PDT-style Treebanks (spoken language) Specifics of spoken language Short sentences but unclear segmentation Sentence breaks must be (re)annotated Ungrammatical (esp. for Czech coll.) Annotation based on written-language rules difficult if not impossible additional decisions: Change annotation? Change the input? (but original must be kept) Sep. 7, 2012 PDT @ LDC 20 14

Spoken corpora Solution: Speech reconstruction Keep audio, word-for-word transcript Adds two layers to the annotation scheme: audio, transcript Add edited text: LINKS to original transcript / audio Annotate edited text (using usual guidelines) Sep. 7, 2012 PDT @ LDC 20 15

Accompanying Tools TrEd (http://ufal.mff.cuni.cz/tred) Annotation, View/Browse and Search environment Open source, perl Search and visualization: PML-TQ Powerful query language for complex NLP annotation, esp. tree-based Treex (http://ufal.mff.cuni.cz/treex) Modular NLP processing environment Easy handling of complex NLP-annotated data Modules exists for Czech, English data processing incl. 3 rd -party tools integrated into Treex CPAN-distributed Sep. 7, 2012 PDT @ LDC 20 16

Lessons Learned (1) Positive experience Dependency style Separate layers of annotation Most importantly: separate surface syntax vs. deep syntax Specific format and specific graphical tools (TrEd et al.) Stand-off annotation Spoken annotation trick with speech reconstruction Still, additional guidelines needed Negative experience Lots of time spent on consistency checking Annotator training: guidelines too detailed Prevents crowdsourcing Lots of time goes to final quality checking and corrections min. 3 PY for PDT, PCEDT Sep. 7, 2012 PDT @ LDC 20 17

Lessons Learned (2) Acknowledgements: Ministry Czech Information Charles European Science Univ. University of projects Education Society student Foundation research (in (part) Programme Czech grants part) funds Rep. LC536, ME09008, GAP406/10/0875 GPP406/10/P193 GA405/09/0729 1ET101120503 116310, 034434, ( PRVOUK ) 249119, MSM0021620838 158010, 257528 034291, 7Ennnn 3537/2011 231720, 247762 For future projects Annotation in small teams Phenomenon-by-phenomenon Ongoing quality checking, time allotted for final QC Error discovered at annotation time much cheaper to correct Consequences for tool selection ( intelligent annotation SW) Need for excellent software and annotator s support Programmers efforts always underestimated helpdesk for annotators important (usually former annotator) Organization, statistics, watchdog Single repository for annotated data Payment Annotator s incentives work (for speed of annotation) Speed of annotation vs. quality Almost no correlation Sep. 7, 2012 PDT @ LDC 20 18

Happy 99 85 22 95 90 20 23 30 35 55 75 21 25 70 45 65 40 50 80 60 nd th rd st Birthday! Sep. 7, 2012 PDT @ LDC 20 20