Tutorial on Universal Dependencies

Similar documents
ROSETTA STONE PRODUCT OVERVIEW

Approved Foreign Language Courses

The Ohio State University. Colleges of the Arts and Sciences. Bachelor of Science Degree Requirements. The Aim of the Arts and Sciences

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

DETECTING RANDOM STRINGS; A LANGUAGE BASED APPROACH

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Section V Reclassification of English Learners to Fluent English Proficient

Universities as Laboratories for Societal Multilingualism: Insights from Implementation

Chapter 5: Language. Over 6,900 different languages worldwide

Open Discovery Space: Unique Resources just a click away! Andy Galloway

The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedy

Developing a TT-MCTAG for German with an RCG-based Parser

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

The Effect of Multiple Grammatical Errors on Processing Non-Native Writing

The Survey of Adult Skills (PIAAC) provides a picture of adults proficiency in three key information-processing skills:

Parsing Morphologically Rich Languages:

Experiments with a Higher-Order Projective Dependency Parser

CS 598 Natural Language Processing

Turkish Vocabulary Developer I / Vokabeltrainer I (Turkish Edition) By Katja Zehrfeld;Ali Akpinar

A High-Quality Web Corpus of Czech

Ensemble Technique Utilization for Indonesian Dependency Parser

Using dialogue context to improve parsing performance in dialogue systems

International Branches

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Introduction, Organization Overview of NLP, Main Issues

A Graph Based Authorship Identification Approach

English-German Medical Dictionary And Phrasebook By A.H. Zemback

Cross Language Information Retrieval

University of New Orleans

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Development of the First LRs for Macedonian: Current Projects

Berlitz Swedish-English Dictionary (Berlitz Bilingual Dictionaries) By Berlitz Guides

Bachelor of Arts in Gender, Sexuality, and Women's Studies

German Vocabulary (Quickstudy: Academic) By Inc. BarCharts

The International Coach Federation (ICF) Global Consumer Awareness Study

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

Language Center. Course Catalog

Refining the Design of a Contracting Finite-State Dependency Parser

the contribution of the European Centre for Modern Languages Frank Heyworth

English (from Chinese) (Language Learners) By Daniele Bourdaise

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Modern Project Management. Brendan Bartels

The CESAR Project: Enabling LRT for 70M+ Speakers

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Chapter 4: Valence & Agreement CSLI Publications

Linking Task: Identifying authors and book titles in verbose queries

Accurate Unlexicalized Parsing for Modern Hebrew

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

The MEANING Multilingual Central Repository

What Can Twitter tell us about the language diversity of Greater Manchester?

French Dictionary: 1000 French Words Illustrated By Evelyn Goldsmith

Prediction of Maximal Projection for Semantic Role Labeling

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Cross-lingual Transfer Parsing for Low-Resourced Languages: An Irish Case Study

Adding syntactic structure to bilingual terminology for improved domain adaptation

Grammars & Parsing, Part 1:

My First Spanish Phrases (Speak Another Language!) By Jill Kalz

Impact of Educational Reforms to International Cooperation CASE: Finland

Basic German: CD/Book Package (LL(R) Complete Basic Courses) By Living Language

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Language learning in primary and secondary schools in England Findings from the 2012 Language Trends survey

Modeling full form lexica for Arabic

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Breakthrough Russian (Breakthrough Language Courses) [Paperback] By Halya Coynash

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Dependency Annotation of Coordination for Learner Language

OilSim. Talent Management and Retention in the Oil and Gas Industry. Global network of training centers and technical facilities

Session Six: Software Evaluation Rubric Collaborators: Susan Ferdon and Steve Poast

Achim Stein: Diachronic Corpora Aston Corpus Summer School 2011

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

(English translation)

Derivational and Inflectional Morphemes in Pak-Pak Language

Universität Duisburg-Essen

HIGHLIGHTS OF FINDINGS FROM MAJOR INTERNATIONAL STUDY ON PEDAGOGY AND ICT USE IN SCHOOLS

BULATS A2 WORDLIST 2

Using the CU*BASE Member Survey

Syntactic Dependencies for Multilingual and Multilevel Corpus Annotation

Busuu The Mobile App. Review by Musa Nushi & Homa Jenabzadeh, Introduction. 30 TESL Reporter 49 (2), pp

1. Introduction. 2. The OMBI database editor

arxiv: v1 [cs.cl] 2 Apr 2017

The taming of the data:

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Tour. English Discoveries Online

Maynooth University Study Abroad in Ireland

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

Language. Name: Period: Date: Unit 3. Cultural Geography

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Survey on parsing three dependency representations for English

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

DLM NYSED Enrollment File Layout for NYSAA

LNGT0101 Introduction to Linguistics

School of Languages, Literature and Cultures

On the Open Access Strategy of the Max Planck Society

Words come in categories

Applications of memory-based natural language processing

Controlled vocabulary

Information Session 13 & 19 August 2015

Transcription:

Tutorial on Universal Dependencies Infrastructure, resources and tools for UD Joakim Nivre 1 Daniel Zeman 2 Filip Ginter 3 Francis M. Tyers 45 1 Department of Linguistics and Philology, Uppsala University, Sweden 2 Institute of Formal and Applied Linguistics, Charles University in Prague, Czech Republic 3 Department of Information Technology, University of Turku, Finland 4 Giela ja kultuvrra instituhtta, UiT Norgga árktalaš universitehta, Tromsø, Norway 5 Arvutiteaduse instituut, Tartu Ülikool, Estonia

UD as of Now Treebanks How many? Languages: 50 Treebanks: 72 Trees: 642,000 Words: 12,400,000 Can I use them? Creative Commons and GPL-like: 30 Creative Commons Non-Commercial: 42 Where from? http://universaldependencies.org Official release preferred over GitHub Currently officially released: 70 treebanks Twist: test sets currently withheld 1

UD Treebanks Come in Many Flavors and Sizes Annotation: POS and base dependency relations compulsory: 72 treebanks...and additionally: Size: Forms + Features + Lemmas: 58 Forms - Features + Lemmas: 4 Forms - Features - Lemmas: 7 No Forms: 3 (Arabic-NYUAD, English-ESL, Japanese-KTC) licensing Smallest: approx. 1000 words Swedish Sign Language, Kazakh, Sanskrit Largest: Czech with 1.3M words, Russian with 980K words 2

3

4

5

CoNLL-U Format Derived from CoNLL-X, overall logic same, details differ ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC Only ID UPOS HEAD DEPREL compulsory Distinguishing features: Sentence-level metadata part of the format Explicit (and compulsory!) representation of the original text DEPS field encodes the enhanced dependencies (non-tree structure) MISC field allows arbitrary data stored for every word Empty nodes only referred to from the enhanced representation Words as opposed to tokens 6

CoNLL-U Format 7

CoNLL-U Format Tokens vs. Words 8

UD Infrastructure - Requirements 83 treebank repositories 100+ contributors Online documentation consisting of roughly 14,000 web-pages Guidelines, universal and language-specific Discussions, decision making, validation Regular, carefully checked official releases A comparatively small group of core staff running the show Budget: $0 9

UD Infrastructure - GitHub GitHub in use from Day 1 Documentation and data first Followed exclusive use of the issue tracker for discussions and proposals Before: many email chains chaos Practically everything happens openly 10

UD is Open 11

Data A GitHub repository for every treebank UD_{Language}-{Treebank} master branch holds the most recent official release dev branch holds development data, not guaranteed to be valid Some teams use GitHub for development, others only to submit their data prior to the release No strict requirements on the workflow Official release: LINDAT, May & November, all treebanks which contain valid data 12

Docs One set of documentation for every language (not treebank) A GitHub repository holding mostly markdown pages Special care taken to make it easy to add tree visualizations and examples Stubs pre-generated when adding a new language 11,000+ commits from 80+ contributors Automatically regenerated on every push and published on GitHub pages The issue tracker for the docs repository is where all the UD activity is happening Hundreds of issues, thousands of replies Documentation system: http://spyysalo.github.io/annodoc/ 13

Workflow and Organization Highly chaotic distributed All contributors given broad edit rights to all data, docs, and tools repositories Fully trust-based setup, git giving a safety net Joakim holds the honorary title of Chief Cat Herder and looks after the project as a whole is obeyed unconditionally 14

Validation Script to validate treebank data Passing is compulsory Format validation Runs automatically every time a treebank is updated Indispensable especially close to an official release date Contributors: do we validate? Release team: whom to help next? http://universaldependencies.org/validation.html 15

Content Validation Runs automatically every time a treebank is updated Reports suspicious syntactic constructions Passing not compulsory at the moment Contributors: Is there anything odd-looking in my data? Release team: Overview of guideline adoption http://universaldependencies.org/svalidation.html 16

Tools and Resources UD is not just the treebanks Parsers trained on UD data Large multilingual parsebanks Query tools for treebanks and parsebanks Libraries for handling CoNLL-U Tree visualization tools Annotation tools 17

Parsers UDPipe and SyntaxNet State-of-the-art parsers, free Full-stack parsers: raw text in - parses out Models trained on all of UD UDPipe demo & Web API UDPipe Web API get parsed text with a simple HTTP request 18

UDPipe 19

UDPipe 20

ParseySaurus Major improvement upon SyntaxNet s Parsey s cousins Considerably improved models released mid-march 2017 http://tiny.cc/psaurus description http://tiny.cc/psaurus-base numbers 21

ParseySaurus Average=78% Median=81% 22

Parsebanks UD-parsed corpora for 45 languages Data: CommonCrawl + Wiki + Perseus Parses: UDPipe Over 90B words total, 630GB zipped CoNLL-U files Ancient Greek, Arabic, Basque, Bulgarian, Catalan, ChineseT, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Korean, Latin, Latvian, Norwegian-Bokmaal, Norwegian-Nynorsk, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Urdu, Uyghur, and Vietnamese 23

Syntactic Query dep_search http://bionlp-www.utu.fi/dep_search Relatively expressive query language, especially geared towards dependencies and rich morphology Indexed: Latest UD official release dev branches - reindexed on every push Up to 2 million trees for every language from the UD Parsebanks Web and API access Used by some during annotation Also serves as content validation back-end 24

Syntactic Query 25

Syntactic Query 26

Syntactic Query PML Tree Query http://lindat.mff.cuni.cz/services/pmltq/ A very expressive query language Indexed: official UD releases 27

Syntactic Query 28

Udapi A library and command line tool for processing UD data Python, Java, Perl Format conversions Initial v1-v2 conversion Validation tests Evaluation, filtering, statistics Tree visualization https://udapi.github.io 29

Tree Visualization Tools cat en-ud-dev.conllu udapy -T less -R 30

Tree Visualization Tools cat en-ud-dev.conllu udapy write.tikz conj advmod root obj cc det punct amod nsubj compound advmod amod Also, they have great customer service and a very knowledgeable staff ADV PUNCT PRON VERB ADJ NOUN NOUN CCONJ DET ADV ADJ NOUN 31

Tree Visualization Tools http://spyysalo.github.io/conllu.js/ http://spyysalo.github.io/annodoc/sdparse.html 32

Annotation Tools No official annotation tool (yet) A list of tools: http://universaldependencies.org/tools.html At present, none downright outstanding 33

Questions? 33