Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

Similar documents
A Graph Based Authorship Identification Approach

Achim Stein: Diachronic Corpora Aston Corpus Summer School 2011

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

Annotation Projection for Discourse Connectives

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

The Distribution of Weak and Strong Object Reflexives in Dutch

Using dialogue context to improve parsing performance in dialogue systems

MA Linguistics Language and Communication

Linking Task: Identifying authors and book titles in verbose queries

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

CS 446: Machine Learning

The Smart/Empire TIPSTER IR System

Domain Adaptation for Parsing

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Context Free Grammars. Many slides from Michael Collins

BSID-II-NL project. Heidelberg March Selma Ruiter, University of Groningen

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Progressive Aspect in Nigerian English

Cross Language Information Retrieval

Ensemble Technique Utilization for Indonesian Dependency Parser

Developing a TT-MCTAG for German with an RCG-based Parser

A High-Quality Web Corpus of Czech

Introduction, Organization Overview of NLP, Main Issues

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

AQUA: An Ontology-Driven Question Answering System

CEF, oral assessment and autonomous learning in daily college practice

Questions, Pictures, Answers: Introducing Pictures in Question-Answering Systems

LTAG-spinal and the Treebank

A Coreference Corpus and Resolution System for Dutch

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The following information has been adapted from A guide to using AntConc.

Applications of memory-based natural language processing

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Agnès Tutin and Olivier Kraif Univ. Grenoble Alpes, LIDILEM CS Grenoble cedex 9, France

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A Case Study: News Classification Based on Term Frequency

Android App Development for Beginners

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

visual aid ease of creating

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

My own dictionary. Froukje Bakker. Jan Dekker.

Parsing of part-of-speech tagged Assamese Texts

Using Moodle in ESOL Writing Classes

CODE Multimedia Manual network version

A deep architecture for non-projective dependency parsing

MOODLE 2.0 GLOSSARY TUTORIALS

Accurate Unlexicalized Parsing for Modern Hebrew

Psychology of Speech Production and Speech Perception

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Prediction of Maximal Projection for Semantic Role Labeling

The taming of the data:

Probabilistic Latent Semantic Analysis

On-Line Data Analytics

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Learning Computational Grammars

Using NVivo to Organize Literature Reviews J.J. Roth April 20, Goals of Literature Reviews

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

The stages of event extraction

An Introductory Blackboard (elearn) Guide For Parents

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

The Interface between Phrasal and Functional Constraints

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Students from abroad who are enrolled in other law faculty s can participate in the master European Law which has the following tracks:

Definition Corpus for Finnish Voutilainen, Atro; Linden, Krister; Purtonen, Tanja Katariina Voutilainen, A, Linden, K & Purtonen, T K 2011, '

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Adapting Stochastic Output for Rule-Based Semantics

The Discourse Anaphoric Properties of Connectives

Survey on parsing three dependency representations for English

Second Exam: Natural Language Parsing with Neural Networks

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Team Formation for Generalized Tasks in Expertise Social Networks

A Domain Ontology Development Environment Using a MRD and Text Corpus

Syntactic surprisal affects spoken word duration in conversational contexts

Beyond the Pipeline: Discrete Optimization in NLP

An Evaluation of POS Taggers for the CHILDES Corpus

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

Developing a large semantically annotated corpus

Alpino: accurate, robust, wide coverage computational analysis of Dutch. Gertjan van Noord University of Groningen

Session Six: Software Evaluation Rubric Collaborators: Susan Ferdon and Steve Poast

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

McGraw-Hill Connect and Create Built by Blackboard. Release Notes. Version 2.3 for Blackboard Learn 9.1

The Stress Pages contain written summaries of areas of stress and appropriate actions to prevent stress.

Artificial Intelligence

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Matching Similarity for Keyword-Based Clustering

CS 598 Natural Language Processing

Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text

Compositional Semantics

Pre-Processing MRSes

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Transcription:

Treebank mining with GrETEL Liesbeth Augustinus Frank Van Eynde GrETEL tutorial - 27 March, 2015

GrETEL Greedy Extraction of Trees for Empirical Linguistics Search engine for treebanks

GrETEL Greedy Extraction of Trees for Empirical Linguistics Search engine for treebanks Treebank = syntactically annotated corpus o o o Penn Treebank (English) TüBa (German) LASSY, CGN, SoNaR (Dutch)

NEDERBOOMS Exploitation of Dutch treebanks for research in linguistics CLARIN project October, 2010 February, 2012 Goals: o User-friendly tools o Fast and accurate Result: o GrETEL 1.0 o http://nederbooms.ccl.kuleuven.be

Update of GrETEL 1.0 CLARIN project June, 2013 July, 2014 GrETEL 2.0 Goals: o Improve GUI o Make more data accessible Result: o GrETEL 2.0 o http://gretel.ccl.kuleuven.be

TREEBANKS CGN treebank Spoken Dutch LASSY small Written Dutch Stylistic & regional differences conversations vs read texts NL vs VL Stylistic differences Wikipedia vs legal texts ± 1M words ± 1M words 130k sentences Manually corrected 65k sentences Manually corrected

TREEBANKS SoNaR Written Dutch Stylistic differences Wikipedia vs legal texts ± 500M words 41M sentences Not corrected

GrETEL Greedy Extraction of Trees for Empirical Linguistics Search engine for treebanks Treebank = syntactically annotated corpus o o o Penn Treebank (English) TüBa (German) LASSY, CGN, SoNaR (Dutch) Parser o E.g. Alpino (Van Noord 2006)

ALPINO PARSER Dit is een zin. >> ALPINO parser >> This is a sentence.

ALPINO PARSER Dit is een zin. >> ALPINO parser >> This is a sentence. XML trees Query language: XPath

XPATH //node[@cat="smain" and node[@rel="su" and @pt="vnw" and @lemma="dit"] and node[@rel="hd" and @pt="ww" and @lemma="zijn"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @lemma="een"] and node[@rel="hd" and @pt="n" and @lemma="zin"]]]

XPATH //node[@cat="smain" and node[@rel="su" and @pt="vnw" and @lemma="dit"] and node[@rel="hd" and @pt="ww" and @lemma="zijn"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @lemma="een"] and node[@rel="hd" and @pt="n" and @lemma="zin"]]]

XPATH //node[@cat="smain" and node[@rel="su" and @pt="vnw" and @lemma="dit"] and node[@rel="hd" and @pt="ww" and @lemma="zijn"] and node[@rel="predc" and @cat="np" and node[@rel="det" and @pt="lid" and @lemma="een"] and node[@rel="hd" and @pt="n" and @lemma="zin"]]]

XPATH

GrETEL 2 search modes: o Example-based search o XPath search

GrETEL 2 search modes: o Example-based search advantage: no or limited knowledge of data structure and/or formal query languages needed o XPath search

the user 1. Example sentence 2. Inspect parse 3. Indicate relevant items of the sentence 4. Select treebank 5. (Adapt XPath) 6. Inspect results GrETEL Parser (Alpino) Automatically generate XPath expression Present results

OUTLINE GrETEL in a nutshell GrETEL demo o o Case study Search options Conclusions

CASE STUDY Infinitivus Pro Participio (IPP) constructions in Dutch Hij heeft Marie horen zingen. He has heard Mary sing. dat Jan niet is kunnen komen. that Jan was not able to come.

CASE STUDY Infinitivus Pro Participio (IPP) constructions in Dutch Hij heeft Marie horen/*gehoord zingen. He has heard Mary sing. dat Jan niet is kunnen/*gekund komen. that Jan was not able to come.

GrETEL ONLINE

INPUT

INPUT PARSE

SELECTION MATRIX

SELECTION GUIDELINES

TREEBANK SELECTION

TREEBANK SELECTION

QUERY OVERVIEW

IPP constructions in CGN RESULTS Hij heeft Marie horen zingen. He has heard Mary sing. 344 hits

RESULTS

RESULTS: table

RESULTS: data

greedy search RESULTS: data

RESULTATEN: trees

IPP constructions in CGN RESULTS Hij heeft Marie horen zingen. He has heard Mary sing. 344 hits dat Jan niet is kunnen komen. that Jan was not able to come. 24 hits

MORE RESULTS Option 1: Use different queries Hij heeft Marie horen zingen. He has heard Mary sing. 344 hits dat hij Marie heeft horen zingen. that he has heard Mary sing. 79 hits dat Jan niet is kunnen komen. that Jan was not able to come. 24 hits Jan is niet kunnen komen. Jan was not able to come. 120 hits TOTAL: 567 hits

MORE RESULTS Option 2: Adapt query (via XPath Search ) //node[@cat="smain" and node[@rel="hd" and @pt="ww" and @lemma="hebben"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"]]]] //node[(@cat="smain" or @cat="ssub") and node[@rel="hd" and (@lemma="hebben" or @lemma="zijn")] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"]]]]

MORE RESULTS

MORE RESULTS Option 2: Adapt query (via XPath Search )

MORE RESULTS Option 2: Adapt query (via XPath Search ) //node[@cat="smain" and node[@rel="hd" and @pt="ww" and @lemma="hebben"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"]]]] //node[(@cat="smain" or @cat="ssub") and node[@rel="hd" and (@lemma="hebben" or @lemma="zijn")] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"] and node[@rel="vc" and @cat="inf" and node[@rel="hd" and @pt="ww"]]]] 566 hits (one sentence matches twice: fva400364 10)

OUTLINE GrETEL in a nutshell GrETEL demo o o Case study Search options Conclusions

ADVANCED SEARCH

ADVANCED SEARCH

ADVANCED SEARCH

ADVANCED SEARCH

SEARCH OPTIONS Below annotation matrix

WORD ORDER PP-over-V o V + PP o dat hij opstond met een kater.... that he woke up with a hangover. o o PP + V dat hij met een kater opstond. that he with a hangover woke-up... that he woke up with a hangover.

PP-over-V in LASSY small o o o V + PP WORD ORDER dat hij opstond met een kater.... that he woke up with a hangover. 2,890 hits in 2,764 sentences But: results include PP + V as well!

PP-over-V in LASSY small o o WORD ORDER V + PP + word order option dat hij opstond met een kater.... that he woke up with a hangover. 787 hits in 775 sentences Results only include V + PP

IGNORE TOP NODE

CONTEXT

CONTEXT

OUTLINE GrETEL in a nutshell GrETEL demo o o Case study Search options Conclusions

CONCLUSIONS GrETEL: search engine for Dutch treebanks Input = natural language example Output = sample of similar sentences Syntactic concordancer Available online (via Mozilla Firefox) No installation required

Try it yourself! http://gretel.ccl.kuleuven.be Thanks for your attention!