Usage - Overview. Treebank Usage. Parser usages. Training a chunker / parser. Grammar Learning. Grammar learning from a treebank.

Similar documents
11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Context Free Grammars. Many slides from Michael Collins

An Efficient Implementation of a New POP Model

Prediction of Maximal Projection for Semantic Role Labeling

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Grammars & Parsing, Part 1:

Parsing of part-of-speech tagged Assamese Texts

CS 598 Natural Language Processing

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Accurate Unlexicalized Parsing for Modern Hebrew

Learning Computational Grammars

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

An Interactive Intelligent Language Tutor Over The Internet

Developing a TT-MCTAG for German with an RCG-based Parser

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Loughton School s curriculum evening. 28 th February 2017

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

AQUA: An Ontology-Driven Question Answering System

Natural Language Processing. George Konidaris

LTAG-spinal and the Treebank

The Smart/Empire TIPSTER IR System

Control and Boundedness

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Adapting Stochastic Output for Rule-Based Semantics

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Some Principles of Automated Natural Language Information Extraction

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Proof Theory for Syntacticians

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

The stages of event extraction

Character Stream Parsing of Mixed-lingual Text

Constraining X-Bar: Theta Theory

Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

Using dialogue context to improve parsing performance in dialogue systems

The Interface between Phrasal and Functional Constraints

Analysis of Probabilistic Parsing in NLP

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Memory-based grammatical error correction

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

BULATS A2 WORDLIST 2

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

Specifying a shallow grammatical for parsing purposes

Compositional Semantics

Som and Optimality Theory

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Hindi-Urdu Phrase Structure Annotation

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Refining the Design of a Contracting Finite-State Dependency Parser

Ensemble Technique Utilization for Indonesian Dependency Parser

First Grade Curriculum Highlights: In alignment with the Common Core Standards

The Ups and Downs of Preposition Error Detection in ESL Writing

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

Chapter 4: Valence & Agreement CSLI Publications

A Computational Evaluation of Case-Assignment Algorithms

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Applications of memory-based natural language processing

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

A Version Space Approach to Learning Context-free Grammars

"f TOPIC =T COMP COMP... OBJ

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Linking Task: Identifying authors and book titles in verbose queries

The suffix -able means "able to be." Adding the suffix -able to verbs turns the verbs into adjectives. chewable enjoyable

CS Machine Learning

CEFR Overall Illustrative English Proficiency Scales

Rule Learning With Negation: Issues Regarding Effectiveness

Argument structure and theta roles

Using Semantic Relations to Refine Coreference Decisions

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

The Effect of Multiple Grammatical Errors on Processing Non-Native Writing

An Empirical and Computational Test of Linguistic Relativity

LING 329 : MORPHOLOGY

A Case Study: News Classification Based on Term Frequency

The Role of the Head in the Interpretation of English Deverbal Compounds

The College Board Redesigned SAT Grade 12

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

Constructing Parallel Corpus from Movie Subtitles

GENERAL COMMENTS Some students performed well on the 2013 Tamil written examination. However, there were some who did not perform well.

Advanced Grammar in Use

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Underlying and Surface Grammatical Relations in Greek consider

Specifying Logic Programs in Controlled Natural Language

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Transcription:

Treebank Usage tockholm University Usage - Overview 1. Training a chunker / parser on a treebank = Learning a probabilistic context-free grammar from a treebank 2. Evaluating a parser against a treebank 3. Using a treebank in education for language learning for linguistics education 2 Training a chunker / parser Parser usages Good introduction Manning and chütze: Foundations of tatistical NLP. MIT Press. 1999. Chap 11: Probabilistic Context Free Grammars Chap 12: Probabilistic Parsing Three ways to use a probabilistic parser: 1. Probabilities for determining the best sentence When the actual input is uncertain (e.g. word lattice in speech recognition), to determine the most probable sentence. 2. Probabilities for faster parsing To find the best parse more quickly. 3. Probabilities for choosing between parses To choose the most likely parse tree among the many parse trees for the input string. 3 4 Grammar Learning Grammar learning from a treebank Automatic learning of grammars based solely on text input is impossible / hard. unless negative evidence is included! But automatic learning of grammars based on treebanks is easy and provides probabilities on grammar rules. Count all derivations. Compute the probabilities based on the frequencies. The probabilities of all derivations with the same mother node must sum to 1. Penn Treebank > 10,000 rules ~4,000 appear more than once Which rule is the most frequent? Det NN Det AP NN NN NE 2533 1255 501 388 5 6 1

Problems with rule probabilities Lexicalization needs to be taken into account. In a pure PCFG the probability of a rule like is independent of the verb. This is clearly wrong from a linguistic point of view. Problems with rule probabilities Rule probabilities depend on grammatical functions. Compare subject and object positions in English: An is much more likely to be realized as pronoun in subject position, ( Pron) and to be realized as with a prepositional attribute in object position ( PP). 7 8 One solution: The grandparent node One solution: The grandparent node consider the tree on the right with one in subject position and one in object position Distinguishing local trees based on the grandparent node via node relabeling leads to improved parsing results. This is a way to take the derivation history into account! - - 9 10 Transform treebank trees, and proceed with PCFG extraction (Johnson, 1997) ~80% labeled precision and recall Relatives of probabilistic cf parsing DOP: Data- oriented parsing (Rens Bod) is parsing via the re- combination of parse trees of arbitrary depth. Jo Mary heard saw ue Jim 11 12 2

DOP: Data-oriented parsing DOP: Data-oriented parsing Example: Mary heard ue. Mary heard ue Problems How to store all possible trees. low parsing since the highest probability tree cannot be found efficiently. iterbi algorithm cannot be used. is similar to parsing with Probabilistic Tree Adjoining Grammars 13 14 Parser Evaluation: PAREAL Parser evaluation Labeled Precision and Recall of constituents Precision P = # Constituents in parser output Our parser gives: The Treebank says: Recall R = # Constituents in gold standard Crossing branches 15 16 Precision P = # Constituents in parser output 1 7 1 7 1 1 1 1 2 7 2 7 3 4 3 7 PP 5 7 3 4 6 7 PP 5 7 6 7 Recall R = # Constituents in gold standard In our example: Precision = 6/6 = 1.0 Recall = 6/7 = 0.86 17 18 3

Problems of PAREAL PAREAL measures are not very discriminating. Charniak s ( 96) vanilla PCFG which ignores all lexical content worked well. PAREAL measure is quite easy at reproducing the tree structures given by the Penn Treebank. PAREAL measures the success at the level of individual decisions. In NLP consecutive decisions are more important and harder. Evaluation Penn Treebank s problem Too flat. Non-standard adjunct structure given to post noun-head modifiers PAREAL measure seems too harsh on some specific problems. 19 20 Language / Linguistics Learning Language / Linguistics Learning Learning tasks over treebanks iewing / searching trees Labeling trees Combining subtrees Comparing trees Evaluating trees Drawing trees easy difficult ome problems How to find rare constructions? How to avoid confusing the student with ungrammatical examples? 21 22 Interactive syntactic trees (from Eckhard Bick) BuildTree: Drag & drop constituents 23 24 4

LabelTree: Drag & drop syntactic function Treebanks in Linguistics Courses H.v.Halteren yntactic Databases in the Classroom in: Excursions into yntactic Databases. Rodopi. 1997. Experiments in English syntax courses at Nijmegen University based on the TOCA Treebank CLUE: Computer Library of Utterances for Exercises in yntax 25 26 CLUE Exercise Types tudien-cd Linguistik 1. Mark empty node, ask for label What is the label for node X? 2. Give label, ask for node (unlabeled tree) Which node is a prepositional complement? 3. how partial tree, ask for reconstruction 4. how incorrect tree, ask for correction an introduction to (German) linguistics developed at the University of Zurich (2001-2004) published with an introductory linguistics book contains 100 German syntax trees across 10 different text genres (novel, medical abstract, weather report, interview, newspaper report, fairy tale) in two views (complex vs. easy ) to be used in self-learning as examples for word classes examples for syntax structures 27 28 ummary There is a straight-forward way to derive a probabilistic contextfree grammar from a treebank. But this PCF grammar will need optimization (e.g. lexicalisation, context) for high accuracy parsing. It is difficult to establish a good measure for parser evaluation (ie. tree comparison). PAREAL is the measure with wide-spread use. Treebanks can be used in syntax education. 29 5