Corpora and Statistical Methods Lecture 11. Albert Gatt

Similar documents
11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Grammars & Parsing, Part 1:

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Prediction of Maximal Projection for Semantic Role Labeling

An Efficient Implementation of a New POP Model

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Natural Language Processing. George Konidaris

Context Free Grammars. Many slides from Michael Collins

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Introduction to Simulation

Developing a TT-MCTAG for German with an RCG-based Parser

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

The stages of event extraction

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Accurate Unlexicalized Parsing for Modern Hebrew

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Analysis of Probabilistic Parsing in NLP

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Proof Theory for Syntacticians

SEMAFOR: Frame Argument Resolution with Log-Linear Models

LTAG-spinal and the Treebank

Adapting Stochastic Output for Rule-Based Semantics

Chapter 4: Valence & Agreement CSLI Publications

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Some Principles of Automated Natural Language Information Extraction

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Parsing of part-of-speech tagged Assamese Texts

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

CS 598 Natural Language Processing

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Domain Adaptation for Parsing

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Linking Task: Identifying authors and book titles in verbose queries

The Discourse Anaphoric Properties of Connectives

Compositional Semantics

A Graph Based Authorship Identification Approach

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Learning Computational Grammars

The Interface between Phrasal and Functional Constraints

"f TOPIC =T COMP COMP... OBJ

A Computational Evaluation of Case-Assignment Algorithms

Can Human Verb Associations help identify Salient Features for Semantic Verb Classification?

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Rule Learning With Negation: Issues Regarding Effectiveness

LNGT0101 Introduction to Linguistics

Using dialogue context to improve parsing performance in dialogue systems

A Version Space Approach to Learning Context-free Grammars

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Control and Boundedness

Ensemble Technique Utilization for Indonesian Dependency Parser

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Memory-based grammatical error correction

An Interactive Intelligent Language Tutor Over The Internet

Constraining X-Bar: Theta Theory

Argument structure and theta roles

Beyond the Pipeline: Discrete Optimization in NLP

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Refining the Design of a Contracting Finite-State Dependency Parser

Radius STEM Readiness TM

Planning with External Events

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Hyperedge Replacement and Nonprojective Dependency Structures

Probabilistic Latent Semantic Analysis

CS Machine Learning

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

arxiv: v1 [math.at] 10 Jan 2016

(Sub)Gradient Descent

Construction Grammar. University of Jena.

The Smart/Empire TIPSTER IR System

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Cross Language Information Retrieval

AQUA: An Ontology-Driven Question Answering System

Annotation Projection for Discourse Connectives

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Organizational Knowledge Distribution: An Experimental Evaluation

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Rule Learning with Negation: Issues Regarding Effectiveness

Second Exam: Natural Language Parsing with Neural Networks

A deep architecture for non-projective dependency parsing

A Case Study: News Classification Based on Term Frequency

Formulaic Language and Fluency: ESL Teaching Applications

Character Stream Parsing of Mixed-lingual Text

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Discriminative Learning of Beam-Search Heuristics for Planning

Universiteit Leiden ICT in Business

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Introduction to Causal Inference. Problem Set 1. Required Problems

Transcription:

Corpora and Statistical Methods Lecture 11 Albert Gatt

Part 2 Statistical parsing

Preliminary issues How parsers are evaluated

Evaluation The issue: what objective criterion are we trying to maximise? i.e. under what objective function can I say that my parser does well (and how well?) need a gold standard Possibilities: strict match of candidate parse against gold standard match of components of candidate parse against gold standard components

Evaluation A classic evaluation metric is the PARSEVAL one initiative to compare parsers on the same data not initially concerned with stochastic parsers evaluate parser output piece by piece Main components: compares gold standard tree to parser tree typically, gold standard is the tree in a treebank computes: precision recall crossing brackets

PARSEVAL: labeled recall #correct nodes in candidate parse # nodes in treebank parse Correct node = node in candidate parse which: has same node label originally omitted from PARSEVAL to avoid theoretical conflict spans the same words

PARSEVAL: labeled precision #correct nodes in candidate parse # nodes in candidate parse The proportion of correctly labelled and correctly spanning nodes in the candidate.

Combining Precision and Recall As usual, Precision and recall can be combined into a single F-measure: 1 F 1 1 (1 ) P R

PARSEVAL: crossed brackets number of brackets in the candidate parse which cross brackets in the treebank parse e.g. treebank has ((X Y) Z) and candidate has (X (Y Z)) Unlike precision/recall, this is an objective function to minimise

Current performance Current parsers achieve: ca. 90% precision >90% recall 1% cross-bracketed constituents

Some issues with PARSEVAL 1. These measures evaluate parses at the level of individual decisions (nodes). ignore the difficulty of getting a globally correct solution by carrying out a correct sequence of decisions 2. Success on crossing brackets depends on the kind of parse trees used Penn Treebank has very flat trees (not much embedding), therefore likelihood of crossed brackets decreases. 3. In PARSEVAL, if a constituent is attached lower in a tree than the gold standard, all its daughters are counted wrong.

Probabilistic parsing with PCFGs The basic algorithm

The basic PCFG parsing algorithm Many statistical parsers use a version of the CYK algorithm. Assumptions: CFG productions are in Chomsky Normal Form. A BC A a Use indices between words: Book the flight through Houston (0) Book (1) the (2) flight (3) through (4) Houston (5) Procedure (bottom-up): Traverse input sentence left-to-right Use a chart to store constituents and their span + their probability.

Probabilistic CYK: example PCFG S NP VP [.80] NP Det N [.30] VP V NP [.20] V includes [.05] Det the [.4] Det a [.4] N meal [.01] N flight [.02]

Probabilistic CYK: initialisation The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chart ij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chart ik & C in chart kj : chart ij := chart ij U {A} 0 1 2 3 4 5 1 2 3 4 5

Probabilistic CYK: lexical step The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} 1 2 3 4 5 0 Det (.4) 1 2 3 4 5

Probabilistic CYK: lexical step The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} 1 2 3 4 5 0 Det (.4) 1 N.02 2 3 4 5

Probabilistic CYK: syntactic step The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chart ij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chart ik & C in chart kj : chart ij := chart ij U {A} 1 2 3 4 5 0 Det (.4) NP.0024 1 N.02 2 3 4 5

Probabilistic CYK: lexical step The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} 1 2 3 4 5 0 Det (.4) NP.0024 1 N.02 2 V.05 3 4 5

Probabilistic CYK: lexical step The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} 1 2 3 4 5 0 Det (.4) NP.0024 1 N.02 2 V.05 3 Det.4 4 5

Probabilistic CYK: syntactic step The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} 1 2 3 4 5 0 Det (.4) NP.0024 1 N.02 2 V.05 3 Det.4 4 N.01

Probabilistic CYK: syntactic step The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chart ij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chart ik & C in chart kj : chart ij := chart ij U {A} 1 2 3 4 5 0 Det (.4) NP.0024 1 N.02 2 V.05 3 Det.4 NP.001 4 N.01

Probabilistic CYK: syntactic step The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chart ij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chart ik & C in chart kj : chart ij := chart ij U {A} 1 2 3 4 5 0 Det (.4) NP.0024 1 N.02 2 V.05 VP.00001 3 Det.4 NP.001 4 N.01

Probabilistic CYK: syntactic step The flight includes a meal. //Lexical lookup: for j = 1 to length(string) do: chart j-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chart ij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chart ik & C in chart kj : chart ij := chart ij U {A} 0 Det 1 2 3 4 5 (.4) NP 1 N.0024.02 2 V.05 3 Det.4 4 N S.00000001 92 VP.00001 NP.001.01

Probabilistic CYK: summary Cells in chart hold probabilities Bottom-up procedure computes probability of a parse incrementally. To obtain parse trees, cells need to be augmented with backpointers.

Probabilistic parsing with lexicalised PCFGs Main approaches (focus on Collins (1997,1999)) see also: Charniak (1997)

Unlexicalised PCFG Estimation Charniak (1996) used Penn Treebank POS and phrasal categories to induce a maximum likelihood PCFG only used relative frequency of local trees as the estimates for rule probabilities did not apply smoothing or any other techniques Works surprisingly well: 80.4% recall; 78.8% precision (crossed brackets not estimated) Suggests that most parsing decisions are mundane and can be handled well by unlexicalized PCFG

Probabilistic lexicalised PCFGs Standard format of lexicalised rules: associate head word with non-terminal e.g. dumped sacks into VP(dumped) VBD(dumped) NP(sacks) PP(into) associate head tag with non-terminal VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) Types of rules: lexical rules expand pre-terminals to words: e.g. NNS(sacks,NNS) sacks probability is always 1 internal rules expand non-terminals e.g. VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)

Estimating probabilities Non-generative model: take an MLE estimate of the probability of an entire rule Count( VP(dumped, VBD) VBD(dumped, VBD) NP(sacks, NNS) PP(into,IN)) Count( VP(dumped, VBD)) non-generative models suffer from serious data sparseness problems Generative model: estimate the probability of a rule by breaking it up into sub-rules.

Collins Model 1 Main idea: represent CFG rules as expansions into Head + left modifiers + right modifiers LHS STOP Ln Ln 1... L1 H R1... Rn 1Rn STOP L i /R i is of the form L/R(word,tag); e.g. NP(sacks,NNS) STOP: special symbol indicating left/right boundary. Parsing: Given the LHS, generate the head of the rule, then the left modifiers (until STOP) and right modifiers (until STOP) inside-out. Each step has a probability.

Collins Model 1: example VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) 1. Head H(hw,ht): ( H( hw, ht) Parent, hw, ht) P H P( VBD(dumped, VBD) VP(dumped, VBD))

Collins Model 1: example VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) 1. Head H(hw,ht): P H ( H( hw, ht) Parent, hw, ht) 2. Left modifiers: n 1 i 1 P ( L L i ( lw i, lw t ) Parent, H, hw, ht) P( STOP VP(dumped, VBD), VBD(dumped, VBD))

Collins Model 1: example VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) 1. Head H(hw,ht): P H ( H( hw, ht) Parent, hw, ht) 2. Left modifiers: 3. Right modifiers: P P P R R R n 1 i 1 n 1 i 1 P R P ( L L ( R i i ( rw ( lw i i, lw, rw t t ) Parent, H, hw, ht) ) Parent, H, hw, ht) ( NP(sacks, NNS) VP(dumped, VBD), VBD(dumped, VBD)) ( PP(into,IN) VP(dumped, VBD), VBD(dumped, VBD)) (STOP VP(dumped, VBD), VBD(dumped, VBD))

Collins Model 1: example VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) 1. Head H(hw,ht): 2. Left modifiers: 3. Right modifiers: n 1 i 1 P H ( H( hw, ht) Parent, hw, ht) n 1 i 1 P ( R R P ( L i L i ( rw i ( lw i, rw, lw t t ) Parent, H, hw, ht) ) Parent, H, hw, ht) 4. Total probability: multiplication of (1) (3)

Variations on Model 1: distance Collins proposed to extend rules by conditioning on distance of modifiers from the head: P L ( L ( lw, lw ) P, H, hw, ht,distance ( i 1)) i i t PR ( Ri ( rwi, rwt ) P, H, hw, ht,distance R( i 1)) a function of the yield of modifiers seen. L Distance for R 2 probability = words under R 1

Using a distance function Simplest kind of distance function is a tuple of binary features: Is the string of length 0? Does the string contain a verb? Example uses: if the string has length 0, P R should be higher: English is right-branching & most right modifiers are adjacent to the head verb if string contains a verb, P R should be higher: accounts for preference to attach dependencies to main verb

Further additions Collins Model 2: subcategorisation preferences distinction between complements and adjuncts. Model 3 augmented to deal with long-distance (WH) dependencies.

Smoothing and backoff Rules may condition on words that never occur in training data. Collins used 3-level backoff model. Combined using linear interpolation. 1. use head word P ( R ( rw, rw ) Parent, H, hw, ht) R 2. use head tag i i t P ( R ( rw, rw ) Parent, H, ht) R 3. parent only i i t P R ( R ( rw, rw ) Parent ) i i t

Other parsing approaches

Data-oriented parsing Alternative to grammar-based models does not attempt to derive a grammar from a treebank treebank data is stored as fragments of trees parser uses whichever trees seem to be useful

Data-oriented parsing Suppose we want to parse Sue heard Jim. Corpus contains the following potentially useful fragments: Parser can combine these to give a parse

Data-oriented Parsing Multiple fundamentally distinct derivations of a single tree. Parse using Monte Carlo simulation methods: randomly produce a large sample of derivations use these to find the most probable parse disadvantage: needs very large samples to make parses accurate, therefore potentially slow

Data-oriented parsing vs. PCFGs Possible advantages: using partial trees directly accounts for lexical dependencies also accounts for multi-word expressions and idioms (e.g. take advantage of) while PCFG rules only represent trees of depth 1, DOP fragments can represent trees of arbitrary length Similarities to PCFG: tree fragments could be equivalent to PCFG rules probabilities estimated for grammar rules are exactly the same as for tree fragments

History Based Grammars (HBG) General idea: any derivational step can be influenced by any earlier derivational step (Black et al. 1993) the probability of expansion of the current node conditioned on all previous nodes along the path from the root

History Based Grammars (HBG) Black et al lexicalise their grammar. every phrasal node inherits 2 words: its lexical head H 1 a secondary head H 2, deemed to be useful e.g. the PP in the bank might have H1=in and H2=bank Every non-terminal is also assigned: a syntactic category (Syn) e.g. PP a semantic category (Sem) e.g with-data Use the index I that indicates what number child of the parent node is being expanded

HBG Example (Black et al 1993)

History Based Grammars (HBG) Estimation of the probability of a rule R: P( Syn, Sem, R, H, H2 Syn p, Sem p, Rp, I, H1, H2 1 p p ) probability of: the current rule R to be applied its Syn and Sem category its heads H1 and H2 conditioned on: Syn and Sem of parent node the rule that gave rise to the parent the index of this child relative to the parent the heads H1 and H2 of the parent

Summary This concludes our overview of statistical parsing We ve looked at three important models Also considered basic search techniques and algorithms