Introduction to Computational Linguistics

Similar documents
11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Grammars & Parsing, Part 1:

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Context Free Grammars. Many slides from Michael Collins

CS 598 Natural Language Processing

Parsing of part-of-speech tagged Assamese Texts

Prediction of Maximal Projection for Semantic Role Labeling

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Natural Language Processing. George Konidaris

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Accurate Unlexicalized Parsing for Modern Hebrew

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

An Efficient Implementation of a New POP Model

Domain Adaptation for Parsing

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Compositional Semantics

Lecture 10: Reinforcement Learning

Analysis of Probabilistic Parsing in NLP

The stages of event extraction

LTAG-spinal and the Treebank

Some Principles of Automated Natural Language Information Extraction

The Interface between Phrasal and Functional Constraints

arxiv: v1 [cs.cl] 2 Apr 2017

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Using dialogue context to improve parsing performance in dialogue systems

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Probabilistic Latent Semantic Analysis

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Training and evaluation of POS taggers on the French MULTITAG corpus

Linking Task: Identifying authors and book titles in verbose queries

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

Ensemble Technique Utilization for Indonesian Dependency Parser

Developing a TT-MCTAG for German with an RCG-based Parser

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Speech Recognition at ICSI: Broadcast News and beyond

CS Machine Learning

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

The Strong Minimalist Thesis and Bounded Optimality

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Proof Theory for Syntacticians

Beyond the Pipeline: Discrete Optimization in NLP

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Statewide Framework Document for:

"f TOPIC =T COMP COMP... OBJ

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

The Ups and Downs of Preposition Error Detection in ESL Writing

Detecting English-French Cognates Using Orthographic Edit Distance

Truth Inference in Crowdsourcing: Is the Problem Solved?

Introduction to Simulation

A Domain Ontology Development Environment Using a MRD and Text Corpus

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Chapter 4: Valence & Agreement CSLI Publications

Hyperedge Replacement and Nonprojective Dependency Structures

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

A Computational Evaluation of Case-Assignment Algorithms

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Using computational modeling in language acquisition research

AQUA: An Ontology-Driven Question Answering System

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Role of the Head in the Interpretation of English Deverbal Compounds

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

(Sub)Gradient Descent

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Artificial Neural Networks written examination

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Lecture 1: Machine Learning Basics

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

An Introduction to the Minimalist Program

Language Model and Grammar Extraction Variation in Machine Translation

Coordination Structure Analysis using Dual Decomposition

Parsing natural language

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Adapting Stochastic Output for Rule-Based Semantics

Comparison of network inference packages and methods for multiple networks inference

PRODUCT PLATFORM DESIGN: A GRAPH GRAMMAR APPROACH

An Interactive Intelligent Language Tutor Over The Internet

A Version Space Approach to Learning Context-free Grammars

The Smart/Empire TIPSTER IR System

Construction Grammar. University of Jena.

Pre-Processing MRSes

Transcription:

Introduction to Computational Linguistics Olga Zamaraeva (2018) Based on Bender (prev. years) and Levow (2016) University of Washington May 8, 2018 1 / 54

Projects Please remember that Error Analysis goes beyond just running the package and obtaining numbers Make sure to discuss your EA strategy for Milestone 3 Assignment 4 If you already did, expand Can start after today s lecture Assignment 5 Midterms: Can start after next Tuesday DO NOT DELAY Training and Test data Precision and Recall 2 / 54

Why statistical parsing? s Estimating rule probabilities Ways to improve 3 / 54

Why statistical parsing? Your turn 4 / 54

Why statistical parsing? Parsing = making explicit structure that is inherent (implicit) in natural language strings Most application scenarios that use parser output want just one parse Have to choose among all the possible analyses (disambiguation) CKY represents the ambiguities efficiently but does not solve them Most application scenarios need robust parsers Need some output for every input, even if its not grammatical 5 / 54

Probabilistic Context Free Grammars N: a set of non-terminal symbols : a set of terminal symbols (disjoint from N) R: a set of rules, of the form A β [p] A: non-terminal β: string of symbols from or N p: probability of β given A S: a designated start symbol 6 / 54

How does this differ from CFG? How do we use it to calculate the probability of a parse? The probability of a sentence? What assumptions does that require? 7 / 54

How does this differ from CFG? added probability to each rule How do we use it to calculate the probability of a parse? multiply the probabilities of used rules The probability of a sentence? sum of probabilities of all trees What assumptions does that require? expansion of a node does not depend on the context 8 / 54

Augment each production with probability that LHS will be expanded as RHS P(A B) or P(A B A), P(RHS LHS) Sum over all possible expansions is 1 β P(A β) = 1 A is consistent if sum of probabilities of all sentences in language is 1. Recursive rules can yield inconsistent grammars We look at consistent grammars in this class 9 / 54

Example: probabilities (Assume some small CFG grammar of the kind we saw before.) I have a cat. Do you have a cat? I have a dog. What is P(Det a)? What is P(N cat)? What is P(S NP VP)? 10 / 54

Example: fragment How would a real-life trained grammar differ from this one? 11 / 54

Disambiguation A assigns probability to each parse tree T for input S Probability of T: product of all rules to derive T (i is step in derivation) Why? 12 / 54

Disambiguation A assigns probability to each parse tree T for input S Probability of T: product of all rules to derive T Why? Because P(S T) = 1 13 / 54

Parsing problem for Select T such that: ˆT(S) = argmaxts.t.s=yield(t) P(T) String of S is yield of parse tree over S Select tree that maximizes probability of parse Extend existing algorithms, e.g. CKY Most modern parsers based on CKY 14 / 54

Argmax Recall: the argmax function over a parameter: returns the value for the parameter at which the value of the function is maximum E.g. f(x) is maximum when x=0.5 (suppose f(0.5) = 1) max(f(x)) = 1 argmax x (f(x)) = 0.5 15 / 54

Parsing problem for ˆT(S) = argmaxts.t.s=yield(t) P(T S) P(T,S) ˆT(S) = argmaxts.t.s=yield(t) P(S) But P(S) is constant for each tree Why? ˆT(S) = argmaxts.t.s=yield(t) P(T, S) P(T, S) = P(T) So, ˆT(S) = argmaxts.t.s=yield(t) P(T) 16 / 54

Parsing problem for ˆT(S) = argmaxts.t.s=yield(t) P(T S) P(T,S) ˆT(S) = argmaxts.t.s=yield(t) P(S) But P(S) is constant for each tree Because it is the same sentence! (NB: it is not 1 but it still is the same number for all trees, so doesn t matter in finding the max) ˆT(S) = argmaxts.t.s=yield(t) P(T, S) P(T, S) = P(T) So, ˆT(S) = argmaxts.t.s=yield(t) P(T) 17 / 54

Disambiguation 18 / 54

Assigning probability to a string can tell us what the probability of a sentence is (joint with structure) What s this useful for? 19 / 54

Assigning probability to a string can tell us what the probability of a sentence is (joint with structure) What s this useful for? Language modeling Where have we seen language modeling? 20 / 54

Assigning probability to a string can tell us what is the probability of a sentence What s this useful for? Language modeling Where have we seen language modeling? N-grams are models which account for more context than N-grams What are some applications? 21 / 54

Assigning probability to a string can tell us what is the probability of a sentence What s this useful for? Language modeling Where have we seen language modeling? N-grams are models which account for more context than N-grams What are some applications? MT, ASR, Spelling correction, Grammar correction... 22 / 54

How to estimate rule probabilities? Get a Treebank Gather all instances of each non-terminal For each expansion of the non-terminal (= rule), count how many times it occurs P(α β α = C(α β) C(α) ) 23 / 54

(Ney, 1991): in each cell, store just the most probable edge for each non-terminal (Btw, CKY does not appear to have a canonical citation?.. Several papers, from late 50s into 70s) 24 / 54

Like regular CKY Assume grammar in Chomsky Normal Form (CNF) Productions: A B C or A w Represent input with indices b/t words E.g., 0 Book 1 that 2 flight 3 through 4 Houston 5 For input string length n and non-terminals V Cell [i,j,a] in (n + 1)x(n + 1)xV matrix contains the probability that constituent A spans [i,j] V is a 3rd dimension, for each nonterminal V stores the probabilities 25 / 54

Note: this is pseudocode generally means assignment (e.g. x = 2) downto 0 means the number is decreasing with each iteration until it is 0 26 / 54

PCKY grammar segment S NP VP [0.80] NP Det N [0.30] VP V NP [0.20] V includes [0.05] Det the [0.40] Det a [0.40] N flight [0.02] N meal [0.05] 27 / 54

This flight includes a meal 28 / 54

Problems with Independence assumption is problematic What does independence assumption mean? What is the evidence that it s wrong? Not sensitive to lexical dependencies What does that mean? 29 / 54

Problems with Independence assumption is problematic What does independence assumption mean? NP Pronoun vs NP Noun The choice will be made independently of other things going on in the tree What is the evidence that it s wrong? NP Pronoun is much more likely to be in the subject position Not sensitive to lexical dependencies What does that mean? Verb/preposition subcategorization Coordination ambiguity 30 / 54

Problems with Switchboard corpus (Francis et al., 1999): Pronoun Non-Pronoun Subject 91% 9% Object 34% 66% But estimated probabilities: NP DT NN [0.28] NP PRP [0.25] 31 / 54

Problems with 32 / 54

Problems with 33 / 54

Problems with NP NP PP has fairly high probability in English (aka WSJ) So a trained on it will always prefer this kind of attachment Remember how pervasive this issue was shown to be in Kummerfeld et al. 34 / 54

Lexical dependenices The workers dumped the sacks into a bin How to capture the fact that there is grater affinity between dumped and into than between sacks and into? 35 / 54

Coordination Why would these parses have exactly the same probabilities? 36 / 54

Ways to improve Split the non-terminals Rename each non-terminal based on its parent (NP-S vs. NP-VP) Hand-written rules to split pre-terminal categories Automatically search for optimal splits through split and merge algorithm Lexicalized s: add identity of lexical head to each node label Data sparsity problem smoothing again 37 / 54

Parent annotation Can distinguish Subject from Object now (how?) 38 / 54

Parent annotation Advantages: Captures structural dependency in grammars Disadvantages? 39 / 54

Parent annotation Did parent annotation help with the left tree? What did we do in the right tree? 40 / 54

Parent annotation Advantages: Captures structural dependency in grammars Disadvantages? Increases number of rules in grammar Decreases amount of data available for training per rule 41 / 54

Lexicalized CFG Lexicalized rules: Best known parsers: Collins, Charniak parsers Each non-terminal annotated with its lexical head E.g. verb with verb phrase, noun with noun phrase Each rule must identify RHS element as head Heads propagate up tree Conceptually like adding 1 rule per head value VP(dumped) VBD(dumped)NP(sacks)PP(into) VP(dumped) VBD(dumped)NP(cats)PP(into) In practice, data would be very sparse to train something like this 42 / 54

Lexicalized CFG 43 / 54

Disambiguation example 44 / 54

Disambiguation example 45 / 54

Improving s: Tradeoffs Tensions: Note: The running time of the parser is not just O(n 3 ) It is actually O(n 3 G ) where G is the size of the grammar Increasing accuracy increases the grammar size Lexicalized and specialized grammars can be huge This increases processing times And increases training data requirements (why?) How can we balance? 46 / 54

Improving s: Tradeoffs Tensions: Note: The running time of the parser is not just O(n 3 ) It is actually O(n 3 G ) where G is the size of the grammar Increasing accuracy increases the grammar size Lexicalized and specialized grammars can be huge This increases processing times And increases training data requirements (why?) sparsity (remember why we couldn t compute joint probabilities directly for N-gram models?) How can we balance? 47 / 54

Solutions Beam Thresholding (inspired by Beam Search algorithm) Main idea: Keep only top k most probable partial parses Retain only k choices per cell Collins parser Make further independence assumptions 48 / 54

Solutions Beam Thresholding (inspired by Beam Search algorithm) Main idea: Keep only top k most probable partial parses Retain only k choices per cell Collins parser Make further independence assumptions Aren t we going in circles? related: is CFG a theory of human language? 49 / 54

Gold Standard Obtained from a Treebank What is the problem with this? 50 / 54

metric: Parseval 51 / 54

metric: Parseval 52 / 54

Example: Precision and Recall Gold standard: Hypothesis: (S (NP (A a) ) (VP (B b) (NP (C c)) (PP (D d)))) (S (NP (A a)) (VP (B b) (NP (C c) (PP (D d))))) G: S(0,4) NP(0,1) VP (1,4) NP (2,3) PP(3,4) H: S(0,4) NP(0,1) VP (1,4) NP (2,4) PP(3,4) LP: 4/5 LR: 4/5 F1: 4/5 53 / 54

Balancing efficiency with accuracy in : Fixing one problem often creates new problems don t seem to help us understand much about language Why do we need to implement parsers for that anyway? : Syntactic theory and then unification-based parsing 54 / 54