Introduction to Computational Linguistics

Introduction to Computational Linguistics Olga Zamaraeva (2018) Based on Bender (prev. years) and Levow (2016) University of Washington May 8, 2018 1 / 54

Projects Please remember that Error Analysis goes beyond just running the package and obtaining numbers Make sure to discuss your EA strategy for Milestone 3 Assignment 4 If you already did, expand Can start after today s lecture Assignment 5 Midterms: Can start after next Tuesday DO NOT DELAY Training and Test data Precision and Recall 2 / 54

Why statistical parsing? s Estimating rule probabilities Ways to improve 3 / 54

Why statistical parsing? Your turn 4 / 54

Why statistical parsing? Parsing = making explicit structure that is inherent (implicit) in natural language strings Most application scenarios that use parser output want just one parse Have to choose among all the possible analyses (disambiguation) CKY represents the ambiguities efficiently but does not solve them Most application scenarios need robust parsers Need some output for every input, even if its not grammatical 5 / 54

Probabilistic Context Free Grammars N: a set of non-terminal symbols : a set of terminal symbols (disjoint from N) R: a set of rules, of the form A β [p] A: non-terminal β: string of symbols from or N p: probability of β given A S: a designated start symbol 6 / 54

How does this differ from CFG? How do we use it to calculate the probability of a parse? The probability of a sentence? What assumptions does that require? 7 / 54

How does this differ from CFG? added probability to each rule How do we use it to calculate the probability of a parse? multiply the probabilities of used rules The probability of a sentence? sum of probabilities of all trees What assumptions does that require? expansion of a node does not depend on the context 8 / 54

Augment each production with probability that LHS will be expanded as RHS P(A B) or P(A B A), P(RHS LHS) Sum over all possible expansions is 1 β P(A β) = 1 A is consistent if sum of probabilities of all sentences in language is 1. Recursive rules can yield inconsistent grammars We look at consistent grammars in this class 9 / 54

Example: probabilities (Assume some small CFG grammar of the kind we saw before.) I have a cat. Do you have a cat? I have a dog. What is P(Det a)? What is P(N cat)? What is P(S NP VP)? 10 / 54

Example: fragment How would a real-life trained grammar differ from this one? 11 / 54

Disambiguation A assigns probability to each parse tree T for input S Probability of T: product of all rules to derive T (i is step in derivation) Why? 12 / 54

Disambiguation A assigns probability to each parse tree T for input S Probability of T: product of all rules to derive T Why? Because P(S T) = 1 13 / 54

Parsing problem for Select T such that: ˆT(S) = argmaxts.t.s=yield(t) P(T) String of S is yield of parse tree over S Select tree that maximizes probability of parse Extend existing algorithms, e.g. CKY Most modern parsers based on CKY 14 / 54

Argmax Recall: the argmax function over a parameter: returns the value for the parameter at which the value of the function is maximum E.g. f(x) is maximum when x=0.5 (suppose f(0.5) = 1) max(f(x)) = 1 argmax x (f(x)) = 0.5 15 / 54

Parsing problem for ˆT(S) = argmaxts.t.s=yield(t) P(T S) P(T,S) ˆT(S) = argmaxts.t.s=yield(t) P(S) But P(S) is constant for each tree Why? ˆT(S) = argmaxts.t.s=yield(t) P(T, S) P(T, S) = P(T) So, ˆT(S) = argmaxts.t.s=yield(t) P(T) 16 / 54

Parsing problem for ˆT(S) = argmaxts.t.s=yield(t) P(T S) P(T,S) ˆT(S) = argmaxts.t.s=yield(t) P(S) But P(S) is constant for each tree Because it is the same sentence! (NB: it is not 1 but it still is the same number for all trees, so doesn t matter in finding the max) ˆT(S) = argmaxts.t.s=yield(t) P(T, S) P(T, S) = P(T) So, ˆT(S) = argmaxts.t.s=yield(t) P(T) 17 / 54

Disambiguation 18 / 54

Assigning probability to a string can tell us what the probability of a sentence is (joint with structure) What s this useful for? 19 / 54

Assigning probability to a string can tell us what the probability of a sentence is (joint with structure) What s this useful for? Language modeling Where have we seen language modeling? 20 / 54

Assigning probability to a string can tell us what is the probability of a sentence What s this useful for? Language modeling Where have we seen language modeling? N-grams are models which account for more context than N-grams What are some applications? 21 / 54

How to estimate rule probabilities? Get a Treebank Gather all instances of each non-terminal For each expansion of the non-terminal (= rule), count how many times it occurs P(α β α = C(α β) C(α) ) 23 / 54

(Ney, 1991): in each cell, store just the most probable edge for each non-terminal (Btw, CKY does not appear to have a canonical citation?.. Several papers, from late 50s into 70s) 24 / 54

Like regular CKY Assume grammar in Chomsky Normal Form (CNF) Productions: A B C or A w Represent input with indices b/t words E.g., 0 Book 1 that 2 flight 3 through 4 Houston 5 For input string length n and non-terminals V Cell [i,j,a] in (n + 1)x(n + 1)xV matrix contains the probability that constituent A spans [i,j] V is a 3rd dimension, for each nonterminal V stores the probabilities 25 / 54

Note: this is pseudocode generally means assignment (e.g. x = 2) downto 0 means the number is decreasing with each iteration until it is 0 26 / 54

PCKY grammar segment S NP VP [0.80] NP Det N [0.30] VP V NP [0.20] V includes [0.05] Det the [0.40] Det a [0.40] N flight [0.02] N meal [0.05] 27 / 54

This flight includes a meal 28 / 54

Problems with Independence assumption is problematic What does independence assumption mean? What is the evidence that it s wrong? Not sensitive to lexical dependencies What does that mean? 29 / 54

Problems with Independence assumption is problematic What does independence assumption mean? NP Pronoun vs NP Noun The choice will be made independently of other things going on in the tree What is the evidence that it s wrong? NP Pronoun is much more likely to be in the subject position Not sensitive to lexical dependencies What does that mean? Verb/preposition subcategorization Coordination ambiguity 30 / 54

Problems with Switchboard corpus (Francis et al., 1999): Pronoun Non-Pronoun Subject 91% 9% Object 34% 66% But estimated probabilities: NP DT NN [0.28] NP PRP [0.25] 31 / 54

Problems with 32 / 54

Problems with 33 / 54

Problems with NP NP PP has fairly high probability in English (aka WSJ) So a trained on it will always prefer this kind of attachment Remember how pervasive this issue was shown to be in Kummerfeld et al. 34 / 54

Lexical dependenices The workers dumped the sacks into a bin How to capture the fact that there is grater affinity between dumped and into than between sacks and into? 35 / 54

Coordination Why would these parses have exactly the same probabilities? 36 / 54

Ways to improve Split the non-terminals Rename each non-terminal based on its parent (NP-S vs. NP-VP) Hand-written rules to split pre-terminal categories Automatically search for optimal splits through split and merge algorithm Lexicalized s: add identity of lexical head to each node label Data sparsity problem smoothing again 37 / 54

Parent annotation Can distinguish Subject from Object now (how?) 38 / 54

Parent annotation Advantages: Captures structural dependency in grammars Disadvantages? 39 / 54

Parent annotation Did parent annotation help with the left tree? What did we do in the right tree? 40 / 54

Parent annotation Advantages: Captures structural dependency in grammars Disadvantages? Increases number of rules in grammar Decreases amount of data available for training per rule 41 / 54

Lexicalized CFG Lexicalized rules: Best known parsers: Collins, Charniak parsers Each non-terminal annotated with its lexical head E.g. verb with verb phrase, noun with noun phrase Each rule must identify RHS element as head Heads propagate up tree Conceptually like adding 1 rule per head value VP(dumped) VBD(dumped)NP(sacks)PP(into) VP(dumped) VBD(dumped)NP(cats)PP(into) In practice, data would be very sparse to train something like this 42 / 54

Lexicalized CFG 43 / 54

Disambiguation example 44 / 54

Disambiguation example 45 / 54

Improving s: Tradeoffs Tensions: Note: The running time of the parser is not just O(n 3 ) It is actually O(n 3 G ) where G is the size of the grammar Increasing accuracy increases the grammar size Lexicalized and specialized grammars can be huge This increases processing times And increases training data requirements (why?) How can we balance? 46 / 54

Solutions Beam Thresholding (inspired by Beam Search algorithm) Main idea: Keep only top k most probable partial parses Retain only k choices per cell Collins parser Make further independence assumptions 48 / 54

Gold Standard Obtained from a Treebank What is the problem with this? 50 / 54

metric: Parseval 51 / 54

metric: Parseval 52 / 54

Example: Precision and Recall Gold standard: Hypothesis: (S (NP (A a) ) (VP (B b) (NP (C c)) (PP (D d)))) (S (NP (A a)) (VP (B b) (NP (C c) (PP (D d))))) G: S(0,4) NP(0,1) VP (1,4) NP (2,3) PP(3,4) H: S(0,4) NP(0,1) VP (1,4) NP (2,4) PP(3,4) LP: 4/5 LR: 4/5 F1: 4/5 53 / 54

Balancing efficiency with accuracy in : Fixing one problem often creates new problems don t seem to help us understand much about language Why do we need to implement parsers for that anyway? : Syntactic theory and then unification-based parsing 54 / 54