Better Syntactic Parsing with Lexical-Semantic Features from Auto-parsed Data

Better Syntactic Parsing with Lexical-Semantic Features from Auto-parsed Data Yoav Goldberg (actual work by Eliyahu Kiperwasser) ICRI-CI Retreat, May 2015

Language

Language People use language to communicate

Language People use language to communicate Language is Everywhere

Language People use language to communicate Language is Everywhere Conversations Newspapers Scientific articles Medicine (patient records) Patents Law Product reviews Blogs Facebook, Twitter...

A lot of text. Need to understand what s being said. this is where we come in.

NLP text meaning

NLP text meaning What does it mean to understand?

NLP text meaning What does it mean to understand? I focus on the building blocks

This talk is about syntactic parsing

Syntactic Parsing Sentences in natural language have structure

Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures

Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures the mainstream theory can be quite convoluted

Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures the mainstream theory can be quite convoluted countless debates regarding many corner cases most linguists agree on the basics ( the boring stuff )

This talk - Dependency Structures A syntactic representation in which Every word is a node in a tree A Single ROOT node No non-word nodes other than root

Syntactic Parsing The soup, which I expected to be good, was bad

Syntactic Parsing subj root rcmod rel xcomp det subj aux acomp acomp The soup, which I expected to be good, was bad

Syntactic Parsing The gromp, which I furpled to be drogby, was spujky

Syntactic Parsing subj root rcmod rel xcomp det subj aux acomp acomp The gromp, which I furpled to be drogby, was spujky

Can go a long way without the words based on structural cues.

Syntactic Parsing But sometimes words do matter

Syntactic Parsing But sometimes words do matter compare: I ate pizza with olives

Syntactic Parsing But sometimes words do matter compare: I ate pizza with olives I ate pizza with friends correct analysis depends on words

Parers are created using machine learning. Based on a training set of (sentence,trees) pairs. In English, we have 40, 000 such pairs.

Parers are created using machine learning. Based on a training set of (sentence,trees) pairs. In English, we have 40, 000 such pairs. Too small to learn word-word interactions.

Parers are created using machine learning. Based on a training set of (sentence,trees) pairs. In English, we have 40, 000 such pairs. Too small to learn word-word interactions. Semi-supervised learning Unannotated data is cheap. Use a lot of unannotated data to improve lexical coverage.

This talk Improve parsing accuracy using a lot of unannotated text

Prior Semi-supervised parsing State-of-the-art Simple Semi-supervised Dependency Parsing (Koo et al, 2008) Take a large amount of unannotated text. Use a word clustering algorithm to learn word clusters. Now each word is associated with a cluster. Use clusters identities as additional features in a supervised parser.

Note: the clustering metric is not related to the parsing task. We take a different approach

Auto-Parsed Data Parsed Data

Auto-Parsed Data Parsed Data Train Model

Auto-Parsed Data Parsed Data Train Model Predict Test Set Annotations Predict Auto-Parsed Data

Auto-Parsed Data Parsed Data Train Model Predict Test Set Annotations Predict Auto-Parsed Features Extract Features Auto-Parsed Data

Auto-Parsed Data Parsed Data Train Model Predict Test Set Annotations Train Predict Auto-Parsed Features Extract Features Auto-Parsed Data

Graph-based Parsing parse(sent) = score(sent, tree) = part tree argmax score(sent, tree) tree Trees(sent) w φ(sent, part)

Graph-based Parsing parse(sent) = score(sent, tree) = part tree argmax score(sent, tree) tree Trees(sent) w φ(sent, part) + (h,m) tree assoc(h, m) we add a term for each head-modifier word pair in the tree

Auto-Parsed Features m... h the black fox... will jump over DET ADJ NN AUX VERB PREP

Auto-Parsed Features m... h the black fox... will jump over DET ADJ NN AUX VERB PREP assoc(h, m) = w φ lex (h, m)

Auto-Parsed Features m... h the black fox... will jump over DET ADJ NN AUX VERB PREP assoc(h, m) = w φ lex (h, m) Features in φ lex (h, m) bin(s(h, m)) bin(s(h, m)) dist(h,m) bin(s(h, m)) pos(h) pos(m) bin(s(h, m)) pos(h) pos(m) dist(h,m) The term S(h,m) measures how well h and m fit together.

Auto-Parsed Features S(h,m) example (officer, chief) (well, as) (year, last) (ate, pizza) (dog, the) (ate, dog) (dog, thirsty)... (dog, professional) (dog, ate) (USD, 1999)

Estimating S(h,m)

Estimating S(h,m) Method 1: Rank Percentile Let D be a list of (h,m) pairs, sorted according to their frequency. Let R(h, m) be the index of (h,m) in the list. S Rank (h, m) = R(h, m) D

Estimating S(h,m) Method 1: Rank Percentile Let D be a list of (h,m) pairs, sorted according to their frequency. Let R(h, m) be the index of (h,m) in the list. S Rank (h, m) = R(h, m) D Cons Need to store all observed pairs. Does not generalize to new pairs. Is this really a good metric?

Estimating S(h,m) Method 2: word-vectors Log-bilinear embedding model: ln (σ (v m v h )) m,h D m D m h D h ln (σ (v m v h )) (this is the negative-sampling model from word2vec (Mikolov et al 2013) ) Represent each head-word h and modifier word m as a vector. Dot-products of compatible pairs receive high scores. Dot-products of bad pairs receive low scores.

Estimating S(h,m) Method 3: sigmoid-pmi Levy and Goldberg (2014) show that the optimal solution for the negative-sampling embedding model of Mikolov et al is achieved when: v h v m = PMI (h, m) Use this as our metric.

Results (1) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96

Results (1) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05

Results (1) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21

We can do better use more context

Auto-Parsed Features (Context) Instead of word pairs, we look at relations between word-triplets m 1 m 0 m +1... h 1 h 0 h +1 the black fox... will jump over

Auto-Parsed Features (Context) Instead of word pairs, we look at relations between word-triplets m 1 m 0 m +1... h 1 h 0 h +1 the black fox... will jump over Problem: Gather reliable statistics over pairs of trigrams requires an enormous annotated corpus.

Decomposition Idea from vector-space models Represent each (word,position) pair as a vector: v h0, v h1, v h 1, v m0, v m1, v m 1

Decomposition Idea from vector-space models Represent each (word,position) pair as a vector: v h0, v h1, v h 1, v m0, v m1, v m 1 Model a triplet as a sum: (v h 1 + v h0 + v h1 ) (v m 1 + v m0 + v m1 ) Expanding the terms, we get: assoc(h 1 h 0 h +1, m 1 m 0 m +1 ) = 1 1 α ij assoc ij (h i, m j ) i= 1 j= 1

Auto-Parsed Features (Context) m 1 m 0 m +1... h 1 h 0 h +1 the black fox... will jump over assoc(h 1 h 0 h +1, m 1 m 0 m +1 ) = α 1, 1 assoc 1, 1 (the, will) + α 1,0 assoc 1,0 (the, jump) + α 1,1 assoc 1,1 (the, over)+ α 0, 1assoc 0, 1 (black, will) + α 0,0 assoc 0,0 (black, jump) + α 0,1 assoc 0,1 (black, over)+ α 1, 1 assoc 1, 1 (fox, will) + α 1,0 assoc 1,0 (fox, jump) + α 1,1 assoc 1,1 (fox, over)

assoc ij (h, m) = w ij φ ij lex (h, m)

assoc ij (h, m) = w ij φ ij lex (h, m) Features in φ ij lex (h, m) bin(s ij (h, m)) bin(s ij (h, m)) dist(h,m) bin(s ij (h, m)) pos(h) pos(m) bin(s ij (h, m)) pos(h) pos(m) dist(h,m) The terms S ij (h, m) are estimated like before.

Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21

Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21 Base + TRIP(S Rank ) 92.31 91.92 Base + TRIP(S Vec ) 92.48 92.26 Base + TRIP(S PMI ) 92.55 92.37 Base + Brown + TRIP(S PMI ) 92.76 92.44

To summarize Semi-supervised dependency parsing Features from auto-parsed data Modeling interaction between word triplets

To summarize Semi-supervised dependency parsing Features from auto-parsed data Modeling interaction between word triplets Ideas inspired by word-embeddings... but explicit counts work better for us

To summarize Semi-supervised dependency parsing Features from auto-parsed data Modeling interaction between word triplets Ideas inspired by word-embeddings... but explicit counts work better for us State of the art results First method to improve over brown-clusters

Thank You