Better Syntactic Parsing with Lexical-Semantic Features from Auto-parsed Data Yoav Goldberg (actual work by Eliyahu Kiperwasser) ICRI-CI Retreat, May 2015
Language
Language People use language to communicate
Language People use language to communicate Language is Everywhere
Language People use language to communicate Language is Everywhere Conversations Newspapers Scientific articles Medicine (patient records) Patents Law Product reviews Blogs Facebook, Twitter...
A lot of text. Need to understand what s being said. this is where we come in.
NLP text meaning
NLP text meaning What does it mean to understand?
NLP text meaning What does it mean to understand? I focus on the building blocks
This talk is about syntactic parsing
Syntactic Parsing Sentences in natural language have structure
Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures
Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures the mainstream theory can be quite convoluted
Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures the mainstream theory can be quite convoluted countless debates regarding many corner cases
Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures the mainstream theory can be quite convoluted countless debates regarding many corner cases most linguists agree on the basics ( the boring stuff )
Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures the mainstream theory can be quite convoluted countless debates regarding many corner cases most linguists agree on the basics ( the boring stuff ) the boring stuff is actually very useful
This talk - Dependency Structures A syntactic representation in which Every word is a node in a tree A Single ROOT node No non-word nodes other than root
Syntactic Parsing The soup, which I expected to be good, was bad
Syntactic Parsing subj root rcmod rel xcomp det subj aux acomp acomp The soup, which I expected to be good, was bad
Syntactic Parsing subj root rcmod rel xcomp det subj aux acomp acomp The soup, which I expected to be good, was bad
Syntactic Parsing The gromp, which I furpled to be drogby, was spujky
Syntactic Parsing subj root rcmod rel xcomp det subj aux acomp acomp The gromp, which I furpled to be drogby, was spujky
Can go a long way without the words based on structural cues.
Syntactic Parsing But sometimes words do matter
Syntactic Parsing But sometimes words do matter compare: I ate pizza with olives
Syntactic Parsing But sometimes words do matter compare: I ate pizza with olives I ate pizza with friends correct analysis depends on words
Parers are created using machine learning. Based on a training set of (sentence,trees) pairs. In English, we have 40, 000 such pairs.
Parers are created using machine learning. Based on a training set of (sentence,trees) pairs. In English, we have 40, 000 such pairs. Too small to learn word-word interactions.
Parers are created using machine learning. Based on a training set of (sentence,trees) pairs. In English, we have 40, 000 such pairs. Too small to learn word-word interactions. Semi-supervised learning Unannotated data is cheap. Use a lot of unannotated data to improve lexical coverage.
This talk Improve parsing accuracy using a lot of unannotated text
Prior Semi-supervised parsing State-of-the-art Simple Semi-supervised Dependency Parsing (Koo et al, 2008) Take a large amount of unannotated text. Use a word clustering algorithm to learn word clusters. Now each word is associated with a cluster. Use clusters identities as additional features in a supervised parser.
Prior Semi-supervised parsing State-of-the-art Simple Semi-supervised Dependency Parsing (Koo et al, 2008) Take a large amount of unannotated text. Use a word clustering algorithm to learn word clusters. Now each word is associated with a cluster. Use clusters identities as additional features in a supervised parser. When using the Brown clustering algorithm
Prior Semi-supervised parsing State-of-the-art Simple Semi-supervised Dependency Parsing (Koo et al, 2008) Take a large amount of unannotated text. Use a word clustering algorithm to learn word clusters. Now each word is associated with a cluster. Use clusters identities as additional features in a supervised parser. When using the Brown clustering algorithm With a good set of cluster-based features
Prior Semi-supervised parsing State-of-the-art Simple Semi-supervised Dependency Parsing (Koo et al, 2008) Take a large amount of unannotated text. Use a word clustering algorithm to learn word clusters. Now each word is associated with a cluster. Use clusters identities as additional features in a supervised parser. When using the Brown clustering algorithm With a good set of cluster-based features This produces state-of-the-art results
Note: the clustering metric is not related to the parsing task. We take a different approach
Auto-Parsed Data Parsed Data
Auto-Parsed Data Parsed Data Train Model
Auto-Parsed Data Parsed Data Train Model Predict Test Set Annotations Predict Auto-Parsed Data
Auto-Parsed Data Parsed Data Train Model Predict Test Set Annotations Predict Auto-Parsed Features Extract Features Auto-Parsed Data
Auto-Parsed Data Parsed Data Train Model Predict Test Set Annotations Train Predict Auto-Parsed Features Extract Features Auto-Parsed Data
Graph-based Parsing parse(sent) = score(sent, tree) = part tree argmax score(sent, tree) tree Trees(sent) w φ(sent, part)
Graph-based Parsing parse(sent) = score(sent, tree) = part tree argmax score(sent, tree) tree Trees(sent) w φ(sent, part) + (h,m) tree assoc(h, m) we add a term for each head-modifier word pair in the tree
Auto-Parsed Features m... h the black fox... will jump over DET ADJ NN AUX VERB PREP
Auto-Parsed Features m... h the black fox... will jump over DET ADJ NN AUX VERB PREP assoc(h, m) = w φ lex (h, m)
Auto-Parsed Features m... h the black fox... will jump over DET ADJ NN AUX VERB PREP assoc(h, m) = w φ lex (h, m) Features in φ lex (h, m) bin(s(h, m)) bin(s(h, m)) dist(h,m) bin(s(h, m)) pos(h) pos(m) bin(s(h, m)) pos(h) pos(m) dist(h,m) The term S(h,m) measures how well h and m fit together.
Auto-Parsed Features S(h,m) example (officer, chief) (well, as) (year, last) (ate, pizza) (dog, the) (ate, dog) (dog, thirsty)... (dog, professional) (dog, ate) (USD, 1999)
Estimating S(h,m)
Estimating S(h,m) Method 1: Rank Percentile Let D be a list of (h,m) pairs, sorted according to their frequency. Let R(h, m) be the index of (h,m) in the list. S Rank (h, m) = R(h, m) D
Estimating S(h,m) Method 1: Rank Percentile Let D be a list of (h,m) pairs, sorted according to their frequency. Let R(h, m) be the index of (h,m) in the list. S Rank (h, m) = R(h, m) D Cons Need to store all observed pairs. Does not generalize to new pairs. Is this really a good metric?
Estimating S(h,m) Method 2: word-vectors Log-bilinear embedding model: ln (σ (v m v h )) m,h D m D m h D h ln (σ (v m v h )) (this is the negative-sampling model from word2vec (Mikolov et al 2013) ) Represent each head-word h and modifier word m as a vector. Dot-products of compatible pairs receive high scores. Dot-products of bad pairs receive low scores.
Estimating S(h,m) Method 2: word-vectors Log-bilinear embedding model: ln (σ (v m v h )) m,h D m D m h D h ln (σ (v m v h )) (this is the negative-sampling model from word2vec (Mikolov et al 2013) ) Represent each head-word h and modifier word m as a vector. Dot-products of compatible pairs receive high scores. Dot-products of bad pairs receive low scores. S Vec (h, m) = σ(v h v m )
Estimating S(h,m) Method 3: sigmoid-pmi Levy and Goldberg (2014) show that the optimal solution for the negative-sampling embedding model of Mikolov et al is achieved when: v h v m = PMI (h, m) Use this as our metric.
Estimating S(h,m) Method 3: sigmoid-pmi Levy and Goldberg (2014) show that the optimal solution for the negative-sampling embedding model of Mikolov et al is achieved when: v h v m = PMI (h, m) Use this as our metric. S PMI (h, m) = σ(pmi (h, m)) = p(h, m) p(h, m) + p(h)p(m)
Results (1) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96
Results (1) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05
Results (1) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21
We can do better use more context
Auto-Parsed Features (Context) Instead of word pairs, we look at relations between word-triplets m 1 m 0 m +1... h 1 h 0 h +1 the black fox... will jump over
Auto-Parsed Features (Context) Instead of word pairs, we look at relations between word-triplets m 1 m 0 m +1... h 1 h 0 h +1 the black fox... will jump over Problem: Gather reliable statistics over pairs of trigrams requires an enormous annotated corpus.
Auto-Parsed Features (Context) Instead of word pairs, we look at relations between word-triplets m 1 m 0 m +1... h 1 h 0 h +1 the black fox... will jump over Problem: Gather reliable statistics over pairs of trigrams requires an enormous annotated corpus. Solution: Decompose the structure into smaller parts
Decomposition Idea from vector-space models Represent each (word,position) pair as a vector: v h0, v h1, v h 1, v m0, v m1, v m 1
Decomposition Idea from vector-space models Represent each (word,position) pair as a vector: v h0, v h1, v h 1, v m0, v m1, v m 1 Model a triplet as a sum: (v h 1 + v h0 + v h1 ) (v m 1 + v m0 + v m1 )
Decomposition Idea from vector-space models Represent each (word,position) pair as a vector: v h0, v h1, v h 1, v m0, v m1, v m 1 Model a triplet as a sum: (v h 1 + v h0 + v h1 ) (v m 1 + v m0 + v m1 ) Expanding the terms, we get: assoc(h 1 h 0 h +1, m 1 m 0 m +1 ) = 1 1 α ij assoc ij (h i, m j ) i= 1 j= 1
Auto-Parsed Features (Context) m 1 m 0 m +1... h 1 h 0 h +1 the black fox... will jump over assoc(h 1 h 0 h +1, m 1 m 0 m +1 ) = α 1, 1 assoc 1, 1 (the, will) + α 1,0 assoc 1,0 (the, jump) + α 1,1 assoc 1,1 (the, over)+ α 0, 1assoc 0, 1 (black, will) + α 0,0 assoc 0,0 (black, jump) + α 0,1 assoc 0,1 (black, over)+ α 1, 1 assoc 1, 1 (fox, will) + α 1,0 assoc 1,0 (fox, jump) + α 1,1 assoc 1,1 (fox, over)
assoc ij (h, m) = w ij φ ij lex (h, m)
assoc ij (h, m) = w ij φ ij lex (h, m) Features in φ ij lex (h, m) bin(s ij (h, m)) bin(s ij (h, m)) dist(h,m) bin(s ij (h, m)) pos(h) pos(m) bin(s ij (h, m)) pos(h) pos(m) dist(h,m) The terms S ij (h, m) are estimated like before.
Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21
Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21 Base + TRIP(S Rank ) 92.31 91.92
Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21 Base + TRIP(S Rank ) 92.31 91.92 Base + TRIP(S Vec ) 92.48 92.26 Base + TRIP(S PMI ) 92.55 92.37
Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21 Base + TRIP(S Rank ) 92.31 91.92 Base + TRIP(S Vec ) 92.48 92.26 Base + TRIP(S PMI ) 92.55 92.37 Base + Brown + TRIP(S PMI ) 92.76 92.44
Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21 Base + TRIP(S Rank ) 92.31 91.92 Base + TRIP(S Vec ) 92.48 92.26 Base + TRIP(S PMI ) 92.55 92.37 Base + Brown + TRIP(S PMI ) 92.76 92.44 Large improvement in accuracy First method to improve over brown-clusters State-of-the-art results for first order model
To summarize Semi-supervised dependency parsing Features from auto-parsed data Modeling interaction between word triplets
To summarize Semi-supervised dependency parsing Features from auto-parsed data Modeling interaction between word triplets Ideas inspired by word-embeddings... but explicit counts work better for us
To summarize Semi-supervised dependency parsing Features from auto-parsed data Modeling interaction between word triplets Ideas inspired by word-embeddings... but explicit counts work better for us State of the art results First method to improve over brown-clusters
Thank You