Better Syntactic Parsing with Lexical-Semantic Features from Auto-parsed Data

Similar documents
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

A Graph Based Authorship Identification Approach

CS 598 Natural Language Processing

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Prediction of Maximal Projection for Semantic Role Labeling

(Sub)Gradient Descent

Beyond the Pipeline: Discrete Optimization in NLP

A deep architecture for non-projective dependency parsing

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Ensemble Technique Utilization for Indonesian Dependency Parser

Learning Methods in Multilingual Speech Recognition

CS Machine Learning

Parsing of part-of-speech tagged Assamese Texts

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Graph Alignment for Semi-Supervised Semantic Role Labeling

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

AQUA: An Ontology-Driven Question Answering System

Multilingual Sentiment and Subjectivity Analysis

The Smart/Empire TIPSTER IR System

A Comparison of Two Text Representations for Sentiment Analysis

LTAG-spinal and the Treebank

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Probabilistic Latent Semantic Analysis

Linking Task: Identifying authors and book titles in verbose queries

Multi-Lingual Text Leveling

The stages of event extraction

Grammars & Parsing, Part 1:

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Using dialogue context to improve parsing performance in dialogue systems

Developing a TT-MCTAG for German with an RCG-based Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Building a Semantic Role Labelling System for Vietnamese

The Role of Semantic and Discourse Information in Learning the Structure of Surgical Procedures

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

arxiv: v1 [cs.cl] 2 Apr 2017

arxiv: v1 [cs.cv] 10 May 2017

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

arxiv: v2 [cs.cv] 3 Aug 2017

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

A Domain Ontology Development Environment Using a MRD and Text Corpus

The Role of the Head in the Interpretation of English Deverbal Compounds

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

Some Principles of Automated Natural Language Information Extraction

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Chapter 4: Valence & Agreement CSLI Publications

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Hyperedge Replacement and Nonprojective Dependency Structures

A relational approach to translation

Leveraging Sentiment to Compute Word Similarity

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Truth Inference in Crowdsourcing: Is the Problem Solved?

A Vector Space Approach for Aspect-Based Sentiment Analysis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Experiments with a Higher-Order Projective Dependency Parser

Construction Grammar. University of Jena.

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Applications of memory-based natural language processing

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Accuracy (%) # features

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Analysis of Probabilistic Parsing in NLP

Adapting Stochastic Output for Rule-Based Semantics

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Matching Similarity for Keyword-Based Clustering

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

TextGraphs: Graph-based algorithms for Natural Language Processing

Semantic Inference at the Lexical-Syntactic Level

STUDENTS' RATINGS ON TEACHER

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Text-mining the Estonian National Electronic Health Record

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Cross Language Information Retrieval

The Strong Minimalist Thesis and Bounded Optimality

THE VERB ARGUMENT BROWSER

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Assignment 1: Predicting Amazon Review Ratings

Mining Topic-level Opinion Influence in Microblog

Discriminative Learning of Beam-Search Heuristics for Planning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Second Exam: Natural Language Parsing with Neural Networks

Proceedings of the 19th COLING, , 2002.

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Transcription:

Better Syntactic Parsing with Lexical-Semantic Features from Auto-parsed Data Yoav Goldberg (actual work by Eliyahu Kiperwasser) ICRI-CI Retreat, May 2015

Language

Language People use language to communicate

Language People use language to communicate Language is Everywhere

Language People use language to communicate Language is Everywhere Conversations Newspapers Scientific articles Medicine (patient records) Patents Law Product reviews Blogs Facebook, Twitter...

A lot of text. Need to understand what s being said. this is where we come in.

NLP text meaning

NLP text meaning What does it mean to understand?

NLP text meaning What does it mean to understand? I focus on the building blocks

This talk is about syntactic parsing

Syntactic Parsing Sentences in natural language have structure

Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures

Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures the mainstream theory can be quite convoluted

Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures the mainstream theory can be quite convoluted countless debates regarding many corner cases

Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures the mainstream theory can be quite convoluted countless debates regarding many corner cases most linguists agree on the basics ( the boring stuff )

Syntactic Parsing Sentences in natural language have structure Linguists create theories defining these structures the mainstream theory can be quite convoluted countless debates regarding many corner cases most linguists agree on the basics ( the boring stuff ) the boring stuff is actually very useful

This talk - Dependency Structures A syntactic representation in which Every word is a node in a tree A Single ROOT node No non-word nodes other than root

Syntactic Parsing The soup, which I expected to be good, was bad

Syntactic Parsing subj root rcmod rel xcomp det subj aux acomp acomp The soup, which I expected to be good, was bad

Syntactic Parsing subj root rcmod rel xcomp det subj aux acomp acomp The soup, which I expected to be good, was bad

Syntactic Parsing The gromp, which I furpled to be drogby, was spujky

Syntactic Parsing subj root rcmod rel xcomp det subj aux acomp acomp The gromp, which I furpled to be drogby, was spujky

Can go a long way without the words based on structural cues.

Syntactic Parsing But sometimes words do matter

Syntactic Parsing But sometimes words do matter compare: I ate pizza with olives

Syntactic Parsing But sometimes words do matter compare: I ate pizza with olives I ate pizza with friends correct analysis depends on words

Parers are created using machine learning. Based on a training set of (sentence,trees) pairs. In English, we have 40, 000 such pairs.

Parers are created using machine learning. Based on a training set of (sentence,trees) pairs. In English, we have 40, 000 such pairs. Too small to learn word-word interactions.

Parers are created using machine learning. Based on a training set of (sentence,trees) pairs. In English, we have 40, 000 such pairs. Too small to learn word-word interactions. Semi-supervised learning Unannotated data is cheap. Use a lot of unannotated data to improve lexical coverage.

This talk Improve parsing accuracy using a lot of unannotated text

Prior Semi-supervised parsing State-of-the-art Simple Semi-supervised Dependency Parsing (Koo et al, 2008) Take a large amount of unannotated text. Use a word clustering algorithm to learn word clusters. Now each word is associated with a cluster. Use clusters identities as additional features in a supervised parser.

Prior Semi-supervised parsing State-of-the-art Simple Semi-supervised Dependency Parsing (Koo et al, 2008) Take a large amount of unannotated text. Use a word clustering algorithm to learn word clusters. Now each word is associated with a cluster. Use clusters identities as additional features in a supervised parser. When using the Brown clustering algorithm

Prior Semi-supervised parsing State-of-the-art Simple Semi-supervised Dependency Parsing (Koo et al, 2008) Take a large amount of unannotated text. Use a word clustering algorithm to learn word clusters. Now each word is associated with a cluster. Use clusters identities as additional features in a supervised parser. When using the Brown clustering algorithm With a good set of cluster-based features

Prior Semi-supervised parsing State-of-the-art Simple Semi-supervised Dependency Parsing (Koo et al, 2008) Take a large amount of unannotated text. Use a word clustering algorithm to learn word clusters. Now each word is associated with a cluster. Use clusters identities as additional features in a supervised parser. When using the Brown clustering algorithm With a good set of cluster-based features This produces state-of-the-art results

Note: the clustering metric is not related to the parsing task. We take a different approach

Auto-Parsed Data Parsed Data

Auto-Parsed Data Parsed Data Train Model

Auto-Parsed Data Parsed Data Train Model Predict Test Set Annotations Predict Auto-Parsed Data

Auto-Parsed Data Parsed Data Train Model Predict Test Set Annotations Predict Auto-Parsed Features Extract Features Auto-Parsed Data

Auto-Parsed Data Parsed Data Train Model Predict Test Set Annotations Train Predict Auto-Parsed Features Extract Features Auto-Parsed Data

Graph-based Parsing parse(sent) = score(sent, tree) = part tree argmax score(sent, tree) tree Trees(sent) w φ(sent, part)

Graph-based Parsing parse(sent) = score(sent, tree) = part tree argmax score(sent, tree) tree Trees(sent) w φ(sent, part) + (h,m) tree assoc(h, m) we add a term for each head-modifier word pair in the tree

Auto-Parsed Features m... h the black fox... will jump over DET ADJ NN AUX VERB PREP

Auto-Parsed Features m... h the black fox... will jump over DET ADJ NN AUX VERB PREP assoc(h, m) = w φ lex (h, m)

Auto-Parsed Features m... h the black fox... will jump over DET ADJ NN AUX VERB PREP assoc(h, m) = w φ lex (h, m) Features in φ lex (h, m) bin(s(h, m)) bin(s(h, m)) dist(h,m) bin(s(h, m)) pos(h) pos(m) bin(s(h, m)) pos(h) pos(m) dist(h,m) The term S(h,m) measures how well h and m fit together.

Auto-Parsed Features S(h,m) example (officer, chief) (well, as) (year, last) (ate, pizza) (dog, the) (ate, dog) (dog, thirsty)... (dog, professional) (dog, ate) (USD, 1999)

Estimating S(h,m)

Estimating S(h,m) Method 1: Rank Percentile Let D be a list of (h,m) pairs, sorted according to their frequency. Let R(h, m) be the index of (h,m) in the list. S Rank (h, m) = R(h, m) D

Estimating S(h,m) Method 1: Rank Percentile Let D be a list of (h,m) pairs, sorted according to their frequency. Let R(h, m) be the index of (h,m) in the list. S Rank (h, m) = R(h, m) D Cons Need to store all observed pairs. Does not generalize to new pairs. Is this really a good metric?

Estimating S(h,m) Method 2: word-vectors Log-bilinear embedding model: ln (σ (v m v h )) m,h D m D m h D h ln (σ (v m v h )) (this is the negative-sampling model from word2vec (Mikolov et al 2013) ) Represent each head-word h and modifier word m as a vector. Dot-products of compatible pairs receive high scores. Dot-products of bad pairs receive low scores.

Estimating S(h,m) Method 2: word-vectors Log-bilinear embedding model: ln (σ (v m v h )) m,h D m D m h D h ln (σ (v m v h )) (this is the negative-sampling model from word2vec (Mikolov et al 2013) ) Represent each head-word h and modifier word m as a vector. Dot-products of compatible pairs receive high scores. Dot-products of bad pairs receive low scores. S Vec (h, m) = σ(v h v m )

Estimating S(h,m) Method 3: sigmoid-pmi Levy and Goldberg (2014) show that the optimal solution for the negative-sampling embedding model of Mikolov et al is achieved when: v h v m = PMI (h, m) Use this as our metric.

Estimating S(h,m) Method 3: sigmoid-pmi Levy and Goldberg (2014) show that the optimal solution for the negative-sampling embedding model of Mikolov et al is achieved when: v h v m = PMI (h, m) Use this as our metric. S PMI (h, m) = σ(pmi (h, m)) = p(h, m) p(h, m) + p(h)p(m)

Results (1) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96

Results (1) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05

Results (1) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21

We can do better use more context

Auto-Parsed Features (Context) Instead of word pairs, we look at relations between word-triplets m 1 m 0 m +1... h 1 h 0 h +1 the black fox... will jump over

Auto-Parsed Features (Context) Instead of word pairs, we look at relations between word-triplets m 1 m 0 m +1... h 1 h 0 h +1 the black fox... will jump over Problem: Gather reliable statistics over pairs of trigrams requires an enormous annotated corpus.

Auto-Parsed Features (Context) Instead of word pairs, we look at relations between word-triplets m 1 m 0 m +1... h 1 h 0 h +1 the black fox... will jump over Problem: Gather reliable statistics over pairs of trigrams requires an enormous annotated corpus. Solution: Decompose the structure into smaller parts

Decomposition Idea from vector-space models Represent each (word,position) pair as a vector: v h0, v h1, v h 1, v m0, v m1, v m 1

Decomposition Idea from vector-space models Represent each (word,position) pair as a vector: v h0, v h1, v h 1, v m0, v m1, v m 1 Model a triplet as a sum: (v h 1 + v h0 + v h1 ) (v m 1 + v m0 + v m1 )

Decomposition Idea from vector-space models Represent each (word,position) pair as a vector: v h0, v h1, v h 1, v m0, v m1, v m 1 Model a triplet as a sum: (v h 1 + v h0 + v h1 ) (v m 1 + v m0 + v m1 ) Expanding the terms, we get: assoc(h 1 h 0 h +1, m 1 m 0 m +1 ) = 1 1 α ij assoc ij (h i, m j ) i= 1 j= 1

Auto-Parsed Features (Context) m 1 m 0 m +1... h 1 h 0 h +1 the black fox... will jump over assoc(h 1 h 0 h +1, m 1 m 0 m +1 ) = α 1, 1 assoc 1, 1 (the, will) + α 1,0 assoc 1,0 (the, jump) + α 1,1 assoc 1,1 (the, over)+ α 0, 1assoc 0, 1 (black, will) + α 0,0 assoc 0,0 (black, jump) + α 0,1 assoc 0,1 (black, over)+ α 1, 1 assoc 1, 1 (fox, will) + α 1,0 assoc 1,0 (fox, jump) + α 1,1 assoc 1,1 (fox, over)

assoc ij (h, m) = w ij φ ij lex (h, m)

assoc ij (h, m) = w ij φ ij lex (h, m) Features in φ ij lex (h, m) bin(s ij (h, m)) bin(s ij (h, m)) dist(h,m) bin(s ij (h, m)) pos(h) pos(m) bin(s ij (h, m)) pos(h) pos(m) dist(h,m) The terms S ij (h, m) are estimated like before.

Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21

Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21 Base + TRIP(S Rank ) 92.31 91.92

Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21 Base + TRIP(S Rank ) 92.31 91.92 Base + TRIP(S Vec ) 92.48 92.26 Base + TRIP(S PMI ) 92.55 92.37

Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21 Base + TRIP(S Rank ) 92.31 91.92 Base + TRIP(S Vec ) 92.48 92.26 Base + TRIP(S PMI ) 92.55 92.37 Base + Brown + TRIP(S PMI ) 92.76 92.44

Results (2) Dev Test Baseline 91.97 91.57 Base + HM(S Rank ) 92.31 91.81 Base + HM(S Vec ) 92.31 91.92 Base + HM(S PMI ) 92.32 91.96 Base + Brown 92.16 92.05 Base + Brown + HM(S PMI ) 92.46 92.21 Base + TRIP(S Rank ) 92.31 91.92 Base + TRIP(S Vec ) 92.48 92.26 Base + TRIP(S PMI ) 92.55 92.37 Base + Brown + TRIP(S PMI ) 92.76 92.44 Large improvement in accuracy First method to improve over brown-clusters State-of-the-art results for first order model

To summarize Semi-supervised dependency parsing Features from auto-parsed data Modeling interaction between word triplets

To summarize Semi-supervised dependency parsing Features from auto-parsed data Modeling interaction between word triplets Ideas inspired by word-embeddings... but explicit counts work better for us

To summarize Semi-supervised dependency parsing Features from auto-parsed data Modeling interaction between word triplets Ideas inspired by word-embeddings... but explicit counts work better for us State of the art results First method to improve over brown-clusters

Thank You