A STRUCTURED LEARNING APPROACH TO TEMPORAL RELATION EXTRACTION

Similar documents
Temporal Information Extraction for Question Answering Using Syntactic Dependencies in an LSTM-based Architecture

(Sub)Gradient Descent

Visual CP Representation of Knowledge

Beyond the Pipeline: Discrete Optimization in NLP

Comparison of network inference packages and methods for multiple networks inference

The Strong Minimalist Thesis and Bounded Optimality

CS Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

AQUA: An Ontology-Driven Question Answering System

The stages of event extraction

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Basic Concepts of Machine Learning

arxiv: v1 [cs.cl] 2 Apr 2017

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Discriminative Learning of Beam-Search Heuristics for Planning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Lecture 1: Machine Learning Basics

Python Machine Learning

CS 446: Machine Learning

Detecting English-French Cognates Using Orthographic Edit Distance

Corrective Feedback and Persistent Learning for Information Extraction

Full text of O L O W Science As Inquiry conference. Science as Inquiry

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Calibration of Confidence Measures in Speech Recognition

Using dialogue context to improve parsing performance in dialogue systems

Linking Task: Identifying authors and book titles in verbose queries

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Applications of memory-based natural language processing

Attributed Social Network Embedding

10.2. Behavior models

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Foundations of Knowledge Representation in Cyc

Artificial Neural Networks written examination

Abstractions and the Brain

Probabilistic Latent Semantic Analysis

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Summarizing Answers in Non-Factoid Community Question-Answering

Lecture 10: Reinforcement Learning

Medical Complexity: A Pragmatic Theory

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Math 96: Intermediate Algebra in Context

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

The Round Earth Project. Collaborative VR for Elementary School Kids

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

arxiv: v1 [cs.cv] 10 May 2017

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Deep Facial Action Unit Recognition from Partially Labeled Data

Consultation skills teaching in primary care TEACHING CONSULTING SKILLS * * * * INTRODUCTION

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Mining Student Evolution Using Associative Classification and Clustering

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Memory-based grammatical error correction

Learning Methods in Multilingual Speech Recognition

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

NCEO Technical Report 27

Truth Inference in Crowdsourcing: Is the Problem Solved?

Tun your everyday simulation activity into research

Multilingual Sentiment and Subjectivity Analysis

SEMAFOR: Frame Argument Resolution with Log-Linear Models

A Case Study: News Classification Based on Term Frequency

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

What is this place? Inferring place categories through user patterns identification in geo-tagged tweets

BMBF Project ROBUKOM: Robust Communication Networks

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Online Updating of Word Representations for Part-of-Speech Tagging

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Word Segmentation of Off-line Handwritten Documents

Extending Place Value with Whole Numbers to 1,000,000

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Scientific Inquiry Test Questions

Issues in the Mining of Heart Failure Datasets

M55205-Mastering Microsoft Project 2016

Ensemble Technique Utilization for Indonesian Dependency Parser

Professor Christina Romer. LECTURE 24 INFLATION AND THE RETURN OF OUTPUT TO POTENTIAL April 20, 2017

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Distant Supervised Relation Extraction with Wikipedia and Freebase

Prediction of Maximal Projection for Semantic Role Labeling

Australian Journal of Basic and Applied Sciences

arxiv: v2 [cs.cv] 30 Mar 2017

Concept Acquisition Without Representation William Dylan Sabo

Assignment 1: Predicting Amazon Review Ratings

Graph Alignment for Semi-Supervised Semantic Role Labeling

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Building a Semantic Role Labelling System for Vietnamese

Introduction to Causal Inference. Problem Set 1. Required Problems

Top US Tech Talent for the Top China Tech Company

Protocols for building an Organic Chemical Ontology

Unsupervised Learning of Narrative Schemas and their Participants

Disciplinary Literacy in Science

Ontologies vs. classification systems

Discovery of Topical Authorities in Instagram

Team Formation for Generalized Tasks in Expertise Social Networks

MetaPAD: Meta Pattern Discovery from Massive Text Corpora

Transcription:

A STRUCTURED LEARNING APPROACH TO TEMPORAL RELATION EXTRACTION Qiang Ning, Zhili Feng, Dan Roth Computer Science University of Illinois, Urbana-Champaign & University of Pennsylvania 1

TOWARDS NATURAL LANGUAGE UNDERSTANDING 1.. 2.. 3.. 4..... 11. Reasoning with respect to Time 2

UNDERSTANDING TIME IN TEXT Understanding time is key to understanding events Timeline construction (e.g., news stories, clinical records), time-slot filling, Q&A, causality analysis, pattern discovery, etc. Applications depend on two fundamental tasks Time expression extraction and normalization yesterday 2017-09-09 Time that is expressed explicitly Thursday after labor day 2017-08-31 2 time expressions in every 100 tokens (in TempEval3 datasets) Temporal relation extraction Time that is expressed implicitly A happens BEFORE/AFTER B 12 temporal relations in every 100 tokens (in TempEval3 datasets) 3

GRAPH REPRESENTATION OF TEMPORAL RELATIONS In Los Angeles that lesson was brought home today when tons of earth cascaded down a hillside, ripping two houses from their foundations. No one was hurt, but firefighters ordered the evacuation of nearby homes and said they'll monitor the shifting ground until March 23 rd. Five Relation types: ripping monitor Before; After; Include; Included; equal hurt cascaded ordered BEFORE INCLUDED 4

CHALLENGE I: STRUCTURE Structure of a temporal graph [Bramsen et al. 06; Chambers & Jurafsky 08l Do et. al. 12] Symmetry: A BEFORE B B AFTER A Transitivity: A BEFORE B + B BEFORE C A BEFORE C Relations are highly interrelated, but existing methods learn models by considering a single pair at a time. Existing methods Expectation ripping monitor ripping vs hurt ripping vs cascaded ripping vs monitor hurt cascaded ordered BEFORE INCLUDED 5

CHALLENGE II: MISSING RELATIONS Most of the relations are left unannotated Ground Truth Problems of existing approaches Addressing both challenges Structured Prediction Dealing with missing relations in the annotation. Provided Annotation (TempEval3) ripping monitor ripping monitor hurt hurt cascaded ordered cascaded ordered BEFORE INCLUDED MISSING Missing relations arise in three scenarios: The annotators did not look at a pair of events (e.g, long distance) The annotators could not decide among multiple options Annotators disagreements The annotation task is difficult if done at a single event pair level 6

EXISTING APPROACHES Local methods [1-4] Learn models or design rules that make pairwise decisions between each pair of events Global consistency (i.e., symmetry and transitivity) is not enforced Inconsistency may exist in local methods Local methods + Global Inference (L+I) [5-7] B A C Formulate the problem as an integer linear programming (ILP) over the entire graph, on top of pre-learnt local models Consistency guaranteed: structural requirements are added as declarative constraints to the ILP Performance improved: Local decisions may be corrected via global consideration L+I [1] Mani et al., ACL2006 [2] Chambers et al., ACL2007 [3] Bethard, ClearTK-TimeML: TempEval 2013 [4] Laokulrat et al., SEM2013 [5] Bramsen et al., EMNLP2006 [6] Chambers and Jurafsky, EMNLP2008 [7] Do et al., EMNLP2012 B A C Consistency is enforced via ILP 7

CHALLENGE I: CONSISTENT DECISION MAKING IS NOT SUFFICIENT Neither local methods nor L+I methods account for structural constraints in the learning phase. But information from other events is often necessary. tons of earth cascaded down a hillside, ripping two houses firefighters ordered the evacuation of nearby homes (What s the temporal relation between ripping and ordered? It s difficult to tell.) As a result, (ripping, ordered)=before cannot be supported given the local information, resulting in overfitting.. However, observing that (ripping, ordered)=before actually results from (ripping, cascaded)=included and (cascaded, ordered)=before, rather than the local text itself, supports better learning. ripping? ordered cascaded ordered ordered ripping ripping 8

PROPOSED APPROACH: INFERENCE-BASED TRAINING Local Training (Perceptron) For each x, y y = sgn(w T x) If y y Update w (x, y): feature and label for a single pair of events When learning from x, y, the algorithm is unaware of decisions with respect to other pairs. IBT (Structured Perceptron) For each (X, Y) Y = argmax Y C If Y Y W T X Update W X, Y : features and labels from a whole document Y C: Enforce consistency through constraint C. 9

PROPOSED APPROACH: INFERENCE-BASED TRAINING Inference step E Event node set, Y temporal label set I r (ij) Boolean variable for event pair (i,j) being relation r f r (ij) softmax score of event pair (i,j) being relation r r m temporal relations implied by r 1 and r 2 s.t. i, j, k E መI = arg min I ij E r Y f r ij I r (ij) I r ij = 1 r I r ij = I r ji I r1 ij + I r2 jk I rm ik 1 m Uniqueness Symmetry Generalized Transitivity 10

PROPOSED APPROACH: INFERENCE-BASED TRAINING Constraint-Driven Learning Make use of unannotated data Chang et al., Guiding semi-supervision with constraint-driven learning. ACL2007. Chang et al., Structured learning with constrained conditional models. Machine Learning 2012. 11 11

RESULTS (CHALLENGE I) When gold related pairs are known (TE3, Task C, Relation only) Enforcing constraints only at decision time. Enforcing constraints during learning System Method Precision Recall F1 UTTime [1] Local 55.6 57.4 56.5 AP Local 58.0 55.3 56.6 AP+ILP L+I 62.2 61.1 61.6 SP+ILP S+I 69.1 65.5 67.2 [1] Laokulrat et al., UTTime: Temporal relation classification using deep syntactic features, SEM2013 12 12

HOWEVER, REALISTICALLY When gold related pairs are NOT known (TE3, Task C) System Method Precision Recall F1 ClearTK [1] Local 37.2 33.1 35.1 AP Local 35.3 37.1 36.1 AP+ILP L+I 35.7 35.0 35.3 SP+ILP S+I 32.4 45.2 37.7 Performance drops significantly. Structured learning is not helping as much as previously in the presence of missing, vague relations Existing methods of handling vague relations are ineffective: Simply add vague to the temporal label set Train a classifier or design rules for vague vs. non-vague [1] Bethard, ClearTK-TimeML: A minimalist approach to TempEval 2013 13 13

CHALLENGE II: MISSING RELATIONS Most of the relations are left unannotated Ground Truth Provided Annotation (TempEval3) ripping monitor ripping monitor hurt hurt cascaded ordered cascaded ordered BEFORE INCLUDED MISSING The annotation task is difficult if done at a single event pair level Some of the missing relations can be inferred Saturate the graph via symmetry and transitivity The vast majority, cannot 14

HANDLING VAGUE RELATIONS 1. Ignore vague labels during training Many vague pairs are not really vague but rather pairs that the annotators failed to look at. The imbalance between vague and non-vague relations makes it hard to learn a good vague classifier. The Vague relation is fundamentally different from other relation types. If (A, B) = BEFORE, then it s always BEFORE regardless of other events. But if (A, B) = VAGUE, the relation can change if more context is provided. 2. Apply post-filtering using KL divergence For each pair, we have a predicted distribution over possible relations. Compute the KL divergence of this distribution with the uniform distribution, and filter out predictions that have a low score. δ i = σ M m=1 f rm i log(mf rm i ), M=#labels, f r i =score for pair i. High similarity to the uniform distribution, δ i < t, implies unconfident prediction change decision to Vague. 15 15

RESULTS (CHALLENGE II) When gold related pairs are NOT known (TE3, Task C) Apply the post-filtering method proposed above System Method Precision Recall F1 ClearTK [1] Local 37.2 33.1 35.1 AP Local 35.3 37.1 36.1 AP+ILP L+I 35.7 35.0 35.3 SP+ILP S+I 32.4 45.2 37.7 Applying post-filtering method for vague relations SP+ILP S+I 33.1 49.2 39.6 CoDL+ILP S+I 35.5 46.5 40.3 [1] Bethard, ClearTK-TimeML: A minimalist approach to TempEval 2013 16 16

OVERALL RESULTS TempEval3 dataset is known to suffer from TLINK sparsity issues. Timebank-dense is another dataset with much denser TLINK annotations. Significant improvement over CAEVO, the previousely best system on Timebank-dense. System Method Precision Recall F1 ClearTK [1] Local 46.04 20.90 28.74 CAEVO [2] L+I 54.17 39.49 45.68 SP+ILP S+I 45.34 48.68 46.95 CoDL+ILP S+I 45.57 51.89 48.53 [1] Bethard, ClearTK-TimeML: A minimalist approach to TempEval 2013 [2] Chambers et al., Dense event ordering with a multi-pass architecture. TACL 2014 17 17

CONCLUSION Thanks Identifying Temporal relations between events is a highly structured task This results also in low quality annotation (vague relations) This work shows that Using structured information during learning is important The structure can be exploited in an unsupervised way (via CoDL) to further improve results Vagueness is the result of lack of information rather than a concrete relation. KL-driven post-filtering is shown to be an effective way to treat vague relations. A lot more work is needed on temporal reasoning from text 18 18