CS674 NLP. Information extraction. Information Extraction. Acquiring extraction patterns. Claire Cardie Cornell University

Similar documents
Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The Smart/Empire TIPSTER IR System

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Linking Task: Identifying authors and book titles in verbose queries

AQUA: An Ontology-Driven Question Answering System

CS 598 Natural Language Processing

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

BYLINE [Heng Ji, Computer Science Department, New York University,

Parsing of part-of-speech tagged Assamese Texts

THE VERB ARGUMENT BROWSER

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

A Domain Ontology Development Environment Using a MRD and Text Corpus

Construction Grammar. University of Jena.

Leveraging Sentiment to Compute Word Similarity

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Prediction of Maximal Projection for Semantic Role Labeling

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Memory-based grammatical error correction

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A Case Study: News Classification Based on Term Frequency

Grammars & Parsing, Part 1:

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Natural Language Processing. George Konidaris

A Comparison of Two Text Representations for Sentiment Analysis

Some Principles of Automated Natural Language Information Extraction

Beyond the Pipeline: Discrete Optimization in NLP

Using dialogue context to improve parsing performance in dialogue systems

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Evaluating Message Understanding Systems: An Analysis of the Third Message Understanding Conference (MUG-3)

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The Role of the Head in the Interpretation of English Deverbal Compounds

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Distant Supervised Relation Extraction with Wikipedia and Freebase

The Interface between Phrasal and Functional Constraints

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Developing a TT-MCTAG for German with an RCG-based Parser

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Using Semantic Relations to Refine Coreference Decisions

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Accuracy (%) # features

Accurate Unlexicalized Parsing for Modern Hebrew

A corpus-based approach to the acquisition of collocational prepositional phrases

A Graph Based Authorship Identification Approach

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Analysis of Probabilistic Parsing in NLP

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Multilingual Sentiment and Subjectivity Analysis

The stages of event extraction

The taming of the data:

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

A Grammar for Battle Management Language

Achim Stein: Diachronic Corpora Aston Corpus Summer School 2011

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Word Segmentation of Off-line Handwritten Documents

Ensemble Technique Utilization for Indonesian Dependency Parser

Context Free Grammars. Many slides from Michael Collins

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Vocabulary Usage and Intelligibility in Learner Language

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

"f TOPIC =T COMP COMP... OBJ

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Compositional Semantics

Applications of memory-based natural language processing

Development of the First LRs for Macedonian: Current Projects

Specifying a shallow grammatical for parsing purposes

An Interactive Intelligent Language Tutor Over The Internet

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning Computational Grammars

Unsupervised Learning of Narrative Schemas and their Participants

Universiteit Leiden ICT in Business

Constraining X-Bar: Theta Theory

Adapting Stochastic Output for Rule-Based Semantics

Formulaic Language and Fluency: ESL Teaching Applications

Extracting and Ranking Product Features in Opinion Documents

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

NATURAL LANGUAGE PARSING AND REPRESENTATION IN XML EUGENIO JAROSIEWICZ

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial ISSN:

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

LTAG-spinal and the Treebank

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Argument structure and theta roles

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

The Choice of Features for Classification of Verbs in Biomedical Texts

Chapter 4: Valence & Agreement CSLI Publications

A relational approach to translation

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Transcription:

CS674 NLP Information Extraction Information extraction Acquiring extraction patterns Learning approaches Semi-automatic methods for extraction from unstructured Fully automatic methods for extraction from structured Finite-state methods Claire Cardie Cornell University IE system: input SAN SALVADOR, 15 JAN 90 (ACAN-EFE) -- [TEXT] ARMANDO CALDERON SOL, PRESIDENT OF THE NATIONALIST REPUBLICAN ALLIANCE (ARENA), THE RULING SALVADORAN PARTY, TODAY CALLED FOR AN INVESTIGATION INTO ANY POSSIBLE CONNECTION BETWEEN THE MILITARY PERSONNEL IMPLICATED IN THE ASSASSINATION OF JESUIT PRIESTS. "IT IS SOMETHING SO HORRENDOUS, SO MONSTROUS, THAT WE MUST INVESTIGATE THE POSSIBILITY THAT THE FMLN (FARABUNDO MARTI NATIONAL LIBERATION FRONT) STAGED THIS ASSASSINATION TO DISCREDIT THE GOVERNMENT," CALDERON SOL SAID. SALVADORAN PRESIDENT ALFREDO CRISTIANI IMPLICATED FOUR OFFICERS, INCLUDING ONE COLONEL, AND FIVE MEMBERS OF THE ARMED FORCES IN THE ASSASSINATION OF SIX JESUIT PRIESTS AND TWO WOMEN ON 16 NOVEMBER AT THE CENTRAL AMERICAN UNIVERSITY. IE system: output 1. DATE - 15 JAN 90 2. LOCATION EL SALVADOR: CENTRAL AMERICAN UNIVERSITY 3. TYPE MURDER 4. STAGE OF EXECUTION ACCOMPLISHED 5. INCIDENT CATEGORY TERRORIST ACT 6. PERP: INDIVIDUAL ID "FOUR OFFICERS" "ONE COLONEL" "FIVE MEMBERS OF THE ARMED FORCES" 7. PERP: ORGANIZATION ID "ARMED FORCES", "FMLN" 8. PERP: CONFIDENCE REPORTED AS FACT 9. HUM TGT: DESCRIPTION "JESUIT PRIESTS" "WOMEN" 10. HUM TGT: TYPE CIVILIAN: "JESUIT PRIESTS" CIVILIAN: "WOMEN" 11. HUM TGT: NUMBER 6: "JESUIT PRIESTS" 2: "WOMEN" 12. EFFECT OF INCIDENT DEATH: "JESUIT PRIESTS" DEATH: "WOMEN"

Issues for learning extraction patterns Training data is difficult to obtain IE answer keys provide some supervisory information --- string to be extracted and its label --- but often not enough No direct means for learning set fills Training examples usually encode the output of earlier levels of syntactic and semantic analysis No standard training set available When these preprocessing components change, examples must be regenerated Standard off-the-shelf learning algorithms tend to work less well than those specifically tailored to the task Learning IE patterns from examples Goal Given a training set of documents paired with humanproduced filled extraction templates [answer keys], Learn extraction patterns for each slot using an appropriate machine learning algorithm. Options Memorize the fillers of each slot Generalize the fillers using p-o-s tags? phrase structure (NP, V) and grammatical roles (SUBJ, OBJ)? semantic categories? Learning IE patterns Methods vary with respect to The model class of pattern learned (e.g. lexically based regular expression, syntactic-semantic pattern) Training corpus requirements Amount and type of human feedback required Degree of pre-processing necessary Background knowledge presumed Autoslog [Riloff 1993] Learns syntactico-semantic patterns (called concept nodes ) from Cardie [1997]

Autoslog algorithm Noun phrase extraction only Relies on a small set of pattern templates <active-voice-verb> <direct object>=<target-np> <subject>=<target-np> <active-voice-verb> <subject>=<target-np> <passive-voice-verb> <passive-voice-verb> by <object>=<target-np> Domain-independent So require little modification when switching domains Requires partial parser Assumes semantic category(ies) for each slot are known, and all potential slot fillers can be tested w.r.t. them Autoslog algorithm Find the sentence from which the noun phrase originated. Present the sentence to the partial parser. Apply the pattern templates in order. When a pattern applies, generate a concept node definition from the matched constituent, its con, the slot type (from the answer key), and the (predefined) semantic class for the filler. Learned terrorism patterns <victim> was murdered <perpetrator> bombed <perpetrator> attempted to kill was aimed at <target> Natural disasters patterns <subject> = disaster-event (earthquake) registered (active) registered (active) <direct obj> = magnitude Yesterday s earthquake registered 6.9 on the Richter scale. measuring (gerund) <direct obj> = magnitude measuring 6.9 aid (noun) in/to/for (prep) <obj> = disaster-event-location/ victim sending medical aid to Afghanistan sending medical aid to earthquake victims

Advantages and Disadvantages Learns bad patterns as well as good patterns Too general (e.g. triggered by is or are or by verbs not tied to the domain) Too specific Just plain wrong Parsing errors Target NPs occur in a prepositional phrase and Autoslog can t determine the trigger (e.g. is it the preceding verb or the preceding NP?) Requires that a person review the proposed extraction patterns, discarding bad ones No computational linguist needed (?) Reduced human effort from 1200-1500 hours to ~4.5 hours F-measure dropped from 50.5 to 48.7 (for one test set); from 41.9 to 41.8 (for a second test set) Autoslog-TS Largely unsupervised Two sets of documents: relevant, not relevant Apply pattern templates to extract every NP in the s Compute relevance rate for each pattern i : Pr (relevant contains i) = freq of i in relevant s / frequency of i in corpus Sort patterns according to relevance rate and frequency relevance rate * log (freq) Autoslog-TS Learning extraction patterns Human review of learned patterns required Also requires labeling the semantic category of the extracted slot filler

Information extraction Acquiring extraction patterns Learning approaches Semi-automatic methods for extraction from unstructured Fully automatic methods for extraction from structured Finite-state methods Covering algorithms E.g. Crystal [Soderland et al. 1995] Allows for more complicated patterns Can test target NP or any constituent in its con for presence of any word or sequence of words semantic class of heads or modifiers Covering algorithm: successively generalizes the extraction patterns until the generalization produces errors Generate the most specific pattern possible for every phrase to be extracted in the training corpus For each pattern, P, find the most similar pattern P and relax the constraints of each just enough to unify P and P. Test the new extraction pattern E against the training corpus. If its error rate is < threshold T, add E to the set of patterns, replacing P and P. Repeat the process on E until the error tolerance is exceeded. Move on to the next pattern, P, in the original set Extraction patterns for semistructured If extracting from automatically generated web pages, simple regex patterns usually work. Specify an item to extract for a slot using a regular expression pattern. Price pattern: \b\$\d+(\.\d{2})?\b May require preceding (pre-filler) pattern to identify proper con. Amazon list price: Pre-filler pattern: <b>list Price:</b> <span class=listprice> Filler pattern: \$\d+(\.\d{2})?\b May require succeeding (post-filler) pattern to identify the end of the filler. Amazon list price: Pre-filler pattern: <b>list Price:</b> <span class=listprice> Filler pattern:.+ Post-filler pattern: </span> slides 47-51 Ray Mooney Simple template extraction Extract slots in order, starting the search for the filler of the n+1 slot where the filler for the nth slot ended. Assumes slots always in a fixed order. Title Author List price Make patterns specific enough to identify each filler always starting from the beginning of the document. Rapier system learns three regex-style patterns for each slot: pre-filler, filler, post-filler

Extraction patterns for semistructured If extracting from more natural, unstructured, humanwritten, some NLP will usually help. Part-of-speech (POS) tagging Mark each word as a noun, verb, preposition, etc. Syntactic parsing Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet) KILL: kill, murder, assassinate, strangle, suffocate E.g. Rapier s extraction patterns can use POS or phrase tags. Crime victim: Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP] Set fill extraction If a slot has a fixed set of pre-specified possible fillers, categorization can be used to fill the slot. Job category Company type Treat each of the possible values of the slot as a category, and classify the entire document to determine the correct filler. When won t this work? XML and IE If relevant documents were all available in standardized XML format, IE would be unnecessary. But Difficult to develop a universally adopted DTD format for the relevant domain. Difficult to manually annotate documents with appropriate XML tags. Commercial industry may be reluctant to provide data in easily accessible XML format. IE provides a way of automatically transforming semi-structured or unstructured data into an XML compatible format.