Concept Chunking. Introduction. Overview. Example. Why concept chunking? What is a concept? Sander Canisius Text Mining February 22, 2005

Similar documents
Linking Task: Identifying authors and book titles in verbose queries

Applications of memory-based natural language processing

AQUA: An Ontology-Driven Question Answering System

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Introduction to Text Mining

Cross Language Information Retrieval

Compositional Semantics

A Case Study: News Classification Based on Term Frequency

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Python Machine Learning

BYLINE [Heng Ji, Computer Science Department, New York University,

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The Smart/Empire TIPSTER IR System

ARNE - A tool for Namend Entity Recognition from Arabic Text

Distant Supervised Relation Extraction with Wikipedia and Freebase

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Probabilistic Latent Semantic Analysis

SEMAFOR: Frame Argument Resolution with Log-Linear Models

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Lecture 1: Basic Concepts of Machine Learning

Learning From the Past with Experiment Databases

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

CS Machine Learning

The stages of event extraction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

CS 598 Natural Language Processing

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Universiteit Leiden ICT in Business

Reducing Features to Improve Bug Prediction

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Extracting and Ranking Product Features in Opinion Documents

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Prediction of Maximal Projection for Semantic Role Labeling

A Vector Space Approach for Aspect-Based Sentiment Analysis

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Multilingual Sentiment and Subjectivity Analysis

Indian Institute of Technology, Kanpur

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Search right and thou shalt find... Using Web Queries for Learner Error Detection

ScienceDirect. Malayalam question answering system

Rule Learning With Negation: Issues Regarding Effectiveness

Postprint.

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Memory-based grammatical error correction

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

The taming of the data:

GACE Computer Science Assessment Test at a Glance

A Framework for Customizable Generation of Hypertext Presentations

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Abstractions and the Brain

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Corrective Feedback and Persistent Learning for Information Extraction

Expert locator using concept linking. V. Senthil Kumaran* and A. Sankar

Switchboard Language Model Improvement with Conversational Data from Gigaword

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Named Entity Recognition: A Survey for the Indian Languages

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Learning Computational Grammars

The Role of the Head in the Interpretation of English Deverbal Compounds

Beyond the Pipeline: Discrete Optimization in NLP

Rule-based Expert Systems

Natural Language Processing: Interpretation, Reasoning and Machine Learning

Training and evaluation of POS taggers on the French MULTITAG corpus

Disambiguation of Thai Personal Name from Online News Articles

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Word Segmentation of Off-line Handwritten Documents

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Using dialogue context to improve parsing performance in dialogue systems

Natural Language Processing. George Konidaris

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Evolution of Symbolisation in Chimpanzees and Neural Nets

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Levels of processing: Qualitative differences or task-demand differences?

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Automatic document classification of biological literature

A Bayesian Learning Approach to Concept-Based Document Classification

A Comparison of Two Text Representations for Sentiment Analysis

CS 446: Machine Learning

Detecting Online Harassment in Social Networks

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Bug triage in open source systems: a review

Development of the First LRs for Macedonian: Current Projects

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Parsing of part-of-speech tagged Assamese Texts

Transcription:

Overview Concept Chunking Introduction Techniques Applications Sander Canisius Text Mining February 22, 2005 Example Apologies, as always, for any cross-postings... Introduction CALL FOR PAPERS THE CHALLENGE OF IMAGE RETRIEVAL A Workshop on Content-Based Image & Video Retrieval February 5, 1998, University of Northumbria at Newcastle, UK IMPORTANT DATES: Deadline for Submission: 24 November 1997 Notification of Acceptance: 8 December 1997 Camera-ready Papers due: 23 January 1998 This one-day Research Workshop forms the first day of a two-day conference on image retrieval to be held in Newcastle upon Tyne on 5-6 February 1998. It aims to provide a forum for presenting new research... Slot types conferenceacronym conferencehomepage conferencename workshopacronym workshopcamerareadycopydate workshopdate workshophomepage workshoplocation workshopname workshopnotificationofacceptancedate workshoppapersubmissiondate What is a concept? Textual information unit that is an instance of one of several domain-specific types For example, disease names in a medical domain, submission dates in call for papers, etc. Some standard concept chunking / information extraction domains Seminar announcements Molecular biology News reports on terrorist attacks Why concept chunking? An example for information retrieval Standard information retrieval engines treat tokens in a document as atomic units that are equal to terms in a search query if and only if their byte strings match exactly In case of a google query Java, all top-ranked matches are about the programming language, which is of no use to someone searching information about the island One way of overcoming this issue is if you could somehow tell google that you are looking for the island Java rather than for the programming language For example, island(java) 1

Why concept chunking? Question Answering with concepts Research done within the IMIX Rolaquad project Several levels of annotation are combined to perform question answering Automatic concept finding The previous examples assume that information about the (types of) concepts in a document is available to the system This would be the case if the author of a document added this information; however, in practise, this seldomly happens The other solution then, is automatically predicting the concepts in a document => concept chunking Concept chunking in a broader NLP context For concept chunking, it is useful to have linguistic knowledge about the text at hand This information can be generated by other automated NLP components For example, tokenisation, part of speech tagging, (shallow) syntactic parsing, Similar NLP tasks Both information extraction (IE) and named-entity recognition (NER) share some resemblance with concept chunking However, Concept chunking does not take the document context into account (no document-centred approach) The goal of IE is to construct structured database records from an unstructured document; concept chunking and NER only mark the relevant concepts in a text IE also includes co-reference resolution In many cases, it is difficult to distinguish the three tasks; one could even argue that concept chunking is a subtask of named entity recognition, which is itself a subtask of information extraction Properties of concept chunking Often, there is interaction between the concepts to be predicted For example, in conference announcements, the conference date usually follows the conference name shortly There might even be interaction between different levels of annotation Properties of concept chunking Interaction between levels of annotation A document in which the words compiler, source code and object-oriented occur, most likely deals with a programming-related topic Another document in which the words tourism, capital and population occur, would more likely be about tourist information In the first type of document, the word Java is more likely to refer to the programming language, whereas in the second type of document, the island Java would make much more sense 2

Automatic concept chunking Techniques The goal of automatic concept chunking is to mark the location (chunk identification) and the type (chunk classification) of all concept instances in a text, without requiring any input from human experts Evaluation metrics Precision: the percentage of chunks correctly predicted by the system Recall: the percentage of correct chunks predicted by the system F-score: the harmonic mean of recall and precision, that is, F=2PR/(P+R) Techniques: quick overview Lexicon look-up?? Does not generalise to unseen instances A word may be ambiguous with respect to its concept type (for example Java) Knowledge-based approach Human experts construct a set of rules with which concepts can be identified in a text Learning approach Automated learning algorithms induce a model with which concepts can be identified in a text Knowledge-based vs. learning Knowledge-based approach Human experts construct a set of rules with which concepts can be identified in a text Learning approach Automated learning algorithms induce a model with which concepts can be identified in a text Knowledge-based approach Advantages Human experience can be used to quickly distinguish good rules from bad ones Disadvantages Laborious, time-intensive development process Requires the availability of human expertise 3

Learning approach Advantages There is no need for human experts Techniques are largely domain independent Exceptions are not likely to be overlooked Disadvantages (Large amounts of) example data are required to train most common machine learning algorithms Knowledge acquisition bottleneck vs. data acquisition bottleneck Resulting model might not be easily understandable by human observer Creating a rule for the concept <programming-language> Observation: a programming language concept is always a proper noun Context predicates: written in <NOUN> <NOUN> compiler But what to do with: Java is a beautiful programming language Java is a beautiful island Long distance dependencies can be problematic Rules can be created by a human expert or automatically by a ruleinduction algorithm Other machine learning algorithms may use more abstract models than rules Identification & classification: parallel vs. sequential Parallel: One classifier performs chunk identification and classification at the same time Sequential: One classifier performs chunk identification; another performs classification Might be useful if there are several similar but different concept types (workshop data, submission data, cameraready date) In that case identification can focus on correctly identifying a higher-level concept (date), and classification can focus on disambiguating already identified phrases Chunk identification Most common methods Chunking-as-tagging (IOB tagging) Each token is assigned a tag denoting whether a word is outside a chunk (O), inside a chunk (I), or inside a chunk that is different from the previous one (B) Possible problem: discontinuous chunks For example, Java/I programming/o language/i Open/Close bracketing Tokens that start or end a chunk are assigned a [ or ] symbol respectively Possible problem: unmatched brackets For example, [ Java programming ] language ] Chunk classification Parallel identification and classification Append concept type to identification tag For example, I-programming_language, or [programming_language Sequential classification Compress the chunk found in the identification step into a single unit For example, concatenating all words (may lead to sparse data!!!) Orthographic properties Encode chunk as bag-of-words Instances for concept chunking Granularity Tokens, characters, syntactic chunks Sliding window Any feature that may be useful POS tag, syntactic chunk, orthographic information, seed lists 4

Machine learning techniques General-purpose classifiers K-NN, Maximum Entropy, SVM, Sequence learners Hidden Markov Models, Conditional Random Fields Class skewedness In concept chunking, there is often an unbalanced class distribution, where the majority class is the negative class Standard machine learning techniques try to optimise towards accuracy As a result they may converge to always predicting the majority class Leads to high precision, but low recall Sampling may be used for dealing with class skewedness Up-sampling: add copies of positive instances Down-sampling: remove negative instances Two concept chunking domains Applications Call for papers domain IMIX Rolaquad: medical concept finding Call for papers domain CfP domain: approach Apologies, as always, for any cross-postings... CALL FOR PAPERS THE CHALLENGE OF IMAGE RETRIEVAL A Workshop on Content-Based Image & Video Retrieval February 5, 1998, University of Northumbria at Newcastle, UK IMPORTANT DATES: Deadline for Submission: 24 November 1997 Notification of Acceptance: 8 December 1997 Camera-ready Papers due: 23 January 1998 This one-day Research Workshop forms the first day of a two-day conference on image retrieval to be held in Newcastle upon Tyne on 5-6 February 1998. It aims to provide a forum for presenting new research... Double classification for dealing with class skewedness First select relevant sentences, then do concept chunking on the selected sentences Features: POS tags, orthographic information, Named entities Slot types conferenceacronym conferencehomepage conferencename workshopacronym workshopcamerareadycopydate workshopdate workshophomepage workshoplocation workshopname workshopnotificationofacceptancedate workshoppapersubmissiondate 5

CfP: results Precision: 66.5 Recall: 40.9 F-score: 50.6 State-of-the-art performance F-score: 73.5 However, uses document-centred approach Medical concepts POKKEN of variola major, een besmettelijke, door het variola virus verwekte ziekte. De ziekte is door het intensieve wereldwijde `eradicatieprogramma' van de Wereldgezondheidsorganisatie (WHO), officieel sinds 8 mei 1980 volledig uitgeroeid. Het pokkenvirus wordt nu nog slechts in een aantal laboratoria bewaard. Slot types disease disease_feature disease_symptom method_of_diagnosis person person_feature body_part bodily_function treatment advice micro-organism duration Medical concepts: approach Relatively new project Only simple tagging applied so far Medical concepts: results Precision: 69.09 Recall: 66.03 F-score: 67.53 6