Two hierarchical text categorization approaches for BioASQ semantic indexing challenge. BioASQ challenge 2013 Valencia, September 2013

Similar documents
Python Machine Learning

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

CS Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

(Sub)Gradient Descent

Probabilistic Latent Semantic Analysis

Lecture 1: Machine Learning Basics

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Linking Task: Identifying authors and book titles in verbose queries

Rule Learning With Negation: Issues Regarding Effectiveness

Cross-Lingual Text Categorization

A Bayesian Learning Approach to Concept-Based Document Classification

A Case Study: News Classification Based on Term Frequency

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Assignment 1: Predicting Amazon Review Ratings

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Ontological spine, localization and multilingual access

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Controlled vocabulary

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Memory-based grammatical error correction

The stages of event extraction

AQUA: An Ontology-Driven Question Answering System

Artificial Neural Networks written examination

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Calibration of Confidence Measures in Speech Recognition

Switchboard Language Model Improvement with Conversational Data from Gigaword

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Rule Learning with Negation: Issues Regarding Effectiveness

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Universidade do Minho Escola de Engenharia

Australian Journal of Basic and Applied Sciences

Seminar - Organic Computing

Evolutive Neural Net Fuzzy Filtering: Basic Description

Text-mining the Estonian National Electronic Health Record

The MEANING Multilingual Central Repository

New Features & Functionality in Q Release Version 3.2 June 2016

Conference Presentation

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

On document relevance and lexical cohesion between query terms

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Prediction of Maximal Projection for Semantic Role Labeling

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Model Ensemble for Click Prediction in Bing Search Ads

Axiom 2013 Team Description Paper

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

An investigation of imitation learning algorithms for structured prediction

A process by any other name

Issues in the Mining of Heart Failure Datasets

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

A Graph Based Authorship Identification Approach

Rule-based Expert Systems

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Speech Recognition at ICSI: Broadcast News and beyond

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Learning Methods in Multilingual Speech Recognition

Lecture 1: Basic Concepts of Machine Learning

Beyond the Pipeline: Discrete Optimization in NLP

Learning From the Past with Experiment Databases

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Developing a TT-MCTAG for German with an RCG-based Parser

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES SCHOOL OF INFORMATION SCIENCES

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Abstractions and the Brain

Learning goal-oriented strategies in problem solving

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

TextGraphs: Graph-based algorithms for Natural Language Processing

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Using AMT & SNOMED CT-AU to support clinical research

A Vector Space Approach for Aspect-Based Sentiment Analysis

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Word Sense Disambiguation

CSC200: Lecture 4. Allan Borodin

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

Ontologies vs. classification systems

Evolution of Symbolisation in Chimpanzees and Neural Nets

Generative models and adversarial training

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Radius STEM Readiness TM

INPE São José dos Campos

Chapter 2 Rule Learning in a Nutshell

Introduction to Simulation

Accuracy (%) # features

Transcription:

Two hierarchical text categorization approaches for BioASQ semantic indexing challenge Francisco J. Ribadas Víctor M. Darriba Compilers and Languages Group Universidade de Vigo (Spain) http://www.grupocole.org/ ribadas@uvigo.es darriba@uvigo.es Luis M. de Campos Alfonso E. Romero Research Group of Uncertainty Treatment in Artificial Intelligence Universidad de Granada (Spain) http://decsai.ugr.es/gte/ lci@decsai.ugr.es aeromero@cs.rhul.ac.uk BioASQ challenge 2013 Valencia, September 2013

1 Motivation and objectives 2 Description of our systems framework 3 in BioASQ Tested configurations 4 and future work

Motivation Joint work of CoLe group (Univ. of Vigo) and UTAI group (Univ. of Granada) Previous independent work on related small/medium size problems { parliamentary initiatives (UTAI) Legal documents: public grants and subsidies (CoLe) Medium size thesauri (EUROVOC + custom thesaurus) Both dealing with Spanish texts (also Galician for CoLe) Minimal linguistic processing (no tagging, no lemmatization, no NER) Thesarus topic assigment as a hierarchical text categorization problem Top-down scheme using a local classifier per node approach (CoLe) Bayesian network induced from thesaurus hierarchy (UTAI)

Objectives 1 Test the scalability of our proposals with large real-world data BioASQ Task 1A: Large-Scale Online Biomedical Semantic Indexing Large hierarchy of descriptors and large training set size and time restrictions 2 Evaluate the suitability of a pure text categorization approach for semantic indexing with MeSH minimal linguistic processing Very different domain complex terminology

framework Origins of our systems Text categorization on a public grants/subsidies collection Small size custom thesaurus ( 1800 descriptors) Medium/large size documents A few labeling inconsistence in training documents Additional requirement: return many results (search for a high recall) Human curators will postprocess system output Text categorization on a parliamentary initiatives collection EUROVOC thesaurus ( 4000 descriptors) Very small size documents (1-2 paragraphs) Additional requirement: return many results (search for a high recall) Human curators will postprocess system output

framework (I) Generic framework for hierarchical categorization (under development) Top-down Local Classifier per Node Approach Local binary classifier trained for each node in the hierarchy Is current node (or its descendants) pertinent as label? Pachinko-like top-down traversal of local classifiers Able to deal with tree and DAG structured taxonomies Plug-in architecture with several components for: selecting sets of positive examples with a bottom-up procedure selecting sets of negative examples feature selection at each local model (IG, Chi squared,...) classification algorithm to perform the routing decisions at each local model dealing with unbalanced classes (weighting, boundary negative examples, split negative example set in an ensemble of classifiers)

framework (II) Specific features for large scale hierarchical text categorization Textual features computation backed by a Lucene index Bottom-up positive example selection (from positive examples sets in descendant nodes) avoid unmanageable training sets on top levels random selection among descendant positive examples k-means clustering based selection selecting examples close to centroids Guided top-down search using a simplified k-nearest neighbours query the Lucene index to get a set of promising labels from most similar documents top-down search starts at grandparents nodes avoids premature discard of useful paths

framework (III) Contextual routing decisions (not tested in BioASQ data) Objective: try to reduce false negatives (mainly in top levels) { content based Two classifier per node model context based Exploiting bottom-up information (metafeatures) coming from content based routing decisions performed by descendant nodes (and optionally by ancestor and sibling nodes) Roughly inspired by classifier chain approaches in multilabel classification Adds to content based features a set of metafeatures about decisions of surrounding models Moderate performance improvements + high training/classification cost

(I) Builds a Bayesian network using: 1 thesaurus hierarchical structure { 2 descriptor labels terms (tokens) taken from non-descriptor labels 3 terms (tokens) taken from training documents Elements Concept nodes (representing thesaurus concepts/nodes) Descriptor and Non-Descriptor nodes (representing descriptor and non descriptor labels) Term nodes (representing words [tokens]) Every concept node C linked with three virtual nodes H C : info. from BT (Broader Term) relationships in the thesaurus E C : info. from descriptor and non-descriptor labels (synonymy) T C : info. from training documents Efficient OR-gate model to define conditional probabilities.

(II) Thesaurus fragment (D: descriptors, ND: non-descriptors)

(II) Bayesian network build from thesaurus hierarchy and descriptor and non-descriptor equivalence relationships

(II) Bayesian network after adding terms from training documents

for BioASQ challenge Own concept taxonomy with a DAG structure extracted from 2013 XML version of MeSH Hierarchical relationships created from TreeNumber elements TreeNumbers describe the places a MeSH descriptor occupies inside the 16 concept taxonomies Results DAG with 26,702 nodes (excludes 151 descriptors from subhierarchy V) 36,647 parent-child relationship with only two cycles descriptors D009014 (Morals) + D004989 (Ethics) descriptors D006885 (Hydroxybutyrates) + D020155 (3-Hydroxybutyric Acid) 108,117 related terms (synonyms or lexical variants) [non-descriptors for ]

Problem: spurious relationships in the final DAG taxonomy { part of face Example: eye as a sense organ } leads to consider eyebrows as an element of a sense organ No disambiguation info. on training data to avoid it

for BioASQ challenge Only elementary text processing on train and validation documents. stop-word removal using a standard English stop-word list default English stemmer from the Snowball project Also: alternative collection extracting word bigrams from descriptor labels and document text after stop-word removal simple way to capture some complex terms (but far from perfect) scalability limitations Extract a reduced training set of 1,242,670 documents (10 %) 50 more representative documents for every descriptor taken from Lucene index Split into 5 groups (248.534 training instances) to train

framework Tested configurations (I) Previous parameter tunning phase with a small dataset from subhierarchy [C] Diseases 1 Effectiveness of bottom-up positive example selection random document selection vs. k-means based document selection up to 500, 1000 and 2000 selected instances per node 2 Effectiveness of guided top-down classification 3 Usefulness of word bigram based features vs. single token features Results k-means based document selection has better performance, but training time is almost twice better results using greater amounts of positive instances guided top-down search improved classification time and quality word bigrams did not appear to help

Tested configurations (II) 1 Effectiveness of aggregating the results of 5 models vs. single model performance 2 Usefulness of word bigrams as instance features using a single model marginal improvements when results of 5 models were combined great improvement due to word bigram representation

BioASQ Task 1A Submitted configurations 1: framework with k-means bottom-up positive example selection (2000 documents per node) IG as local feature selection (100 features) SVM as local content based classifier 2: same as 1 employing a guided top-down search approach 2-NE: same as 2 using word bigrams as textual features REBAYCT: combination of 5 models trained with 5 splits of the reduced training set REBAYCT2: single model using word bigram alternative collection

Two different hierarchical text categorization systems evaluated in 2013 BioASQ challenge Quite far from top performance systems in the challenge, but some improvements were done from the original systems employed in first batch With minor changes our systems were able to deal with a problem larger than the ones that originated them Tested configurations and BioASQ challenge results give us some insights for improvement Large training data sets make unnecessary to employ sophisticated machine learning approaches? Training text contribution in classifications was more important than structural and descriptor label contributions Guided top-down search in employs a very simple kind of k-nn prefiltering.

Advanced NLP processing of documents and descriptor labels (POS tagging, NER,...) framework Make a more deep parameter tunning Exploit the guided top-down search approach Exploit (and optimize) the context based routing approach Evolve to a sort of active learning approach with better training document selection Exploit word bigrams and powerful text processing approaches to improve quality of input data