Guido Boella Dipartimento di Informatica Università di Torino FP7-ICT-2013-SME-DCA

Similar documents
A Case Study: News Classification Based on Term Frequency

Rule Learning With Negation: Issues Regarding Effectiveness

Detecting English-French Cognates Using Orthographic Edit Distance

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Multilingual Sentiment and Subjectivity Analysis

CS 446: Machine Learning

Learning From the Past with Experiment Databases

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Rule Learning with Negation: Issues Regarding Effectiveness

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

CS Machine Learning

Using dialogue context to improve parsing performance in dialogue systems

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Cross-Lingual Text Categorization

Australian Journal of Basic and Applied Sciences

Python Machine Learning

Disambiguation of Thai Personal Name from Online News Articles

Using Web Searches on Important Words to Create Background Sets for LSI Classification

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Assignment 1: Predicting Amazon Review Ratings

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Word Segmentation of Off-line Handwritten Documents

Linking Task: Identifying authors and book titles in verbose queries

A Comparison of Two Text Representations for Sentiment Analysis

Reducing Features to Improve Bug Prediction

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Modeling function word errors in DNN-HMM based LVCSR systems

Lecture 1: Machine Learning Basics

Using Hashtags to Capture Fine Emotion Categories from Tweets

Conference Presentation

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

A Graph Based Authorship Identification Approach

Learning Methods in Multilingual Speech Recognition

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Speech Recognition at ICSI: Broadcast News and beyond

AQUA: An Ontology-Driven Question Answering System

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Switchboard Language Model Improvement with Conversational Data from Gigaword

Finding Translations in Scanned Book Collections

Using AMT & SNOMED CT-AU to support clinical research

Multi-label classification via multi-target regression on data streams

Automatic document classification of biological literature

A Bayesian Learning Approach to Concept-Based Document Classification

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

SARDNET: A Self-Organizing Feature Map for Sequences

Issues in the Mining of Heart Failure Datasets

Indian Institute of Technology, Kanpur

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Mining Student Evolution Using Associative Classification and Clustering

(Sub)Gradient Descent

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Generative models and adversarial training

Modeling function word errors in DNN-HMM based LVCSR systems

Forget catastrophic forgetting: AI that learns after deployment

Diverse Concept-Level Features for Multi-Object Classification

Cooperative evolutive concept learning: an empirical study

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:

arxiv: v1 [cs.cl] 2 Apr 2017

Determining the Semantic Orientation of Terms through Gloss Classification

Online Updating of Word Representations for Part-of-Speech Tagging

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Marie Skłodowska-Curie Actions (MSCA)

Laboratorio di Intelligenza Artificiale e Robotica

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

Matching Similarity for Keyword-Based Clustering

16.1 Lesson: Putting it into practice - isikhnas

A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain

The stages of event extraction

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Truth Inference in Crowdsourcing: Is the Problem Solved?

Memory-based grammatical error correction

Data Modeling and Databases II Entity-Relationship (ER) Model. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Postprint.

Probability and Statistics Curriculum Pacing Guide

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Beyond the Pipeline: Discrete Optimization in NLP

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Robust Sense-Based Sentiment Classification

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Artificial Neural Networks written examination

The CESAR Project: Enabling LRT for 70M+ Speakers

Corpus Linguistics (L615)

Writing Research Articles

arxiv: v1 [cs.lg] 15 Jun 2015

Chapter 2 Rule Learning in a Nutshell

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Operational Knowledge Management: a way to manage competence

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications

SME Academia cooperation in research projects in Research for the Benefit of SMEs within FP7 Capacities programme

Telekooperation Seminar

Probabilistic Latent Semantic Analysis

Learning to Rank with Selection Bias in Personal Search

Transcription:

EuroVoc classifier Guido Boella Dipartimento di Informatica Università di Torino FP7-ICT-2013-SME-DCA

Overview Introduction Background Our approach Pre-processing of the texts Evaluation

Introduction Classification of legal text deals with large amount of documents it usually involves intensive manual work (slow and costly) Need of automatization

Eurovoc thesaurus Eurovoc, a multilingual, multidisciplinary thesaurus with about 7,000 categories (also called classes, labels, or descriptors from now on) covering the activities of the EU, the European Parliament in particular. It contains terms in several languages and it is managed by the Publications Office of the European Union, an interinstitutional office whose task is to publish the publications of the institutions of European Union. Eurovoc is an ontology-based information collector that groups and links concepts through different types of relationships. The top level of the scheme is defined by 21 general concepts. 4

Multi-label Text Classification Background (1) Each document can belong to more than one label / category Problems Most of the algorithms only support mono-labeled datasets Solutions Adaptation of existing algorithms to deal with multi-labels Transformation of multi-labeled datasets into monolabeled

Background on transformation algorithms Background (2) Removal of all the documents that have more than one label from the dataset Random selection of one of the multiple labels for each documents, discarding the rest Very naïve solutions! (with bad results)

Background on transformation algorithms Background (3) Each different set of labels is considered as a single label (power set) Example: if the labels of a document are A, B, and C, the system transforms the labels of the document in a single label ABC. Weakness: it may lead to datasets with a large number of classes and few examples per class. Learning one binary classifier for each label in the data Classification procedure: to classify a new document, it needs to pass over all the classifiers to determine its associated set of labels) Weakness: in case of thousands of categories (as in the data that we will use), this strategy becomes unsustainable.

Main idea Our approach (1) each n-labeled document becomes a collection of n minor documents (each one associated to only one label), and then use a state-of-the-art classification technique for mono-labeled datasets Problem how to segment the original document, that is how to choose the features to maintain for each of the new mono-label documents?

Our approach (2) Category A Category A Category B Category C Category B Category C state-of-the-art technique for monolabeled datasets

Our approach (3) Segmentation We compute the Pointwise Mutual Information (PMI) between categories and features (terms) P i,j is the probability of having a non-zero co-occurrence value for the i-th feature and the j-th category in the whole corpus P i and P j are the individual probabilities The utility of M is to capture the strength of the associations between features and categories.

Segmentation Our approach (4) for each original document vector d to be segmented, given the set of categories S d to which it belongs, the system creates n = S d new document vectors d k (each one associated to exactly one class) in the following way: where k S d (it represents the category associated to the new vector), and where sel(f i ) is a selection function that can assume the following values:

Our approach (5) Segmentation variant: selection parameter Q that is, selq(fi) is equal to fi if there exists a subset of Sd named S d of cardinality Q such that each one of its element has a PMI-value with feature fi greater than (or equal to) all the elements outside S d (but in Sd). This way, the system allows the use of feature fi for exactly Q segmented vectors.

Pre-processing of the texts

Measures Precision and Recall (and F-Measure). Data JRC-Acquis (http://ipsc.jrc.ec.europa.eu/?id=198) 23.472 documents (5 languages version) Evaluation

Evaluation

Evaluation