Utility Theory, Minimum Effort, and Predictive Coding

Similar documents
Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Python Machine Learning

Linking Task: Identifying authors and book titles in verbose queries

A Case Study: News Classification Based on Term Frequency

Using dialogue context to improve parsing performance in dialogue systems

Detecting English-French Cognates Using Orthographic Edit Distance

CS Machine Learning

Determining the Semantic Orientation of Terms through Gloss Classification

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Automatic document classification of biological literature

Learning From the Past with Experiment Databases

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Probabilistic Latent Semantic Analysis

Cross-lingual Short-Text Document Classification for Facebook Comments

Disambiguation of Thai Personal Name from Online News Articles

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Exposé for a Master s Thesis

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Australian Journal of Basic and Applied Sciences

CS 446: Machine Learning

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

(Sub)Gradient Descent

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Generative models and adversarial training

Switchboard Language Model Improvement with Conversational Data from Gigaword

BMC Medical Informatics and Decision Making 2012, 12:33

Chapter 2 Rule Learning in a Nutshell

Cross-Lingual Text Categorization

What is a Mental Model?

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Ensemble Technique Utilization for Indonesian Dependency Parser

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

arxiv: v2 [cs.cv] 30 Mar 2017

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

An investigation of imitation learning algorithms for structured prediction

Issues in the Mining of Heart Failure Datasets

Using Games with a Purpose and Bootstrapping to Create Domain-Specific Sentiment Lexicons

Learning to Rank with Selection Bias in Personal Search

Assignment 1: Predicting Amazon Review Ratings

Mining Student Evolution Using Associative Classification and Clustering

On-the-Fly Customization of Automated Essay Scoring

AQUA: An Ontology-Driven Question Answering System

Finding Translations in Scanned Book Collections

Lecture 1: Machine Learning Basics

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Modified Systematic Approach to Answering Questions J A M I L A H A L S A I D A N, M S C.

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Cross Language Information Retrieval

Georgetown University at TREC 2017 Dynamic Domain Track

Evolutive Neural Net Fuzzy Filtering: Basic Description

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

Online Updating of Word Representations for Part-of-Speech Tagging

Multilingual Sentiment and Subjectivity Analysis

Multivariate k-nearest Neighbor Regression for Time Series data -

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Calibration of Confidence Measures in Speech Recognition

Latent Semantic Analysis

Softprop: Softmax Neural Network Backpropagation Learning

Laboratorio di Intelligenza Artificiale e Robotica

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Detailed Instructions to Create a Screen Name, Create a Group, and Join a Group

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Radius STEM Readiness TM

Reducing Features to Improve Bug Prediction

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

content First Introductory book to cover CAPM First to differentiate expected and required returns First to discuss the intrinsic value of stocks

Conference Presentation

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Corrective Feedback and Persistent Learning for Information Extraction

Transductive Inference for Text Classication using Support Vector. Machines. Thorsten Joachims. Universitat Dortmund, LS VIII

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

INPE São José dos Campos

Robust Sense-Based Sentiment Classification

B. How to write a research paper

The taming of the data:

On the Combined Behavior of Autonomous Resource Management Agents

Multi-label classification via multi-target regression on data streams

Prediction of Maximal Projection for Semantic Role Labeling

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Semi-Supervised Face Detection

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

A Case-Based Approach To Imitation Learning in Robotic Agents

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Feature-oriented vs. Needs-oriented Product Access for Non-Expert Online Shoppers

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Creating a culturally inclusive learning environment. Alicia Santiago, Ph.D., Consultant Science Advisor & Latino Engagement Specialist

arxiv: v1 [cs.lg] 3 May 2013

Transcription:

Utility Theory, Minimum Effort, and Predictive Coding Fabrizio Sebastiani (Joint work with Giacomo Berardi and Andrea Esuli) Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche 56124 Pisa, Italy DESI V Roma, IT, 14 June 2013

What I ll be talking about A talk about text classification ( predictive coding ), about humans in the loop, and about how to best support their work I will be looking at scenarios in which 1 text classification technology is used for identifying documents belonging to a given class / relevant to a given query... 2... but the level of accuracy that can be obtained from the classifier is not considered sufficient... 3... with the consequence that one or more human assessors are asked to inspect (and correct where appropriate) a portion of the classification decisions, with the goal of increasing overall accuracy. How can we support / optimize the work of the human assessors?

What I ll be talking about A talk about text classification ( predictive coding ), about humans in the loop, and about how to best support their work I will be looking at scenarios in which 1 text classification technology is used for identifying documents belonging to a given class / relevant to a given query... 2... but the level of accuracy that can be obtained from the classifier is not considered sufficient... 3... with the consequence that one or more human assessors are asked to inspect (and correct where appropriate) a portion of the classification decisions, with the goal of increasing overall accuracy. How can we support / optimize the work of the human assessors?

What I ll be talking about A talk about text classification ( predictive coding ), about humans in the loop, and about how to best support their work I will be looking at scenarios in which 1 text classification technology is used for identifying documents belonging to a given class / relevant to a given query... 2... but the level of accuracy that can be obtained from the classifier is not considered sufficient... 3... with the consequence that one or more human assessors are asked to inspect (and correct where appropriate) a portion of the classification decisions, with the goal of increasing overall accuracy. How can we support / optimize the work of the human assessors?

A worked out example true predicted Y N Y TP = 4 FP = 3 N FN = 4 TN = 9 F 1 = 2TP 2TP + FP + FN = 0.53

A worked out example (cont d) true predicted Y N Y TP = 4 FP = 3 N FN = 4 TN = 9 F 1 = 2TP 2TP + FP + FN = 0.53

A worked out example (cont d) true predicted Y N Y TP = 5 FP = 3 N FN = 3 TN = 9 F 1 = 2TP 2TP + FP + FN = 0.63

A worked out example (cont d) true predicted Y N Y TP = 5 FP = 2 N FN = 3 TN = 10 F 1 = 2TP 2TP + FP + FN = 0.67

A worked out example (cont d) true predicted Y N Y TP = 6 FP = 2 N FN = 2 TN = 10 F 1 = 2TP 2TP + FP + FN = 0.75

A worked out example (cont d) true predicted Y N Y TP = 6 FP = 1 N FN = 2 TN = 11 F 1 = 2TP 2TP + FP + FN = 0.80

What I ll be talking about (cont d) We need methods that given a desired level of accuracy, minimize the assessors effort necessary to achieve it; alternatively, given an available amount of human assessors effort, maximize the accuracy that can be obtained through it This can be achieved by ranking the automatically classified documents in such a way that, by starting the inspection from the top of the ranking, the cost-effectiveness of the annotators work is maximized We call the task of generating such a ranking Semi-Automatic Text Classification (SATC)

What I ll be talking about (cont d) We need methods that given a desired level of accuracy, minimize the assessors effort necessary to achieve it; alternatively, given an available amount of human assessors effort, maximize the accuracy that can be obtained through it This can be achieved by ranking the automatically classified documents in such a way that, by starting the inspection from the top of the ranking, the cost-effectiveness of the annotators work is maximized We call the task of generating such a ranking Semi-Automatic Text Classification (SATC)

What I ll be talking about (cont d) Previous work has addressed SATC via techniques developed for active learning In both cases, the automatically classified documents are ranked with the goal of having the human annotator start inspecting/correcting from the top; however in active learning the goal is providing new training examples in SATC the goal is increasing the overall accuracy of the classified set We claim that a ranking generated à la active learning is suboptimal for SATC 1 1 G Berardi, A Esuli, F Sebastiani. A Utility-Theoretic Ranking Method for Semi-Automated Text Classification. Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), Portland, US, 2012.

What I ll be talking about (cont d) Previous work has addressed SATC via techniques developed for active learning In both cases, the automatically classified documents are ranked with the goal of having the human annotator start inspecting/correcting from the top; however in active learning the goal is providing new training examples in SATC the goal is increasing the overall accuracy of the classified set We claim that a ranking generated à la active learning is suboptimal for SATC 1 1 G Berardi, A Esuli, F Sebastiani. A Utility-Theoretic Ranking Method for Semi-Automated Text Classification. Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), Portland, US, 2012.

Outline of this talk 1 We discuss how to measure error reduction (i.e., increase in accuracy) 2 We discuss a method for maximizing the expected error reduction for a fixed amount of annotation effort 3 We show some promising experimental results

Outline Error Reduction, and How to Measure it 1 Error Reduction, and How to Measure it 2 3

Error Reduction, and how to measure it Assume we have 1 class (or query ) c; 2 classifier h for c; 3 set of unlabeled documents D that we have automatically classified by means of h, so that every document in D is associated with a binary decision (Y or N) with a confidence score (a positive real number) 4 measure of accuracy A, ranging on [0,1]

Error Reduction, and how to Measure it (cont d) We will assume that A is F 1 = 2 Precision Recall Precision + Recall = 2 TP (2 TP) + FP + FN but any set-based measure of accuracy (i.e., based on a contingency table) may be used An amount of error, measured as E = (1 A), is present in the automatically classified set D Human annotators inspect-and-correct a portion of D with the goal of reducing the error present in D

Error Reduction, and how to Measure it (cont d) We will assume that A is F 1 = 2 Precision Recall Precision + Recall = 2 TP (2 TP) + FP + FN but any set-based measure of accuracy (i.e., based on a contingency table) may be used An amount of error, measured as E = (1 A), is present in the automatically classified set D Human annotators inspect-and-correct a portion of D with the goal of reducing the error present in D

Error Reduction, and how to Measure it (cont d) We define error at rank n (noted as E(n)) as the error still present in D after the annotator has inspected the documents at the first n rank positions E(0) is the initial error generated by the automated classifier E( D ) is 0 We define error reduction at rank n (noted as ER(n)) to be ER(n) = E(0) E(n) E(0) the error reduction obtained by the annotator who inspects the docs at the first n rank positions ER(n) [0, 1] ER(n) = 0 indicates no reduction ER(n) = 1 indicates total elimination of error

Error Reduction, and how to Measure it (cont d) We define error at rank n (noted as E(n)) as the error still present in D after the annotator has inspected the documents at the first n rank positions E(0) is the initial error generated by the automated classifier E( D ) is 0 We define error reduction at rank n (noted as ER(n)) to be ER(n) = E(0) E(n) E(0) the error reduction obtained by the annotator who inspects the docs at the first n rank positions ER(n) [0, 1] ER(n) = 0 indicates no reduction ER(n) = 1 indicates total elimination of error

Error Reduction, and how to Measure it (cont d) 1.0 0.8 Error Reduction (ER) 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Inspection Length

Outline Error Reduction, and How to Measure it 1 Error Reduction, and How to Measure it 2 3

Error Reduction, and how to Maximize it Problem How should we rank the documents in D so as to maximize the expected error reduction?

A worked out example true predicted Y N Y TP = 4 FP = 3 N FN = 4 TN = 9 F 1 = 2TP 2TP + FP + FN = 0.53

A worked out example (cont d) true predicted Y N Y TP = 4 FP = 3 N FN = 4 TN = 9 F 1 = 2TP 2TP + FP + FN = 0.53

A worked out example (cont d) true predicted Y N Y TP = 5 FP = 3 N FN = 3 TN = 9 F 1 = 2TP 2TP + FP + FN = 0.63

A worked out example (cont d) true predicted Y N Y TP = 5 FP = 2 N FN = 3 TN = 10 F 1 = 2TP 2TP + FP + FN = 0.67

A worked out example (cont d) true predicted Y N Y TP = 6 FP = 2 N FN = 2 TN = 10 F 1 = 2TP 2TP + FP + FN = 0.75

A worked out example (cont d) true predicted Y N Y TP = 6 FP = 1 N FN = 2 TN = 11 F 1 = 2TP 2TP + FP + FN = 0.80

Error Reduction, and how to Maximize it Problem: how should we rank the documents in D so as to maximize the expected error reduction? Intuition 1: Documents that have a higher probability of being misclassified should be ranked higher Intuition 2: Documents that, if corrected, bring about a higher gain (i.e., a bigger impact on A) should be ranked higher Here, consider that a false positive and a false negative may have different impacts on A (e.g., when A F β, for any value of β) Bottom line Documents that have a higher utility (= probability gain) should be ranked higher

Error Reduction, and how to Maximize it Problem: how should we rank the documents in D so as to maximize the expected error reduction? Intuition 1: Documents that have a higher probability of being misclassified should be ranked higher Intuition 2: Documents that, if corrected, bring about a higher gain (i.e., a bigger impact on A) should be ranked higher Here, consider that a false positive and a false negative may have different impacts on A (e.g., when A F β, for any value of β) Bottom line Documents that have a higher utility (= probability gain) should be ranked higher

Error Reduction, and how to Maximize it (cont d) Given a set Ω of mutually disjoint events, a utility function is defined as U(Ω) = ω Ω P(ω)G(ω) where P(ω) is the probability of occurrence of event ω G(ω) is the gain obtained if event ω occurs We can thus estimate the utility, for the aims of increasing A, of manually inspecting a document d as U(TP, TN, FP, FN) = P(FP) G(FP) + P(FN) G(FN) provided we can estimate If d is labelled with class c: P(FP) and G(FP) If d is not labelled with class c: P(FN) and G(FN)

Error Reduction, and how to Maximize it (cont d) Given a set Ω of mutually disjoint events, a utility function is defined as U(Ω) = ω Ω P(ω)G(ω) where P(ω) is the probability of occurrence of event ω G(ω) is the gain obtained if event ω occurs We can thus estimate the utility, for the aims of increasing A, of manually inspecting a document d as U(TP, TN, FP, FN) = P(FP) G(FP) + P(FN) G(FN) provided we can estimate If d is labelled with class c: P(FP) and G(FP) If d is not labelled with class c: P(FN) and G(FN)

Error Reduction, and how to Maximize it (cont d) Estimating P(FP) and P(FN) (the probability of misclassification) can be done by converting the confidence score returned by the classifier into a probability of correct classification Tricky: requires probability calibration via a generalized sigmoid function to be optimized via k-fold cross-validation Gains G(FP) and G(FN) can be defined differentially ; i.e., The gain obtained by correcting a FN is (A FN TP A) The gain obtained by correcting a FP is (A FP TN A) Gains need to be estimated by estimating the contingency table on the training set via k-fold cross-validation Key observation: in general, G(FP) G(FN)

Error Reduction, and how to Maximize it (cont d) Estimating P(FP) and P(FN) (the probability of misclassification) can be done by converting the confidence score returned by the classifier into a probability of correct classification Tricky: requires probability calibration via a generalized sigmoid function to be optimized via k-fold cross-validation Gains G(FP) and G(FN) can be defined differentially ; i.e., The gain obtained by correcting a FN is (A FN TP A) The gain obtained by correcting a FP is (A FP TN A) Gains need to be estimated by estimating the contingency table on the training set via k-fold cross-validation Key observation: in general, G(FP) G(FN)

Error Reduction, and how to Maximize it (cont d) Estimating P(FP) and P(FN) (the probability of misclassification) can be done by converting the confidence score returned by the classifier into a probability of correct classification Tricky: requires probability calibration via a generalized sigmoid function to be optimized via k-fold cross-validation Gains G(FP) and G(FN) can be defined differentially ; i.e., The gain obtained by correcting a FN is (A FN TP A) The gain obtained by correcting a FP is (A FP TN A) Gains need to be estimated by estimating the contingency table on the training set via k-fold cross-validation Key observation: in general, G(FP) G(FN)

Outline Error Reduction, and How to Measure it 1 Error Reduction, and How to Measure it 2 3

Learning algorithms: MP-Boost, SVMs Datasets: # Cats # Training # Test F1 M MP-Boost F1 M SVMs Reuters-21578 115 9603 3299 0.608 0.527 OHSUMED-S 97 12358 3652 0.479 0.478 Baseline: ranking by probability of misclassification, equivalent to applying our ranking method with G(FP) = G(FN) = 1

Learner: MP-Boost; Dataset: Reuters-21578; Type: Macro 1.0 0.8 Error Reduction (ER) 0.6 0.4 0.2 0.0 Random Baseline Utility-theoretic Oracle 0.0 0.2 0.4 0.6 0.8 1.0 Inspection Length

Learner: SVMs; Dataset: Reuters-21578; Type: Macro 1.0 0.8 Error Reduction (ER) 0.6 0.4 0.2 0.0 Random Baseline Utility-theoretic Oracle 0.0 0.2 0.4 0.6 0.8 1.0 Inspection Length

Learner: MP-Boost; Dataset: Ohsumed-S; Type: Macro 1.0 0.8 Error Reduction (ER) 0.6 0.4 0.2 0.0 Random Baseline Utility-theoretic Oracle 0.0 0.2 0.4 0.6 0.8 1.0 Inspection Length

Learner: SVMs; Dataset: Ohsumed-S; Type: Macro 1.0 0.8 Error Reduction (ER) 0.6 0.4 0.2 0.0 Random Baseline Utility-theoretic Oracle 0.0 0.2 0.4 0.6 0.8 1.0 Inspection Length

A few side notes This approach allows the human annotator to know, at any stage of the inspection process, what the estimated accuracy is at that stage Estimate accuracy at the beginning of the process, via k-fold cross validation Update after each correction is made This approach lends itself to having more than one assessor working in parallel on the same inspection task Recent research I have not discussed today : A dynamic SATC method in which gains are updated after each correction is performed Microaveraging and Macroaveraging -oriented methods

Concluding Remarks Take-away message: Semi-automatic text classification needs to be addressed as a task in its own right Active learning typically makes use of probabilities of misclassification but does not make use of gains ranking à la active learning is suboptimal for SATC The use of utility theory means that the ranking algorithm is optimized for a specific accuracy measure Choose the accuracy measure the best mirrors your applicative needs (e.g., F β with β > 1), and choose it well! SATC is important, since in more and more application contexts the accuracy obtainable via completely automatic text classification is not sufficient; more and more frequently humans will need to enter the loop

Thank you!