A Bayesian Hierarchical Model for Comparing Average F1 Scores

Similar documents
Lecture 1: Machine Learning Basics

Rule Learning With Negation: Issues Regarding Effectiveness

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Assignment 1: Predicting Amazon Review Ratings

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Probabilistic Latent Semantic Analysis

arxiv: v1 [cs.lg] 3 May 2013

Australian Journal of Basic and Applied Sciences

Switchboard Language Model Improvement with Conversational Data from Gigaword

CS Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Reducing Features to Improve Bug Prediction

Python Machine Learning

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

A Case Study: News Classification Based on Term Frequency

CS 446: Machine Learning

Using dialogue context to improve parsing performance in dialogue systems

Learning From the Past with Experiment Databases

Semi-Supervised Face Detection

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Experts Retrieval with Multiword-Enhanced Author Topic Model

Detecting English-French Cognates Using Orthographic Edit Distance

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

A Comparison of Two Text Representations for Sentiment Analysis

Truth Inference in Crowdsourcing: Is the Problem Solved?

Issues in the Mining of Heart Failure Datasets

(Sub)Gradient Descent

Bug triage in open source systems: a review

Speech Emotion Recognition Using Support Vector Machine

Cross-lingual Short-Text Document Classification for Facebook Comments

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Disambiguation of Thai Personal Name from Online News Articles

Calibration of Confidence Measures in Speech Recognition

Mining Topic-level Opinion Influence in Microblog

Multi-label classification via multi-target regression on data streams

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

arxiv: v1 [cs.lg] 15 Jun 2015

A Bayesian Learning Approach to Concept-Based Document Classification

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Introduction to Causal Inference. Problem Set 1. Required Problems

Universidade do Minho Escola de Engenharia

Combining Proactive and Reactive Predictions for Data Streams

Linking Task: Identifying authors and book titles in verbose queries

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Preference Learning in Recommender Systems

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Using Web Searches on Important Words to Create Background Sets for LSI Classification

How do adults reason about their opponent? Typologies of players in a turn-taking game

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

CSL465/603 - Machine Learning

Multivariate k-nearest Neighbor Regression for Time Series data -

Online Updating of Word Representations for Part-of-Speech Tagging

Automatic document classification of biological literature

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Indian Institute of Technology, Kanpur

Mining Student Evolution Using Associative Classification and Clustering

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Term Weighting based on Document Revision History

STA 225: Introductory Statistics (CT)

Applications of data mining algorithms to analysis of medical data

SARDNET: A Self-Organizing Feature Map for Sequences

Generative models and adversarial training

Activity Recognition from Accelerometer Data

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Extracting Verb Expressions Implying Negative Opinions

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Attributed Social Network Embedding

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

UCLA UCLA Electronic Theses and Dissertations

Summarizing Contrastive Themes via Hierarchical Non-Parametric Processes

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Genre classification on German novels

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Latent Semantic Analysis

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Postprint.

Comment-based Multi-View Clustering of Web 2.0 Items

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Comparison of network inference packages and methods for multiple networks inference

Development of Multistage Tests based on Teacher Ratings

Abnormal Activity Recognition Based on HDP-HMM Models

Corrective Feedback and Persistent Learning for Information Extraction

Axiom 2013 Team Description Paper

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Evolutive Neural Net Fuzzy Filtering: Basic Description

Segmentation of Multi-Sentence Questions: Towards Effective Question Retrieval in cqa Services

Conversational Framework for Web Search and Recommendations

Probability and Statistics Curriculum Pacing Guide

A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain

Ensemble Technique Utilization for Indonesian Dependency Parser

Exposé for a Master s Thesis

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

For Jury Evaluation. The Road to Enlightenment: Generating Insight and Predicting Consumer Actions in Digital Markets

Transcription:

A Bayesian Hierarchical Model for Comparing Average F1 Scores Dell Zhang 1, Jun Wang 2, Xiaoxue Zhao 2, Xiaoling Wang 3 1 Birkbeck, University of London, UK 2 University College London, UK 3 East China Normal University, China 17 Nov 2015

Outline 1 2 3

Outline 1 2 3

- Text Classification Definition: Automatic text classification is a fundamental technique in information retrieval Applications: Topic categorisation, spam filtering, sentiment analysis, message routing... Performance measure: F 1 Score

- F 1 Score Definition: The harmonic mean of precision(p) and recall(r). Two methods: Micro-averagedF 1 score (MiF 1 ): Gives equal weight to each classification decision Macro-averaged F 1 score (MaF 1 ): Gives equal weight to each class Limitations: Does not tell us how reliable it is on unseen data.

Outline 1 2 3

Goal: Assess the uncertainty of a classifier s performance as measured by mif 1 and maf 1

Outline 1 2 3

- Frequentist Performance Comparison NHST Y. Yang and X. Liu, A re-examination of text categorization methods, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) use s-test to compare two classifiers accuracy scores use t-test to compare two classifiers performancce measures in the form of proportions

- Frequentist Performance Comparison Deficiencies of NHST Can only reject the null hypothesis, can never accept the null hypothesis. Will reject the null hypothesis even the performance difference is very close to zero. Can only be compared on the category level but not on the document level for complex performance measures

- Bayes Factor 1 Bayes Factor D. Barber, Are two classifiers performing equally? a treatment using Bayesian hypothesis testing, IDIAP, Tech. Rep.,2004., Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012. 2 Deficiencies of Beyes Factor Sensitive to the choice of prior distribution in the alternative model. The null hypothesis can be strongly preferred even with very few data and very large uncertainty in the estimate of the performance difference

- Bayesian Estimation 1 Bayesian Estimation C. Goutte and E. Gaussier, A probabilistic interpretation of precision, recall and F-score, with implication for evaluation, in Proceedings of the 27th European Conference on IR Research (ECIR), 2 It is restricted to a single F 1 score for binary classification with two classes only. 3 In contrast, our proposed approach opens up many possibilities for adaptation or extension.

Outline 1 2 3

- True Classiication Multi-class single-label classification M different classes N labelled test documents Documents true class labels y i are i.i.d. µ = (µ 1,..., µ M ): the probabilities that a test document truly belongs to each class n = (n 1,..., n M ): the true size of each class n follows a multinomial distribution with parameter µ, where M j=1 n j = N.

- True Classiication β µ ψ α N n c j θ j ω j M η Figure: The probabilistic graphical model for estimating the uncertainty of average F 1 scores.

- Predicted Classification Class level θ j = (θ 1,..., θ M ): the probabilities that a document of true class label j is classified into different classes. ω j = (ω 1,..., ω M ): the parameters of the θ j s Dirichlet prior. Model level η: the overall tendency of making correct predictions { η if k = j w jk = (1 η)/(m 1) if k j for k = 1,..., M

- Predicted Classification β µ ψ α N n c j θ j ω j M η Figure: The probabilistic graphical model for estimating the uncertainty of average F 1 scores.

- Performance Confusion matrix C presents the classification results. C is a M M matrix. c jk represents the number of documents with true class label j but predicted class label k. c j follows a multinomial distribution with parameter θ j, where M k=1 c jk = n j. β µ ψ α N n c j θ j ω j M η

- Performance µ presents the true classification of documents. ω presents the predicted classification. Treat the performance measure (either mif 1 or maf 1 ) as a random variable ψ, which is a function of µ and ω. For example, in mif 1 Precision = Recall = M j=1 tp j M j=1 tp j +fp j M j=1 tp j M j=1 tp j +fn j = M j=1 µ jθ jj = M j=1 µ jθ jj. In multi-class single-label, mif 1 = Precision = Recall.

- Performance β µ ψ α N n c j θ j ω j M η For two models A and B, the difference of the overall performance is represented by δ, where δ = ψ A ψ B. Estimate the uncertainty difference of two models by examining the posterior probability distribution of δ.

Outline 1 2 3

- Dataset A standard benchmark dataset for text classification, 20newsgroups 1. 60% subset for training 40% subset for testing Filtered by striping newsgroup-related metadata 1 http://qwone.com/~jason/20newsgroups/

- Classifiers Classification algorithms: Naive Bayse (NB) Bernoulli event model (NB Bern ) Multinomial event model (NB Mult ) linear Support Vector Machine (SVM) L1 penalty (SVM L1 ) L2 penalty (SVM L2 ) Implementation of these algorithms: Python library scikit-learn

- Results true class label y 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 121 4 0 2 16 2 3 6 9 4 4 4 2 6 9 64 7 14 17 25 1 242 1 15 34 30 5 1 5 1 0 21 9 9 13 1 0 1 0 0 5 67 38 93 54 48 11 3 7 0 1 23 10 14 14 2 1 0 3 0 0 22 1023160 10 18 1 2 0 2 9 23 3 1 0 0 0 0 0 0 25 1 34235 7 17 6 7 3 1 5 20 7 17 0 0 0 0 0 0 60 2 11 21260 8 2 3 0 0 11 3 8 5 0 1 0 0 0 1 10 0 30 34 5 25410 10 2 3 2 7 9 9 1 1 2 0 0 1 3 0 1 29 0 14244 48 3 0 3 14 3 15 2 3 2 9 2 7 1 0 0 22 0 12 30267 0 5 4 12 4 7 1 14 6 5 1 4 1 0 1 21 0 5 4 9 27816 4 3 16 6 3 10 1 12 3 4 1 1 0 11 0 7 5 9 11322 3 2 4 4 2 3 4 5 1 10 7 0 6 33 4 3 3 14 3 2 254 11 5 14 1 8 7 8 3 2 19 0 23 32 7 17 14 11 2 1 2619322 20 0 3 1 0 0 5 6 0 2 20 0 8 12 12 1 1 1 6 282 9 12 6 6 4 3 5 9 0 1 21 4 8 9 13 2 2 2 6 11272 3 7 7 12 0 28 4 0 0 16 2 0 0 5 1 1 2 0 5 2 281 2 9 5 35 15 1 0 2 19 1 3 9 19 1 4 15 4 7 8 6 18915 23 23 15 4 0 4 9 2 2 3 12 6 1 7 3 1 5 13 5 26314 7 23 2 0 0 8 2 2 8 15 2 1 8 1 8 10 10 63 30108 9 42 2 0 0 11 2 1 4 21 5 1 4 2 8 7 58 19 14 8 42 Comparing maf 1 between NB Bern and NB Mult. Conclusion: NB Bern is significantly outperformed by NB Mult true class label y 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 predicted class label ŷ 138 0 1 2 0 3 0 3 4 2 10 4 1 3 11 75 10 16 13 23 3 275 8 16 18 29 3 0 6 0 5 12 0 1 9 3 0 1 0 0 4 3418864 13 28 3 1 5 0 16 13 2 4 8 3 0 1 4 3 0 14 23276 31 3 8 4 0 0 8 5 19 0 1 0 0 0 0 0 0 13 8 39256 3 8 7 1 0 15 7 17 2 7 2 0 0 0 0 0 46 11 8 6 293 3 0 1 1 5 10 4 2 2 1 1 1 0 0 0 2 1 27 21 0 28214 6 2 10 1 8 1 7 4 1 1 2 0 3 1 2 1 1 0 8 28030 1 25 5 11 3 7 3 4 3 7 1 7 3 1 1 2 2 3 29289 2 13 1 9 6 4 5 11 3 6 1 7 3 1 0 0 0 4 2 5 31229 4 2 4 3 7 6 1 7 0 5 0 0 0 0 1 0 1 2 5 370 2 0 1 2 4 2 2 2 0 3 11 4 3 4 1 0 0 3 3 16298 4 1 6 4 21 6 7 1 1 12 7 27 12 1 8 7 11 1 12 3722414 13 2 0 2 2 0 4 5 1 2 1 0 0 5 4 0 15 0 5 309 8 17 9 4 6 1 3 8 1 1 0 1 0 6 3 1 18 4 6 5 309 6 4 7 8 3 8 3 0 0 1 1 0 1 1 1 14 1 1 1 2 353 2 0 3 5 7 0 0 0 0 1 1 7 2 1 11 10 0 4 9 15262 9 14 11 13 2 0 1 0 1 0 3 3 2 8 3 0 0 2 21 7 300 9 1 19 2 0 0 0 0 1 5 3 1 7 5 2 11 8 11 90 11131 3 31 3 1 1 0 0 0 1 3 2 7 4 2 8 5 10123 9 7 43 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 predicted class label ŷ 50 40 30 20 10 0 mean = -0.109 100.0% < 0 < 0.0% 0.0% in ROPE HDI 95% [-0.123, -0.094] 0.12 0.10 0.08 0.06 0.04 0.02 0.00 δ

- Results true class label y 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 141 4 3 2 2 1 5 15 5 4 1 3 5 8 11 47 13 8 4 37 5 25223 11 3 26 4 9 6 3 2 8 11 1 6 5 1 4 6 3 5 2323437 16 11 4 19 1 2 2 2 3 5 9 2 4 1 9 5 2 13 3623526 5 16 10 0 1 1 4 33 2 1 0 1 0 5 1 6 9 14 33242 4 8 18 6 1 2 2 22 5 5 3 2 0 2 1 5 46 33 9 5 254 5 6 3 0 1 3 4 2 7 4 1 3 2 2 0 6 3 19 15 1 282 18 8 4 1 1 12 2 5 3 4 1 1 4 4 8 3 2 6 2 14280 19 4 1 4 15 2 4 3 7 1 10 7 4 5 2 3 2 0 6 39 281 6 2 0 8 9 6 3 2 4 10 6 4 1 5 3 3 1 6 22 7 28724 0 5 5 2 9 3 2 6 2 2 3 2 2 4 2 1 12 2 20326 4 1 3 4 4 4 0 1 2 3 5 6 4 6 4 8 19 2 1 4 27610 3 8 4 14 3 11 5 6 13 8 25 16 7 16 24 13 4 4 1320118 6 6 4 0 5 4 9 6 2 4 0 0 8 23 7 1 5 1 11280 7 8 6 4 10 4 5 13 3 4 5 1 5 24 4 3 3 1 18 9 266 7 5 2 15 1 25 1 2 2 0 1 3 16 1 2 1 3 2 10 4 288 1 5 3 28 7 3 4 3 2 2 4 22 2 3 0 11 4 4 9 9 219 9 34 13 26 0 3 1 2 0 0 7 5 6 0 6 2 3 3 16 1526513 3 10 1 0 2 0 0 2 12 5 3 2 8 3 7 5 4 84 5 13918 34 4 1 4 1 2 2 12 3 2 2 4 3 8 3 55 21 3 12 75 Comparing maf 1 between SVM L1 and SVM L2. Conclusion: SVM L1 is only slightly outperformed by SVM L2 true class label y 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 predicted class label ŷ 145 1 2 1 0 2 2 3 6 11 1 2 10 11 10 47 12 10 12 31 5 25622 9 6 25 8 3 2 10 3 6 12 2 8 2 1 4 2 3 5 2423436 19 14 2 2 2 19 2 3 4 6 9 1 3 1 5 3 1 18 3224325 8 13 2 0 7 1 2 33 0 3 1 0 1 2 0 1 5 9 38 254 1 8 7 7 15 3 4 19 2 6 2 1 0 2 1 0 48 33 9 5 265 2 0 1 6 0 6 4 2 6 3 1 2 0 2 0 5 7 15 13 2 302 8 4 10 1 2 7 0 5 2 2 3 0 2 6 2 2 4 5 2 1026515 30 2 2 16 5 7 1 8 4 7 3 2 3 2 2 4 0 4 23 280 18 1 3 11 7 8 8 5 6 7 4 3 1 0 5 2 2 5 5 4 31626 2 2 5 3 6 1 1 8 0 1 3 3 1 1 1 2 4 3 24 338 1 1 2 1 3 2 2 2 4 5 3 6 2 5 4 7 3 4 20 3 277 11 4 5 4 14 7 9 3 7 15 7 21 17 9 17 11 7 15 1 1421415 12 4 2 3 2 0 5 9 3 2 0 1 4 7 7 15 5 1 8 288 5 14 5 4 9 4 4 10 5 2 4 3 3 6 4 18 4 2 14 6 276 11 9 3 7 3 22 2 3 3 0 1 1 1 0 16 1 1 1 5 2 318 0 1 2 18 4 3 5 2 1 1 3 4 6 17 0 10 3 10 6 1123410 22 12 24 1 1 4 2 0 0 4 7 12 1 4 3 2 2 13 8 26914 5 14 1 0 1 0 1 1 6 3 12 3 7 2 4 9 4 91 9 133 9 31 4 3 3 3 2 2 3 2 9 4 2 0 11 5 67 17 8 9 66 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 predicted class label ŷ 50 40 30 20 10 0 mean = -0.016 98.0% < 0 < 2.0% 7.3% in ROPE HDI 95% [-0.031, -0.001] 0.08 0.06 0.04 0.02 0.00 0.02 0.04 δ

- Results true class label y 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 138 0 1 2 0 3 0 3 4 2 10 4 1 3 11 75 10 16 13 23 3 275 8 16 18 29 3 0 6 0 5 12 0 1 9 3 0 1 0 0 4 3418864 13 28 3 1 5 0 16 13 2 4 8 3 0 1 4 3 0 14 23276 31 3 8 4 0 0 8 5 19 0 1 0 0 0 0 0 0 13 8 39256 3 8 7 1 0 15 7 17 2 7 2 0 0 0 0 0 46 11 8 6 293 3 0 1 1 5 10 4 2 2 1 1 1 0 0 0 2 1 27 21 0 28214 6 2 10 1 8 1 7 4 1 1 2 0 3 1 2 1 1 0 8 28030 1 25 5 11 3 7 3 4 3 7 1 7 3 1 1 2 2 3 29289 2 13 1 9 6 4 5 11 3 6 1 7 3 1 0 0 0 4 2 5 31229 4 2 4 3 7 6 1 7 0 5 0 0 0 0 1 0 1 2 5 370 2 0 1 2 4 2 2 2 0 3 11 4 3 4 1 0 0 3 3 16298 4 1 6 4 21 6 7 1 1 12 7 27 12 1 8 7 11 1 12 3722414 13 2 0 2 2 0 4 5 1 2 1 0 0 5 4 0 15 0 5 309 8 17 9 4 6 1 3 8 1 1 0 1 0 6 3 1 18 4 6 5 309 6 4 7 8 3 8 3 0 0 1 1 0 1 1 1 14 1 1 1 2 353 2 0 3 5 7 0 0 0 0 1 1 7 2 1 11 10 0 4 9 15262 9 14 11 13 2 0 1 0 1 0 3 3 2 8 3 0 0 2 21 7 300 9 1 19 2 0 0 0 0 1 5 3 1 7 5 2 11 8 11 90 11131 3 31 3 1 1 0 0 0 1 3 2 7 4 2 8 5 10123 9 7 43 Comparing maf 1 between NB Mult and SVM L2. Conclusion: NB Mult works a lot better than SVM L2 true class label y 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 predicted class label ŷ 145 1 2 1 0 2 2 3 6 11 1 2 10 11 10 47 12 10 12 31 5 25622 9 6 25 8 3 2 10 3 6 12 2 8 2 1 4 2 3 5 2423436 19 14 2 2 2 19 2 3 4 6 9 1 3 1 5 3 1 18 3224325 8 13 2 0 7 1 2 33 0 3 1 0 1 2 0 1 5 9 38 254 1 8 7 7 15 3 4 19 2 6 2 1 0 2 1 0 48 33 9 5 265 2 0 1 6 0 6 4 2 6 3 1 2 0 2 0 5 7 15 13 2 302 8 4 10 1 2 7 0 5 2 2 3 0 2 6 2 2 4 5 2 1026515 30 2 2 16 5 7 1 8 4 7 3 2 3 2 2 4 0 4 23 280 18 1 3 11 7 8 8 5 6 7 4 3 1 0 5 2 2 5 5 4 31626 2 2 5 3 6 1 1 8 0 1 3 3 1 1 1 2 4 3 24 338 1 1 2 1 3 2 2 2 4 5 3 6 2 5 4 7 3 4 20 3 277 11 4 5 4 14 7 9 3 7 15 7 21 17 9 17 11 7 15 1 1421415 12 4 2 3 2 0 5 9 3 2 0 1 4 7 7 15 5 1 8 288 5 14 5 4 9 4 4 10 5 2 4 3 3 6 4 18 4 2 14 6 276 11 9 3 7 3 22 2 3 3 0 1 1 1 0 16 1 1 1 5 2 318 0 1 2 18 4 3 5 2 1 1 3 4 6 17 0 10 3 10 6 1123410 22 12 24 1 1 4 2 0 0 4 7 12 1 4 3 2 2 13 8 26914 5 14 1 0 1 0 1 1 6 3 12 3 7 2 4 9 4 91 9 133 9 31 4 3 3 3 2 2 3 2 9 4 2 0 11 5 67 17 8 9 66 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 predicted class label ŷ 50 40 30 20 10 0 mean = +0.022 0.2% < 0 < 99.8% 1.3% in ROPE HDI 95% [+0.007, +0.037] 0.06 0.04 0.02 0.00 0.02 0.04 0.06 δ

The main contribution of this paper is a Bayesian estimation approach to assessing the uncertainty of average F 1 scores in multi-class text classification. We make interval estimation instead of simplistic point estimation of a text classifier s future performance on unseen data. Extension To be used in the multi-class multi-label classification. To compare classifiers on any type of data, e.g., images.