Tagger Evaluation Given Hierarchical Tag Sets

Similar documents
Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

An Introduction to the Minimalist Program

Proof Theory for Syntacticians

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Leveraging Sentiment to Compute Word Similarity

Word Sense Disambiguation

Lecture 1: Machine Learning Basics

Lecture 10: Reinforcement Learning

Rule-based Expert Systems

Linking Task: Identifying authors and book titles in verbose queries

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

A Version Space Approach to Learning Context-free Grammars

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

A Case Study: News Classification Based on Term Frequency

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

A Comparison of Standard and Interval Association Rules

arxiv:cmp-lg/ v1 22 Aug 1994

Vocabulary Usage and Intelligibility in Learner Language

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Annotating (Anaphoric) Ambiguity 1 INTRODUCTION. Paper presentend at Corpus Linguistics 2005, University of Birmingham, England

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Parsing of part-of-speech tagged Assamese Texts

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Extending Place Value with Whole Numbers to 1,000,000

The MEANING Multilingual Central Repository

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

A Bayesian Learning Approach to Concept-Based Document Classification

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

arxiv: v1 [cs.cl] 2 Apr 2017

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Toward Probabilistic Natural Logic for Syllogistic Reasoning

On document relevance and lexical cohesion between query terms

Writing a Basic Assessment Report. CUNY Office of Undergraduate Studies

Beyond the Pipeline: Discrete Optimization in NLP

A Comparison of Two Text Representations for Sentiment Analysis

AQUA: An Ontology-Driven Question Answering System

Python Machine Learning

Corpus Linguistics (L615)

Statewide Framework Document for:

Using dialogue context to improve parsing performance in dialogue systems

Accuracy (%) # features

2.1 The Theory of Semantic Fields

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

The stages of event extraction

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Ensemble Technique Utilization for Indonesian Dependency Parser

Distant Supervised Relation Extraction with Wikipedia and Freebase

Generation of Referring Expressions: Managing Structural Ambiguities

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Grade 6: Correlated to AGS Basic Math Skills

CS Machine Learning

Learning to Rank with Selection Bias in Personal Search

Matching Similarity for Keyword-Based Clustering

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Disambiguation of Thai Personal Name from Online News Articles

The Discourse Anaphoric Properties of Connectives

Artificial Neural Networks written examination

The Choice of Features for Classification of Verbs in Biomedical Texts

Software Maintenance

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Chapter 2 Rule Learning in a Nutshell

Methods for the Qualitative Evaluation of Lexical Association Measures

CS 446: Machine Learning

NCEO Technical Report 27

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

THE VERB ARGUMENT BROWSER

Prediction of Maximal Projection for Semantic Role Labeling

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

The taming of the data:

Automating the E-learning Personalization

A Comparative Evaluation of Word Sense Disambiguation Algorithms for German

A Domain Ontology Development Environment Using a MRD and Text Corpus

Switchboard Language Model Improvement with Conversational Data from Gigaword

On-Line Data Analytics

Combining a Chinese Thesaurus with a Chinese Dictionary

Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Rule Learning With Negation: Issues Regarding Effectiveness

DKPro WSD A Generalized UIMA-based Framework for Word Sense Disambiguation

Handling Sparsity for Verb Noun MWE Token Classification

Aviation English Training: How long Does it Take?

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

FREQUENTLY ASKED QUESTIONS

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Lecture 1: Basic Concepts of Machine Learning

Transcription:

Tagger Evaluation Given Hierarchical Tag Sets I. Dan Melamed (dan.melamed@westgroup.com) West Group Philip Resnik (resnik@umiacs.umd.edu) University of Maryland arxiv:cs/0008007v1 [cs.cl] 10 Aug 2000 Abstract. We present methods for evaluating human and automatic taggers that extend current practice in three ways. First, we show how to evaluate taggers that assign multiple tags to each test instance, even if they do not assign probabilities. Second, we show how to accommodate a common property of manually constructed gold standards that are typically used for objective evaluation, namely that there is often more than one correct answer. Third, we show how to measure performance when the set of possible tags is tree-structured in an is-a hierarchy. To illustrate how our methods can be used to measure inter-annotator agreement, we show how to compute the kappa coefficient over hierarchical tag sets. 1. Introduction Objective evaluation has been central in advancing our understanding of the best ways to engineer natural language processing systems. A major challenge of objective evaluation is to design fair and informative evaluation metrics, and algorithms to compute those metrics. When the task involves any kind of tagging (or labeling ), the most common performance criterion is simply exact match, i.e. exactly matching the right answer scores a point, and no other answer scores any points. This measure is sometimes adjusted for the expected frequency of matches occuring by chance (Carletta, 1996). Resnik and Yarowsky (1997; to appear), henceforth R&Y, have argued that the exact match criterion is inadequate for evaluating word sense disambiguation (WSD) systems. R&Y proposed a generalization capable of assigning partial credit, thus enabling more informative comparisons on a finer scale. In this article, we present three further generalizations. First, we show how to evaluate non-probabilistic assignments of multiple tags. Second, we show how to accommodate a common property of manually constructed gold standards that are typically used for objective evaluation, namely that there is often more than one correct answer. Third, we show how to measure performance when the set of possible tags is tree-structured in an is-a hierarchy. To illustrate how our methods can be applied to the comparison of human taggers, we show how to compute the kappa coefficient (Siegel and Castellan, 1988) over hierarchical tag sets. August00.tex; 31/12/2013; 22:11; p.1

2 Table I. Hypothetical output of four WSD systems on a test instance, where the correct sense is (2). The exact match criterion would assign zero credit to all four systems. Source: (Resnik and Yarowsky, 1997) WSD System sense of interest (in English) 1 2 3 4 (1) monetary (e.g. on a loan).47.85.28 1.00 (2) stake or share correct.42.05.24.00 (3) benefit/advantage/sake.06.05.24.00 (4) intellectual curiosity.05.05.24.00 Our methods depend on the tree structure of the tag hierarchy, but not on the nature of the nodes in it. For example, although these generalizations were motivated by the senseval exercise (Palmer and Kilgarriff, this issue), the mathematics applies just as well to any tagging task that might involve hierarchical tag sets, such as part-of-speech tagging or semantic tagging (Chinchor, 1998). With respect to word sense disambiguation in particular, questions of whether part-of-speech and other syntactic distinctions should be part of the sense inventory are orthogonal to the issues addressed here. 2. Previous Work Work on tagging tasks such as part-of-speech tagging and word sense disambiguation has traditionally been evaluated using the exact match criterion, which simply computes the percentage of test instances for which exactly the correct answer is obtained. R&Y noted that, even if a system fails to uniquely identify the correct tag, it may nonetheless be doing a good job of narrowing down the possibilities. To illustrate the myopia of the exact match criterion, R&Y used the hypothetical example in Table I. Some of the systems in the table are clearly better than others, but all would get zero credit under the exact match criterion. R&Y proposed the following measure, among others, as a more discriminating alternative: Score(A) = Pr(c w,context(w)), (1) A In words, the score for system A on test instance w is the probability assigned by the system to the correct sense c given w in its context. In August00.tex; 31/12/2013; 22:11; p.2

the example in Table I, System 1 would get a score of 0.42 and System 4 would score zero. 3 3. New Generalizations The generalizations below start with R&Y s premise that, given a probability distribution over tags and a single known correct tag, the algorithm s score should be the probability that the algorithm assigns to the correct tag. 3.1. Non-probabilistic Algorithms Algorithms that output multiple tags but do not assign probabilities should be treated as assigning uniform probabilities over the tags that they output. For example, an algorithm that considers tags A and B as possible, but eliminates tags C, D and E for a word with 5 tags in the reference inventory should be viewed as assigning probabilities of.5 each to A and B, and probability 0 to each of C, D, and E. Under this policy, algorithms that deterministically select a single tag are viewed as assigning 100% of the probability mass to that one tag, like System 4 in Table I. These algorithms would get the same score from Equation 1 as from the exact match criterion. 3.2. Multiple Correct Tags Given multiple correct tags for a given word token, the algorithm s score should be the sum of all probabilities that it assigns to any of the correct tags; that is, multiple tags are interpreted disjunctively. This is consistent with instructions provided to the senseval annotators: In general, use disjunction... where you are unsure which tag to apply (Krishnamurthy and Nicholls, 1998). In symbols, we build on Equation 1: Score(A) = C t=1 Pr A (c t w,context(w)), (2) where t ranges over the C correct tags. Even if it is impossible to know for certain whether annotators intended a multi-tag annotation as disjunctive or conjunctive, the disjunctive interpretation gives algorithms the benefit of the doubt. August00.tex; 31/12/2013; 22:11; p.3

4 3.3. Tree-structured Tag Sets The same scoring criterion can be used for structured tag sets as for unstructured ones: What is the probability that the algorithm assigns to any of the correct tags? The complication for structured tag sets is that it is not obvious how to compare tags that are in a parent-child relationship. The probabilistic evaluation of taggers can be extended to handle tree-structured tag sets, such as hector (Atkins, 1993), if the structure is interpreted as an is-a hierarchy. For example, if word sense A.2 is a sub-sense of word sense A, then any word token of sense A.2 also is-a token of sense A. Under this interpretation, the problem can be solved by defining two kinds of probability distributions: 1. Pr(occurrence of parent tag occurrence of child tag) 2. Pr(occurrence of child tag occurrence of parent tag). In a tree-structured is-a hierarchy Pr(parent child) = 1, so the first one is easy. The second one is harder, unfortunately; in general, these ( downward ) probabilities are unknown. Given a sufficiently large training corpus, the downward probabilities can be estimated empirically. However, in cases of very sparse training data, as in senseval, such estimates are likely to be unreliable, and may undermine the validity of experiments based on them. In the absence of reliable prior knowledge about tag distributions over various tag-tree branches, we appeal to the maximum entropy principle, which dictates that we assume a uniform distribution of sub-tags for each tag. This assumption is not as bad as it may seem. It will be false in most individual cases, but if we compare tagging algorithms by averaging performance over many different word types, most of the biases should come out in the wash. Now, how do we use these conditional probabilities for scoring? The key is to treat each non-leaf tag as under-specified. For example, if sense A has just the two subsenses A.1 and A.2, then tagging a word with sense A is equivalent to giving it a probability of one half of being sense A.1 and one half of being sense A.2, given our assumption of uniform downward probabilities. This interpretation applies both to the tags in the output of tagging algorithms and to the manual (correct, reference) annotations. 4. Example Suppose our sense inventory for a given word is as shown in Figure 1. August00.tex; 31/12/2013; 22:11; p.4

5 A A.1 A.2 A.1a A.1b Figure 1. Example tag inventory. B.1 B B.2 B.3 Table II. Examples of the scoring scheme, for the tag inventory in Figure 1. Manual Annotation Algorithm s Output Score B A 0 A A 1 A A.1 1 A A.1b 1 A.1 A.5 A.1 and A.2 A.5 +.5 = 1 A.1a A.25 A.1a and B.2 B Pr(B.2 B) = 1 3 A.1a and B.2 A.1.5 A.1a and B.2 A.1 and B.2.5.5 +.5 1 =.75 A.1a and B.2 A.1 and B.5.5 +.5.333 =.41666 Under the assumption of uniform downward probabilities, we start by deducing that Pr(A.1 A) =.5, Pr(A.1a A.1) =.5, (so Pr(A.1a A) =.25 ), Pr(B.2 B) = 1 3, and so on. If any of these conditional probabilities is reversed, its value is always 1. For example, Pr(A A.1a) = 1. Next, these probabilities are applied in computing Equation 2, as illustrated in Table II. 5. Inter-Annotator Agreement Given Hierarchical Tag Sets Gold standard annotations are often validated by measurements of inter-annotator agreement. The computation of any statistic that may be used for this purpose necessarily involves comparing tags to see whether they are the same. Again, the question arises as to how to compare tags that are in a parent-child relationship. We propose the same answer as before: Treat non-leaf tags as underspecified. August00.tex; 31/12/2013; 22:11; p.5

6 To compute agreement statistics under this proposal, every non-leaf tag in each annotation is recursively distributed over its children, using uniform downward probabilities. The resulting annotations involve only the most specific possible tags, which can never be in a parentchild relationship. Agreement statistics can then be computed as usual, taking into account the probabilities distributed to each tag. One of the most common measures of pairwise inter-annotator agreement is the kappa coefficient (Siegel and Castellan, 1988): K = Pr(A) Pr(E) 1 Pr(E) (3) where Pr(A) is the proportion of times that the annotators agree and Pr(E) is the probability of agreement by chance. Once the annotations are distributed over the leaves L of the tag inventory, these quantities are easy to compute. Given a set of test instances T, Pr(A) = 1 T Pr(l annotation 1 (t)) Pr(l annotation 2 (t)) (4) t T l L Pr(E) = l L Pr(l) 2 (5) Computing these probabilities over just the leaves of the tag inventory ensures that the importance of non-leaf tags is not inflated by doublecounting. 6. Conclusion We have presented three generalizations of standard evaluation methods for tagging tasks. Our methods are based on the principle of maximum entropy, which minimizes potential evaluation bias. As with the R&Y generalization in Equation 1, and the exact match criterion before it, our methods produce scores that can be justifiably interpreted as probabilities. Therefore, decision processes can combine these scores with other probabilities in a maximally informative way by using the axioms of probability theory. Our generalizations make few assumptions, but even these few assumptions lead to some limitations on the applicability of our proposal. First, although we are not aware of any algorithms that were designed to behave this way, our methods are not applicable to algorithms that conjunctively assign more than one tag per test instance. A potentially more serious limitation is our interpretation of tree-structured tag sets August00.tex; 31/12/2013; 22:11; p.6

as is-a hierarchies. There has been considerable debate, for example, about whether this interpretation is valid for such well-known tag sets as hector and WordNet. This work can be extended in a number of ways. For example, it would not be difficult to generalize our methods from trees to hierarchies with multiple inheritance, such as WordNet (Fellbaum, 1998). 7 References Atkins, S.: 1993, Tools for computer-aided lexicography: the Hector project. In: Papers in Computational Lexicography: COMPLEX 93. Budapest. Carletta, J.: 1996, Assessing agreement on classification tasks: the Kappa statistic. Computational Linguistics 22(2), 249 254. Chinchor, N. (ed.): 1998, Proceedings of the 7th Message Understanding Conference. Columbia, MD:, Science Applications International Corporation (SAIC). Online publication at http://www.muc.saic.com/proceedings/muc_7_toc.html. Fellbaum, C. (ed.): 1998, WordNet: An Electronic Lexical Database. MIT Press. Krishnamurthy, R. and D. Nicholls: 1998, Peeling an onion: the lexicographer s experience of manual sense-tagging. In: SENSEVAL Workshop. Sussex, England. Resnik, P. and D. Yarowsky: 1997, A perspective on word sense disambiguation methods and their evaluation. In: M. Light (ed.): ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How? Washington, D.C. Resnik, P. and D. Yarowsky: to appear, Distinguishing Systems and Distinguishing Senses: New Evaluation Methods for Word Sense Disambiguation. Natural Language Engineering. Siegel, S. and N. J. Castellan, Jr.: 1988, Nonparametric Statistics for the Behavioral Sciences. Second edition. McGraw-Hill. August00.tex; 31/12/2013; 22:11; p.7

August00.tex; 31/12/2013; 22:11; p.8