Evaluation Issues in AI and NLP. COMP-599 Dec 5, 2016

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Beyond the Pipeline: Discrete Optimization in NLP

English Language and Applied Linguistics. Module Descriptions 2017/18

Natural Language Processing. George Konidaris

A Case Study: News Classification Based on Term Frequency

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Using dialogue context to improve parsing performance in dialogue systems

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

arxiv: v1 [cs.cl] 2 Apr 2017

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Python Machine Learning

Learning Methods in Multilingual Speech Recognition

The Smart/Empire TIPSTER IR System

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Applications of memory-based natural language processing

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Prediction of Maximal Projection for Semantic Role Labeling

MYCIN. The MYCIN Task

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Cross Language Information Retrieval

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

The stages of event extraction

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Lecture 1: Basic Concepts of Machine Learning

The Curriculum in Primary Schools

Parsing of part-of-speech tagged Assamese Texts

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

TD(λ) and Q-Learning Based Ludo Players

Linking Task: Identifying authors and book titles in verbose queries

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Probabilistic Latent Semantic Analysis

CS 598 Natural Language Processing

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Rule Learning with Negation: Issues Regarding Effectiveness

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Modeling function word errors in DNN-HMM based LVCSR systems

CS Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

CEFR Overall Illustrative English Proficiency Scales

Learning Computational Grammars

Modeling function word errors in DNN-HMM based LVCSR systems

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Assignment 1: Predicting Amazon Review Ratings

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

The Role of the Head in the Interpretation of English Deverbal Compounds

Learning Methods for Fuzzy Systems

Rendezvous with Comet Halley Next Generation of Science Standards

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Lecture 1: Machine Learning Basics

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

AQUA: An Ontology-Driven Question Answering System

ANGLAIS LANGUE SECONDE

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Sight Word Assessment

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Online Updating of Word Representations for Part-of-Speech Tagging

Ensemble Technique Utilization for Indonesian Dependency Parser

BYLINE [Heng Ji, Computer Science Department, New York University,

Laboratorio di Intelligenza Artificiale e Robotica

Learning From the Past with Experiment Databases

Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Rubric for Scoring English 1 Unit 1, Rhetorical Analysis

Facing our Fears: Reading and Writing about Characters in Literary Text

Calibration of Confidence Measures in Speech Recognition

How to Judge the Quality of an Objective Classroom Test

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Language Acquisition Chart

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

Teacher: Mlle PERCHE Maeva High School: Lycée Charles Poncet, Cluses (74) Level: Seconde i.e year old students

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Context Free Grammars. Many slides from Michael Collins

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Rule Learning With Negation: Issues Regarding Effectiveness

Distant Supervised Relation Extraction with Wikipedia and Freebase

1 Copyright Texas Education Agency, All rights reserved.

Interpretive (seeing) Interpersonal (speaking and short phrases)

Unpacking a Standard: Making Dinner with Student Differences in Mind

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Knowledge-Based - Systems

The Strong Minimalist Thesis and Bounded Optimality

Mathematics Scoring Guide for Sample Test 2005

Florida Reading Endorsement Alignment Matrix Competency 1

Speech Emotion Recognition Using Support Vector Machine

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Laboratorio di Intelligenza Artificiale e Robotica

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Transcription:

Evaluation Issues in AI and NLP COMP-599 Dec 5, 2016

Announcements Course evaluations: please submit one! Course projects: due today, but you can submit by Dec 19, 11:59pm without penalty A3 and A4: You ll be able to pick them up after they re marked. 2

A4 Reading Discussion What do you think is the main contribution of the paper that is still relevant today? How does the paper relate to the following concepts? Language modelling Underspecification Morphological analysis What are some of its limitations that we could perhaps better solve today? 3

Outline Evaluation in NLP The Turing Test Deception in the Turing test Gaming the measure with cheap tricks Winograd Schema Challenge Recap 4

Evaluation in NLP What are some evaluation measures and methods for different NLP tasks that we have discussed in this class? 5

Classes of Evaluation Methods Intrinsic measures Pertains to the particular task that a model aims to solve Extrinsic measures Pertains to some downstream application of the current model Separate issue from whether the evaluation is manual or automatic Let s classify the previous evaluations. 6

Validity of Evaluations Different kinds of validity in our evaluations, to help us know whether our model is making real progress Internal validity External validity Test validity 7

Internal Validity Whether a causal conclusion drawn by study is warranted Conclusion: Method A outperforms Method B Independent variable: method Dependent variable: evaluation measure Same training data? Same preprocessing? Both methods parameters were tuned? No other confounds? Methods, evaluation measures, etc. implemented correctly? 8

External Validity Whether or not the conclusions drawn by study generalizes to other situations and other data Conclusion: Method A outperforms Method B How big was the test data set? Is it representative of all kinds of language? e.g., benchmark data sets usually are drawn from one genre of text Is it biased in some way? 9

Case Study: Parsing Results Parsing results, from McClosky et al. (2010) An evaluation only on WSJ would have limited external validity Developing methods that generalize across domains is called domain adaptation 10

Construct Validity Concerned with whether an evaluation actually measures what it claims to Does ROUGE reflect usefulness of summaries? Does better perplexity in language modelling lead to lower word error rate in ASR? Does lower word error rate lead to better user satisfaction in an ASR system? 11

Developing an Evaluation Measure Would like a good, cheap, automatic measure of quality of NLP output Correlate against some gold standard e.g., ROUGE approximates human summary judgment scores Could develop automatic measure that correlates with extrinsic task performance measure As we ll see, overly optimizing on a measure tends to reduce its usefulness -> lowers construct validity 12

Solving AI and NLP When will we have solved NLP, or indeed, AI? Is there a single test that can measure progress to this goal? 13

The Turing Test Alan Turing proposed this as a way to check whether a machine exhibits intelligent behaviour (1950). More concrete than some philosophical question such as Can machines think? Setup: Human interlocutor chats with an agent, which is either a machine, or another human, for 5 minutes. Afterwards, must decide whether the agent is human or machine 14

Loebner Prize Annual competition inspired by the Turing test with prize money: http://www.aisb.org.uk/events/loebner-prize Another competition is organized by the University of Reading Claim to have passed Turing test: http://www.bbc.com/news/technology-27762088 Is this format, or the Turing test in general, a good indicator of progress in NLP? 15

Deception in the Turing Test Most of the effort seems to be to deceive. Meet Rose, the 2014 Loebner winner: http://ec2-54-215-197-164.us-west- 1.compute.amazonaws.com/speech.php 16

Sample Chatlog Jackie: Where are you right now? Rose: I'm here, at Bletchley Park. Jackie: Where's that? Rose: It doesn't matter. Jackie: What is the weather there right now? Rose: It's nice enough out. But it's pretty reasonable the year round. Jackie: That must be nice. It's pretty cold in Montreal in winter. Rose: I do prefer nice weather. Jackie: What do you think about computational linguistics? Rose: I haven't a clue. San Francisco is perverse. When you expect sunny it's fog and chill winds, but just when you think winter's nearly here, it goes sunny. Jackie: Do you only care about weather? Rose: I prefer sunny climates and good weather. Though occasional warm rain is OK. 17

Goodhart s Law First formulated in economics: Once a measure of some quality is turned into a target to optimize, it is no longer a good measure of quality. Back to the summarization example: Many systems now directly optimize for ROUGE to perform well in shared tasks. Some tricks clearly do not represent genuine progress Others are less clear, but also appear to be so 18

Optimizing ROUGE ROUGE is recall-oriented Make sure we are using the entire word length limit, even if the last sentence is cut off. ROUGE was developed using purely extractive summarization methods Sentence simplification and compression helps ROUGE, because we can fit more content into the same word length limit This usually degrades readability and overall quality Other cases of this in NLP: BLEU, PARSEVAL 19

Ignoring Less Common Issues Less common, but important and systematic issues are ignored, if we only use standard evaluation measures e.g., Parsing Overall parsing accuracy is relatively high (~90 F1), but parsing of coordinate structures is poor Hogan (2007) found that a baseline parser gets about 70 F1 on parsing NP coordination busloads of [executives and their wives] [busloads of executives] and [their wives] CORRECT INCORRECT 20

Cheap Tricks Are we overly enamoured by corpus-based, statistical approaches? Cheap tricks (Levesque, 2013): e.g., Get the answer right, but for dubious reasons different from human-like reasoning Could a crocodile run a steeplechase? Can use statistical reasoning, closed-world assumption to answer such questions Should baseball players be allowed to glue small wins on their caps? 21

Cheap Tricks in NLP Chatbot: Create fictitious personality, backstory Deceive with humour, emotional outburst, misdirection Question answering and information extraction: Use existing knowledge bases, regularities in statistical patterns to look up memorized knowledge Automatic summarization and NLG: Use extraction and redundancy to avoid having to really understand the text and generate summary sentences (Cheung and Penn, 2013) 22

Winograd Schema Challenge Attempt to design multiple-choice questions that require deeper understanding beyond: Simple statistical look-ups with some search method Features that map simply to other features (older than maps to AGE) Biases in word order, vocabulary, grammar Basic format: binary questions, where a small change in wording leads to a different correct solution 23

Example Joan made sure to thank Susan for all the help she had given. Who had given the help? Joan Susan Joan made sure to thank Susan for all the help she had received. Who had received the help? Joan Susan https://www.cs.nyu.edu/davise/papers/ws.html 24

Consequences It turns out it is possible to use statistical knowledge and existing work in coreference resolution to partially solve WSC questions A variety of semantic features fed to a machine learning system -> 73% accuracy (Rahman and Ng, 2012) Bigger point remains: Is there a science of AI distinct from the technological aspect of it? How do we decide what kinds of techniques are cheap tricks vs. genuine intelligent behaviour? 25

Recap of Course What have we done in COMP-599? 26

Computational Linguistics (CL) Modelling natural language with computational models and techniques Domains of natural language Acoustic signals, phonemes, words, syntax, semantics, Speech vs. text Natural language understanding (or comprehension) vs. natural language generation (or production) 27

Computational Linguistics (CL) Modelling natural language with computational models and techniques Goals Language technology applications Scientific understanding of how language works 28

Computational Linguistics (CL) Modelling natural language with computational models and techniques Methodology and techniques Gathering data: language resources Evaluation Statistical methods and machine learning Rule-based methods 29

Current Trends and Challenges Speculations about the future of NLP 30

Better Use of More Data Large amounts of data now available Unlabelled Noisy May not be directly relevant to your specific problem How do we make better use of it? Unsupervised or lightly supervised methods Prediction models that can make use of data to learn what features are important (neural networks) Incorporate linguistic insights with large-scale data processing 31

Using More Sources of Knowledge Old set up: Annotated data set Better model? Feature extraction + Simple supervised learning Model predictions Background text General knowledge bases Domain-specific constraints Directly relevant annotated data Model predictions 32

Away From Discreteness Discreteness is sometimes convenient assumption, but also a problem Words, phrases, sentences and labels for them Symbolic representations of semantics Motivated a lot of work in regularization and smoothing Representation learning Learn continuous-valued representations using cooccurrence statistics, or some other objective function e.g., vector-space semantics 33

Continuous-Valued Representations cat, linguistics, NP, VP Advantages: Implicitly deal with smoothness, soft boundaries Incorporate many sources of information in training vectors Challenges: What should a good continuous representation look like? Evaluation is often still in terms of a discrete set of labels 34

Broadening Horizons We are getting better at solving specific problems on specific benchmark data sets. e.g., On WSJ corpus, POS tagging performance of >97% matches human-level performance. Much more difficult and interesting: Working across multiple kinds of text and data sets Integrating disparate theories, domains, and tasks 35

Connections to Other Fields Cognitive science and psycholinguistics e.g., model L1 and L2 acquisition; other human behaviour based on computational models Human computer interaction and information visualization That s nice that you have a tagger/parser/summarizer/asr system/nlg module. Now, what do you do with it? Multi-modal systems and visualizations 36

That s It! Good luck on your projects and finals! 37