Data Collection and System Evaluation. SD Study Meeting Nurul Lubis, AHC Lab, NAIST

Similar documents
Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Using dialogue context to improve parsing performance in dialogue systems

Speech Recognition at ICSI: Broadcast News and beyond

Detecting English-French Cognates Using Orthographic Edit Distance

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Generating Test Cases From Use Cases

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Disambiguation of Thai Personal Name from Online News Articles

Guidelines for Writing an Internship Report

Software Maintenance

Mandarin Lexical Tone Recognition: The Gating Paradigm

UCEAS: User-centred Evaluations of Adaptive Systems

What is PDE? Research Report. Paul Nichols

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Introduction to Simulation

Math 96: Intermediate Algebra in Context

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

English Language and Applied Linguistics. Module Descriptions 2017/18

Word Segmentation of Off-line Handwritten Documents

Modeling function word errors in DNN-HMM based LVCSR systems

AQUA: An Ontology-Driven Question Answering System

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

10.2. Behavior models

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Miscommunication and error handling

Lecture 1: Machine Learning Basics

Tun your everyday simulation activity into research

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Self Study Report Computer Science

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Learning Methods in Multilingual Speech Recognition

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

Australian Journal of Basic and Applied Sciences

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The Use of Drama and Dramatic Activities in English Language Teaching

Radius STEM Readiness TM

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

GACE Computer Science Assessment Test at a Glance

Florida Reading Endorsement Alignment Matrix Competency 1

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Monitoring Metacognitive abilities in children: A comparison of children between the ages of 5 to 7 years and 8 to 11 years

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Statewide Framework Document for:

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Motivation to e-learn within organizational settings: What is it and how could it be measured?

Compositional Semantics

Task Completion Transfer Learning for Reward Inference

Corpus Linguistics (L615)

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Cross Language Information Retrieval

An Introduction to Simio for Beginners

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v1 [cs.cl] 2 Apr 2017

Speech Emotion Recognition Using Support Vector Machine

Dialog Act Classification Using N-Gram Algorithms

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Python Machine Learning

Introducing the New Iowa Assessments Mathematics Levels 12 14

Practical Research. Planning and Design. Paul D. Leedy. Jeanne Ellis Ormrod. Upper Saddle River, New Jersey Columbus, Ohio

Executive summary (in English)

Mathematics subject curriculum

Eyebrows in French talk-in-interaction

Multi-Lingual Text Leveling

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Truth Inference in Crowdsourcing: Is the Problem Solved?

Inside the mind of a learner

Procedia - Social and Behavioral Sciences 209 ( 2015 )

Automatic Pronunciation Checker

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Unit 7 Data analysis and design

Evidence for Reliability, Validity and Learning Effectiveness

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

OPAC Usability: Assessment through Verbal Protocol

A study of speaker adaptation for DNN-based speech synthesis

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

How to Judge the Quality of an Objective Classroom Test

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

Rule Learning With Negation: Issues Regarding Effectiveness

AN INTRODUCTION (2 ND ED.) (LONDON, BLOOMSBURY ACADEMIC PP. VI, 282)

Task Completion Transfer Learning for Reward Inference

MTH 141 Calculus 1 Syllabus Spring 2017

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

On-Line Data Analytics

Online Marking of Essay-type Assignments

Thesis-Proposal Outline/Template

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Developing an Assessment Plan to Learn About Student Learning

Transcription:

Data Collection and System Evaluation SD Study Meeting Nurul Lubis, AHC Lab, NAIST

Contents 1. Introduction to Spoken Dialogue Systems 1.4 Data Collection and Dialogue Corpora 1.5 Evaluating Spoken Dialogue System 6. Methodologies and Practices of Evaluation 6.2 Concepts 6.3 Measures 6.4 Frameworks

Data Collection and Dialogue Corpora

Corpus? Corpora? Machine-readable collection of speech and text that can be used to observe the occurrences of particular phenomena As resource (in data-driven approach) and basis of analysis, design, or evaluation

Example SEMAINE (MCKWEON+, 2013) SWITCHBOARD (GODFREY+, 1992) Telephone conversation between strangers 2430 conversations 240 hours of speech

How to create such corpus? Recording Transcribing Annotating

Recording What to collect? Dialogues collected must be representative of the system to be developed For hotel reservation? Flight booking? How to collect it? Record conversation between humans Not representative of spoken dialogue systems Simulate it Wizard of Oz User simulation

Wizard of Oz Scenario system user wizard User interacts with system System is actually operated by a human wizard However Time and money wise, this is expensive

User Simulation system User simulator User simulator is used in place of real user Generate responses to system s query Rule-based Statistical Simulated data can be used to evaluate different aspects of dialogue systems For example, in early stage, when a functionality is changed

Transcribing Putting what the speaker is saying into text May be orthographic Employs the standard spelling system of each target language Phonetic notation may be used to represent pronunciation and prosodic information Much effort has been put into standardizing representation and data format of transcriptions Extensible Markup Language (XML) Extensible MultiModal Annotation Markup Language (EMMA)

Annotating To capture other important information of the interaction in the corpus In relevance to the system to be developed Markup language may be used For example: Discourse structure Dialogue act Emotion

Emotion annotation scheme on RECOLA Database (Ringeval+, 2013)

Evaluation of Spoken Dialogue Systems

Which system is better?

Which system is better? Intuitively, a dialogue system should be preferred to another one if it enables interactions with the user in a natural and efficient manner We want to have quantitative measures for this!

6.2 Some Distinctions and Definitions

Evaluation Approaches: Where to conduct evaluation? REAL SITUATION Evaluation setup is complex Costly May cause unexpected result LABORATORY User is adapting to the environment Their behavior is not 100% natural Do not reflect the difficult conditions in which the system will actually be used The evaluation result is regarded as the upper bound of the system performance

Evaluation Approaches: Empirical or Theoretical? MORE THEORETICAL LESS THEORETICAL Aiming at verifying the consistency of a certain model Assessing predictions that the model makes about the application domain Collecting data on the basis of which empirical models can be compared and elaborated

Different assumptions will lead to different evaluation setups

Different evaluation goals will lead to different evaluation types Evaluation Type Functional Performance Usability and quality Reusability Evaluation Question Is the system functionally appropriate? How well does the system perform on the particular task it is designed for? Are the potential and real users satisfied with the system s performance? How do the users perceive the system when using it? Is the system flexible and portable?

6.3 Evaluation Measures

Qualitative or Quantitative? QUALITATIVE Aims at forming a conceptual model of the system What does it do? Why do errors occur? Which part to change? Subjective! The results of the evaluation are preferences and tendencies QUANTITATIVE Tries to address the evaluation using a quantified special metrics Objective! Results in claims with strong predictive power Distinction between qualitative and quantitative evaluation is rather ambiguous Evaluation of quantitative measures always have subjective elements in it We need to calculate inter-annotator agreement to see the objectiveness of the result

We can evaluate different aspects of an algorithm In respect to time How long does it take to complete the task? In respect to complexity How many rules are needed? How many steps of inference are performed? How much memory is used?

System s prediction We can evaluate different aspects of an algorithm In terms of correctness Is the algorithm giving the correct output? We calculate several measurements based on the confusion matrix Yes No Yes True positive (tp) False negative (fn) Ground truth No False positive (fp) True negative (tn) Metric Symbol Formula Recall R tp tp + fn Precision P tp tp + fp F measure Percentage accuracy Percentage error F acc err 2RP R + P tp + tn tp + tn + fp + fn fp + fn tp + tn + fp + fn

Different task calls for different evaluations In speech recognition WER = I+D+S #words In language modeling 100 ; I = insertion, D = deletion, S = substitution Perplexity; complexity of a search path In translation and segmentation task BLEU = I+D+S #words 100; on translation or segmentation result aligned with the reference In spoken dialogue system? We need more holistic measures!

Spoken dialogue system needs to be evaluated from different perspectives EFFICIENCY Length of dialogue The time required to complete a task User and system response time Number of help requests, barge-ins, and repair utterances Correction time EFFECTIVENESS Number of completed tasks Number of completed subtasks Number of transaction success

Spoken dialogue system needs to be evaluated from different perspectives USABILITY Asking potential users to test the system in realistic situations Collecting user s opinions through interview and questionnaire Observing the interaction Comparing system s performance with expert s

6.4 Evaluation Frameworks

A correct evaluation methodology should fulfill these requirements [Möller, 2009] Validity The method should measure what it is intended to measure Reliability The method should provide stable result across repeated applications Objectivity The method should achieve interexperimenter agreement on the result Sensitivity The method should measure small variations of what it is intended to be measured Robustness The method should provide results independently from variables that are extraneous to the measured construct

6.4.1 The PARADISE (PARAdigm for DIalogue System Evaluation) Framework [Walker+, 1997,1998,2001] [Litman and Pan, 2002]

Viewing system s task as an indicator of user satisfaction This framework measures the system s performance with the help of features that relate to task success and task cost Assumes that main objective is to maximize user satisfaction To do that, we need to maximize task success while minimizing the cost

PARADISE structure of objectives Maximize user satisfaction Maximize task success Minimize cost Efficiency measures Qualitative measures Kappa coefficient Number of utterances Length of utterances Response delay Number of corrections Repair ratio

Task is represented by attribute-value matrix (AVM) How is the actual interaction (user answer) compared to the ideal case (possible correct values)?

AVM helps measure task success 1 2 Calculate the agreement between actual interaction and ideal case Calculate kappa value to measure how hard it was for the user to perform the task using the system

AVM is not a perfect representation of system s tasks We have to represent all possible task parameters and their values As the size increase, confusion possibilities also increases, making agreement more difficult to achieve It s not possible to model openended goals Too simple to model complex task structure It is only suitable if the number of attributes and their values is small

6.4.2 QoE (Quality of Experience) Evaluation [Möller, 2002, 2005, 2009]

QoE Evaluation emphasizes the user s experience in the evaluation process The goal of QoE Evaluation is to establish the level of system quality as perceived by the user Quality refers to system characteristics and the interaction Can be measured through evaluative judgments by comparing what they perceive, their actual experience of the system, and expectations and knowledge they have of other similar systems

Quality aspects are influenced by many different factors To assess users experience with the system, the users previous expectations must be distinguished from their actual experience with the system. Users have different expectation of the system depending on their Knowledge Attitudes Desires Experience with similar system

The impact of expectation on experience is measured with a two-stage evaluation process Users are given same set of evaluation questions twice First, before using the system, to collect users expectation Second, after using the system, to collect users experience Comparison between the two answers is expected to reveal more accurate view of the system If the difference is positive, the experience fulfilled the expectation, and vice versa The impact of the experience is estimated by the magnitude in the difference