Data Collection and System Evaluation SD Study Meeting Nurul Lubis, AHC Lab, NAIST
Contents 1. Introduction to Spoken Dialogue Systems 1.4 Data Collection and Dialogue Corpora 1.5 Evaluating Spoken Dialogue System 6. Methodologies and Practices of Evaluation 6.2 Concepts 6.3 Measures 6.4 Frameworks
Data Collection and Dialogue Corpora
Corpus? Corpora? Machine-readable collection of speech and text that can be used to observe the occurrences of particular phenomena As resource (in data-driven approach) and basis of analysis, design, or evaluation
Example SEMAINE (MCKWEON+, 2013) SWITCHBOARD (GODFREY+, 1992) Telephone conversation between strangers 2430 conversations 240 hours of speech
How to create such corpus? Recording Transcribing Annotating
Recording What to collect? Dialogues collected must be representative of the system to be developed For hotel reservation? Flight booking? How to collect it? Record conversation between humans Not representative of spoken dialogue systems Simulate it Wizard of Oz User simulation
Wizard of Oz Scenario system user wizard User interacts with system System is actually operated by a human wizard However Time and money wise, this is expensive
User Simulation system User simulator User simulator is used in place of real user Generate responses to system s query Rule-based Statistical Simulated data can be used to evaluate different aspects of dialogue systems For example, in early stage, when a functionality is changed
Transcribing Putting what the speaker is saying into text May be orthographic Employs the standard spelling system of each target language Phonetic notation may be used to represent pronunciation and prosodic information Much effort has been put into standardizing representation and data format of transcriptions Extensible Markup Language (XML) Extensible MultiModal Annotation Markup Language (EMMA)
Annotating To capture other important information of the interaction in the corpus In relevance to the system to be developed Markup language may be used For example: Discourse structure Dialogue act Emotion
Emotion annotation scheme on RECOLA Database (Ringeval+, 2013)
Evaluation of Spoken Dialogue Systems
Which system is better?
Which system is better? Intuitively, a dialogue system should be preferred to another one if it enables interactions with the user in a natural and efficient manner We want to have quantitative measures for this!
6.2 Some Distinctions and Definitions
Evaluation Approaches: Where to conduct evaluation? REAL SITUATION Evaluation setup is complex Costly May cause unexpected result LABORATORY User is adapting to the environment Their behavior is not 100% natural Do not reflect the difficult conditions in which the system will actually be used The evaluation result is regarded as the upper bound of the system performance
Evaluation Approaches: Empirical or Theoretical? MORE THEORETICAL LESS THEORETICAL Aiming at verifying the consistency of a certain model Assessing predictions that the model makes about the application domain Collecting data on the basis of which empirical models can be compared and elaborated
Different assumptions will lead to different evaluation setups
Different evaluation goals will lead to different evaluation types Evaluation Type Functional Performance Usability and quality Reusability Evaluation Question Is the system functionally appropriate? How well does the system perform on the particular task it is designed for? Are the potential and real users satisfied with the system s performance? How do the users perceive the system when using it? Is the system flexible and portable?
6.3 Evaluation Measures
Qualitative or Quantitative? QUALITATIVE Aims at forming a conceptual model of the system What does it do? Why do errors occur? Which part to change? Subjective! The results of the evaluation are preferences and tendencies QUANTITATIVE Tries to address the evaluation using a quantified special metrics Objective! Results in claims with strong predictive power Distinction between qualitative and quantitative evaluation is rather ambiguous Evaluation of quantitative measures always have subjective elements in it We need to calculate inter-annotator agreement to see the objectiveness of the result
We can evaluate different aspects of an algorithm In respect to time How long does it take to complete the task? In respect to complexity How many rules are needed? How many steps of inference are performed? How much memory is used?
System s prediction We can evaluate different aspects of an algorithm In terms of correctness Is the algorithm giving the correct output? We calculate several measurements based on the confusion matrix Yes No Yes True positive (tp) False negative (fn) Ground truth No False positive (fp) True negative (tn) Metric Symbol Formula Recall R tp tp + fn Precision P tp tp + fp F measure Percentage accuracy Percentage error F acc err 2RP R + P tp + tn tp + tn + fp + fn fp + fn tp + tn + fp + fn
Different task calls for different evaluations In speech recognition WER = I+D+S #words In language modeling 100 ; I = insertion, D = deletion, S = substitution Perplexity; complexity of a search path In translation and segmentation task BLEU = I+D+S #words 100; on translation or segmentation result aligned with the reference In spoken dialogue system? We need more holistic measures!
Spoken dialogue system needs to be evaluated from different perspectives EFFICIENCY Length of dialogue The time required to complete a task User and system response time Number of help requests, barge-ins, and repair utterances Correction time EFFECTIVENESS Number of completed tasks Number of completed subtasks Number of transaction success
Spoken dialogue system needs to be evaluated from different perspectives USABILITY Asking potential users to test the system in realistic situations Collecting user s opinions through interview and questionnaire Observing the interaction Comparing system s performance with expert s
6.4 Evaluation Frameworks
A correct evaluation methodology should fulfill these requirements [Möller, 2009] Validity The method should measure what it is intended to measure Reliability The method should provide stable result across repeated applications Objectivity The method should achieve interexperimenter agreement on the result Sensitivity The method should measure small variations of what it is intended to be measured Robustness The method should provide results independently from variables that are extraneous to the measured construct
6.4.1 The PARADISE (PARAdigm for DIalogue System Evaluation) Framework [Walker+, 1997,1998,2001] [Litman and Pan, 2002]
Viewing system s task as an indicator of user satisfaction This framework measures the system s performance with the help of features that relate to task success and task cost Assumes that main objective is to maximize user satisfaction To do that, we need to maximize task success while minimizing the cost
PARADISE structure of objectives Maximize user satisfaction Maximize task success Minimize cost Efficiency measures Qualitative measures Kappa coefficient Number of utterances Length of utterances Response delay Number of corrections Repair ratio
Task is represented by attribute-value matrix (AVM) How is the actual interaction (user answer) compared to the ideal case (possible correct values)?
AVM helps measure task success 1 2 Calculate the agreement between actual interaction and ideal case Calculate kappa value to measure how hard it was for the user to perform the task using the system
AVM is not a perfect representation of system s tasks We have to represent all possible task parameters and their values As the size increase, confusion possibilities also increases, making agreement more difficult to achieve It s not possible to model openended goals Too simple to model complex task structure It is only suitable if the number of attributes and their values is small
6.4.2 QoE (Quality of Experience) Evaluation [Möller, 2002, 2005, 2009]
QoE Evaluation emphasizes the user s experience in the evaluation process The goal of QoE Evaluation is to establish the level of system quality as perceived by the user Quality refers to system characteristics and the interaction Can be measured through evaluative judgments by comparing what they perceive, their actual experience of the system, and expectations and knowledge they have of other similar systems
Quality aspects are influenced by many different factors To assess users experience with the system, the users previous expectations must be distinguished from their actual experience with the system. Users have different expectation of the system depending on their Knowledge Attitudes Desires Experience with similar system
The impact of expectation on experience is measured with a two-stage evaluation process Users are given same set of evaluation questions twice First, before using the system, to collect users expectation Second, after using the system, to collect users experience Comparison between the two answers is expected to reveal more accurate view of the system If the difference is positive, the experience fulfilled the expectation, and vice versa The impact of the experience is estimated by the magnitude in the difference