Introduction WOZ setup

Similar documents
Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Rendezvous with Comet Halley Next Generation of Science Standards

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Modeling function word errors in DNN-HMM based LVCSR systems

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Teaching a Laboratory Section

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Speech Emotion Recognition Using Support Vector Machine

BEETLE II: a system for tutoring and computational linguistics experimentation

Modeling function word errors in DNN-HMM based LVCSR systems

learning collegiate assessment]

Linking Task: Identifying authors and book titles in verbose queries

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Human Emotion Recognition From Speech

Effect of Word Complexity on L2 Vocabulary Learning

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Unit: Human Impact Differentiated (Tiered) Task How Does Human Activity Impact Soil Erosion?

Metadiscourse in Knowledge Building: A question about written or verbal metadiscourse

Online Updating of Word Representations for Part-of-Speech Tagging

2.B.4 Balancing Crane. The Engineering Design Process in the classroom. Summary

The influence of written task descriptions in Wizard of Oz experiments

Assignment 1: Predicting Amazon Review Ratings

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Using dialogue context to improve parsing performance in dialogue systems

Stephanie Ann Siler. PERSONAL INFORMATION Senior Research Scientist; Department of Psychology, Carnegie Mellon University

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Laboratorio di Intelligenza Artificiale e Robotica

Mandarin Lexical Tone Recognition: The Gating Paradigm

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Degree Qualification Profiles Intellectual Skills

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Circuit Simulators: A Revolutionary E-Learning Platform

A study of speaker adaptation for DNN-based speech synthesis

The Evolution of Random Phenomena

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

Running head: DELAY AND PROSPECTIVE MEMORY 1

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

been each get other TASK #1 Fry Words TASK #2 Fry Words Write the following words in ABC order: Write the following words in ABC order:

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Lecture 1: Machine Learning Basics

Verbal Behaviors and Persuasiveness in Online Multimedia Content

IMPROVING SPEAKING SKILL OF THE TENTH GRADE STUDENTS OF SMK 17 AGUSTUS 1945 MUNCAR THROUGH DIRECT PRACTICE WITH THE NATIVE SPEAKER

INSTRUCTIONAL FOCUS DOCUMENT Grade 5/Science

Problems of the Arabic OCR: New Attitudes

GACE Computer Science Assessment Test at a Glance

Unit 7 Data analysis and design

Appendix L: Online Testing Highlights and Script

DOES RETELLING TECHNIQUE IMPROVE SPEAKING FLUENCY?

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

A Case Study: News Classification Based on Term Frequency

Longman English Interactive

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

The Revised Math TEKS (Grades 9-12) with Supporting Documents

Dialog Act Classification Using N-Gram Algorithms

Speech Recognition at ICSI: Broadcast News and beyond

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Word Stress and Intonation: Introduction

LOUISIANA HIGH SCHOOL RALLY ASSOCIATION

Let's Learn English Lesson Plan

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Switchboard Language Model Improvement with Conversational Data from Gigaword

Function Tables With The Magic Function Machine

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

5 Guidelines for Learning to Spell

Finding a Classroom Volunteer

REVIEW OF CONNECTED SPEECH

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Using focal point learning to improve human machine tacit coordination

Mapping Dialogic Tendencies: A Four-quadrant Method for Analyzing and Teaching Whole-Class Discussion

Word Segmentation of Off-line Handwritten Documents

CHAT To Your Destination

Teaching Literacy Through Videos

Learning Methods in Multilingual Speech Recognition

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

This Performance Standards include four major components. They are

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Fostering social agency in multimedia learning: Examining the impact of an animated agentõs voice q

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Why Did My Detector Do That?!

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

How to write in essay form >>>CLICK HERE<<<

Using Hashtags to Capture Fine Emotion Categories from Tweets

OFFICE OF DISABILITY SERVICES FACULTY FREQUENTLY ASKED QUESTIONS

Degeneracy results in canalisation of language structure: A computational model of word learning

L1 and L2 acquisition. Holger Diessel

Common Core Exemplar for English Language Arts and Social Studies: GRADE 1

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

Timeline. Recommendations

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Multi-Lingual Text Leveling

Organising ROSE (The Relevance of Science Education) survey in Finland

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

Laboratorio di Intelligenza Artificiale e Robotica

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Transcription:

1 Introduction WOZ setup Use hidden, human component WOZ experimental protocol calls for holding all other input and output constant so that the only unknown variable is who does the internal processing (Paek, 2001) WOZ systems appear automated to user Gather data for fully-automated system

2 Introduction WOZ performance Assume user behavior is similar between the WOZ and automated (AUT) setups In one system, training with AUT data gave rise to better performance than training WOZ data (Drummond and Litman, 2011) System automation differences may have caused performance gap Differences in user behavior may weaken automated system performance

3 Introduction - goal User Belief WOZ AUT True Operator WOZ AUT Differences Differences? Investigate differences in WOZ and AUT user behaviors Hypothesized that what users say and how they say it will differ between WOZ and AUT setups

4 Outline Introduction Dialogue System Post-hoc Experiment Results Conclusions

5 Dialogue System - ITSPOKE Our data comes from the Intelligent Tutoring Spoken Dialogue System (ITSPOKE) We draw from two prior experiments (one WOZ, one AUT) (Forbes-Riley and Litman(a), 2011; Forbes- Riley and Litman(b), 2011) Baseline, non-adaptive conditions of those experiments Users tutored in basic Newtonian physics Dialogues illustrated one or more basic physics concepts

6 Dialogue System sample dialogue Tutor text is shown on a screen and read aloud via text-to-speech, and the user responds verbally to the tutor s queries Tutor So what are the forces acting on the packet after it s dropped from the plane? Student um gravity then well air resistance is negligible just gravity Tutor Fine. So what s the direction of the force of gravity on the packet? Student vertically down

7 Dialogue System - workflow Front End Dialogue Manager Student Audio Microphone WOZ Human Wizard Student Automatic Speech Recognition AUT Correctness Evaluation Natural Language Understanding Screen Display Tutor Audio Text Display Speech Synthesizer Tutor Question Text Next Tutor State Curr Tutor State

8 Dialogue System two user groups Setups varied by component for understanding and evaluating responses One human, one automated Each student participated in only one setup Students were not informed whether the system was fully automated Distinct student group responses constitute data

9 Outline Introduction Dialogue System Post-hoc Experiment Results Conclusions

10 Post-hoc Experiment Determine whether differences exist between WOZ and AUT responses Compared features of user turns to each question individually The table below shows the number of users and dialogue turns they took for each setup over 111 questions asked in both setups System #Users #Turns WOZ 21 1542 AUT 25 2034

11 Post-hoc Experiment - features Prosodic features: length of the pause before speech began, speech duration, pitch, and energy (RMS) Pitch and energy: maximum, minimum, mean, and standard deviation 10 total prosodic features Normalized each prosodic feature using same algorithm as live system

12 Post-hoc Experiment - features Lexical features: Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al., 2001) Tentative(T): maybe, perhaps, and guess Prepositions(P): to, with, and above Utterance Maybe above would receive feature vector: <0,, 0, T=50, 0,, 0, P=50, 0,, 0> Used human transcriptions for all utterances 69 total LIWC lexical category features

13 Post-hoc Experiment Looked for response feature differences for each question in two ways: 1) A statistical comparison of features 2) Response classification via machine learning

14 Outline Introduction Dialogue System Post-hoc Experiment Results Statistical Comparison of Features Response Classification Experiments Conclusions

15 Statistical Comparison of Features For each question, all features between WOZ and AUT responses were compared Welch s unpaired, two-tailed t-tests

16 Statistical Comparison of Features Possible that differences were inherent in WIZ/ AUT student groups Created control groups with evenly mixed, randomly selected WIZ/AUT students We report only questions for which at least one feature differed between WOZ and AUT but not between these two control groups

17 Statistical Comparison of Features The number of questions for which at least one feature differed statistically significantly (p < 0.05) between WOZ and AUT responses Feature Set #Questions %Corpus by Turns Prosodic 42 46.22% Lexical 33 35.46% Either 61 66.86%

18 Statistical Comparison of Features 10/10 prosodic, 29/69 lexical features differed significantly (p < 0.05) for at least one question Features differing for at least 10% of the corpus: Feature %Corpus #Questions #WOZ>AUT Duration 22.15% 19 1 RMS Min 16.86% 15 14 Dictionary Words 15.13% 13 11 pronoun 12.56% 10 10 social 11.35% 9 8 funct 10.99% 9 9 Six Letter Words 10.91% 9 0

19 Statistical Comparison of Features Users used more words with the wizarded system Feature %Corpus #Questions #WOZ>AUT Dictionary Words 15.13% 13 11 pronoun 12.56% 10 10 social 11.35% 9 8 funct 10.99% 9 9 There exist features which differ for a substantial number of questions

20 Statistical Comparison of Features A question for which the Dictionary Words feature was greater for WOZ responses: Tutor So how do these two forces directions compare? Most common responses WOZ(9) AUT(2) WOZ(3) AUT(8) they are opposite WOZ opposite AUT Longest responses the relationship between the two forces directions are towards each other since the sun is pulling the gravitational force of the earth they are opposite directions

21 Outline Introduction Dialogue System Post-hoc Experiment Results Statistical Comparison of Features Response Classification Experiments Conclusions

22 Response Classification Experiments Use classification models to distinguish WOZ/ AUT setup J-48 model was trained and tested for each question Accuracy compared against a majority-class baseline

23 Response Classification Experiments 97 questions considered in total 21/97 outperformed the majority-class baseline 32.79% of the corpus by turns

24 Response Classification Experiments Would you like to do another problem?

25 Response Classification Experiments This result is consistent with literature (Schechtman and Horowitz, 2003; Rosé and Torrey, 2005) that suggests that users interacting with automated systems will be more curt

26 Response Classification Experiments Now let s find the forces exerted on the car in the vertical direction during the collision. First, what vertical force is always exerted on an object near the surface of the earth?

27 Outline Introduction Dialogue System Post-hoc Experiment Results Statistical Comparison of Features Response Classification Experiments Conclusions

28 Discussion There exist significant differences between user responses to a wizarded and an automatic dialogue system s questions Contribution of the wizard was limited to speech recognition and correctness evaluation

29 Discussion Results suggest that user speech changes as a result of user confidence in the system s accuracy Relationship between user confidence and user speech may be analogous to observed differences in past experiments These results suggest ways in which raw wizarded data may fall short of ideal for training an automated system

30 Future Work - exploration Measure how the observed differences change over the course of the dialogue Use different methods of normalization for user speech values

31 Future Work - solutions Intentional wizard error could be introduced to frustrate the user; analogous to intentional errors produced in user simulation (Lee and Eskenazi, 2012) Generalizable statistical classification domain adaptation (Daumé and Marcu, 2006) and adaptation demonstrated to work well in NLPspecific domains (Jiang and Zhai, 2007)

32 DIFFERENCES IN USER RESPONSES TO A WIZARD- OF-OZ VERSUS AUTOMATED SYSTEM Jesse Thomason and Diane Litman University of Pittsburgh