Social Robots and Human-Robot Interaction Ana Paiva. Lecture 8. Dialogues with Robots

Similar documents
Multi-modal Sensing and Analysis of Poster Conversations toward Smart Posterboard

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Using dialogue context to improve parsing performance in dialogue systems

Mandarin Lexical Tone Recognition: The Gating Paradigm

Teachers: Use this checklist periodically to keep track of the progress indicators that your learners have displayed.

Lecturing Module

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Common Core Exemplar for English Language Arts and Social Studies: GRADE 1

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Language Acquisition Chart

Eliciting Language in the Classroom. Presented by: Dionne Ramey, SBCUSD SLP Amanda Drake, SBCUSD Special Ed. Program Specialist

Modeling Dialogue Building Highly Responsive Conversational Agents

Loughton School s curriculum evening. 28 th February 2017

Eyebrows in French talk-in-interaction

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

An Architecture to Develop Multimodal Educative Applications with Chatbots

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

Running head: DELAY AND PROSPECTIVE MEMORY 1

Unit purpose and aim. Level: 3 Sub-level: Unit 315 Credit value: 6 Guided learning hours: 50

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Creating Travel Advice

Dialog Act Classification Using N-Gram Algorithms

Speech Recognition at ICSI: Broadcast News and beyond

Strands & Standards Reference Guide for World Languages

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

TASK 2: INSTRUCTION COMMENTARY

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Client Psychology and Motivation for Personal Trainers

Probabilistic principles in unsupervised learning of visual structure: human data and a model

Introduction to Simulation

CEFR Overall Illustrative English Proficiency Scales

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Emotional Variation in Speech-Based Natural Language Generation

The College Board Redesigned SAT Grade 12

GOLD Objectives for Development & Learning: Birth Through Third Grade

Assessing speaking skills:. a workshop for teacher development. Ben Knight

English Language Arts Missouri Learning Standards Grade-Level Expectations

Behavior List. Ref. No. Behavior. Grade. Std. Domain/Category. Social/ Emotional will notify the teacher when angry (words, signal)

5 th Grade Language Arts Curriculum Map

A Case-Based Approach To Imitation Learning in Robotic Agents

Lærerne I centrum og fremtidens skole The Change Room Action research to promote professional development

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Formulaic Language and Fluency: ESL Teaching Applications

CROSS COUNTRY CERTIFICATION STANDARDS

Eye Movements in Speech Technologies: an overview of current research

Candidates must achieve a grade of at least C2 level in each examination in order to achieve the overall qualification at C2 Level.

Subject: Opening the American West. What are you teaching? Explorations of Lewis and Clark

Understanding and Supporting Dyslexia Godstone Village School. January 2017

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

Corpus Linguistics (L615)

Participant s Journal. Fun and Games with Systems Theory. BPD Conference March 19, 2009 Phoenix AZ

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

SOFTWARE EVALUATION TOOL

Some Basic Active Learning Strategies

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

MULTIMEDIA Motion Graphics for Multimedia

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Stimulating Techniques in Micro Teaching. Puan Ng Swee Teng Ketua Program Kursus Lanjutan U48 Kolej Sains Kesihatan Bersekutu, SAS, Ulu Kinta

What is beautiful is useful visual appeal and expected information quality

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Master s Thesis. An Agent-Based Platform for Dialogue Management

Miscommunication and error handling

MODULE 4 Data Collection and Hypothesis Development. Trainer Outline

Elizabeth R. Crais, Ph.D., CCC-SLP

Learning and Teaching

Guru: A Computer Tutor that Models Expert Human Tutors

Full text of O L O W Science As Inquiry conference. Science as Inquiry

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Instructional Supports for Common Core and Beyond: FORMATIVE ASSESMENT

Cambridge NATIONALS. Creative imedia Level 1/2. UNIT R081 - Pre-Production Skills DELIVERY GUIDE

Functional Mark-up for Behaviour Planning: Theory and Practice

Patterns for Adaptive Web-based Educational Systems

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Compositional Semantics

Non-Secure Information Only

Table of Contents. Introduction Choral Reading How to Use This Book...5. Cloze Activities Correlation to TESOL Standards...

Annotation and Taxonomy of Gestures in Lecture Videos

Ph.D. in Behavior Analysis Ph.d. i atferdsanalyse

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Grounding Language for Interactive Task Learning

CDTL-CELC WORKSHOP: EFFECTIVE INTERPERSONAL SKILLS

Unraveling symbolic number processing and the implications for its association with mathematics. Delphine Sasanguie

Highlighting and Annotation Tips Foundation Lesson

REVIEW OF CONNECTED SPEECH

Course Law Enforcement II. Unit I Careers in Law Enforcement

One Stop Shop For Educators

Improving the impact of development projects in Sub-Saharan Africa through increased UK/Brazil cooperation and partnerships Held in Brasilia

Introduction to Causal Inference. Problem Set 1. Required Problems

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Voice conversion through vector quantization

Secondary English-Language Arts

While you are waiting... socrative.com, room number SIMLANG2016

Lecture 2: Quantifiers and Approximation

Getting the Story Right: Making Computer-Generated Stories More Entertaining

Concept Acquisition Without Representation William Dylan Sabo

Transcription:

Social Robots and Human-Robot Interaction Ana Paiva Lecture 8. Dialogues with Robots

Our goal Build Social Intelligence d) e) f)

The problem When and how should a robot act or say something to the user. - What to say (message content) - When to say (timing, turn-taking) - How to say it (gestures, non-verbal behaviours) But, in order to do that, the robot needs to understand what the user said

Communication is hard and miscommunication is easy

Let s look at communication say/order perceive understand respond Reply (what, how, when)

Let s look at communication say/order perceive understand respond

Let s look at communication say/order perceive understand respond How can a robot understand an order from a user and relate it with actions and objects in the physical world?

Perceive & Understand: Giving Commands to the robot Goal: to infer groundings in the world Use a specific description: Spatial Description Clauses (SDGs). SDC correspond to a constituent of the linguistic input and contain: a figure f, a relation r, and a variable number of landmarks l Each SDC has a type; EVENT: an action sequence that takes place in the world (e.g. Move the tire pallet ). OBJECT: a thing in the world (e.g. Forklift, the truck, the person ) PLACE: a place in the world (e.g. next to the tire pallet ). PATH: a path or path fragment through the world (e.g. past the truck ).

Creating a System based on a collected corpus of actions Create a system: data driven approach To train the system, a corpus of natural language commands was collected; The commands were paired with robot actions and environment state sequences; The corpus was used to both train the model and to evaluate end-to-end performance of the system;

Using videos of action sequences in the Amazon s Mechanical Turk a corpus was created by collecting language associated with each video; The videos showed a simulated robotic forklift in an action such as picking up a pallet or moving through the environment; Paired with each video, there as a complete log of the state of the environment and the robot s actions. Subjects were asked to type a natural language command that would cause an expert human forklift operator to carry out the action shown in the video. Commands were collected from 45 subjects for twenty-two different videos showing the forklift executing an action in a simulated warehouse. Giving Commands to the robot: Creating a Corpus

Evaluation of the Corpus The model was assessed in terms of its performance at predicting the correspondence between the acquired structures ( SDCs) and groundings. An evaluation was performed also using known correct and incorrect command-video pairs. C1: subjects saw a command paired with the original video that a different subject watched when creating the command. C2: the subject saw the command paired with random video that was not used to generate the original command.

Giving Commands to the robot

Learning from a Robot through Dialogue Problem with the previous approach: The interaction has to be learned (by collecting data) and thus it is limited to that particular domain. If we place no restrictions on speech, interpreting a command given to a robot is a challenging problem.. as users may say all kinds of things.

Approach: Learning from a Robot through Dialogue and Access to the Web Learning and using task-relevant knowledge from human-robot dialog and access to the Web

Learning from a Robot through Dialogue and Access to the Web KnoWDiaL, an approach for robot learning of taskrelevant environmental knowledge from human-robot dialog and access to the Web

Learning from a Robot through Dialogue and Access to the Web The speech recognizer returns a set of possible interpretations; these interpretations are the input for the first component of KnoWDiaL, a frame-semantic parser. The parser labels the list of speech-to-text candidates and stores them in pre-defined frame elements, like action references, locations, objects or people.

Learning from a Robot through Dialogue and Access to the Web The Knowledge Base stores groundings of commands encountered in previous dialogues. A grounding is simply a probabilistic mapping of a specific frame element obtained from the frame-semantic parser to locations in the building or tasks the robot can perform.

Learning from a Robot through Dialogue and Access to the Web The Grounding Model, uses the information stored in the Knowledge Base to infer the correct action to take when a command is received.

Learning from a Robot through Dialogue and Access to the Web Sometimes, not all of the parameters required to ground a spoken command are available in the Knowledge Base. When this happens, the Grounding Model resorts to OpenEval, the fourth component of KnoWDiaL. OpenEval is able to extract information from the World Wide Web, to fill missing parameters of the Grounding Model.

Learning from a Robot through Dialogue and Access to the Web: Example

Let s look at communication say/order perceive understand respond Reply (what, how, when)

Perceive & Understand Components of a Dialogue System for a social Robot NL ( natural language ) system: a parser and generator using a grammar for human-robot conversation developed A Speech Recogniser: can use a speech recognition server using a language model for the language we need (the recognised utterances may correspond to a logical form mapped into string) Generate and Synthesize NL ( natural language ) system: a generator using a grammar for human-robot conversation developed TTS ( text-to-speech ): a speech synthesizer, for robot speech output. Manage Dialogue and Non-verbal communication A Dialogue Manager: that co-ordinates multi-modal inputs from the user, interprets the actions of the user (through several modules) as dialogue moves. This dialogue manager must also update and maintain the dialogue context (and common ground), handling questions and miscommunication events. Gestures and non-verbal behaviour handler

Typical Architecture Dialogue Manager Decision making and Action Selection System Understanding Module Non-verbal behaviour understanding Verbal behaviour understanding (NLU) Generation Module Non-verbal behaviour generation (gaze and gesture) Verbal behaviour generation (NLG) Vision Processing Speech recognition system Motion Control Text-to-speech

Perceive & Understand Components of a Dialogue System for a social Robot NL ( natural language ) system: a parser and generator using a grammar for human-robot conversation developed A Speech Recogniser: can use a speech recognition server using a language model for the language we need (the recognised utterances may correspond to a logical form mapped into string) Generate and Synthesize NL ( natural language ) system: a generator using a grammar for human-robot conversation developed TTS ( text-to-speech ): a speech synthesizer, for robot speech output. Manage Dialogue and Non-verbal communication A Dialogue Manager: that co-ordinates multi-modal inputs from the user, interprets the actions of the user (through several modules), and at the same time maintain the dialogue context (and common ground), handling questions and miscommunication events. Gestures and non-verbal behaviour handler (associated with communicative functions)

Non-verbal Communication as Mechanisms for managing conversations There are different mechanisms for managing conversations: -. Interlocutors of a conversation engage in discourse at varying levels of involvement: their participant roles or footing, or participant structure of the conversation; -. Role shifts among conversational participants by a turn-taking mechanism, which allows interlocutors to seamlessly exchange speaking turns, interrupt, etc;. Participants in a conversation create a discourse, that is a composition of discourse segments in particular structures [Grosz and Sidner 1986]. Such structures signal shifts in topic in the discourse or how information is organized. Also, speakers produce a number of cues that signal these structures and enable contributions from other participants or to direct attention to important information (these signals include nor only verbal cues but also nonverbal cues, in particular gaze and gestures).

Let s focus on non-verbal communication With robots, given their physical embodiment, we can add to the verbal communication some level of non-verbal communication. Gaze Head nods

Let s focus on non-verbal communication With robots, given their physical embodiment, we can add to the verbal communication some level of non-verbal communication. Gaze Head nods

Gaze Gaze: During interaction people look at each other in the eye, while listening, talking.. Without eye contact people do not feel they are in communication! Gaze provides a number of potential social cues that can be used by people to learn about the social context, about the environment (objects and events) or even about internal (emotional and intentional) states of others. Gaze cues serve a number of functions in conversations: Clarify who is addressed Help the speaker hold the floor (turn-taking)

Types of Gaze direction Shared Attention is a combination of mutual attention and joint attention, so the focus of one both individuals A and B's attention is not only on the object but also on each other (example: I know you're looking at X, and you know that I'm looking at X) Theory of mind, uses a combination of the previous attentional processes, and higher-order cognitive strategies allowing to reason about the other s attention Mutual gaze is where the attention of two individuals is directed to one another; Gaze following is where individual A detects that B's gaze is not directed towards them, and follows the line of sight of B onto a point in space; Joint Attention is similar to Gaze Following except that there is a focus of attention, for example an object, that two individuals A and B are looking at, at the same time.

Gaze Gaze cues can be used in social robots also to serve as functions in conversations: Clarify who is addressed Help the speaker hold the floor (turn-taking); Help in signaling change in topics of conversation.

Gaze and Mechanisms for managing conversations There are different mechanisms based on gaze for managing conversations: (Who) Role-Signaling Mechanisms (Participation Structure). Speaker gaze cues may signal the roles of interlocutors [Bales et al. 1951]; (When) Turn-Taking Mechanisms (Conversation Structure ) Speaker gaze cues can facilitate turn-exchanges (producing turn-yielding, turn-taking, and floor-holding gaze signals) [Kendon 1967]; (What and How) Topic-Signaling Mechanisms (Information Structure): Patterns in gaze shifts can be temporally aligned with the structure of the speaker s discourse, signaling changes in topic or shifts between thematic units of utterances [Cassell et al. 1999b].

Modeling Conversational Gaze Mechanisms: Approaches Approaches: Theory driven (based on theories of human communication and the role of gaze, models can be built to replicate certain functions identified; e.g. use of the Politeness Theory by Brown and Levinson 1987); Empirically driven (based on experiments and data collected with the precise scenarios in which we will build the robot s gaze behaviour); Combination of both (theory and empirically).

Case study

Case: Modeling Conversational Gaze Mechanisms Goal: Build a model for Gaze behaviour based on both theory and empirical data. Initial data collection: - to capture the basic spatial and temporal parameters of gaze cues - to capture aspects of conversational mechanisms that signal information, conversation, and participation structures.

Data collection for a Gaze model

Data collection for a Gaze model - Subjects gaze behavior was captured using high-definition cameras placed across from their seats. - Subjects speech was captured using stereo microphones attached to their collars. - The cameras provided video sequences of subjects faces (from hair to chin). - An additional camera on the ceiling was used to capture the interaction space. - In total, there were 45 minutes of video for each subject and 180 minutes of data for each triad from four cameras. - A final analysis included an examination of the video data, the coding and descriptive statistics (in particular coding of speech and gaze events from the video), calculating the frequencies of and cooccurrences among events, and computing the distribution parameters for the temporal and spatial properties of these events.

Analysis of data for a Gaze model - Where Do Speakers Look?.

Analysis of data for a Gaze model - Where Do Speakers Look?.

Analysis of data for a Gaze model - Where Do Speakers Look?.

Analysis of data for a Gaze model How Much Time Do Speakers Spend Looking at Each Target? speaker looked at his addressees for the majority of the time; 74%, 76%, and 71% in the two-party, two-party-with-bystander In the first two scenarios, the speaker looked at the bodies of his addressees more than he looked at their faces (26% and 25% at the faces and 48% and 51% at the bodies). In all scenarios, the speaker spent a significant amount of time looking away from addressees (26%, 16%, and 29% of the time in the three conversational situations respectively).

Analysis of data for a Gaze model Each thematic field was mapped onto the speech timeline along with gaze shifts (4000-millisecond periods before the beginning and after the end of the thematic field). This mapping allowed the identification of patterns in gaze shifts that occurred at the onset of each thematic field and quantify the frequency of occurrence for each pattern. Two main recurring patterns of gaze shifts in the two-party conversation and the two-party-with-bystander conversation and another set of two patterns in the three-party conversation.

Analysis of data for a Gaze model

Analysis of data for a Gaze model

Analysis of data for a Gaze model: Who is who (roles) in a conversation?

Analysis of data for a Gaze model: Who is who (roles) in a conversation Three gaze cues were identified to signal the participant roles of his interlocutors. Greetings and summonses. The body of the conversation. the speaker spent the majority of his speaking time looking at addressees (74% of the time and the environment 26% of the time in 1 st scenario and looked towards the addressee, bystander, and the environment 76%, 8%, and 16% of the time, respectively. Turn-exchanges

Analysis of data for a Gaze model: Who is who (roles) in a conversation Three gaze cues were identified to signal the participant roles of his interlocutors. Greetings and summonses. The body of the conversation. the speaker spent the majority of his speaking time looking at addressees (74% of the time and the environment 26% of the time in 1 st scenario and looked towards the addressee, bystander, and the environment 76%, 8%, and 16% of the time, respectively. Turn-exchanges

Gaze patterns in the social robot

Evaluation of Gaze Patterns implemented

Hypotheses of the study Hypothesis 1. Subjects will correctly interpret the footing signals that the robot communicates to them and conform to these roles in their participation to the conversation. Hypothesis 2. Addressees will have better recall of the details of the information presented by the robot than bystanders and overhearers will, as the robot will look toward the addressees significantly more. Hypothesis 3. Addressee or bystanders will evaluate the robot more positively than overhearers will. Hypothesis 4. Addressees will express stronger feelings of groupness (with the robot and the other subject) than bystanders and overhearers will.

Conditions of the study A total of 72 subjects participated in the experiment in 36 trials Condition 1. The robot produced gaze cues for an addressee and an overhearer (ignoring the individual in the latter role), following the norms of a two-party conversation. Condition 2. Gaze cues were produced for an addressee and a bystander, signaling the participation structure of a two-party conversation with bystander. Condition 3. The robot produced gaze cues for two addressees, following the participant roles of a three-party conversation.

Variables of the study The manipulation in the robot s gaze behavior was the only independent variable. The dependent variables involved three kinds of measurements: behavioral, objective, and subjective. Behavioral. We captured subjects behavior using high-definition cameras and from the video and audio data, they measured whether subjects took turns in responding to the robot and how long they spoke. Objective. Subjects recall of the information presented by the robot was measured using a post-experiment questionnaire. Subjective. Subjects affective state using the PANAS scale [Watson et al. 1988], perceptions of the robot s physical, social, and intellectual characteristics using a scale developed to evaluate humanlike agents [Parise et al. 1996], feelings of closeness to the robot [Aron et al. 1992], feelings of groupness and ostracism [Williams et al. 2000], perceptions of the task (how much they enjoyed and attended to the task), and demographic information.

Results

Let s focus on non-verbal communication With robots, given their physical embodiment, we can add to the verbal communication some level of non-verbal communication. Gaze Head nods

Let s focus on non-verbal communication With robots, given their physical embodiment, we can add to the verbal communication some level of non-verbal communication. Gaze Head nods

Head Motion (Head Head motion naturally occurs during speech utterances, and can be either intentional or unconscious. Nods) There is a strong relationships between head motion and dialogue acts (including turn taking functions), and between dialogue acts and prosodic features.

Case study

A model for Robotic Head Nods & Tilt A proposed model, nods are generated in the center of the last syllable of utterances with strong phrase boundaries (k, g, q) and backchannels (bc)

Head Motion (Head Nods) Eleven conversation passages with durations between 10 to 20 seconds, including fillers and turn keeping functions (f and k3), were randomly selected from a database, and rotation angles (nod, shake and tilt angles) Head rotation angles were computed by the head motion generation model for each conversation passage. Video clips were recorded for each stimulus, resulting in 33 stimuli (11 conversation passages and 3 motion types) for each robot type.

Video

Results

The problem revisited When and how should a robot act or say something to the user. - What to say (message content) - When to say (timing, turn-taking) - How to say it (gestures, non-verbal behaviours) Is a complicated task.

Discussion