Social Robots and Human-Robot Interaction Ana Paiva Lecture 8. Dialogues with Robots
Our goal Build Social Intelligence d) e) f)
The problem When and how should a robot act or say something to the user. - What to say (message content) - When to say (timing, turn-taking) - How to say it (gestures, non-verbal behaviours) But, in order to do that, the robot needs to understand what the user said
Communication is hard and miscommunication is easy
Let s look at communication say/order perceive understand respond Reply (what, how, when)
Let s look at communication say/order perceive understand respond
Let s look at communication say/order perceive understand respond How can a robot understand an order from a user and relate it with actions and objects in the physical world?
Perceive & Understand: Giving Commands to the robot Goal: to infer groundings in the world Use a specific description: Spatial Description Clauses (SDGs). SDC correspond to a constituent of the linguistic input and contain: a figure f, a relation r, and a variable number of landmarks l Each SDC has a type; EVENT: an action sequence that takes place in the world (e.g. Move the tire pallet ). OBJECT: a thing in the world (e.g. Forklift, the truck, the person ) PLACE: a place in the world (e.g. next to the tire pallet ). PATH: a path or path fragment through the world (e.g. past the truck ).
Creating a System based on a collected corpus of actions Create a system: data driven approach To train the system, a corpus of natural language commands was collected; The commands were paired with robot actions and environment state sequences; The corpus was used to both train the model and to evaluate end-to-end performance of the system;
Using videos of action sequences in the Amazon s Mechanical Turk a corpus was created by collecting language associated with each video; The videos showed a simulated robotic forklift in an action such as picking up a pallet or moving through the environment; Paired with each video, there as a complete log of the state of the environment and the robot s actions. Subjects were asked to type a natural language command that would cause an expert human forklift operator to carry out the action shown in the video. Commands were collected from 45 subjects for twenty-two different videos showing the forklift executing an action in a simulated warehouse. Giving Commands to the robot: Creating a Corpus
Evaluation of the Corpus The model was assessed in terms of its performance at predicting the correspondence between the acquired structures ( SDCs) and groundings. An evaluation was performed also using known correct and incorrect command-video pairs. C1: subjects saw a command paired with the original video that a different subject watched when creating the command. C2: the subject saw the command paired with random video that was not used to generate the original command.
Giving Commands to the robot
Learning from a Robot through Dialogue Problem with the previous approach: The interaction has to be learned (by collecting data) and thus it is limited to that particular domain. If we place no restrictions on speech, interpreting a command given to a robot is a challenging problem.. as users may say all kinds of things.
Approach: Learning from a Robot through Dialogue and Access to the Web Learning and using task-relevant knowledge from human-robot dialog and access to the Web
Learning from a Robot through Dialogue and Access to the Web KnoWDiaL, an approach for robot learning of taskrelevant environmental knowledge from human-robot dialog and access to the Web
Learning from a Robot through Dialogue and Access to the Web The speech recognizer returns a set of possible interpretations; these interpretations are the input for the first component of KnoWDiaL, a frame-semantic parser. The parser labels the list of speech-to-text candidates and stores them in pre-defined frame elements, like action references, locations, objects or people.
Learning from a Robot through Dialogue and Access to the Web The Knowledge Base stores groundings of commands encountered in previous dialogues. A grounding is simply a probabilistic mapping of a specific frame element obtained from the frame-semantic parser to locations in the building or tasks the robot can perform.
Learning from a Robot through Dialogue and Access to the Web The Grounding Model, uses the information stored in the Knowledge Base to infer the correct action to take when a command is received.
Learning from a Robot through Dialogue and Access to the Web Sometimes, not all of the parameters required to ground a spoken command are available in the Knowledge Base. When this happens, the Grounding Model resorts to OpenEval, the fourth component of KnoWDiaL. OpenEval is able to extract information from the World Wide Web, to fill missing parameters of the Grounding Model.
Learning from a Robot through Dialogue and Access to the Web: Example
Let s look at communication say/order perceive understand respond Reply (what, how, when)
Perceive & Understand Components of a Dialogue System for a social Robot NL ( natural language ) system: a parser and generator using a grammar for human-robot conversation developed A Speech Recogniser: can use a speech recognition server using a language model for the language we need (the recognised utterances may correspond to a logical form mapped into string) Generate and Synthesize NL ( natural language ) system: a generator using a grammar for human-robot conversation developed TTS ( text-to-speech ): a speech synthesizer, for robot speech output. Manage Dialogue and Non-verbal communication A Dialogue Manager: that co-ordinates multi-modal inputs from the user, interprets the actions of the user (through several modules) as dialogue moves. This dialogue manager must also update and maintain the dialogue context (and common ground), handling questions and miscommunication events. Gestures and non-verbal behaviour handler
Typical Architecture Dialogue Manager Decision making and Action Selection System Understanding Module Non-verbal behaviour understanding Verbal behaviour understanding (NLU) Generation Module Non-verbal behaviour generation (gaze and gesture) Verbal behaviour generation (NLG) Vision Processing Speech recognition system Motion Control Text-to-speech
Perceive & Understand Components of a Dialogue System for a social Robot NL ( natural language ) system: a parser and generator using a grammar for human-robot conversation developed A Speech Recogniser: can use a speech recognition server using a language model for the language we need (the recognised utterances may correspond to a logical form mapped into string) Generate and Synthesize NL ( natural language ) system: a generator using a grammar for human-robot conversation developed TTS ( text-to-speech ): a speech synthesizer, for robot speech output. Manage Dialogue and Non-verbal communication A Dialogue Manager: that co-ordinates multi-modal inputs from the user, interprets the actions of the user (through several modules), and at the same time maintain the dialogue context (and common ground), handling questions and miscommunication events. Gestures and non-verbal behaviour handler (associated with communicative functions)
Non-verbal Communication as Mechanisms for managing conversations There are different mechanisms for managing conversations: -. Interlocutors of a conversation engage in discourse at varying levels of involvement: their participant roles or footing, or participant structure of the conversation; -. Role shifts among conversational participants by a turn-taking mechanism, which allows interlocutors to seamlessly exchange speaking turns, interrupt, etc;. Participants in a conversation create a discourse, that is a composition of discourse segments in particular structures [Grosz and Sidner 1986]. Such structures signal shifts in topic in the discourse or how information is organized. Also, speakers produce a number of cues that signal these structures and enable contributions from other participants or to direct attention to important information (these signals include nor only verbal cues but also nonverbal cues, in particular gaze and gestures).
Let s focus on non-verbal communication With robots, given their physical embodiment, we can add to the verbal communication some level of non-verbal communication. Gaze Head nods
Let s focus on non-verbal communication With robots, given their physical embodiment, we can add to the verbal communication some level of non-verbal communication. Gaze Head nods
Gaze Gaze: During interaction people look at each other in the eye, while listening, talking.. Without eye contact people do not feel they are in communication! Gaze provides a number of potential social cues that can be used by people to learn about the social context, about the environment (objects and events) or even about internal (emotional and intentional) states of others. Gaze cues serve a number of functions in conversations: Clarify who is addressed Help the speaker hold the floor (turn-taking)
Types of Gaze direction Shared Attention is a combination of mutual attention and joint attention, so the focus of one both individuals A and B's attention is not only on the object but also on each other (example: I know you're looking at X, and you know that I'm looking at X) Theory of mind, uses a combination of the previous attentional processes, and higher-order cognitive strategies allowing to reason about the other s attention Mutual gaze is where the attention of two individuals is directed to one another; Gaze following is where individual A detects that B's gaze is not directed towards them, and follows the line of sight of B onto a point in space; Joint Attention is similar to Gaze Following except that there is a focus of attention, for example an object, that two individuals A and B are looking at, at the same time.
Gaze Gaze cues can be used in social robots also to serve as functions in conversations: Clarify who is addressed Help the speaker hold the floor (turn-taking); Help in signaling change in topics of conversation.
Gaze and Mechanisms for managing conversations There are different mechanisms based on gaze for managing conversations: (Who) Role-Signaling Mechanisms (Participation Structure). Speaker gaze cues may signal the roles of interlocutors [Bales et al. 1951]; (When) Turn-Taking Mechanisms (Conversation Structure ) Speaker gaze cues can facilitate turn-exchanges (producing turn-yielding, turn-taking, and floor-holding gaze signals) [Kendon 1967]; (What and How) Topic-Signaling Mechanisms (Information Structure): Patterns in gaze shifts can be temporally aligned with the structure of the speaker s discourse, signaling changes in topic or shifts between thematic units of utterances [Cassell et al. 1999b].
Modeling Conversational Gaze Mechanisms: Approaches Approaches: Theory driven (based on theories of human communication and the role of gaze, models can be built to replicate certain functions identified; e.g. use of the Politeness Theory by Brown and Levinson 1987); Empirically driven (based on experiments and data collected with the precise scenarios in which we will build the robot s gaze behaviour); Combination of both (theory and empirically).
Case study
Case: Modeling Conversational Gaze Mechanisms Goal: Build a model for Gaze behaviour based on both theory and empirical data. Initial data collection: - to capture the basic spatial and temporal parameters of gaze cues - to capture aspects of conversational mechanisms that signal information, conversation, and participation structures.
Data collection for a Gaze model
Data collection for a Gaze model - Subjects gaze behavior was captured using high-definition cameras placed across from their seats. - Subjects speech was captured using stereo microphones attached to their collars. - The cameras provided video sequences of subjects faces (from hair to chin). - An additional camera on the ceiling was used to capture the interaction space. - In total, there were 45 minutes of video for each subject and 180 minutes of data for each triad from four cameras. - A final analysis included an examination of the video data, the coding and descriptive statistics (in particular coding of speech and gaze events from the video), calculating the frequencies of and cooccurrences among events, and computing the distribution parameters for the temporal and spatial properties of these events.
Analysis of data for a Gaze model - Where Do Speakers Look?.
Analysis of data for a Gaze model - Where Do Speakers Look?.
Analysis of data for a Gaze model - Where Do Speakers Look?.
Analysis of data for a Gaze model How Much Time Do Speakers Spend Looking at Each Target? speaker looked at his addressees for the majority of the time; 74%, 76%, and 71% in the two-party, two-party-with-bystander In the first two scenarios, the speaker looked at the bodies of his addressees more than he looked at their faces (26% and 25% at the faces and 48% and 51% at the bodies). In all scenarios, the speaker spent a significant amount of time looking away from addressees (26%, 16%, and 29% of the time in the three conversational situations respectively).
Analysis of data for a Gaze model Each thematic field was mapped onto the speech timeline along with gaze shifts (4000-millisecond periods before the beginning and after the end of the thematic field). This mapping allowed the identification of patterns in gaze shifts that occurred at the onset of each thematic field and quantify the frequency of occurrence for each pattern. Two main recurring patterns of gaze shifts in the two-party conversation and the two-party-with-bystander conversation and another set of two patterns in the three-party conversation.
Analysis of data for a Gaze model
Analysis of data for a Gaze model
Analysis of data for a Gaze model: Who is who (roles) in a conversation?
Analysis of data for a Gaze model: Who is who (roles) in a conversation Three gaze cues were identified to signal the participant roles of his interlocutors. Greetings and summonses. The body of the conversation. the speaker spent the majority of his speaking time looking at addressees (74% of the time and the environment 26% of the time in 1 st scenario and looked towards the addressee, bystander, and the environment 76%, 8%, and 16% of the time, respectively. Turn-exchanges
Analysis of data for a Gaze model: Who is who (roles) in a conversation Three gaze cues were identified to signal the participant roles of his interlocutors. Greetings and summonses. The body of the conversation. the speaker spent the majority of his speaking time looking at addressees (74% of the time and the environment 26% of the time in 1 st scenario and looked towards the addressee, bystander, and the environment 76%, 8%, and 16% of the time, respectively. Turn-exchanges
Gaze patterns in the social robot
Evaluation of Gaze Patterns implemented
Hypotheses of the study Hypothesis 1. Subjects will correctly interpret the footing signals that the robot communicates to them and conform to these roles in their participation to the conversation. Hypothesis 2. Addressees will have better recall of the details of the information presented by the robot than bystanders and overhearers will, as the robot will look toward the addressees significantly more. Hypothesis 3. Addressee or bystanders will evaluate the robot more positively than overhearers will. Hypothesis 4. Addressees will express stronger feelings of groupness (with the robot and the other subject) than bystanders and overhearers will.
Conditions of the study A total of 72 subjects participated in the experiment in 36 trials Condition 1. The robot produced gaze cues for an addressee and an overhearer (ignoring the individual in the latter role), following the norms of a two-party conversation. Condition 2. Gaze cues were produced for an addressee and a bystander, signaling the participation structure of a two-party conversation with bystander. Condition 3. The robot produced gaze cues for two addressees, following the participant roles of a three-party conversation.
Variables of the study The manipulation in the robot s gaze behavior was the only independent variable. The dependent variables involved three kinds of measurements: behavioral, objective, and subjective. Behavioral. We captured subjects behavior using high-definition cameras and from the video and audio data, they measured whether subjects took turns in responding to the robot and how long they spoke. Objective. Subjects recall of the information presented by the robot was measured using a post-experiment questionnaire. Subjective. Subjects affective state using the PANAS scale [Watson et al. 1988], perceptions of the robot s physical, social, and intellectual characteristics using a scale developed to evaluate humanlike agents [Parise et al. 1996], feelings of closeness to the robot [Aron et al. 1992], feelings of groupness and ostracism [Williams et al. 2000], perceptions of the task (how much they enjoyed and attended to the task), and demographic information.
Results
Let s focus on non-verbal communication With robots, given their physical embodiment, we can add to the verbal communication some level of non-verbal communication. Gaze Head nods
Let s focus on non-verbal communication With robots, given their physical embodiment, we can add to the verbal communication some level of non-verbal communication. Gaze Head nods
Head Motion (Head Head motion naturally occurs during speech utterances, and can be either intentional or unconscious. Nods) There is a strong relationships between head motion and dialogue acts (including turn taking functions), and between dialogue acts and prosodic features.
Case study
A model for Robotic Head Nods & Tilt A proposed model, nods are generated in the center of the last syllable of utterances with strong phrase boundaries (k, g, q) and backchannels (bc)
Head Motion (Head Nods) Eleven conversation passages with durations between 10 to 20 seconds, including fillers and turn keeping functions (f and k3), were randomly selected from a database, and rotation angles (nod, shake and tilt angles) Head rotation angles were computed by the head motion generation model for each conversation passage. Video clips were recorded for each stimulus, resulting in 33 stimuli (11 conversation passages and 3 motion types) for each robot type.
Video
Results
The problem revisited When and how should a robot act or say something to the user. - What to say (message content) - When to say (timing, turn-taking) - How to say it (gestures, non-verbal behaviours) Is a complicated task.
Discussion