Situierte Generierung Einführung Konstantina Garoufi 18. Oktober 2011
Non-situated language Context form and content of discourse (purpose of discourse) I m going to a party. Do you want to come? OK.
Non-situated language Use the wheelpuller to remove the flywheel. Context form and content of discourse (purpose of discourse)?
Situated language Appelt (1982)
Situated language Context form and content of discourse purpose of discourse objects of the scene in the visual field spatial configuration gestures, gaze history of interaction task at hand...
Challenges of situated language generation Non-linguistic context, in addition to the linguistic one Interplay between language and action: language can itself bring about changes to the non-linguistic context, e.g. by causing the hearer to perform an action Real-time system performance required
Outline What is situated language generation? Challenges Modeling linguistic and non-linguistic context Understanding the interplay between language and action Performing in a dynamic environment in real time Summary and discussion
What is context? Context is what constrains a problem solving without intervening in it explicitly. Brézillon (1999)
Context modeling for language generation What is the relationship between formalization of context and natural language ideas of context? Which phenomena and inferences observed in natural language are context-independent and which ones always depend on context? How to automatically identify context-provided constraints resulting in conveying additional or different aspects of information?
Linguistic context modeling Dial Your Disc (DYD) system One of the first generation systems with a dedicated context model Generation of spoken monologues about W. A. Mozart s instrumental compositions van Deemter & Odijk (1997)
How DYD works
Context modeling in DYD Find a level of representation that is both rich and explicit enough to allow a system of rules to exploit the information in there for contextually appropriate utterances Set up a data structure and fill it with information Formulate rules that exploit this data structure context model
DYD s context model Knowledge state: Which information has been expressed so far, and when? Topic state: Which topics have already been dealt with, which are still to be considered? Context state: Which objects have been introduced? How and when? Dialogue state: What recordings have been selected so far?
What information does that encompass? both syntactic and semantic some generally required, some system-specific granularity subject to application (here: speech generation prosody important)
Has DYD s context model solved all our problems? Consider the following text: M. Walker will give a presentation later today in the same room as where the opening session was held. He is currently in the coffee room, just around the corner and he might be an interesting person for setting up a project on ubiquitous computing. Is DYD s context model sufficient here?
Multidimensional context modeling Parrot-Talk system Human agents in the physical world are supported by software agents Text is generated for output on a wearable device (parrot) Conference center application: parrots search for information and encounters with other users who share same interests Geldof (1999)
Context dimensions in Parrot-Talk Linguistic: How far ahead in the discourse have objects been mentioned? Extra-linguistic temporal: date, time physical: how close is target user? social implicature: what is target user doing? User profile: interest in which topics and persons?
Multimodal context in GRE Richer notions of multimodal context, with focus on generation of referring expressions (GRE) < C deictic pointing gestures current focus of attention * this black block three-dimensional salience: linguistic, inherent, and focus space salience van der Sluis & Krahmer (2001)
Multimodal context in GRE Richer notions of multimodal context, with focus on generation of referring expressions (GRE) < C deictic pointing gestures current focus of attention three-dimensional salience: linguistic, inherent, and focus space salience van der Sluis & Krahmer (2001) * this black block focus space * the white block
Multimodal context in GRE Richer notions of multimodal context, with focus on generation of referring expressions (GRE) > C deictic pointing gestures current focus of attention three-dimensional salience: linguistic, inherent, and focus space salience * focus space that white block to the left of the black one van der Sluis & Krahmer (2001)
34&5"77 )=1"+>!"#$%&'( )%**&+(,-.'&/012, 34&5"67768"'9"' A few years later virtue of simultaneous recording to the video camera. For :4(1%6;1<"'!"#$%&'( )%**&+(,-.'&/012, Figure 6: Sketch of recording hardware Byron & Fosler- 34&5"77 )=1"+> For each session, the corpus contains two movies, one recording the virtual-world experience of each partner, a separate audio recording Lussier in(2006) WAV format, and orthographic transcriptions of the audio. Figure 6 sketches the hardware used in our recording process. Partners spoke to each other through headset-mounted microphones with enclosed-earcup headphones (Sennheiser HMD280- slight amount of bleed-through of the other speaker s voice into the wrong channel. The video-stream going to the leader s computer monitor was also sent to the video input of the digital video camera to be recorded. 1 Therefore, the audio signal and video experience of the person playing the leader role is aligned by OSU Quake 2004 corpus of two-party situated problemsolving dialogs the person playing the follower s role, the video track of their experience in the QuakeII world was recorded after the session was completed, using the replay capability available in QuakeII, and once again feeding the video stream from the computer monitor to the video camera. The audio track containing both audio streams was added onto the video record of the follower s experience, and manually aligned. In order to confirm that the re-recording of the playback of the follower s experience was accurate, we also replayed the leader s viewpoint and verified that it was identical to that which was captured on the camera. 2.6. Annotation deictic and exophoric (i.e. The dialog recordings have been orthographically transcribed. The transcripts do not show timing information, such as overlapping situational) speech or word alignment reference with the audio file, but plans are in place to complete an alignment using the Anvil toolkit (Kipp, 2004). Transcription practices for non-words and abandoned utterances used the ICSI language calibrated against meeting corpus guidelines (Janin et al., 2003). spatial arrangement of world 3. Sample Data from the corpus Figure 7 shows a portion of the dialog in session 10. The partners are in a room together, and the leader (dialog lines marked L) is describing the task that must be accomplished perceptual limitations to the Follower (marked F). Events external to the dialog are marked with symbols at approximately the point at which they occur. Once the Follower finds the correct trigger
Outline What is situated language generation? Challenges Modeling linguistic and non-linguistic context Understanding the interplay between language and action Performing in a dynamic environment in real time Summary and discussion
A sister corpus: SCARE 15 spontaneous English dialogue sessions DF view of the virtual world, displayed on the Each session records the joint problem-solving of a pair of human partners working through a treasurehunt style task in a 3D virtual world Stoia et al. (2008)
The SCARE corpus instruction giver (IG) guides instruction follower (IF) through completing tasks DF view of the virtual world, displayed on the IF s view of the world, as displayed on IG s monitor Stoia et al. (2008) IG s map of the world
Example interaction http://slate.cse.ohio-state.edu/quake-corpora/scare
Example transliterated walk forward and go through the first door you see [pause] and then go through the next one right in front of it [pause] yeah that one [pause] ok [disfluency - w] and then turn to your right [pause] and then hit the button in the middle [pause]
Example step-by-step walk forward and go through the first door you see and then go through the next one right in front of it }navigation and then turn to your right and then hit the button in the middle } referring expression generation
What is happening here? More than mere referring expression generation! Looks like the IG is manipulating the extralinguistic context of the discourse in a way that allows him to use a linguistic utterance of lower cognitive complexity How can a generation system model that?
Leßmann et al. (2006) Virtual Baufix game
Virtual agent Max Can interpret: natural language instructions gestures 3D object manipulations Can generate: natural language instructions gestures 3D object manipulations prosody speech, eye gaze, emotional facial expressions
Cooperative problem solving with Max 1 User: Let us build a propeller. 2 Max: Okay. 3 Max: First, a bolt needs to be inserted 4 User: Which bolt? 5 Max: in the middle of a bar. 6 Max: Any bolt. 7 User puts the yellow bolt into the first hole of a three-hole bar. 8 Max: No, that was the wrong hole. Max undoes the action. Leßmann et al. (2006)
Cooperative problem solving with Max 9 Max: Again, a bolt needs to be inserted in the middle hole of a bar. 10 User puts the yellow bolt into the middle hole of a three-hole-bar. 11 Max: Alright. Now, insert <pointing gesture> this bolt into another three-hole-bar. 12 User: Do it yourself. 13 Max: Okay, I will do it. 14 Max connects the objects as instructed. 15 Max: And now the bars need to be turned <iconic gesture> crosswise. Leßmann et al. (2006)
Cooperative problem solving with Max A long pause occurs where the user does not react. 16 Max: Well, then I will perform the construction step. 17 The objects are rotated as instructed. 18 Max: Great. We have just completed the assembly of a propeller. Figure 2: Max assists the human partner in building a propeller Leßmann et al. (2006)
Outline What is situated language generation? Challenges Modeling linguistic and non-linguistic context Understanding the interplay between language and action Performing in a dynamic environment in real time Summary and discussion
connected over Internet The GIVE Challenge 2 1 move forward 2 steps! press the blue button! User plays 3D game in virtual world Natural language generation system generates instructions in real time Koller et al. (2010)
Website
Demo http://www.give-challenge.org
Outline What is situated language generation? Challenges Modeling linguistic and non-linguistic context Understanding the interplay between language and action Performing in a dynamic environment in real time Summary and discussion
Summary Situated language generation is a useful task, but comes with many challenges Fundamental questions about the nature of context in situated communication are open, no unified account of the notions of situated context exists The interplay between language and action is not yet fully explored However we ll see over the next weeks that a lot has been achieved - stay tuned!
Course slides and literature http://www.ling.uni-potsdam.de/~garoufi/ page.php?id=generierung
References Appelt (1982). Planning natural-language utterances to satisfy multiple goals. Brezillon (1999). Context in problem solving: a survey. Byron & Fosler-Lussier (2006). The OSU Quake 2004 corpus of two-party situated problem-solving dialogs. Geldof (1999). Parrot-Talk requires multiple context dimensions. Koller et al. (2010). The First Challenge on Generating Instructions in Virtual Environments Leßmann et al. (2006). Situated interaction with a virtual human - perception, action, and cognition. Stoia et al. (2008). SCARE: A Situated Corpus with Annotated Referring Expressions. van Deemter & Odijk (1997). Context modeling and the generation of spoken discourse. van der Sluis & Krahmer (2001). Generating referring expressions in a multimodal context.