Dialogue manager. Natural. language parser). Black Board. Domain model. Topsy

Similar documents
Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

An Introduction to the Minimalist Program

Unit purpose and aim. Level: 3 Sub-level: Unit 315 Credit value: 6 Guided learning hours: 50

Using dialogue context to improve parsing performance in dialogue systems

The Verbmobil Semantic Database. Humboldt{Univ. zu Berlin. Computerlinguistik. Abstract

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

LING 329 : MORPHOLOGY

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

Programme Specification

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Cambridge NATIONALS. Creative imedia Level 1/2. UNIT R081 - Pre-Production Skills DELIVERY GUIDE

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Speech Recognition at ICSI: Broadcast News and beyond

Some Principles of Automated Natural Language Information Extraction

Improving the impact of development projects in Sub-Saharan Africa through increased UK/Brazil cooperation and partnerships Held in Brasilia

AQUA: An Ontology-Driven Question Answering System

On-Line Data Analytics

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Probabilistic Latent Semantic Analysis

Speech Emotion Recognition Using Support Vector Machine

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

Parsing of part-of-speech tagged Assamese Texts

An Interactive Intelligent Language Tutor Over The Internet

A student diagnosing and evaluation system for laboratory-based academic exercises

Automating the E-learning Personalization

Modeling function word errors in DNN-HMM based LVCSR systems

Strengthening assessment integrity of online exams through remote invigilation

Radius STEM Readiness TM

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

ASSISTIVE COMMUNICATION

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

LITERACY ACROSS THE CURRICULUM POLICY

Applications of memory-based natural language processing

Introduction and survey

Massively Multi-Author Hybrid Articial Intelligence

Dynamic Pictures and Interactive. Björn Wittenmark, Helena Haglund, and Mikael Johansson. Department of Automatic Control

user s utterance speech recognizer content word N-best candidates CMw (content (semantic attribute) accept confirm reject fill semantic slots

The Strong Minimalist Thesis and Bounded Optimality

Learning Methods in Multilingual Speech Recognition

REVIEW OF CONNECTED SPEECH

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

cmp-lg/ Jul 1995

BUILD-IT: Intuitive plant layout mediated by natural interaction

Abstractions and the Brain

Word Segmentation of Off-line Handwritten Documents

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

GACE Computer Science Assessment Test at a Glance

CS 598 Natural Language Processing

A MULTI-AGENT SYSTEM FOR A DISTANCE SUPPORT IN EDUCATIONAL ROBOTICS

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

21 st Century Skills and New Models of Assessment for a Global Workplace

Eye Movements in Speech Technologies: an overview of current research

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

QUT Digital Repository:

Teaching and Learning as Multimedia Authoring: The Classroom 2000 Project

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

English Language and Applied Linguistics. Module Descriptions 2017/18

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Jacqueline C. Kowtko, Patti J. Price Speech Research Program, SRI International, Menlo Park, CA 94025

A study of speaker adaptation for DNN-based speech synthesis

The Computational Value of Nonmonotonic Reasoning. Matthew L. Ginsberg. Stanford University. Stanford, CA 94305

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Multi-modal Sensing and Analysis of Poster Conversations toward Smart Posterboard

PROCESS USE CASES: USE CASES IDENTIFICATION

Body-Conducted Speech Recognition and its Application to Speech Support System

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Prediction of Maximal Projection for Semantic Role Labeling

Ontologies vs. classification systems

Mandarin Lexical Tone Recognition: The Gating Paradigm

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Teaching digital literacy in sub-saharan Africa ICT as separate subject

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

MULTIMEDIA Motion Graphics for Multimedia

Webquests in the Latin Classroom

SIE: Speech Enabled Interface for E-Learning

Bluetooth mlearning Applications for the Classroom of the Future

Circuit Simulators: A Revolutionary E-Learning Platform

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Developing a Distance Learning Curriculum for Marine Engineering Education

CWIS 23,3. Nikolaos Avouris Human Computer Interaction Group, University of Patras, Patras, Greece

Saliency in Human-Computer Interaction *

phone hidden time phone

Introduction to CRC Cards

Software Maintenance

Application of Virtual Instruments (VIs) for an enhanced learning environment

Emotional Variation in Speech-Based Natural Language Generation

The Common European Framework of Reference for Languages p. 58 to p. 82

Transcription:

The Intellimedia WorkBench - an environment for building multimodal systems Tom Brndsted ftb@cpk.auc.dkg, Lars Bo Larsen flbl@cpk.auc.dkg, Michael Manthey fmanthey@cs.auc.dkg, Paul Mc Kevitt fpmck@cpk.auc.dkg, Thomas Moeslund ftbm@cpk.auc.dkg, Kristian G. Olesen fkgo@vision.auc.dkg Institute of Electronic Systems, Aalborg University, Denmark. 1 Poster Summary 1.1 Background Driven by the current move towards multi modal interaction an activity was initiated at Aalborg University to integrate the expertise present in a number of previously separate research groups [IMM 1997]. Among these are speech and natural language processing, spoken dialogue systems, vision based gesture recognition, decision support and machine learning systems. This activity has resulted in the establishment of an \Intellimedia WorkBench". The workbench is a physical as well as a software platform enabling research and education within the area of multi modal user interfaces. The workbench makes a set of tools available which can be used in a variety of applications. The devices are a mixture of commercially available products (e.g. the speech recogniser and synthesiser), custom made products (e.g. the laser system) and modules developed by the project team (e.g. the gesture recogniser and natural language parser). 1.2 Architecture Speech synthesiser Dialogue manager Natural language parser Gesture recogniser Laser pointer Black Board Speech recogniser Microphone array Domain model Topsy Fig. 1. Architecture of the workbench A very open architecture has been chosen to allow for easy integration of new modules. The central module is a blackboard, which stores information about the system's current state, history, etc. All modules communicate through the exchange of semantic frames with

other modules or the blackboard. The process synchronisation and intercommunication is based on the DACS IPC platform, developed by the SFB360 project at Bielefeld university [Fink et al 1996]. DACS allows the modules to be distributed across a number of servers. The architecture is shown in gure 1, and the physical layout in gure 2. Figure 1 shows the blackboard as the central element with a number of modules around. Presently modules for speech recognition, parsing, speech synthesis, 2D visual gesture recognition and a laser pointing device are integrated into the application described below. Furthermore a sound source locator (a microphone array) and a machine learning system [Manthey 1998] are included in the WorkBench. Fig. 2. Physical layout of the workbench. The camera and the laser are mounted in the ceiling. The microphone array is placed on the wall The present application is a multi modal campus information system. A model (blueprint) of a building layout is placed on the workbench (see gure 2 and the system allows the user to ask questions about the locations of persons and oces, labs, etc. Typical inquiries are about routes from one location to another, where a given person's oce is located, etc. Input is simultaneous speech and/or gestures (pointing to the plan). Output is synchronised speech synthesis and pointing (using the laser beam to point and draw routes on the map). Frame semantics A frame semantics has been developed for integrated perception in the spirit of Minsky (1975) consisting of (1) input, (2) output, and (3) integration frames for representing the meaning or semantics of intended user input and system output. Frames are produced by all modules in the system and are placed on the blackboard where they can be read by all modules. The format of the frames is a predicate-argument structure and we have produced a BNF denition of that format. Frames represent some crucial elements such as module, input/output, intention, location, and time-stamp. Module is simply the name of the module producing the frame (e.g. parser). Inputs are the input recognised whether spoken (e.g. \Show me Hanne's oce") or gestures (e.g. pointing coordinates) and outputs the intended output whether spoken (e.g. \This is Hanne's oce.") or gestures (e.g. pointing coordinates). Time-stamps can include the times a a given event commenced and completed. The frame semantics also includes two keys for language/vision integration: reference and spatial relations.

1.3 Architecture The workbench presently includes the following modules: Speech recogniser. Speech recognition is handled by the graphvite [Power et al 1997] realtime continuous speech recogniser. It is based on Hidden Markov Models of triphones for acoustic decoding of English or Danish. The recognition process focuses on recognition of speech concepts and ignores non content words or phrases. In the present application domain speech concepts are routes, names and commands which are modelled as phrases. A nite state network describing the phrases is created in accordance with the domain model and the grammar for the natural language parser described below. Speech Synthesiser. The speech synthesiser is the Infovox [Infovox 199x], which in the present version is capable of synthesising Danish and English languages. It is a rule based formant synthesiser, and can simultaneously cope with multiple languages, e.g. pronounce a Danish name within an English utterance. Natural Language Parser. The Natural Language parser [Brndsted 1997] is based on a compound feature based (so-called unication) grammar formalism for extracting semantics from the one-best output written by the speech recogniser to the black board. The parser carries out a syntactic constituent analysis of input and subsequently maps values into semantic frames of the type described above. The rules used for syntactic parsing, are based on a subset of the EUROTRA formalism (lexical rules and structure building rules) [Beck 1991]. Semantic rules dene certain syntactic subtrees and which frames to create if the subtrees are found in the syntactic parse trees. For each syntactic parse tree, the parser generates only one predicate and all created semantic frames are arguments or sub-arguments of this predicate. If syntactic parsing cannot complete, the parser can return the found frame fragments to the blackboard. Gesture Recogniser. A design principle of imposing as few physical constraints as possible on the user (e.g. data gloves or touch screens) lead to the inclusion of a vision based gesture recogniser. It tracks a pointer (or the user's nger) via a camera mounted in the ceiling. Using one camera, the gesture recogniser is able to track 2D pointing gestures in real time. In the current applications there are two gestures; pointing and not-pointing. In future versions system other kinds of gestures like mark an area, indicate a direction, etc. will be included. The camera continuously captures images which are digitised by a frame-grabber. From each digitised image the background is subtracted leaving only the motion (and some noise) within this image. This motion is analysed in order to nd the direction of the pointing device and its edge. By temporal segmenting of these two parameters, a clear indication of the position the user is pointing to at a given time is found. The error of the tracker is less than one pixel (through an interpolation process) for the pointer. Laser Pointer. A laser system is mounted next to the camera, acting as a \system pointer". It is used for showing positions and draw routes on the map. The laser beam is controlled in real-time (30kHz). It can scan frames containing up to 600 points with a refresh rate of 50 Hz thus drawing very steady images on the workbench surface. It is controlled by a standard Pentium host computer. The tracker and the laser pointer are carefully calibrated in order to work together. An automatic calibration procedure has been set up, involving both the camera and laser.

Sound source locator. A microphone array [Leth-Espensen and Lindberg 1995] is used to locate a sound source, e.g a person speaking. (This module is not hooked-up at present). Depending upon the placement of a maximum of 12 microphones it calculates the position in 2 or 3 dimensions. It is based on measurement of the delays with which a sound wave arrives at the dierent microphones. From this information the location of the sound source can be identied. Another application of the array is to use it to focus at a specic location, thus enhancing any acoustic activity at that location. Domain Model. The demonstrator domain model holds information on the institute's buildings and the people that works there. The purpose of the model is to be able to answer queries about who lives where etc. The domain model associates information about coordinates, rooms, persons etc. The model is organised in a hierarchical structure: areas, buildings and rooms. Rooms are described by an identier for the room (room number) and the type of the room (oce, corridor etc.). For oces there is also a description of tenants of the room by anumber of attributes (rst and second name, aliation etc.). The model include functions that return information about a room or a person. Possible inputs are coordinates or room number for rooms and name for persons, but in principle any attribute can be used as key and any other attribute can be returned. Further a path planner is provided, calculating the shortest route between two locations. Dialogue Manager. The dialogue manager makes decisions about which actions to take and accordingly sends commands to the output modules via the blackboard. In the present version the functionality of the dialogue manager is mainly to react to the information coming in from the speech/nlp and gesture modules by sending synchronised commands to the laser pointer and the speech synthesiser modules. Phenomena such as clarication sub dialogues are not included at present. Topsy. The basis of the Phase Web paradigm [Manthey 1997], and its incarnation in the form of a program called Topsy, is to represent knowledge and behaviour in the form of hierarchical relationships between the mutual exclusion and co-occurrence of events. (In AI parlance, Topsy is a distributed, associative, continuous-action, partial-order planner that learns from experience.) Relative to multi-media, integrating independent data from multiple media begins with noticing that what ties such otherwise independent inputs together is the fact that they occur simultaneously (more or less). This is also Topsy's basic operating principle, but this is further combined with the notion of mutual exclusion, and thence to hierarchies of such relationships [Manthey 1998]. 1.4 Goals Two major goals are behind the establishment of the workbench. One is to facilitate research on especially the integration of visual and linguistic (spoken) information, and the other is to make a platform available for post graduate student projects. A M.Sc post graduate programme in intelligent multimedia has recently been set up [IMM 1997] and the workbench will play an important role by enabling students to rapidly build advanced user interfaces including multiple modalities. References [Beck 1991] A. Beck: \Description of the EUROTRA Framework" In: Studies in Machine Translation and Natural Language Processing, vol. 2 1991, ed. C. Copeland et al.

[Brndsted 1997] http://www.kom.auc.dk/ tb/nlparser [Fink et al] Fink, G.A. et al: \A Distributed System for Integration of Speech and Image Understanding" In Rogelio Soto (ed.): Proceedings of the International Symposium on Articial Intelligence, Cancun, Mexico 1996, pp. 117-126. [IMM 1997] http://www.kom.auc.dk/cpk/speech/mmui/ [Infovox] \INFOVOX Text-to-speech converter. User's manual" Telia Promoter Infovox 1994. [Leth-Espensen and Lindberg 1995] \Application of microphone arrays for remote voice pickup - RVP project, nal report" Center for PersonKommunikation, Aalborg University 1995 [Manthey 1997] http://www.cs.auc.dk/topsy/ [Manthey 1998] Manthey, M. \The Phase Web Paradigm". Int'l J. of General Systems, special issue on General Physical Systems Theories, K. Bowden Ed. In press. [Minsky 1975]Minsky, M. 1975 \A framework for representing knowledge" The Psychology of Computer Vision, P.H. Winston (Ed.), 211-217 New York: McGraw-Hill. [Power et al 1997] \The graphvite Book" for graphvite Version 1.0 Entropic Cambridge Research Laboratory Ltd, 1997.