Imitation Learning Using Graphical Models

Similar documents
Neural Network Model of the Backpropagation Algorithm

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

Fast Multi-task Learning for Query Spelling Correction

More Accurate Question Answering on Freebase

MyLab & Mastering Business

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

1 Language universals

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Lecture 1: Machine Learning Basics

Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Georgetown University at TREC 2017 Dynamic Domain Track

Transfer Learning Action Models by Measuring the Similarity of Different Domains

A Case-Based Approach To Imitation Learning in Robotic Agents

Regret-based Reward Elicitation for Markov Decision Processes

Speeding Up Reinforcement Learning with Behavior Transfer

High-level Reinforcement Learning in Strategy Games

Improving Action Selection in MDP s via Knowledge Transfer

Efficient Use of Space Over Time Deployment of the MoreSpace Tool

Artificial Neural Networks written examination

Go fishing! Responsibility judgments when cooperation breaks down

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Axiom 2013 Team Description Paper

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Visual CP Representation of Knowledge

Finding Your Friends and Following Them to Where You Are

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Semi-Supervised Face Detection

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Rajesh P. N. Rao, Aaron P. Shon and Andrew N. Meltzoff

Active Learning. Yingyu Liang Computer Sciences 760 Fall

FF+FPG: Guiding a Policy-Gradient Planner

Comparison of network inference packages and methods for multiple networks inference

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Pilot Study on Pearson s Interactive Science 2011 Program

Mathematics subject curriculum

Task Completion Transfer Learning for Reward Inference

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Learning Prospective Robot Behavior

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Improving Conceptual Understanding of Physics with Technology

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Truth Inference in Crowdsourcing: Is the Problem Solved?

Laboratorio di Intelligenza Artificiale e Robotica

Constructing Parallel Corpus from Movie Subtitles

Learning Rules from Incomplete Examples via Implicit Mention Models

What s in a Step? Toward General, Abstract Representations of Tutoring System Log Data

Learning Methods for Fuzzy Systems

CHANCERY SMS 5.0 STUDENT SCHEDULING

A Game-based Assessment of Children s Choices to Seek Feedback and to Revise

An Estimating Method for IT Project Expected Duration Oriented to GERT

An investigation of imitation learning algorithms for structured prediction

A Bayesian Model of Imitation in Infants and Robots

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Graphical Data Displays and Database Queries: Helping Users Select the Right Display for the Task

A Reinforcement Learning Variant for Control Scheduling

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

An Evaluation of E-Resources in Academic Libraries in Tamil Nadu

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Learning Methods in Multilingual Speech Recognition

Task Completion Transfer Learning for Reward Inference

Session Six: Software Evaluation Rubric Collaborators: Susan Ferdon and Steve Poast

Seminar - Organic Computing

Discriminative Learning of Beam-Search Heuristics for Planning

San Francisco County Weekly Wages

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

College Pricing and Income Inequality

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

AMULTIAGENT system [1] can be defined as a group of

LEt s GO! Workshop Creativity with Mockups of Locations

Cued Recall From Image and Sentence Memory: A Shift From Episodic to Identical Elements Representation

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Different Requirements Gathering Techniques and Issues. Javaria Mushtaq

Laboratorio di Intelligenza Artificiale e Robotica

On the Combined Behavior of Autonomous Resource Management Agents

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

CSC200: Lecture 4. Allan Borodin

A Model of Knower-Level Behavior in Number Concept Development

Probabilistic Latent Semantic Analysis

Distant Supervised Relation Extraction with Wikipedia and Freebase

ACTIVITY: Comparing Combination Locks

Guru: A Computer Tutor that Models Expert Human Tutors

Taking Kids into Programming (Contests) with Scratch

An Introduction to Simio for Beginners

Acquiring Competence from Performance Data

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Corrective Feedback and Persistent Learning for Information Extraction

Math Hunt th November, Sodalitas de Mathematica St. Xavier s College, Maitighar Kathmandu, Nepal

M55205-Mastering Microsoft Project 2016

Transcription:

Imiaion Learning Using Graphical Models Deepak Verma and Rajesh P.N. Rao Dep. of Compuer Science & Engineering Universiy of Washingon Seale, WA, USA {deepak,rao}@cs.washingon.edu hp://neural.cs.washingon.edu/ Absrac. Imiaion-based learning is a general mechanism for rapid acquisiion of new behaviors in auonomous agens and robos. In his paper, we propose a new approach o learning by imiaion based on parameer learning in probabilisic graphical models. Graphical models are used no only o model an agen s own dynamics bu also he dynamics of an observed eacher. Parameer ying beween he agen-eacher models ensures consisency and faciliaes learning. Given only observaions of he eacher s saes, we use he expecaion-maximizaion (EM) algorihm o learn boh dynamics and policies wihin graphical models. We presen resuls demonsraing ha EM-based imiaion learning ouperforms pure exploraion-based learning on a benchmark problem (he FlagWorld domain). We addiionally show ha he graphical model represenaion can be leveraged o incorporae domain knowledge (e.g., sae space facoring) o achieve significan speed-up in learning. 1 Inroducion Learning by imiaion is a general mechanism for rapidly acquiring new skills or behaviors in humans and robos. Several approaches o imiaion have previously been proposed (e.g., [1,]). Many of hese rea he problem of imiaion as rajecory-following where he goal is o follow he eacher s rajecory as bes as possible. However, imiaion ofen involves he need o infer inenions and goals which inroduces considerable uncerainy ino he problem, besides he uncerainy already exising in he observaion process and in he environmen. Previous models of imiaion have ypically no been probabilisic and are herefore no geared owards handling uncerainy. There have been some recen effors in modeling goal-based imiaion [3] bu hese eiher assume ha he dynamics of environmen are given or need o learn he dynamics using a ime-consuming exploraion sage. A differen approach o imiaion is based on ideas from he field of Reinforcemen Learning (RL) []. In reinforcemen learning, he agen is assumed o receive rewards in cerain saes and he agen s goal is o learn a sae-oacion mapping ( policy ) ha maximizes he oal fuure expeced reward. The compuaional challenge of solving RL problem is hard for a variey of reasons: (1) he sae space is ofen exponenial in he number of aribues, and () for J.N. Kok e al. (Eds.): ECML 7, LNAI 71, pp. 757 7, 7. c Springer-Verlag Berlin Heidelberg 7

75 D. Verma and R.P.N. Rao uncerain environmens wih large sae spaces, he agen needs o perform a large amoun of exploraion o learn a model of he environmen before learning a good policy. These problems can be amelioraed by using imiaion [5] ( or appreniceship []) where a eacher exhibis he opimal behavior ha is observed by he suden or he eacher guides he suden o he mos imporan saes for exploraion. Price and Bouilier formulae his in he RL framework as Implici Imiaion [7], in which he suden learns he dynamics of he environmen by passively observing he eacher wihou any explici communicaion regarding wha acions o ake. This speeds up he learning of policies. However, hese approaches rely on knowing or inferring an explici reward funcion in he environmen, which may no always be available or easy o infer. In his paper, we propose a new approach o imiaion ha is based on probabilisic Graphical Models (GMs). We pose he problem of imiaion learning as learning he parameers of he underlying GM for he menor s and observer s behavior (we use he erms menor/eacher (and observer/suden) inerchangeably in he paper). To faciliae he ransfer of knowledge from menor o observer we ie he parameers of dynamics for he menor wih ha of he observer, and updae he observer s policy using he learned menor policy. Parameers are learned using he expecaion-maximizaion (EM) algorihm for learning in GMs from parial daa. Our approach provides a principled approach o imiaion based compleely on an inernal GM represenaion, allowing us o leverage he growing number of efficien inference and learning echniques for GMs. Graphical Models for Imiaion Noaion: We use capial leers for variables and small case leers o denoe specific insances. We assume here are wo agens, he observer A o and he menor A m operaing in he environmen 1.LeΩ S be he se of saes in he environmen and Ω A he se of all possible acions available o he agen (boh finie). A ime, he agen is in sae S and execues acion A. The agen s sae changes in a sochasic manner given by he ransiion probabiliy P (S +1 S,A ), which is assumed o be independen of, i.e., P (S +1 = s S = s, A = a) =τ s sa. When obvious from conex, we use s for S = s and a for A = a, ec. For each sae s and acion a, here is a real valued reward R m (s, a) for he menor (R o (s, a) for he observer) associaed wih being in sae s and execuing he acion a (wih negaive values denoing undesirable saes or he cos of he acion). The parameers described above define a Markov Decision Process (MDP) [9]. Solving an MDP ypically involves compuing an opimal policy a = π(s) ha maximizes oal expeced fuure reward (eiher a finie 1 We use he superscrip o disinguish he wo agens and omi i for common variables (e.g., dynamics of he environmen). For simpliciy of exposiion, we assume ha agens operae (non-ineracively) in he same environmen. However, as discussed in [], his assumpion is no essenial and one can apply he echniques discussed here o he more general seing where observer and menor(s) have differen acion and sae spaces.

Imiaion Learning Using Graphical Models 759 horizon cumulaive reward or discouned infinie horizon cumulaive reward) when acion a is execued in sae s. In a ypical Reinforcemen Learning problem, he dynamics and he reward funcion are no known, and one canno herefore compue an opimal policy direcly. One can learn boh hese funcions by exploraion bu his requires he agen o execue a large number of exploraion seps before an opimal policy can be compued. Learning can be grealy sped up via implici imiaion [7] which involves an agen (he observer) observing anoher agen (menor) who has similar goals.. The main idea is o allow he agen o quickly learn he parameers in he relevan porion of he sae space, hereby cuing down on he exploraion required o compue a near-opimal policy. We assume ha he menor follows a saionary policy π m (s) which defines is behavior compleely. The observer is only able o observe he sequence of saes ha menor has been in (S m 1:) andno he acions: hisisimporanbecause some of he mos useful forms of imiaion learning are hose in which he eacher s acions are no available, e.g., when a robo mus learn by waching a human in such a scenario, he robo can observe body poses bu has no access o he human s acions (muscle or moor commands). The ask of he observer is hen o compue he bes esimae of he dynamics ˆτ and menor policy ˆπ m, givenisownhisorys o 1:,Ao 1: and he menor s sae hisory Sm 1:.Noehaπm can be compleely independen of he observer s reward funcion R o : in fac, he problem as formulaed above does no require he inroducion of a reward funcion a all. The goal is simply o imiae he menor by esimaing and execuing he menor s policy. In he special case where he menor is opimizing he same reward funcion as he observer, π m becomes he opimal MDP policy. Noe ha since he observer canno see acions ha he menor ook and he ransiion parameers are no given, he problem is differen from oher approaches which speed up RL via imiaion [,1]..1 Generaive Graphical Model Boh he menor and he observer are solving an MDP. One key observaion we make is ha given he menor policy he acion choice and dynamics can be modeled easily using a generaive model based on he well-known graphical model for MDP shown in Fig. 1(a). One does no need o know he menor s reward model as π m compleely explains he menor sae sequence observed. The figure shows he -slice represenaion of he Dynamic Bayesian Nework (DBN) used o model he imiaion problem. Since we are assuming ha he wo agens are operaing in he same environmen, hey have he same ransiion parameers (τ m =τ o =τ). Noe ha he wo graphical models (for he menor and observer respecively) are disconneced as he wo agens are non-ineracing. The menor s acions are guided by he opimal menor policy P (A m = a S m = s) =π m (a s) and he observer s acions by he policy P (A o = a S m = s) = π o (a s). Unlike he menor, he observer updaes is policy over ime (hence he subscrip on π o ). We require only he menor o have a saionary policy. The menor observaions s m 1:T are generaed by sampling he DBN. In our

7 D. Verma and R.P.N. Rao S m S m +1 τ sas π m Menor S F1 G A m Tied parameers A m +1 S o τ sas S o +1 F3 π o Observer F A o (a) A o +1 (b) Fig. 1. Model and Domain for Imiaion. (a) Graphical Model Represenaion for Imiaion. (b) FlagWorld Domain. experimens, when a goal sae is reached, we jump o he sar sae in he nex sep. T hus represens he oal number of seps aken by agen, which could span muliple episodes of reaching a goal sae. 3 Imiaion Via Parameer Learning Our approach o imiaion is based on esimaing he unknown parameers θ = (τ,π m ) of he graphical model in Fig. 1(a) given observed daa as evidence, i.e., ˆθ =(ˆτ,ˆπ m )= argmax P (θ s m θ 1:T,so 1:T,ao 1:T ). Noe ha he evidence does no include menor acions A m 1:T. This means ha he daa is incomplee as no all nodes of he graphical model are observed. A well-known approach o learning he parameers of a GM from incomplee daa [11] is o use he expecaionmaximizaion (EM) algorihm [1]. Alhough any parameer learning mehod could be used, we use EM in he presen sudy since i is a general-purpose, well-undersood algorihm widely used in machine learning. The EM algorihm involves saring wih an iniial esimae θ (chosen randomly or incorporaing any prior knowledge) which is hen ieraively improved by performing he following wo seps: Expecaion: The curren se of parameers θ i is used o compue a disribuion (expecaion) over he hidden nodes: h(a m 1:T )=P(Am 1:T θi,s m 1:T,so 1:T,ao 1:T ). This allows he expeced sufficien saisics o be compued for he complee daa se. Maximizaion: The disribuion h is hen used o compue he new parameers θ i+1 which maximize he (expeced) log-likelihood of evidence: θ i+1 = argmax h(a m θ 1:T )log(p (sm 1:T,am 1:T,so 1:T,ao 1:T θ)) a 1:T When saes and acions are discree, he new esimae can be compued by simply using he expeced couns. The wo seps above are performed alernaively

Imiaion Learning Using Graphical Models 71 unil convergence. The mehod is guaraneed o improve performance in each ieraion in ha he incomplee log likelihood of daa (log P (s m 1:T,so 1:T,ao 1:T θi )) is guaraneed o increase in every ieraion and converge o a local maximum [1]. We hen use he esimae for ˆθ o conrol he observer. In paricular, he observer combines he learned menor policy ˆπ m wih an exploraion sraegy o arrive a he policy π o. 3.1 Parameer Learning Resuls Domain: We esed our resuls on a benchmark problem known as he Flag- World domain [13] shown in Fig. 1(b). The agen s objecive is o reach he goal sae G saring from he sae S and pick up a subse of he hree flags locaed a saes F 1, F andf3. I receives a reward of 1 poin for each flag picked up bu rewards are discouned by a facor of γ =.99 a each ime sep unil he goal is reached; he laer consrain favors shores pahs o goal. The environmen is a sandard maze environmen used in RL [] in ha each acion (N,E,S,W) akes he agen o he inended sae wih a high probabiliy (.9) and o a sae perpendicular o he inended sae wih a small probabiliy (.1). The probabiliy mass going ino he wall or ouside he maze is assigned o he sae in which acion aken. This domain is ineresing in ha here are saes (33 locaions, augmened wih a boolean aribue for each flag picked), resuling in a large number of parameers ha needs o be learned ( sae acion pairs for which τ(s, a, :) and π m (a s) needs o be learned). However, he opimal policy pah is sparse and hence only a small subse of parameers needs o be learned o compue a near-opimal policy, hereby making i ideal for demonsraing he uiliy of imiaion as a medium o speed up RL. Exploraion versus Exploiaion: We used he ɛ greedy mehod o radeoff exploraion of he domain wih exploiaion of he curren learned policy: a random acion is chosen wih probabiliy ɛ, wihɛ gradually decreased over ime o favor exploraion iniially and exploiaion of he learned policy in laer ime seps. Resuls: The resuls of EM-based learning are shown in Fig (a) (averaged over 5 runs). The parameers were learned in a bach mode where T was increased from o 5 in seps of and reward in he las seps was repored. Average reward received is shown in op righ corner. Also shown are he Error in parameers (mean absolue difference w.r.. rue parameers 3 ), he log-likelihood of he learned parameers and value funcion of sar sae under he curren esimae for observer policy Vˆπ o(s) w.r. he rue ransiion parameers. The resuls show ha he observer is able o learn he menor policy o a high degree of accuracy, hough no perfecly. The uncerain dynamics of he environmen leads i o collec less rewards han he menor as he opimal policy is no learned everywhere. An imporan poin o noe is ha he error in 3 The error beween uniformly random parameers and rue parameers is 1.5 for π m and 1.75 for τ.

7 D. Verma and R.P.N. Rao Average Error (Mean Abs dis from rue) Average Log likelihood (per sep) 1.5 1.5.5 3 3.5.5 Error in Learn Parameers ransiion policy Log likelihood of learn parameers Training Tes 5 (a) Reward obained by wo agens in las seps (5 Runs) 1 Reward 1 1 Menor (Oracle) Observer Value Funcion of Sar Sae of learn observer policy Value V(S) for Obs Opimal V(S) 1 1 1 (b) Fig.. Imiaion Learning Resuls for FlagWorld Domain. (a) (Clockwise) Error in parameers (mean absolue difference w.r.. rue parameers), average reward received, he log-likelihood of he learned parameers, and value funcion of sar sae Vˆπ o(s) w.r. he rue ransiion parameers. (b) Comparison of learned policy (ParamImi) wih some popular exploraion echniques (measured in erms of average discouned reward obained per seps). ParamImi ouperforms all he pure exploraion-based mehods. parameers is sill quie high even when observer policy is quie good, hereby confirming he inuiion ha only a small (relevan) subse of parameers needs o be learned well before he agen can sar exploiing a learned policy. Figure (b) compares he relaive qualiy of he learned policy wih a number of pure exploraion-based echniques used in [13]. The bars represen he average discouned reward obained per seps in he nd sage, i.e., obained in nex, seps afer an iniial 1s sage of exploraion consising of, seps. For ParamImi (our algorihm) he average is aken afer only seps of exploraion. The righmos bar is he Menor value. As can be seen, ParamImi ouperforms all he exploraion sraegies wih far less experience. 3. Facored Graphical Model A major advanage of using a graphical models-based approach o imiaion is he abiliy o leverage domain knowledge o speed up learning. For example, he number of rue parameers in he FlagWorld is acually much less han he number ha was learned in he previous secion since here are only 33 locaions for which he ransiion parameers need o be learned: he dynamics are he same irrespecive of which flags have been picked up. To reflec his fac, we can facor he menor sae S m ino locaion L m and flag saus variable Picked Flag PF m as shown in Fig. 3(a) (and similarly for he observer). This reduces he number of ransiion parameers significanly (from τ sas o τ lal ).

Imiaion Learning Using Graphical Models 73 We can incorporae domain knowledge abou he flags by defining he CPT P (PF +1 L +1,PF )ashe, P (PF +1 L +1,PF )=δ(pf +1,pf(PF,i)) if L +1 = Fi = δ(pf +1,PF ) oherwise where pf(pf,i)ishedeerminsic funcion which maps he old value of PF o one in which he i h flag is picked up. L m PF m π m l,pf τ lal A m (a) L m +1 PF m +1 A m +1 Average Error (Mean Abs dis from rue) Average Log likelihood (per sep) 1.5 1.5.5 3 3.5.5 Error in Learn Parameers ransiion policy Log likelihood of learn parameers Training Tes 5 Reward obained by wo agens in las seps (5 Runs) 1 Reward 1 1 Menor (Oracle) Observer Value Funcion of Sar Sae of learn observer policy Value (b) V(S) for Obs Opimal V(S) Fig. 3. Fas Learning using Facored Graphical Models. (a) Facored model for FlagWorld (only he menor model is shown). (b) Resuls using facored model. Noe he speed-up in learning w.r.. he unfacored case (Fig. (a)). The resuls of EM-based parameer learning for he facored graphical model are shown in Fig. 3(b). As expeced, he error in ransiion parameers goes down much more rapidly han in he unfacored case (compare wih Fig. (a)). Conclusion This paper inroduces a new framework for learning by imiaion based on modeling he imiaion process in erms of probabilisic graphical models. Imiaive policies are learned in a principled manner using he expecaion-maximizaion (EM) algorihm. The model achieves ransfer of knowledge by ying he parameers for he menor s dynamics wih hose of he observer. Our resuls 5 demonsrae ha he menor s policy can be esimaed direcly from observaions of This is a common rick used in GMs o encode deerminisic domain knowledge. 5 Addiional resuls are presened in he exended version of he paper available a hp://neural.cs.washingon.edu/. In paricular, we show how learning can be furher sped up by incorporaing reward informaion colleced on he way. Also, we demonsrae he generaliy of parameer learning by exending he graphical model o learn ask-oriened policies.

7 D. Verma and R.P.N. Rao he menor s sae sequences and ha significan speed-up in learning can be achieved by exploiing he graphical models framework o facor he sae space in accordance wih domain knowledge. Our curren work is focused on esing he approach more exhausively, especially in he conex of roboic imiaion. No only do Graphical Models provide a compuaionally efficien framework for general imiaion, hey are also being used for modeling behavior [1]. An exciing prospec of using graphical models for imiaion is he ease of exension o models wih more absracion, including parially observable, hierarchical, and relaional models. Acknowledgmens This maerial is based upon work suppored by ONR, he Packard Foundaion, and NSF Grans 13335 and 5. References 1. Schaal, S.: Is imiaion learning he roue o humanoid robos? Trends in Cogniive Sciences 3, 33 (1999). Dauenhahn, K., Nehaniv, C.: Imiaion in Animals and Arifacs. MIT Press, Cambridge, MA () 3. Verma, D., Rao, R.P.N.: Goal-based imiaion as probabilisic inference over graphical models. In: NIPS 1 (). Suon, R.S., Baro, A.: Reinforcemen Learning: An Inroducion. MIT Press, Cambridge, MA (199) 5. Akeson, C.G., Schaal, S.: Robo learning from demonsraion. In: Proc. 1h ICML, pp. 1 (1997). Abbeel, P., Ng, A.Y.: Appreniceship learning via inverse reinforcemen learning. In: ICML, pp. 1 () 7. Price, B., Bouilier, C.: Acceleraing reinforcemen learning hrough implici imiaion. JAIR 19, 59 9 (3). Price, B., Bouilier, C.: A bayesian approach o imiaion in reinforcemen learning. In: IJCAI, pp. 71 7 (3) 9. Bouilier, C., Dean, T., Hanks, S.: Decision-heoreic planning: Srucural assumpions and compuaional leverage. JAIR 11, 1 9 (1999) 1. Raliff, N.D., Bagnell, J.A., Zinkevich, M.A.: Maximum margin planning. In: ICML, pp. 79 73 () 11. Heckerman, D.: A uorial on learning wih bayesian neworks. Technical repor, Microsof Research, Redmond, Washingon (1995) 1. Dempser, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplee daa via he EM algorihm. Journal of he Royal Saisical Sociey, Series B 39, 1 3 (1977) 13. Dearden, R., Friedman, N., Andre, D.: Model-based Bayesian Exploraion. In: UAI- 99, San Francisco, CA, pp. 15 159 (1999) 1. Griffihs, T.L., Tenenbaum, J.B.: Srucure and srengh in causal inducion. Cogniive Psychology 51(), 33 3 (5)