Imitative Policies for Reinforcement Learning

Size: px
Start display at page:

Download "Imitative Policies for Reinforcement Learning"

Transcription

1 Imitative Policies for Reinforcement Learning Dana Dahlstrom and Eric Wiewiora Department of Computer Science and Engineering University of California, San Diego La Jolla CA , USA Abstract We discuss a reinforcement learning framework where learners observe experts interacting with the environment. Our approach is to construct from these observations exploratory policies which favor selection of actions the expert has taken. This imitation strategy can be applied at any stage of learning, and requires neither that information regarding reinforcement be conveyed from the expert to the learner nor that the learner have any explicit knowledge of its reinforcement structure. We show that learning with an imitative policy can be be faster than passively observing an expert or learning from direct experience alone. We also show that imitative policies are robust to sub-optimal experts. 1 Introduction In its standard formulation, reinforcement learning assumes a solipsistic environment where the learner s task is to find a reward-maximizing control policy starting with no prior knowledge and receiving no outside help the only available information is first-hand experience. In the real world, however, learners are usually part of social networks from which they can cull some knowledge pertaining to their tasks. For such social learners, Learning is more often a transfer than a discovery [1]. To take full advantage of social knowledge requires sophisticated communication. Besides techniques for accomplishing tasks, it is useful to know potential pitfalls and how to avoid them, or how to recover afterward. Human language can encode such complex information, but it should be possible to transfer some knowledge without the full generality of language. A relatively simple way to learn from others is by imitating them. If their behavior is expedient, imitation may yield better performance much faster than lone experimentation. Of course others may not be perfect, or they may have different goals entirely. Thus it is generally unwise to merely imitate; one should pay attention to first-hand experience as well. When imitated behavior is not perfect, it may be possible to improve upon it by experimentation. The approach we propose is to use observations of others to guide the learner s exploration. We assume the learner can observe the states and actions of others but not necessarily their rewards. We also assume the learner can identify its own states with those of others, and that the actions it observes others taking are available to it. These assumptions are consistent

2 with a model in which the learner and those it observes are of the same species but may have different goals. 2 Background One way to use an expert is simply as another source of experience. In what Whitehead has termed Learning by Watching (LBW), the learner observes an expert s state-actionreinforcement triples and uses them as if they were its own [2]. Lin integrates this strategy with his experience replay technique, which is to record experience sequences and replay them in chronologically backward order to speed up reinforcement propagation through the value function [3, 4]. Price and Boutilier present an LBW approach that doesn t require knowing the expert s actions: a special placeholder action is assumed for every expert state transition [5]. Other approaches use the expert in entirely different ways. For instance, the learning agent of Utgoff and Clouse queries the expert for a recommended action when its confidence in its own decision is too low [6]. Alternatively the expert might take the initiative, intervening to suggest an action when it sees fit, as suggested later by Clouse and Utgoff [7]. This active role intentionally guiding the learner s exploration is appropriately called teaching. LBW is appealing not only for its simplicity, but for its generality. Very little is demanded of the expert: it need not respond to queries or actively make suggestions to the learner, but must merely act normally in the environment. What s more, using LBW the learner can benefit from observations even when they exhibit poor performance, though in practice it is more useful to observe an expert than a novice [8]. An underlying assumption in LBW is that the learner knows the reinforcement it would receive in the expert s position 1. In our approach the learner uses the observed actions to guide its own exploration, so the approach is still applicable when this assumption does not hold. In that the expert guides the learner s exploration, our approach is related to that of Clouse and Utgoff, but does not require the expert to intentionally teach the learner. 2.1 Markov decision processes Most reinforcement learning techniques model the learning environment as a Markov decision process (MDP) [9]. An MDP is a quadruple (S, A, T, R), where S is the set of states, A is the set of actions, T (s s, a) is the probability of transitioning to state s when performing action a in state s, and R(s, a, s ) is the reinforcement received when action a is performed in state s and there is a transition to state s. The reinforcement learning task is to find a policy π : S A that maximizes the total discounted reinforcement t=0 γt r t where r t is the reinforcement received at time t and γ is the discount rate determining the relative importance of future versus immediate reinforcement. 1 If the expert s reinforcements are observable and the learner has the same reinforcement structure as the expert, then the learner effectively knows what reinforcement it would receive.

3 2.2 Q-Learning Q-learning is a reinforcement learning algorithm based on estimating the expected total discounted reinforcement Q(s, a) received when taking action a in state s [9]. An experience is a quadruple (s, a, r, s ) where action a is taken in state s, resulting in reinforcement r and a transition to next state s. We consider an implementation of Q-learning which stores Q values in a tabular format; for each experience, a table entry is updated according to the rule Q(s, a) (1 α)q(s, a) + α ( r + γ argmax Q(s, a ) ) a where α is the learning rate. The greedy policy π g (s) = argmax a Q(s, a) is optimal when the Q values are accurate. To guarantee the entries in the table converge to true Q values, all state-action pairs must be explored infinitely often. This can be ensured by using an exploration strategy such as ɛ-greedy: with probability ɛ, choose an action uniformly at random; otherwise choose the greedy action π g (s). 3 Imitative policies We present two imitative policies for reinforcement learning and show how they integrate with Q-learning. Action biasing modifies the greedy objective so observed actions are more likely to be taken; estimated policy exploration replaces ɛ-greedy s uniform distribution with an estimation of the expert s policy. Both strategies are applicable whenever observation of the expert is available be it prior to direct experience, in parallel with it, or intermittent. Both mechanisms utilize the count e (s, a) of the times the expert has been observed taking action a in state s. Action biasing also uses count(s) and count e (s): the number of times the learner and expert have visited s, respectively. 3.1 Action biasing Action biasing is similar to Whitehead s Biasing Binary Learning with an External Critic (BB-LEC) framework [2]. In action biasing, the policy is defined with respect to the bias function { +b : counte (s, a) > 0 B(s, a) = 0 : count e (s) = 0 b : otherwise where the nominal bias b is the magnitude of the reward for taking an action the teacher has taken, or the penalty for taking one it hasn t. The learner uses a biased ɛ-greedy policy { a U(A) with probability ɛ π b (s) = ( argmax a (1 δ count(s) )B(s, a) + δ count(s) Q(s, a) ) otherwise where U(A) is a uniform distribution over the actions and δ (0, 1) is the decay rate of the bias influence. As the learner gains experience the influence of the bias approaches zero and π b converges to the greedy policy π g. 3.2 Estimated policy exploration Another approach is to estimate the expert s policy and choose exploratory actions probabilistically according to this estimation. For discrete MDPs the observed policy can be

4 estimated as a multinomial distribution: given a count of the times the expert has taken each action in each state, Lidstone s law of succession gives the estimated expert policy count e (s, a) + λ π e (a s) = a [count e(s, a ) + λ] where the flattening parameter λ > 0 determines how much weight to assign to the uniform prior distribution versus the observed distribution of the expert s actions. This estimation can replace the usual uniform distribution over actions in ɛ-greedy: { a π π(s) = e ( s) with probability ɛ argmax a Q(s, a) otherwise When the expert has never been observed in state s, the estimated expert policy π e (a s) is a uniform distribution over the actions, and in this case the policy is equivalent to standard ɛ-greedy. There are other ways to estimate multinomial distributions that may be preferable to Lidstone s law in some circumstances [10]. Estimating the expert s policy as a multinomial distribution accounts for nondeterminism but cannot capture the dependence upon previous states when the Markov assumption does not hold, such as in partially observable Markov decision processes (POMDPs). Sequence prediction techniques are apt to this more complex task. 4 Experiments with pong We have experimented with imitative policies in learning a control policy for a pong game. The pong board is a rectangle in which a point-sized ball bounces. The agent controls a 2-unit-wide paddle which can move left or right 1 unit per time step along the bottom edge of the board. The goal is to position the paddle to intercept the ball whenever it reaches the bottom edge. The ball is launched from one of 12 positions on the top edge with a downward velocity of 1 unit per time step and an integral horizontal velocity ranging from 2 to 2 units per time step. An episode begins when the ball is launched and ends when it contacts the bottom edge. Hitting the ball yields a +1 reward; missing it yields a 1 penalty. We compare the estimated policy exploration and action bias methods to LBW and standard Q-learning. In our experiments, the learners interact with the environment in three distinct, interleaved phases, with one episode per phase for each cycle. observation The agent observes expert experience. exploration The agent learns from direct experience. evaluation The agent neither observes nor learns; it merely exploits a greedy policy. We measure performance during the evaluation phase to isolate the effects of the methods on the learned greedy policy without the noise introduced by random exploration. So that the standard Q-learner gets the same amount of total experience as the others, it explores on its own for the time they observe the expert s actions. All the learners use a learning rate α = 0.5 and a discount rate γ = The action biasing learner uses a bias decay rate δ = The estimated policy exploration learner uses a flattening parameter λ = 1, and its exploration rate ɛ = 1000/( k) decays based on the number k of exploration episodes. The rest of the learners use a constant exploration rate ɛ = 0.1.

5 Learning with a Perfect Expert Learning with an Imperfect Expert Performance over Previous 100 Episodes 50 0 Policy Imitation Biased Actions Learning by Watching Standard Q learning Expert Performance on Previous 100 Evaluation Episodes 50 0 Policy Imitation Critic Biasing Learning by Watching Q Learning Expert Observation and Exploration Episodes Observation and Exploration Epsisodes Figure 1: Observing a perfect expert. Figure 2: Observing an imperfect expert. We have created two expert pong agents. The perfect expert predicts where the ball will next contact the bottom edge and moves the paddle there directly. The imperfect expert attempts to keep the paddle directly beneath the ball at all times. When the ball s horizontal velocity is faster than the paddle can move, this strategy results in frequent misses. 4.1 Observing the perfect expert Figure 1 shows the benefit of imitative policies in conjunction with LBW; the imitative learners did Q updates using the expert experiences in addition to using their respective imitative policies. Both imitative learners learned faster than LBW, and all three outperformed standard Q-learning. Imitative exploration is most useful in the initial stage of learning when reinforcement has not yet propagated far through the Q table; imitative policies can pick up some of the expert s rewarding behavior even before this happens Observing the imperfect expert Figure 2 demonstrates the effects of imitating the imperfect expert, which misses the ball approximately 20% of the time. Again the imitative learners did updates using the expert experiences. All the learners that observe the expert initially learn faster than standard Q-learning, but the imitative learners do not outperform LBW. By the end of the run, all learners have reached approximately the same level of performance, but the standard Q-learner s performance has a steeper upward slope than the rest. Once the learners surpass the expert s performance, standard Q-learning has an advantage: the others spend their observation phases watching the imperfect expert while the standard Q-learner explores based on its superior policy. In short, the influence of an imperfect expert makes it more difficult to improve beyond its level. 4.2 Unobservable expert reinforcements Unlike in LBW, the imitative learners do not need to know the expert s reinforcements. Figure 3 shows the performance of imitative policies without LBW updates. The imitative learners are tested against the standard Q-learner, which again receives additional exploration episodes instead of observations. The results show that even though the imitation learners are making half as many updates to their Q-tables, they are not dramatically outperformed by Q-learning. In fact, the action bias learner that imitates the perfect expert does slightly better. If observation is less costly

6 Performance of Imitative Algorithms Without Learning by Watching Performance over Previous 100 Evaluation Episodes Policy Imitation with Perfect Expert Biased Actions with Perfect Expert 60 Policy Imitation with Imerfect Expert Biased Actions with Imerfect Exper Standard Q Learning Observation and Exploration Episodes Figure 3: Observation without reinforcements. than direct experience, an imitative policy is likely a good choice. 5 Discussion When they can observe a reasonably good expert, learners can improve much faster than by direct experience alone. If the learner must trade off observation against direct experience, however, this benefit is not without cost. Probably the most problematic situation is imitating a bad expert, in which case the learner is biased away from more rewarding actions. Because our methods decay the influence of observations as the learner gains its own experience, imitating a bad expert will only slow convergence to an optimal policy rather than prevent it altogether. Future work on imitation in reinforcement learning may include developing a principled way to decay imitation based upon the relative performance of the expert and the learner; our separate exploration and evaluation phases could facilitate this kind of comparison. If there is a cost for observation, another related problem is how to decide when it would be beneficial to observe rather than explore. It may also be productive to incorporate imitative policies into other reinforcement learning frameworks such as learning with eligibility traces, generalizing function approximators, or model-based methods. References [1] Steven D. Whitehead. A study of cooperative mechanisms for faster reinforcement learning. Technical Report 365, Department of Computer Science, University of Rochester, Mar [2] Steven D. Whitehead. A complexity analysis of cooperative mechanisms in reinforcement learning. In Proceedings, Ninth National Conference on Artificial Intelligence (AAAI-91), volume 2, pages AAAI Press / The MIT Press, Jul [3] Long-Ji Lin. Programming robots using reinforcement learning and teaching. In Proceedings, Ninth National Conference on Artificial Intelligence (AAAI-91), volume 2, pages AAAI Press / The MIT Press, Jul [4] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning, and teaching. Machine Learning, 8(3/4): , May [5] Bob Price and Craig Boutilier. Implicit imitation in multiagent reinforcement learning. In Machine Learning, Proceedings of the Sixteenth International Conference, pages Morgan Kaufmann, Jun 1999.

7 [6] Paul E. Utgoff and Jeffery A. Clouse. Two kinds of training information for evaluation function learning. In Proceedings, Ninth National Conference on Artificial Intelligence (AAAI-91), volume 2, pages AAAI Press / The MIT Press, Jul [7] Jeffery A. Clouse and Paul E. Utgoff. A teaching method for reinforcement learning. In Machine Learning, Proceedings of the Ninth International Workshop (ML92), pages Morgan Kaufmann, Jul [8] Ming Tan. Multi-agent reinforcement learning: Independent versus cooperative. In Machine Learning, Proceedings of the Tenth International Conference, pages , Jun [9] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, [10] Eric Sven Ristad. A natural law of succession. Technical Report CS-TR , Department of Computer Science, Princeton University, May 1995.

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation

More information

Reinforcement Learning by Comparing Immediate Reward

Reinforcement Learning by Comparing Immediate Reward Reinforcement Learning by Comparing Immediate Reward Punit Pandey DeepshikhaPandey Dr. Shishir Kumar Abstract This paper introduces an approach to Reinforcement Learning Algorithm by comparing their immediate

More information

Axiom 2013 Team Description Paper

Axiom 2013 Team Description Paper Axiom 2013 Team Description Paper Mohammad Ghazanfari, S Omid Shirkhorshidi, Farbod Samsamipour, Hossein Rahmatizadeh Zagheli, Mohammad Mahdavi, Payam Mohajeri, S Abbas Alamolhoda Robotics Scientific Association

More information

High-level Reinforcement Learning in Strategy Games

High-level Reinforcement Learning in Strategy Games High-level Reinforcement Learning in Strategy Games Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu Guy Shani Department of Computer

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Improving Action Selection in MDP s via Knowledge Transfer

Improving Action Selection in MDP s via Knowledge Transfer In Proc. 20th National Conference on Artificial Intelligence (AAAI-05), July 9 13, 2005, Pittsburgh, USA. Improving Action Selection in MDP s via Knowledge Transfer Alexander A. Sherstov and Peter Stone

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Exploration. CS : Deep Reinforcement Learning Sergey Levine Exploration CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 4 due on Wednesday 2. Project proposal feedback sent Today s Lecture 1. What is exploration? Why is it a problem?

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Georgetown University at TREC 2017 Dynamic Domain Track

Georgetown University at TREC 2017 Dynamic Domain Track Georgetown University at TREC 2017 Dynamic Domain Track Zhiwen Tang Georgetown University zt79@georgetown.edu Grace Hui Yang Georgetown University huiyang@cs.georgetown.edu Abstract TREC Dynamic Domain

More information

Regret-based Reward Elicitation for Markov Decision Processes

Regret-based Reward Elicitation for Markov Decision Processes 444 REGAN & BOUTILIER UAI 2009 Regret-based Reward Elicitation for Markov Decision Processes Kevin Regan Department of Computer Science University of Toronto Toronto, ON, CANADA kmregan@cs.toronto.edu

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

TD(λ) and Q-Learning Based Ludo Players

TD(λ) and Q-Learning Based Ludo Players TD(λ) and Q-Learning Based Ludo Players Majed Alhajry, Faisal Alvi, Member, IEEE and Moataz Ahmed Abstract Reinforcement learning is a popular machine learning technique whose inherent self-learning ability

More information

A Reinforcement Learning Variant for Control Scheduling

A Reinforcement Learning Variant for Control Scheduling A Reinforcement Learning Variant for Control Scheduling Aloke Guha Honeywell Sensor and System Development Center 3660 Technology Drive Minneapolis MN 55417 Abstract We present an algorithm based on reinforcement

More information

Learning Prospective Robot Behavior

Learning Prospective Robot Behavior Learning Prospective Robot Behavior Shichao Ou and Rod Grupen Laboratory for Perceptual Robotics Computer Science Department University of Massachusetts Amherst {chao,grupen}@cs.umass.edu Abstract This

More information

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Proceedings of 28 ISFA 28 International Symposium on Flexible Automation Atlanta, GA, USA June 23-26, 28 ISFA28U_12 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Amit Gil, Helman Stern, Yael Edan, and

More information

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com

More information

Speeding Up Reinforcement Learning with Behavior Transfer

Speeding Up Reinforcement Learning with Behavior Transfer Speeding Up Reinforcement Learning with Behavior Transfer Matthew E. Taylor and Peter Stone Department of Computer Sciences The University of Texas at Austin Austin, Texas 78712-1188 {mtaylor, pstone}@cs.utexas.edu

More information

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica Laboratorio di Intelligenza Artificiale e Robotica A.A. 2008-2009 Outline 2 Machine Learning Unsupervised Learning Supervised Learning Reinforcement Learning Genetic Algorithms Genetics-Based Machine Learning

More information

AMULTIAGENT system [1] can be defined as a group of

AMULTIAGENT system [1] can be defined as a group of 156 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 2, MARCH 2008 A Comprehensive Survey of Multiagent Reinforcement Learning Lucian Buşoniu, Robert Babuška,

More information

A cognitive perspective on pair programming

A cognitive perspective on pair programming Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2006 Proceedings Americas Conference on Information Systems (AMCIS) December 2006 A cognitive perspective on pair programming Radhika

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chinese Language Parsing with Maximum-Entropy-Inspired Parser Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art

More information

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners Andrea L. Thomaz and Cynthia Breazeal Abstract While Reinforcement Learning (RL) is not traditionally designed

More information

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon

More information

Using focal point learning to improve human machine tacit coordination

Using focal point learning to improve human machine tacit coordination DOI 10.1007/s10458-010-9126-5 Using focal point learning to improve human machine tacit coordination InonZuckerman SaritKraus Jeffrey S. Rosenschein The Author(s) 2010 Abstract We consider an automated

More information

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA Testing a Moving Target How Do We Test Machine Learning Systems? Peter Varhol, Technology

More information

FF+FPG: Guiding a Policy-Gradient Planner

FF+FPG: Guiding a Policy-Gradient Planner FF+FPG: Guiding a Policy-Gradient Planner Olivier Buffet LAAS-CNRS University of Toulouse Toulouse, France firstname.lastname@laas.fr Douglas Aberdeen National ICT australia & The Australian National University

More information

Seminar - Organic Computing

Seminar - Organic Computing Seminar - Organic Computing Self-Organisation of OC-Systems Markus Franke 25.01.2006 Typeset by FoilTEX Timetable 1. Overview 2. Characteristics of SO-Systems 3. Concern with Nature 4. Design-Concepts

More information

The Strong Minimalist Thesis and Bounded Optimality

The Strong Minimalist Thesis and Bounded Optimality The Strong Minimalist Thesis and Bounded Optimality DRAFT-IN-PROGRESS; SEND COMMENTS TO RICKL@UMICH.EDU Richard L. Lewis Department of Psychology University of Michigan 27 March 2010 1 Purpose of this

More information

An empirical study of learning speed in backpropagation

An empirical study of learning speed in backpropagation Carnegie Mellon University Research Showcase @ CMU Computer Science Department School of Computer Science 1988 An empirical study of learning speed in backpropagation networks Scott E. Fahlman Carnegie

More information

Discriminative Learning of Beam-Search Heuristics for Planning

Discriminative Learning of Beam-Search Heuristics for Planning Discriminative Learning of Beam-Search Heuristics for Planning Yuehua Xu School of EECS Oregon State University Corvallis,OR 97331 xuyu@eecs.oregonstate.edu Alan Fern School of EECS Oregon State University

More information

BMBF Project ROBUKOM: Robust Communication Networks

BMBF Project ROBUKOM: Robust Communication Networks BMBF Project ROBUKOM: Robust Communication Networks Arie M.C.A. Koster Christoph Helmberg Andreas Bley Martin Grötschel Thomas Bauschert supported by BMBF grant 03MS616A: ROBUKOM Robust Communication Networks,

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

More information

Getting Started with TI-Nspire High School Science

Getting Started with TI-Nspire High School Science Getting Started with TI-Nspire High School Science 2012 Texas Instruments Incorporated Materials for Institute Participant * *This material is for the personal use of T3 instructors in delivering a T3

More information

An investigation of imitation learning algorithms for structured prediction

An investigation of imitation learning algorithms for structured prediction JMLR: Workshop and Conference Proceedings 24:143 153, 2012 10th European Workshop on Reinforcement Learning An investigation of imitation learning algorithms for structured prediction Andreas Vlachos Computer

More information

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games Santiago Ontañón

More information

Learning to Schedule Straight-Line Code

Learning to Schedule Straight-Line Code Learning to Schedule Straight-Line Code Eliot Moss, Paul Utgoff, John Cavazos Doina Precup, Darko Stefanović Dept. of Comp. Sci., Univ. of Mass. Amherst, MA 01003 Carla Brodley, David Scheeff Sch. of Elec.

More information

How do adults reason about their opponent? Typologies of players in a turn-taking game

How do adults reason about their opponent? Typologies of players in a turn-taking game How do adults reason about their opponent? Typologies of players in a turn-taking game Tamoghna Halder (thaldera@gmail.com) Indian Statistical Institute, Kolkata, India Khyati Sharma (khyati.sharma27@gmail.com)

More information

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots Varun Raj Kompella, Marijn Stollenga, Matthew Luciw, Juergen Schmidhuber The Swiss AI Lab IDSIA, USI

More information

A Comparison of Charter Schools and Traditional Public Schools in Idaho

A Comparison of Charter Schools and Traditional Public Schools in Idaho A Comparison of Charter Schools and Traditional Public Schools in Idaho Dale Ballou Bettie Teasley Tim Zeidner Vanderbilt University August, 2006 Abstract We investigate the effectiveness of Idaho charter

More information

Task Completion Transfer Learning for Reward Inference

Task Completion Transfer Learning for Reward Inference Machine Learning for Interactive Systems: Papers from the AAAI-14 Workshop Task Completion Transfer Learning for Reward Inference Layla El Asri 1,2, Romain Laroche 1, Olivier Pietquin 3 1 Orange Labs,

More information

(Sub)Gradient Descent

(Sub)Gradient Descent (Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

More information

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING

A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING A GENERIC SPLIT PROCESS MODEL FOR ASSET MANAGEMENT DECISION-MAKING Yong Sun, a * Colin Fidge b and Lin Ma a a CRC for Integrated Engineering Asset Management, School of Engineering Systems, Queensland

More information

Sector Differences in Student Learning: Differences in Achievement Gains Across School Years and During the Summer

Sector Differences in Student Learning: Differences in Achievement Gains Across School Years and During the Summer Catholic Education: A Journal of Inquiry and Practice Volume 7 Issue 2 Article 6 July 213 Sector Differences in Student Learning: Differences in Achievement Gains Across School Years and During the Summer

More information

NCEO Technical Report 27

NCEO Technical Report 27 Home About Publications Special Topics Presentations State Policies Accommodations Bibliography Teleconferences Tools Related Sites Interpreting Trends in the Performance of Special Education Students

More information

Learning From the Past with Experiment Databases

Learning From the Past with Experiment Databases Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University

More information

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017 Instructor Syed Zahid Ali Room No. 247 Economics Wing First Floor Office Hours Email szahid@lums.edu.pk Telephone Ext. 8074 Secretary/TA TA Office Hours Course URL (if any) Suraj.lums.edu.pk FINN 321 Econometrics

More information

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica Laboratorio di Intelligenza Artificiale e Robotica A.A. 2008-2009 Outline 2 Machine Learning Unsupervised Learning Supervised Learning Reinforcement Learning Genetic Algorithms Genetics-Based Machine Learning

More information

Knowledge Transfer in Deep Convolutional Neural Nets

Knowledge Transfer in Deep Convolutional Neural Nets Knowledge Transfer in Deep Convolutional Neural Nets Steven Gutstein, Olac Fuentes and Eric Freudenthal Computer Science Department University of Texas at El Paso El Paso, Texas, 79968, U.S.A. Abstract

More information

Pedagogical Content Knowledge for Teaching Primary Mathematics: A Case Study of Two Teachers

Pedagogical Content Knowledge for Teaching Primary Mathematics: A Case Study of Two Teachers Pedagogical Content Knowledge for Teaching Primary Mathematics: A Case Study of Two Teachers Monica Baker University of Melbourne mbaker@huntingtower.vic.edu.au Helen Chick University of Melbourne h.chick@unimelb.edu.au

More information

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Transfer Learning Action Models by Measuring the Similarity of Different Domains Transfer Learning Action Models by Measuring the Similarity of Different Domains Hankui Zhuo 1, Qiang Yang 2, and Lei Li 1 1 Software Research Institute, Sun Yat-sen University, Guangzhou, China. zhuohank@gmail.com,lnslilei@mail.sysu.edu.cn

More information

Learning Cases to Resolve Conflicts and Improve Group Behavior

Learning Cases to Resolve Conflicts and Improve Group Behavior From: AAAI Technical Report WS-96-02. Compilation copyright 1996, AAAI (www.aaai.org). All rights reserved. Learning Cases to Resolve Conflicts and Improve Group Behavior Thomas Haynes and Sandip Sen Department

More information

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments Proceedings of the First International Workshop on Intelligent Adaptive Systems (IAS-95) Ibrahim F. Imam and Janusz Wnek (Eds.), pp. 38-51, Melbourne Beach, Florida, 1995. Constructive Induction-based

More information

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Robot Learning Simultaneously a Task and How to Interpret Human Instructions Robot Learning Simultaneously a Task and How to Interpret Human Instructions Jonathan Grizou, Manuel Lopes, Pierre-Yves Oudeyer To cite this version: Jonathan Grizou, Manuel Lopes, Pierre-Yves Oudeyer.

More information

Cooperative evolutive concept learning: an empirical study

Cooperative evolutive concept learning: an empirical study Cooperative evolutive concept learning: an empirical study Filippo Neri University of Piemonte Orientale Dipartimento di Scienze e Tecnologie Avanzate Piazza Ambrosoli 5, 15100 Alessandria AL, Italy Abstract

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Guru: A Computer Tutor that Models Expert Human Tutors

Guru: A Computer Tutor that Models Expert Human Tutors Guru: A Computer Tutor that Models Expert Human Tutors Andrew Olney 1, Sidney D'Mello 2, Natalie Person 3, Whitney Cade 1, Patrick Hays 1, Claire Williams 1, Blair Lehman 1, and Art Graesser 1 1 University

More information

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany Jana Kitzmann and Dirk Schiereck, Endowed Chair for Banking and Finance, EUROPEAN BUSINESS SCHOOL, International

More information

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Active Learning. Yingyu Liang Computer Sciences 760 Fall Active Learning Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven,

More information

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer

More information

On-Line Data Analytics

On-Line Data Analytics International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob

More information

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN- LEARNING TO PLAY IN A DAY: FASTER DEEP REIN- FORCEMENT LEARNING BY OPTIMALITY TIGHTENING Frank S. He Department of Computer Science University of Illinois at Urbana-Champaign Zhejiang University frankheshibi@gmail.com

More information

Liquid Narrative Group Technical Report Number

Liquid Narrative Group Technical Report Number http://liquidnarrative.csc.ncsu.edu/pubs/tr04-004.pdf NC STATE UNIVERSITY_ Liquid Narrative Group Technical Report Number 04-004 Equivalence between Narrative Mediation and Branching Story Graphs Mark

More information

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences

TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION. by Yang Xu PhD of Information Sciences TOKEN-BASED APPROACH FOR SCALABLE TEAM COORDINATION by Yang Xu PhD of Information Sciences Submitted to the Graduate Faculty of in partial fulfillment of the requirements for the degree of Doctor of Philosophy

More information

Knowledge-Based - Systems

Knowledge-Based - Systems Knowledge-Based - Systems ; Rajendra Arvind Akerkar Chairman, Technomathematics Research Foundation and Senior Researcher, Western Norway Research institute Priti Srinivas Sajja Sardar Patel University

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Softprop: Softmax Neural Network Backpropagation Learning

Softprop: Softmax Neural Network Backpropagation Learning Softprop: Softmax Neural Networ Bacpropagation Learning Michael Rimer Computer Science Department Brigham Young University Provo, UT 84602, USA E-mail: mrimer@axon.cs.byu.edu Tony Martinez Computer Science

More information

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy Informatics 2A: Language Complexity and the Chomsky Hierarchy September 28, 2010 Starter 1 Is there a finite state machine that recognises all those strings s from the alphabet {a, b} where the difference

More information

Lecture 1: Basic Concepts of Machine Learning

Lecture 1: Basic Concepts of Machine Learning Lecture 1: Basic Concepts of Machine Learning Cognitive Systems - Machine Learning Ute Schmid (lecture) Johannes Rabold (practice) Based on slides prepared March 2005 by Maximilian Röglinger, updated 2010

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

A Case Study: News Classification Based on Term Frequency

A Case Study: News Classification Based on Term Frequency A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Learning and Transferring Relational Instance-Based Policies

Learning and Transferring Relational Instance-Based Policies Learning and Transferring Relational Instance-Based Policies Rocío García-Durán, Fernando Fernández y Daniel Borrajo Universidad Carlos III de Madrid Avda de la Universidad 30, 28911-Leganés (Madrid),

More information

An Investigation into Team-Based Planning

An Investigation into Team-Based Planning An Investigation into Team-Based Planning Dionysis Kalofonos and Timothy J. Norman Computing Science Department University of Aberdeen {dkalofon,tnorman}@csd.abdn.ac.uk Abstract Models of plan formation

More information

On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

More information

An OO Framework for building Intelligence and Learning properties in Software Agents

An OO Framework for building Intelligence and Learning properties in Software Agents An OO Framework for building Intelligence and Learning properties in Software Agents José A. R. P. Sardinha, Ruy L. Milidiú, Carlos J. P. Lucena, Patrick Paranhos Abstract Software agents are defined as

More information

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many Schmidt 1 Eric Schmidt Prof. Suzanne Flynn Linguistic Study of Bilingualism December 13, 2013 A Minimalist Approach to Code-Switching In the field of linguistics, the topic of bilingualism is a broad one.

More information

K5 Math Practice. Free Pilot Proposal Jan -Jun Boost Confidence Increase Scores Get Ahead. Studypad, Inc.

K5 Math Practice. Free Pilot Proposal Jan -Jun Boost Confidence Increase Scores Get Ahead. Studypad, Inc. K5 Math Practice Boost Confidence Increase Scores Get Ahead Free Pilot Proposal Jan -Jun 2017 Studypad, Inc. 100 W El Camino Real, Ste 72 Mountain View, CA 94040 Table of Contents I. Splash Math Pilot

More information

Universityy. The content of

Universityy. The content of WORKING PAPER #31 An Evaluation of Empirical Bayes Estimation of Value Added Teacher Performance Measuress Cassandra M. Guarino, Indianaa Universityy Michelle Maxfield, Michigan State Universityy Mark

More information

Firms and Markets Saturdays Summer I 2014

Firms and Markets Saturdays Summer I 2014 PRELIMINARY DRAFT VERSION. SUBJECT TO CHANGE. Firms and Markets Saturdays Summer I 2014 Professor Thomas Pugel Office: Room 11-53 KMC E-mail: tpugel@stern.nyu.edu Tel: 212-998-0918 Fax: 212-995-4212 This

More information

Henry Tirri* Petri Myllymgki

Henry Tirri* Petri Myllymgki From: AAAI Technical Report SS-93-04. Compilation copyright 1993, AAAI (www.aaai.org). All rights reserved. Bayesian Case-Based Reasoning with Neural Networks Petri Myllymgki Henry Tirri* email: University

More information

Improving Fairness in Memory Scheduling

Improving Fairness in Memory Scheduling Improving Fairness in Memory Scheduling Using a Team of Learning Automata Aditya Kajwe and Madhu Mutyam Department of Computer Science & Engineering, Indian Institute of Tehcnology - Madras June 14, 2014

More information

A Case-Based Approach To Imitation Learning in Robotic Agents

A Case-Based Approach To Imitation Learning in Robotic Agents A Case-Based Approach To Imitation Learning in Robotic Agents Tesca Fitzgerald, Ashok Goel School of Interactive Computing Georgia Institute of Technology, Atlanta, GA 30332, USA {tesca.fitzgerald,goel}@cc.gatech.edu

More information

Data Structures and Algorithms

Data Structures and Algorithms CS 3114 Data Structures and Algorithms 1 Trinity College Library Univ. of Dublin Instructor and Course Information 2 William D McQuain Email: Office: Office Hours: wmcquain@cs.vt.edu 634 McBryde Hall see

More information

1 3-5 = Subtraction - a binary operation

1 3-5 = Subtraction - a binary operation High School StuDEnts ConcEPtions of the Minus Sign Lisa L. Lamb, Jessica Pierson Bishop, and Randolph A. Philipp, Bonnie P Schappelle, Ian Whitacre, and Mindy Lewis - describe their research with students

More information

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4 Chapters 1-5 Cumulative Assessment AP Statistics Name: November 2008 Gillespie, Block 4 Part I: Multiple Choice This portion of the test will determine 60% of your overall test grade. Each question is

More information

Introduction to Simulation

Introduction to Simulation Introduction to Simulation Spring 2010 Dr. Louis Luangkesorn University of Pittsburgh January 19, 2010 Dr. Louis Luangkesorn ( University of Pittsburgh ) Introduction to Simulation January 19, 2010 1 /

More information

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS L. Descalço 1, Paula Carvalho 1, J.P. Cruz 1, Paula Oliveira 1, Dina Seabra 2 1 Departamento de Matemática, Universidade de Aveiro (PORTUGAL)

More information

Predicting Future User Actions by Observing Unmodified Applications

Predicting Future User Actions by Observing Unmodified Applications From: AAAI-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. Predicting Future User Actions by Observing Unmodified Applications Peter Gorniak and David Poole Department of Computer

More information

Writing Research Articles

Writing Research Articles Marek J. Druzdzel with minor additions from Peter Brusilovsky University of Pittsburgh School of Information Sciences and Intelligent Systems Program marek@sis.pitt.edu http://www.pitt.edu/~druzdzel Overview

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

Probability estimates in a scenario tree

Probability estimates in a scenario tree 101 Chapter 11 Probability estimates in a scenario tree An expert is a person who has made all the mistakes that can be made in a very narrow field. Niels Bohr (1885 1962) Scenario trees require many numbers.

More information

The Round Earth Project. Collaborative VR for Elementary School Kids

The Round Earth Project. Collaborative VR for Elementary School Kids Johnson, A., Moher, T., Ohlsson, S., The Round Earth Project - Collaborative VR for Elementary School Kids, In the SIGGRAPH 99 conference abstracts and applications, Los Angeles, California, Aug 8-13,

More information

Action Models and their Induction

Action Models and their Induction Action Models and their Induction Michal Čertický, Comenius University, Bratislava certicky@fmph.uniba.sk March 5, 2013 Abstract By action model, we understand any logic-based representation of effects

More information

DOCTOR OF PHILOSOPHY HANDBOOK

DOCTOR OF PHILOSOPHY HANDBOOK University of Virginia Department of Systems and Information Engineering DOCTOR OF PHILOSOPHY HANDBOOK 1. Program Description 2. Degree Requirements 3. Advisory Committee 4. Plan of Study 5. Comprehensive

More information

Go fishing! Responsibility judgments when cooperation breaks down

Go fishing! Responsibility judgments when cooperation breaks down Go fishing! Responsibility judgments when cooperation breaks down Kelsey Allen (krallen@mit.edu), Julian Jara-Ettinger (jjara@mit.edu), Tobias Gerstenberg (tger@mit.edu), Max Kleiman-Weiner (maxkw@mit.edu)

More information

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation

Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation Interaction Design Considerations for an Aircraft Carrier Deck Agent-based Simulation Miles Aubert (919) 619-5078 Miles.Aubert@duke. edu Weston Ross (505) 385-5867 Weston.Ross@duke. edu Steven Mazzari

More information