Policy Reuse in a General Learning Framework

Similar documents
Reinforcement Learning by Comparing Immediate Reward

Axiom 2013 Team Description Paper

Learning and Transferring Relational Instance-Based Policies

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Speeding Up Reinforcement Learning with Behavior Transfer

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Improving Action Selection in MDP s via Knowledge Transfer

Laboratorio di Intelligenza Artificiale e Robotica

Seminar - Organic Computing

AMULTIAGENT system [1] can be defined as a group of

Georgetown University at TREC 2017 Dynamic Domain Track

Python Machine Learning

Probabilistic Latent Semantic Analysis

An OO Framework for building Intelligence and Learning properties in Software Agents

A Reinforcement Learning Variant for Control Scheduling

TD(λ) and Q-Learning Based Ludo Players

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

TABLE OF CONTENTS TABLE OF CONTENTS COVER PAGE HALAMAN PENGESAHAN PERNYATAAN NASKAH SOAL TUGAS AKHIR ACKNOWLEDGEMENT FOREWORD

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 10: Reinforcement Learning

Learning Methods for Fuzzy Systems

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Cooperative evolutive concept learning: an empirical study

Lecture 1: Machine Learning Basics

Artificial Neural Networks written examination

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Knowledge-Based - Systems

High-level Reinforcement Learning in Strategy Games

Improving Fairness in Memory Scheduling

Software Maintenance

(Sub)Gradient Descent

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Task Completion Transfer Learning for Reward Inference

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

An investigation of imitation learning algorithms for structured prediction

CSL465/603 - Machine Learning

Task Completion Transfer Learning for Reward Inference

Lecture 1: Basic Concepts of Machine Learning

Agent-Based Software Engineering

Rule Learning With Negation: Issues Regarding Effectiveness

Teachable Robots: Understanding Human Teaching Behavior to Build More Effective Robot Learners

Reducing Features to Improve Bug Prediction

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Learning Prospective Robot Behavior

Evolution of Symbolisation in Chimpanzees and Neural Nets

Knowledge Transfer in Deep Convolutional Neural Nets

Australian Journal of Basic and Applied Sciences

On-the-Fly Customization of Automated Essay Scoring

Evolutive Neural Net Fuzzy Filtering: Basic Description

FF+FPG: Guiding a Policy-Gradient Planner

AQUA: An Ontology-Driven Question Answering System

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Learning Cases to Resolve Conflicts and Improve Group Behavior

Analysis of Enzyme Kinetic Data

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

A Case-Based Approach To Imitation Learning in Robotic Agents

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Assignment 1: Predicting Amazon Review Ratings

Probability and Game Theory Course Syllabus

Mathematics. Mathematics

Regret-based Reward Elicitation for Markov Decision Processes

Learning From the Past with Experiment Databases

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Rule Learning with Negation: Issues Regarding Effectiveness

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Case Acquisition Strategies for Case-Based Reasoning in Real-Time Strategy Games

Welcome to. ECML/PKDD 2004 Community meeting

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Truth Inference in Crowdsourcing: Is the Problem Solved?

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

University of Groningen. Systemen, planning, netwerken Bosman, Aart

PH.D. IN COMPUTER SCIENCE PROGRAM (POST M.S.)

B.S/M.A in Mathematics

COMPUTER-AIDED DESIGN TOOLS THAT ADAPT

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Multiagent Simulation of Learning Environments

A Pipelined Approach for Iterative Software Process Model

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Learning to Schedule Straight-Line Code

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

College Pricing and Income Inequality

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

Ph.D in Advance Machine Learning (computer science) PhD submitted, degree to be awarded on convocation, sept B.Tech in Computer science and

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

BMBF Project ROBUKOM: Robust Communication Networks

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Developing a TT-MCTAG for German with an RCG-based Parser

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Stimulating Techniques in Micro Teaching. Puan Ng Swee Teng Ketua Program Kursus Lanjutan U48 Kolej Sains Kesihatan Bersekutu, SAS, Ulu Kinta

Go fishing! Responsibility judgments when cooperation breaks down

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Parsing of part-of-speech tagged Assamese Texts

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Transcription:

Policy Reuse in a General Learning Framework Fernando Martínez-Plumed, Cèsar Ferri, José Hernández-Orallo, María José Ramírez-Quintana CAEPIA 2013 September 15, 2013 1 / 31

Table of contents 1 Introduction 2 The gerl System 3 Reusing Past Policies 4 Conclusions and Future Work 2 / 31

Introduction The reuse of knowledge which has been acquired in previous learning processes in order to improve or accelerate the learning of future tasks is an appealing idea. The knowledge transferred between tasks can be viewed as a bias in the learning of the target using the information learned in the source task Different Tasks Source Tasks Target Task Learning System Learning System Learning System Knowledge Learning System 3 / 31

Introduction Research on transfer learning has attracted more and more attention since 1995 in different names and areas: Learning to learn Life-long learning Knowledge-transfer Inductive transfer Multitask learning Knowledge consolidation Incremental/cumulative learning Meta-learning Reinforcement Learning. Reframing 4 / 31

Introduction Research on transfer learning has attracted more and more attention since 1995 in different names and areas: Learning to learn Life-long learning Knowledge-transfer Inductive transfer Multitask learning Knowledge consolidation Incremental/cumulative learning Meta-learning Reinforcement Learning. Reframing 4 / 31

Introduction Reinforcement Learning. The knowledge is transfered in several ways ([Taylor and Stone, 2009] for a survey): Modifying the learning algorithm [Fernandez and Veloso, 2006, Mehta, 2005]. Biasing the initial action-value function [J.Carroll, 2002]. Mapping between actions and/or states [Liu and Stone, 2006, Price and Boutilier, 2003]. 5 / 31

Introduction We present a general rule-based learning setting where operators can be defined and customised for each kind of problem. The generalisation/especialiazation operator to use depends on the structure of the data. Adaptive and flexible rethinking of heuristics, with a model-based reinforcement learning approach. http://users.dsic.upv.es/ fmartinez/gerl.html 6 / 31

gerl Flexible architecture [Lloyd, 2001] (1/2): Designing customised systems for applications with complex data. Operators can be modified and finetuned for each problem. Different to: Specialized systems (Incremental models [Daumé III and Langford, 2009, Maes et al., 2009]). Feature transformations (kernels [Gärtner, 2005] or distances [Estruch et al., 2006]). Fixed operators (Plotkin s lgg [Plotkin, 1970], Inverse Entailment [Muggleton, 1995], Inverse narrowing and CRG [Ferri et al., 2001]). 7 / 31

gerl Flexible architecture [Lloyd, 2001] (2/2): Population of rules and programs evolved as in an evolutionary programming setting (LCS [Holmes et al., 2002]). Reinforcement Learning-based heuristic. Optimality criteria (MML/MDL) [Wallace and Dowe, 1999]). Erlang functional programming language [Virding et al., 1996]. This is a challenging proposal not sufficiently explored in machine learning. 8 / 31

Architecture A given problem (E + and E ) and a (possible empty) BK. member([1, 2, 3], 3) true 9 / 31

Architecture Flexible architecture which works with populations of rules (unconditional / conditional equations) and programs written in Erlang. member([x Y ], Z) when true member(y, Z) 9 / 31

Architecture The population evolves as in an evolutionary programming setting. 9 / 31

Architecture Operators are applied to rules for generating new rules and combined with existing or new programs. 9 / 31

Architecture Reinforcement Learning-based heuristic to guide the learning. 9 / 31

Architecture Appropriate operators + MML based optimality criteria + Reinforcement Learning-based heuristic. 9 / 31

Introduction The gerl System Reusing Past Policies Conclusions and Future Work Architecture SystempBEnvironmentk Reward Population State Reinforcement Modulep BAgentk Problem RulespR ProgramspP EvidencepEp Be+,e-k R HeuristicpModel OperatorspO O ρ- p- Rule Generator Program Generator ρ- P Background Knowledge C CombinerspC Actionp{o,ρ} Actionp{o,ρ} o c As a result, this architecture can be seen as a meta-learning system, that is, as a system for writing machine learning systems. 9 / 31

Why Erlang? Erlang/OTP [Virding et al., 1996] is a functional programming language developed by Ericsson and was designed from the ground up for writing scalable, fault-tolerant, distributed, non-stop and softrealtime applications. Free and open-source language with a large community of developers behind. Reflection and higher order. Unique representation language, operators, examples, models and background knowledge are represented in the same language. 10 / 31

Operators over Rules and Programs The definition of customized operators is one of the key concepts of our proposal. In gerl, the set of rules R is transformed by applying a set of operators O O. Operators perform modifications over any of subparts of a rule in order to generalise or specialise it. gerl provides two meta-operators able to define well-known generalisation and specialisation operators in Machine Learning 11 / 31

RL-based heuristics Heuristics must be overhauled as decisions about the operator that must be used (over a rule) at each particular state of the learning process. A Reinforcement Learning (RI) [Sutton and Barto, 1998] approach suits perfectly for our purposes. Our decision problem is a four-tuple S, A, τ, ω where: S: state space (s t = R, P ). A : O R (a= o, ρ ). τ : S A S. ω : S A R. 12 / 31

MML/MDL-based Optimality According to the MDL/MML philosophy, the optimality of a program p is defined as the weighted sum of two simpler heuristics, namely, a complexity-based heuristic (which measures the complexity of p) and a coverage heuristic (which measures how well p fits the evidence): Cost Cost(p) = β 1 MsgLen(p) + β 2 (MsgLen(e p)) 13 / 31

MML/MDL-based Optimality According to the MDL/MML philosophy, the optimality of a program p is defined as the weighted sum of two simpler heuristics, namely, a complexity-based heuristic (which measures the complexity of p) and a coverage heuristic (which measures how well p fits the evidence): Cost Cost(p) = β 1 MsgLen(p)+ β 2 (MsgLen({e E + : p = e}) + MsgLen({e E : p = e})) 13 / 31

RL-based heuristics The probably infinite number of states and actions makes the application of classical RL algorithms not feasible: States. ṡ t = φ 1, φ 2, φ 3 1 Global optimality (φ 1 ): 2 Average Size of Rules (φ 2 ) 3 Average Size of programs (φ 3 ) Actions. ȧ = o, ϕ 1, ϕ 2, ϕ 3, ϕ 4, ϕ 5, ϕ 6, ϕ 7, ϕ 8 1 Operator (o) 2 Size (ϕ 1 ) 3 Positive Coverage Rate (ϕ 2 ). 4 Negative Coverage Rate (ϕ 3 ). 5 NumVars (ϕ 4 ) 6 NumCons (ϕ 5 ) 7 NumFuncs (ϕ 6 ) 8 NumStructs (ϕ 7 ) 9 isrec (ϕ 8 ) Transitions. Transitions are deterministic. A transition τ evolves the current sets of rules and programs by applying the operators selected (together with the rule) and the combiners. Rewards. The optimality criteria seen above is used to feed the rewards. 14 / 31

Modelling the state-value function: using a regression model We use a hybrid between value-function methods (which update a state-value matrix) and model-based methods (which learn models for τ and ω) [Sutton, 1998]. Generalise the state-value function Q(s, a) of the Q-learning [Watkins and Dayan, 1992] (which returns quality values, q R) by a supervised model Q M : S A R gerl uses linear regression by default for generating Q M, which is retrained periodically from Q. Q M is used to obtain the best action ȧ for the state ṡ t as follows: a t = arg max a A {Q M(ṡ t, ȧ)} 15 / 31

Modelling the state-value function: using a regression model state (s) action (a) q Φ1 Φ2 Φ3 o φ1 φ2 φ3 φ4 φ5 φ6 φ7 φ8 161.32 17.92 1 1 17.92 0.11 0 0 4 2 0 0 1 161.32 17.92 1 4 17.92 0.11 0 0 4 2 0 0 1 140.81 17.92 1 2 15.33 0.11 0 1 3 2 0 0 0.82 161.32 17.92 1 3 15.33 0.11 0 1 3 2 0 0 0.82 161.32 17.92 1 2 15.33 0.11 0 1 3 2 0 0 0.82 161.32 17.92 1 2 15.33 0.22 0 1 3 2 0 0 0.85 161.32 17.92 1 1 15.33 0.11 0.2 1 3 2 0 0 0.79 Once the system has started, at each step, Q is updated using the following formula: [ ] Q[s t, a t ] α w t+1 + γ max Q M (s t+1, a t+1 ) +(1 α)q[s t, a t ] a t+1 (1) 16 / 31

Example: Playtennis Id e + e + 1 playtennis(overcast, hot, high, weak) yes 2 playtennis(rain, mild, high, weak) yes 3 playtennis(rain, cool, normal, weak) yes 4 playtennis(overcast, cool, normal, strong) yes 5 playtennis(sunny, cool, normal, weak) yes 6 playtennis(rain, mild, normal, weak) yes 7 playtennis(sunny, mild, normal, strong) yes 8 playtennis(overcast, mild, high, strong) yes 9 playtennis(overcast, hot, normal, weak) yes Table 1: Set of positive examples E (Playtennis problem) Id o o 1 replace (L 1, X 1 ) 2 replace (L 2, X 2 ) 3 replace (L 3, X 3 ) 4 replace (L 4, X 4 ) Table 3: Set of operators O O Id e e 1 playtennis(sunny, hot, high, weak) yes 2 playtennis(sunny, hot, high, strong) yes 3 playtennis(rain, cool, normal, strong) yes 4 playtennis(sunny, mild, high, weak) yes 5 playtennis(rain, mild, high, strong) yes Table 2: Set of negative examples E (Playtennis problem) 17 / 31

Example: Playtennis Id ρ ρ MsgLen(ρ) Opt(ρ) Cov+ [ρ] Cov- [ρ] Id e + e + 1 playtennis(overcast, hot, high, weak) yes 2 playtennis(rain, mild, high, weak) yes 3 playtennis(rain, cool, normal, weak) yes 4 playtennis(overcast, cool, normal, strong) yes 5 playtennis(sunny, cool, normal, weak) yes 6 playtennis(rain, mild, normal, weak) yes 7 playtennis(sunny, mild, normal, strong) yes 8 playtennis(overcast, mild, high, strong) yes 9 playtennis(overcast, hot, normal, weak) yes Table 1: Set of positive examples E (Playtennis problem) Step 0 1 playtennis(overcast, hot, high, weak) yes 17.92 161.32 1 [1] 0 [] 2 playtennis(rain, mild, high, weak) yes 17.92 161.32 1 [2] 0 [] 3 playtennis(rain, cool, normal, weak) yes 17.92 161.32 1 [3] 0 [] 4 playtennis(overcast, cool, normal, strong) yes 17.92 161.32 1 [4] 0 [] 5 playtennis(sunny, cool, normal, weak) yes 17.92 161.32 1 [5] 0 [] 6 playtennis(rain, mild, normal, weak) yes 17.92 161.32 1 [6] 0 [] 7 playtennis(sunny, mild, normal, strong) yes 17.92 161.32 1 [7] 0 [] 8 playtennis(overcast, mild, high, strong) yes 17.92 161.32 1 [8] 0 [] 9 playtennis(overcast, hot, normal, weak) yes 17.92 161.32 1 [9] 0 [] Table 4: Set of rules generated R R state (s) action (a) Φ1 Φ2 Φ3 o φ1 φ2 φ3 φ4 φ5 φ6 φ7 φ8 q 161.32 17.92 1 1 17.92 0.11 0 0 4 2 0 0 1 161.32 17.92 1 4 17.92 0.11 0 0 4 2 0 0 1 Step2 Step 3 Step 4 Step 5 Table 5: Matrix Q 17 / 31

Example: Playtennis Id ρ ρ MsgLen(ρ) Opt(ρ) Cov+ [ρ] Cov- [ρ] 1 playtennis(overcast, hot, high, weak) yes 17.92 161.32 1 [1] 0 [] 2 playtennis(rain, mild, high, weak) yes 17.92 161.32 1 [2] 0 [] 3 playtennis(rain, cool, normal, weak) yes 17.92 161.32 1 [3] 0 [] 4 playtennis(overcast, cool, normal, strong) yes 17.92 161.32 1 [4] 0 [] 5 playtennis(sunny, cool, normal, weak) yes 17.92 161.32 1 [5] 0 [] 6 playtennis(rain, mild, normal, weak) yes 17.92 161.32 1 [6] 0 [] 7 playtennis(sunny, mild, normal, strong) yes 17.92 161.32 1 [7] 0 [] 8 playtennis(overcast, mild, high, strong) yes 17.92 161.32 1 [8] 0 [] 9 playtennis(overcast, hot, normal, weak) yes 17.92 161.32 1 [9] 0 [] 10 playtennis(sunny, X 2, normal, weak) yes 15.34 158.74 1 [5] 0 [] Table 4: Set of rules generated R R Id o o 1 replace (L 1, X 1 ) a 2 replace (L 2, X 2 ) t=1 = arg max M(s t, a)} = 2, 5 a A 3 replace (L 3, X 3 ) 4 replace (L 4, X 4 ) state (s) action (a) Table 3: Set of operators O O Φ1 Φ2 Φ3 o φ1 φ2 φ3 φ4 φ5 φ6 φ7 φ8 161.32 17.92 1 1 17.92 0.11 0 0 4 2 0 0 1 161.32 17.92 1 4 17.92 0.11 0 0 4 2 0 0 1 140.81 17.92 1 2 15.33 0.11 0 1 3 2 0 0 0.82 Table 5: Matrix Q q Step 1 17 / 31

Example: Playtennis Id ρ ρ MsgLen(ρ) Opt(ρ) Cov+ [ρ] Cov- [ρ] 1 playtennis(overcast, hot, high, weak) yes 17.92 161.32 1 [1] 0 [] 2 playtennis(rain, mild, high, weak) yes 17.92 161.32 1 [2] 0 [] 3 playtennis(rain, cool, normal, weak) yes 17.92 161.32 1 [3] 0 [] 4 playtennis(overcast, cool, normal, strong) yes 17.92 161.32 1 [4] 0 [] 5 playtennis(sunny, cool, normal, weak) yes 17.92 161.32 1 [5] 0 [] 6 playtennis(rain, mild, normal, weak) yes 17.92 161.32 1 [6] 0 [] 7 playtennis(sunny, mild, normal, strong) yes 17.92 161.32 1 [7] 0 [] 8 playtennis(overcast, mild, high, strong) yes 17.92 161.32 1 [8] 0 [] 9 playtennis(overcast, hot, normal, weak) yes 17.92 161.32 1 [9] 0 [] 10 playtennis(sunny, X 2, normal, weak) yes 15.34 158.74 1 [5] 0 [] 11 playtennis(overcast, cool, X 3, strong) yes 15.34 158.74 1 [4] 0 [] 12 playtennis(overcast, X 2, normal, weak) yes 15.34 158.74 1 [9] 0 [] 13 playtennis(rain, X 2, normal, weak) yes 15.34 140.81 2 [3,6] 0 [] 14 playtennis(x 1, hot, high, weak) yes 15.34 176.66 1 [1] 1 [1] Table 4: Set of rules generated R R state (s) action (a) q Φ1 Φ2 Φ3 o φ1 φ2 φ3 φ4 φ5 φ6 φ7 φ8 161.32 17.92 1 1 17.92 0.11 0 0 4 2 0 0 1 161.32 17.92 1 4 17.92 0.11 0 0 4 2 0 0 1 140.81 17.92 1 2 15.33 0.11 0 1 3 2 0 0 0.82 161.32 17.92 1 3 15.33 0.11 0 1 3 2 0 0 0.82 161.32 17.92 1 2 15.33 0.11 0 1 3 2 0 0 0.82 161.32 17.92 1 2 15.33 0.22 0 1 3 2 0 0 0.85 161.32 17.92 1 1 15.33 0.11 0.2 1 3 2 0 0 0.79 Step 1 Step 2 Step 3 Step 4 Step 5 17 / 31

Reusing Past Policies state (ss) action (aa) qq ΦΦ1 ΦΦ2 ΦΦ3 o φφ1 φφ2 φφ3 φφ4 φφ5 φφ6 φφ7 φφ8 161.32 17.92 1 1 17.92 0.11 0 0 4 2 0 0 1 161.32 17.92 1 4 17.92 0.11 0 0 4 2 0 0 1 140.81 17.92 1 2 15.33 0.11 0 1 3 2 0 0 0.82 161.32 17.92 1 3 15.33 0.11 0 1 3 2 0 0 0.82 161.32 17.92 1 2 15.33 0.11 0 1 3 2 0 0 0.82 161.32 17.92 1 2 15.33 0.22 0 1 3 2 0 0 0.85 161.32 17.92 1 1 15.33 0.11 0.2 1 3 2 0 0 0.79 The abstract representation of states and actions (the φ and ϕ features) which allows the system does not start from the scratch and reuse the optimal information: Actions successfully applied to certain states (from the previous task) when it reaches a similar (with similar features) new state. Due this abstract representation, how different are the source and target task does not matter. 18 / 31

Reusing Past Policies The table Q S can be viewed as knowledge acquired during the learning process that can be transferred to a new situation. When gerl learns the new task, Q S is used to train a new model Q T M 1. Q S is used from the first learning step and it is afterwards updated with the new information acquired using the model Q T M. Source Task Q S [s, a] Target Task Q T [s, a] step 1 step 2 step n state (s) action (a) q Φ i a step, Φ step,j q step step 1 step 2 step n state (s) action (a) q Φ i a step, Φ step,j q step Φ i a step, Φ step,j q step Previous Knowledge New Knowledge 1 We don t transfer the Q S M model since it may not have been retrained with the last information added to the table Q S (because of the periodicity of training). 19 / 31

An illustrative example of Transfer Knowledge List processing problems as a structured prediction domain: 1 d c: replaces d by c. (trans([t, r, a, d, e]) [t, r, a, c, e]) 2 e ing: replaces e by ing located at the last position of a list. (trans([t, r, a, d, e]) [t, r, a, d, i, n, g]) 3 d pez: replaces d by pez located at any position of a list. (trans([t, r, a, d, e]) [t, r, a, p, e, z, e]) 4 Prefix over : adds the prefix over. (trans([t, r, a, d, e]) [o, v, e, r, t, r, a, d, e]) 5 Suffix mark : adds the suffix mark. (trans([t, r, a, d, e]) [t, r, a, d, e, m, a, r, k]) 20 / 31

An illustrative example of Transfer Knowledge Since we want to analyse the ability of the system to improve the learning process when reusing past policies: 1 we will solve each of the previous problems separately and, 2 then we will reuse the policy learnt solving one problem to solve the rest (including itself). The set of operators used consists of the user-defined operators and a small number of non-relevant operators (20). To make the experiments independent of the operator index, we will set up 5 random orders for them. Each problem has 20 positive instances e + and no negative ones. 21 / 31

An illustrative example of Transfer Knowledge l c e ing d pez Prefix over Suffix mark Steps 108.68 76.76 74.24 61.28 62.28 Table: Results not reusing previous policies (average number of steps). Problem PCY from l c e ing d pez Prefix over Suffix mark l c 65.68 58 70, 64 48.84 49.12 e ing 66.48 50.04 56.4 45.2 45.36 d pez 56.36 49.6 57.32 52.24 45.84 Prefix over 58.8 48.96 60.6 43.8 46.88 Suffix mark 102, 72 64.4 67.32 56.16 57.48 Average 70.01 54.2 62.46 49.25 48.94 Table: Results reusing policies (average number of steps). From each problem we extract 5 random samples of ten positive instances in order to learn a policy from them with each of the five order of operators (5 problems 5 samples 5 operator orders = 125 different experiments). 22 / 31

Conclusions and Future Work One of the problems of reusing knowledge from previous learning problems to new ones is the representation and abstraction of this knowledge. In this paper we have investigated how policy reuse can be useful (even in cases where the problems have no operators in common), simply because some abstract characteristics of two learning problems are similar at a more general level. 23 / 31

Conclusions and Future Work There are many other things to explore in the context of gerl: Include features for the operators. Measure of similarity between problems (would help us to better understand when the system is able to detect these similarities). Apply the ideas in this paper to other kinds of systems (LCS, RL and other evolutionary techniques). Apply this ideas to other psychonometrics (IQ tests): Odd-one-out problems. Raven s matrices. Thurstone Letter Series. 24 / 31

Thanks THANKS 25 / 31

References I [Daumé III and Langford, 2009] Daumé III, H. and Langford, J. (2009). Search-based structured prediction. [Estruch et al., 2006] Estruch, V., Ferri, C., Hernández-Orallo, J., and Ramírez-Quintana, M. J. (2006). Similarity functions for structured data. an application to decision trees. Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial, 10(29):109 121. [Fernandez and Veloso, 2006] Fernandez, F. and Veloso, M. (2006). Probabilistic policy reuse in a Reinforcement Learning agent. In AAMAS âăź06, pages 720 727. ACM Press. 26 / 31

References II [Ferri et al., 2001] Ferri, C., Hernández-Orallo, J., and Ramírez-Quintana, M. (2001). Incremental learning of functional logic programs. In FLOPS, pages 233 247. [Gärtner, 2005] Gärtner, T. (2005). Kernels for Structured Data. PhD thesis, Universitat Bonn. [Holmes et al., 2002] Holmes, J. H., Lanzi, P., and Stolzmann, W. (2002). Learning classifier systems: New models, successful applications. Information Processing Letters. 27 / 31

References III [J.Carroll, 2002] J.Carroll (2002). Fixed vs Dynamic Sub-transfer in Reinforcement Learning. In ICMLA 02. CSREA Press. [Liu and Stone, 2006] Liu, Y. and Stone, P. (2006). Value-function-based transfer for reinforcement learning using structure mapping. AAAI, pages 415 20. [Lloyd, 2001] Lloyd, J. W. (2001). Knowledge representation, computation, and learning in higher-order logic. [Maes et al., 2009] Maes, F., Denoyer, L., and Gallinari, P. (2009). Structured prediction with reinforcement learning. Machine Learning Journal, 77(2-3):271 301. 28 / 31

References IV [Mehta, 2005] Mehta, N. (2005). Transfer in variable-reward hierarchical reinforcement learning. In In Proc. of the Inductive Transfer workshop at NIPS. [Muggleton, 1995] Muggleton, S. (1995). Inverse entailment and Progol. New Generation Computing. [Plotkin, 1970] Plotkin, G. (1970). A note on inductive generalization. Machine Intelligence, 5. [Price and Boutilier, 2003] Price, B. and Boutilier, C. (2003). Accelerating Reinforcement Learning through implicit imitation. Journal of Artificial Intelligence Research, 19:2003. 29 / 31

References V [Sutton, 1998] Sutton, R. (1998). Reinforcement Learning: An Introduction. MIT Press. [Sutton and Barto, 1998] Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press. [Taylor and Stone, 2009] Taylor, M. and Stone, P. (2009). Transfer learning for Reinforcement Learning domains: A survey. Journal of Machine Learning Research, 10(1):1633 1685. [Virding et al., 1996] Virding, R., Wikström, C., and Williams, M. (1996). Concurrent programming in ERLANG (2nd ed.). Prentice Hall International (UK) Ltd., Hertfordshire, UK, UK. 30 / 31

References VI [Wallace and Dowe, 1999] Wallace, C. S. and Dowe, D. L. (1999). Minimum message length and kolmogorov complexity. Computer Journal, 42:270 283. [Watkins and Dayan, 1992] Watkins, C. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279 292. 31 / 31