Learning for Actor-Cr

Similar documents
Reinforcement Learning by Comparing Immediate Reward

Axiom 2013 Team Description Paper

Georgetown University at TREC 2017 Dynamic Domain Track

Lecture 10: Reinforcement Learning

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

AMULTIAGENT system [1] can be defined as a group of

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

TD(λ) and Q-Learning Based Ludo Players

Speeding Up Reinforcement Learning with Behavior Transfer

Learning and Transferring Relational Instance-Based Policies

Modeling user preferences and norms in context-aware systems

High-level Reinforcement Learning in Strategy Games

On the Combined Behavior of Autonomous Resource Management Agents

An investigation of imitation learning algorithms for structured prediction

Laboratorio di Intelligenza Artificiale e Robotica

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Automating the E-learning Personalization

Improving Action Selection in MDP s via Knowledge Transfer

FF+FPG: Guiding a Policy-Gradient Planner

A Case-Based Approach To Imitation Learning in Robotic Agents

Probabilistic Latent Semantic Analysis

A Reinforcement Learning Variant for Control Scheduling

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Software Maintenance

On-Line Data Analytics

Reducing Features to Improve Bug Prediction

Laboratorio di Intelligenza Artificiale e Robotica

Task Completion Transfer Learning for Reward Inference

Improving Fairness in Memory Scheduling

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Lecture 1: Machine Learning Basics

Seminar - Organic Computing

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Learning Prospective Robot Behavior

College Pricing and Income Inequality

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

College Pricing and Income Inequality

A heuristic framework for pivot-based bilingual dictionary induction

Carter M. Mast. Participants: Peter Mackenzie-Helnwein, Pedro Arduino, and Greg Miller. 6 th MPM Workshop Albuquerque, New Mexico August 9-10, 2010

Truth Inference in Crowdsourcing: Is the Problem Solved?

(Sub)Gradient Descent

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

2 User Guide of Blackboard Mobile Learn for CityU Students (Android) How to download / install Bb Mobile Learn? Downloaded from Google Play Store

Task Completion Transfer Learning for Reward Inference

What is a Mental Model?

Using focal point learning to improve human machine tacit coordination

Functional Skills Mathematics Level 2 assessment

Agent-Based Software Engineering

Rule Learning With Negation: Issues Regarding Effectiveness

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Regret-based Reward Elicitation for Markov Decision Processes

Physics 270: Experimental Physics

LEARNING AGREEMENT FOR STUDIES

Utilizing Soft System Methodology to Increase Productivity of Shell Fabrication Sushant Sudheer Takekar 1 Dr. D.N. Raut 2

Parsing of part-of-speech tagged Assamese Texts

Learning Methods for Fuzzy Systems

A Case Study: News Classification Based on Term Frequency

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Data Fusion Models in WSNs: Comparison and Analysis

Action Models and their Induction

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Software Security: Integrating Secure Software Engineering in Graduate Computer Science Curriculum

Generating Test Cases From Use Cases

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Visual CP Representation of Knowledge

FOR TEACHERS ONLY. The University of the State of New York REGENTS HIGH SCHOOL EXAMINATION PHYSICAL SETTING/PHYSICS

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

CS177 Python Programming

Abnormal Activity Recognition Based on HDP-HMM Models

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

An OO Framework for building Intelligence and Learning properties in Software Agents

Artificial Neural Networks written examination

Discriminative Learning of Beam-Search Heuristics for Planning

INTERMEDIATE ALGEBRA PRODUCT GUIDE

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Abstractions and the Brain

Evolutive Neural Net Fuzzy Filtering: Basic Description

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Running Head: STUDENT CENTRIC INTEGRATED TECHNOLOGY

Detecting English-French Cognates Using Orthographic Edit Distance

An Empirical and Computational Test of Linguistic Relativity

Lab Reports for Biology

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Knowledge-Based - Systems

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

Mining Topic-level Opinion Influence in Microblog

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

KIS MYP Humanities Research Journal

Generative models and adversarial training

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Intelligent Agents. Chapter 2. Chapter 2 1

Stopping rules for sequential trials in high-dimensional data

Transcription:

Departmental Bulletin Paper / 紀要論文 Accelerate Learning P Avoiding Inappropriat Learning for Actor-Cr TAKANO, Toshiaki; TAKAE, Haruhiko; TURUOKA, hinji Proceedings of the econd Internati Innovation tudies : (IWRI2010). http://hdl.handle.net/10076/11661

Accelerate Learning Processes by Avoiding Inappropriate Rules in Transfer Learning for Actor-Critic Toshiaki TAKANO, Haruhiko TAKAE, Hiroharu KAWANAKA and hinji TURUOKA raduate chool of Engineering, Mie University, Japan raduate chool of Regional Innovation tudies, Mie University, Japan takano@ip.elec.mie-u.ac.jp Abstruct This paper aims to accelerate processes of actor-critic method, which is one of major reinforcement learning algorithms, by a transfer learning. In general, reinforcement learning is used to solve optimization problems. Learning agents acquire a policy to accomplish the target task autonomously. To solve the problems, agents require long learning processes for trial and error. Transfer learning is one of effective methods to accelerate learning processes of machine learning algorithms. It accelerates learning processes by using prior knowledge from a policy for a source task. We propose an effective transfer learning algorithm for actor-critic method. Two basic issues for the transfer learning are method to select an effective source policy and method to reuse without negative transfer. In this paper, we mainly discuss the latter. We proposed the reuse method which based on the selection method that uses the forbidden rule set. Forbidden rule set is the set of rules that cause immediate failure of tasks. It is used to foresee similarity between a source policy and the target policy. Agents should not transfer the inappropriate rules in the selected policy. In actor-critic, a policy is constructed by two parameter sets: action preferences and state values. To avoid inappropriate rules, agents reuse only reliable action preferences and state values that imply preferred actions. We perform simple experiments to show the effectiveness of the proposed method. In conclusion, the proposed method accelerates learning processes for the target tasks. Keywords: Reinforcement learning, actor-critic method, Transfer learning 1 Introduction Acceleration of learning processes is one of important issues in machine learning, especially reinforcement learning[1, 2]. Reinforcement leanring make agent s decision rules for its action suitable for a given environment. ince they have no information to solve a target task at the begining of learning, they should get information by trial and error. It requires long learning processes to acquire enough information. Therefore, many researchers try to accelerate learning processes[3, 4, 5]. IWRI2010, 55 Transfer learning[6] is one of effective methods to accelerate learning processes in some machine learning algorithms. It is based on the ideas that knowledge to solve source tasks, which are called as source policies, accelerate learning processes of a target task. Important processes in transfer learning for reinforcement learning are selection of effective source policies and reusing the selected policies, we focus on the latter. In this paper, we aims to propose effective reuse method for selected policies which is decided by using our previous proposed method[7]. In detail, agents reuse it each parameter of reinforcement learning in a selected policy. Here, we treat actorcritic method that is one of major reinforcement learning algorithms. 2 Acceleration a Learning Process by Transfer Learning In this section, we simply explain actor-critic method and framework of transfer learning. 2.1 Actor-critic Method Actor-critic is one of popular reinforcement learning algorithms[1]. It finds a policy Π that maximizes the quantity R t, R t = τ γ τ r t+τ (1) for given tasks. Here, R t is a stochastic reward function R : A R, and γ is a predefined parameter, which called as discount rate. is a finite set of states. A is a finite set of actions. Actor-critic method is separated structure of actor and critic(ee Fig.1). Actor decides an action according to action preferences. An action preference p(s, a) is a parameter that is defined as preference of the action a A at the state s. Critic evaluates the action based on the reward r and state values. A state value v(s) represents the inference of the state s. Each state values is modified according to a reward, and each action preference is modified according to state values, repeatedly.

Agent Critic v(s) Actor p(s, a) ource task 1 ource task 2 ource policy 1 ource policy 2 Target task tate (s) Reward (r) Action (a) Environment Fig. 1: Framework of actor-critic method ource task 3 ource policy 3 Transfer Target policy 2.2 Transfer Learning Database In this paper, we discuss a transfer learning in actorcritic method. Figure 2 illustrates the framework of it. First, agents learn various source tasks and construct a database of policies. econd, an agent for the target task refers the database and selects a similar source policy to the optimal policy for the target task. Finally, the agent trains the target task based on the selected policy. ince the selected policy would contain effective information for the target task, the learning process of the target task would be accelerated. Transfer learning reuses a source policy which has same domain in the database to the target task. We define the domain as follows. Definition 1. Domain D is a tuple <, A >. Task Ω is a tuple < D, T, R >. T is a stochastic state transition function T : A R, which is the probability that the action a in the state s 1 will lead the state s 2. Fig. 2: Framework of Transfer Learning High concordance rate of a source forbidden rule set means that the corresponding policy is effective for the target task. ince the complete forbidden rule set for the target task is unknown during the training phase, agents compute the concordance rate based on an incomplete forbidden rule set, which is found by the instant. They select the knowledge that has the highest concordance rate from the database, if its concordance rate is greater than the given transfer threshold θ. Here, a high threshold brings precise similarity and less transfer, and a low threshold brings opposite. 3 Proposal The method accepts source tasks that have a same size of the state value table and the action preference table with ones for the target task. The domain is defined by many researchers independently. For example, Fernández defined a domain as a tuple <, A, T >[8]. We intend the definition 2 to keep wide application of the proposed method. In this section, we propose a reuse method in actorcritic method. 3.1 Reuse method based on the selected policy Agents cannot completely foresee the optimal target policy by using our selection method. Therefore, the selected policy may be include inappropriate rules 2.3 Our Previous Work for Transfer Learning which cause decelerate learning process for the target task. We proposed the selection method for transfer learning in the previous work[7]. In [7], we introduced two concepts: forbidden rule set and concordance rate. The former is a set of rules that cause immediate failure of a task. The latter is defined as follows. We discuss a method that reuse action preferences and state values instead of a policy in the form of the set of rules. ince function of each parameter is different, they should be reused in consideration of their characteristics. Action preferences should be transferred carefully, Definition 2. The state s is an equivalent state, if all source fobidden rules related to the state s are since they are directly used to decide agent s action. Only reliable action preferences should be reused. agreed with ones for the target task. The concordance Rules that related to an equivalent state would be rate of the source forbidden rule set is a rate reliable, since all forbidden rules are agreed. The of equivalent states against all state. agent merges reliable source action preferences into IWRI2010, 56

current action preferences by the equation 2, p t (s, a) p t (s, a) + ζp s (s, a), s equivalent states, a A. (2) Here, subscript t and s mean target and source, respectively. Transfer efficiency ζ is a fixed parameter that controls effects of the reused action preferences. To prevent negative transfer, the transfer efficiency is defined as 0 < ζ < 1. tate values can be reused aggressively. tate values have less impact for the negative transfer than action preferences, since they affect agent s decision indirectly. Agents reuse only reliable action preferences, which are selected according to forbidden rules. It implies that reliable action preferences would not contain information related to preferred actions. To compensate it, preferred actions are reused with state values. Agents transfer only positive state values, because agents tend to move to states which have higher state values. They merge source state values into its state values by the equation 3, v t (s) v t (s) + ηv s (s), s {s v s (s) > 0, s. (3) Here, transfer efficiency η is a fixed parameter that controls effects of reused state values. As well as transfer efficiency ζ, η is defined as 0 < η < 1. initialize parameters P and V. ϕ forbidden rule setf ( ) the latest transferred item (P p, V p, F p). while( agent does not satisfy termination conditions ) { observe state s. decide action a. receive reword r. if( a is a forbidden action ) { add (s, a) into F. ( ) the most effective item (P e, V e, F e ). 0 the highest concordance rate C e. foreach( (P d, V d, F d ) in database D ) { concordance rate for F d to F C. if( C > C e ) { (P d, V d, F d ) (P e, V e, F e ). C C e. if( C e > θ && (P e, V e, F e)!= (P p, V p, F p) ) { merge P e to P according to equation (2). merge V e to V according to equation (3). (P e, V e, F e ) (P p, V p, F p ). else { update P and V (actor-critic method). Fig. 3: Pseudo code to learn the target task 3.2 Whole Algorithm Flow In this section, we show the complete transfer algorithm. In the training phase, an agent learns the target task Ω t. It searches a policy to transfer, every time it receives a reward. It transfers the policy, if the policy is different from the last selected policy. Figure 3 shows pseudo code of this phase. We get the optimal policy from the database L, and the target task Ω t. Here, the optimal policy is represented as the final action preferences P. 4 Experiments In this section, we perform simple experiments to show the effectiveness of the proposed method. We perform the effectiveness of proposed method by comparing it with π-reuse[8]. 4.1 Experiments etting We use simple maze tasks for our experiments. Each maze consists of 7 7 cells. Each cell is a coordinate or a pit. An agent moves from the start cell to the goal cell through only coordinates. The agent moves 4-way one-by-one, and decides its action by IWRI2010, October 14-15, Mie sensing its location. It repeats observation, decision, and action, every time it moves one cell. Here, the domain D is defined with = { 1, 2,..., 49 and A = {up, down, left, right. tate labels are arranged in a row major way from the left upper corner to the right bottom corner. The state 9 is the start cell and 41 is the goal cell for all tasks. Rewards are defined as follows: r = 50 for actions to get out of coordinates, r = 100 for actions to reach the goal, and r = 25 for actions every 100th move. tate transition T is defined as follows. For all moves that are same to agent s actions, transition rate is 0.9. Agents turn right against their actions by the transition rate 0.05, turn left in the same manner. They never move to opposite to agent s actions and remain stationary. We prepare three mazes for target tasks (see figure 4) and 24 mazes for source tasks. In figure 4, white cells are coordinates, and black cells are pits. First, we prepare a database by training an agent for each source task. The database is commonly used for following experiments. An agent finishes its learning process, when it reaches to the goal cell for ten episodes in a row. Each episode is a subsequence of the learning process while the agent moves from the start to a pit or the University 57

Target task A Target task B Target task C proposed method accelerate learning process for the current task. References Fig. 4: Maze of target tasks Table 1: Number of episodes for each transfer method Original Proposed π-reuse Ω A 250.4(38) 221.2(16) 255.9(41) Ω B 231.1(66) 195.7(31) 228.9(67) Ω C 281.1(147) 281.7(142) 271.0(149) goal. Parameters of actor-critic method are as follows: discount rate γ = 0.95, learning rate α = 0.05, step size parameter β = 0.05. The agent decides its action by soft-max method during its learning. The transfer threshold θ is 0.2. The fixed transfer efficiency ζ and η is 0.5, 0.05, respectively. Their experiments iterated 2000 trials. 4.2 Acceleration of Learning Processes In this section, we discuss the effect of the proposed method. Agents learn each target task by three methods: original actor-critic method, proposed method, and π-reuse method. In Table 1, Ω A, Ω B and Ω C show the result for the target task A, B and C, respectively. Each value represents the average number of episodes, and each value in parentheses represents the number of failure of training. rayed cells mean results that shows significant differences (p < 0.05) from the original method (left row). The learning cycles of π-reuse tend to hardly differ from the learning cycles of original actor-critic method, and the learning cycles of proposed method tend to accelerate learning processes from original ones. From the result, proposed method reuses the selected policy avoiding inappropriate rules, and accelerate learning process. [1] Richard. utton and Andrew. Barto, Reinforcement Learning, MIT Press, Cambridge, MA, 1998. [2] Leslie Pack Kaelbling, Michael L. Littman and Andrew W. Moore, Reinforcement Learning A urvey, Journal of Artificial Intelligence Research, vol.4, pp.237 285, 1996. [3] Marco Wiering and Jürgen chmidhuber, Fast Online Q(λ), Machine Learning, vol.33, pp.105 115, 1998. [4] Arthur Plínio de. Braga and Aluízio F. R. Araújo, Influence zones A strategy to enhance reinforcement learning, Neurocomputing, vol.70, pp.21 34, 2006. [5] Laëtitia Matignon, uillaume J. Laurent and Nadine Le Fort-Piat, Reward Function and Initial Values Better Choices for Accelerated oal-directed, Lecture Notes in Computer cience, vol.4131, pp.840 849, 2006. [6] inno Jialin Pan and Qiang Yang, A urvey on Transfer Learning, Technical Report, Dept. of Computer cience and Engineering, Hong Kong Univ. of cience and Technology, HKUT-C08-08, 2008. [7] Toshiaki Takano, Haruhiko Takase, Hiroharu Kawanaka, Hidehiko Kita, Terumine Hayashi, hinji Tsuruoka: Detection of the effective knowledge for knowledge reuse in Actor-Critic, Proceedings of the 19th Intelligent ystem ymposium and the 1st International Workshop on Aware Computing, pp.624 627, 2009. [8] Fernando Fernández and Manuela Veloso, Probabilistic Policy Reuse in a Reinforcement Learning Agent, Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, pp.720 727, 2006. 5 Conclusion In this paper, we proposed reuse method for transfer learning in actor-critic. The method allows learning agent to avoid inappropriate rules for current task. In detail, it merges action preferences and state values of the selected policy to the current parameters. We perform simple experiments to show the effectiveness of the proposed method. As the result, our IWRI2010, 58