Free-energy-based Reinforcement Learning in a Partially Observable Environment

Similar documents
Lecture 10: Reinforcement Learning

Artificial Neural Networks written examination

Reinforcement Learning by Comparing Immediate Reward

Axiom 2013 Team Description Paper

Python Machine Learning

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Calibration of Confidence Measures in Speech Recognition

Lecture 1: Machine Learning Basics

Learning Methods for Fuzzy Systems

SARDNET: A Self-Organizing Feature Map for Sequences

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

On the Formation of Phoneme Categories in DNN Acoustic Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

AMULTIAGENT system [1] can be defined as a group of

Improving Action Selection in MDP s via Knowledge Transfer

Deep Neural Network Language Models

Syntactic systematicity in sentence processing with a recurrent self-organizing network

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Georgetown University at TREC 2017 Dynamic Domain Track

High-level Reinforcement Learning in Strategy Games

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

FF+FPG: Guiding a Policy-Gradient Planner

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Human Emotion Recognition From Speech

Model Ensemble for Click Prediction in Bing Search Ads

Regret-based Reward Elicitation for Markov Decision Processes

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

INPE São José dos Campos

Truth Inference in Crowdsourcing: Is the Problem Solved?

CSL465/603 - Machine Learning

Learning Prospective Robot Behavior

Probabilistic Latent Semantic Analysis

A Review: Speech Recognition with Deep Learning Methods

TD(λ) and Q-Learning Based Ludo Players

arxiv: v2 [cs.ro] 3 Mar 2017

Using focal point learning to improve human machine tacit coordination

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Attributed Social Network Embedding

WHEN THERE IS A mismatch between the acoustic

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Semi-Supervised Face Detection

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Go fishing! Responsibility judgments when cooperation breaks down

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

A Reinforcement Learning Variant for Control Scheduling

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

CS Machine Learning

A Comparison of Annealing Techniques for Academic Course Scheduling

Learning Methods in Multilingual Speech Recognition

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Modeling function word errors in DNN-HMM based LVCSR systems

Introduction to Causal Inference. Problem Set 1. Required Problems

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Artificial Neural Networks

Second Exam: Natural Language Parsing with Neural Networks

Softprop: Softmax Neural Network Backpropagation Learning

***** Article in press in Neural Networks ***** BOTTOM-UP LEARNING OF EXPLICIT KNOWLEDGE USING A BAYESIAN ALGORITHM AND A NEW HEBBIAN LEARNING RULE

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Deep Facial Action Unit Recognition from Partially Labeled Data

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Comment-based Multi-View Clustering of Web 2.0 Items

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Australian Journal of Basic and Applied Sciences

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Assignment 1: Predicting Amazon Review Ratings

Knowledge Transfer in Deep Convolutional Neural Nets

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Evolutive Neural Net Fuzzy Filtering: Basic Description

Soft Computing based Learning for Cognitive Radio

STA 225: Introductory Statistics (CT)

Predicting Future User Actions by Observing Unmodified Applications

Modeling function word errors in DNN-HMM based LVCSR systems

Discriminative Learning of Beam-Search Heuristics for Planning

Learning to Schedule Straight-Line Code

arxiv: v1 [cs.lg] 15 Jun 2015

Why Did My Detector Do That?!

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Radius STEM Readiness TM

The Good Judgment Project: A large scale test of different methods of combining expert predictions

arxiv: v1 [cs.cv] 10 May 2017

Corrective Feedback and Persistent Learning for Information Extraction

Task Completion Transfer Learning for Reward Inference

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

An OO Framework for building Intelligence and Learning properties in Software Agents

arxiv: v2 [cs.ir] 22 Aug 2016

Speeding Up Reinforcement Learning with Behavior Transfer

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Learning From the Past with Experiment Databases

Transcription:

Free-energy-based Reinforcement Learning in a Partially Observable Environment Makoto Otsuka,2, Junichiro Yoshimoto,2 and Kenji Doya,2 - Initial Research Project, Okinawa Institute of Science and Technology 2-22 Suzaki, Uruma, Okinawa 94-2234, Japan 2 - Graduate School of Information Science, Nara Institute of Science and Technology 896- Takayama, Ikoma, Nara 63-92, Japan Abstract. Free-energy-based reinforcement learning (FERL) can handle Markov decision processes (MDPs) with high-dimensional state spaces by approximating the state-action value function with the negative equilibrium free energy of a restricted Boltzmann machine (RBM). In this study, we extend the FERL framework to handle partially observable MDPs (POMDPs) by incorporating a recurrent neural network that learns a memory representation sufficient for predicting future observations and rewards. We demonstrate that the proposed method successfully solves POMDPs with high-dimensional observations without any prior knowledge of the environmental hidden states and dynamics. After learning, task structures are implicitly represented in the distributed activation patterns of hidden nodes of the RBM. Introduction Partially observable Markov decision processes (POMDPs) are versatile enough to model sequential decision making in the real world. However, state-of-theart algorithms for POMDPs [, 2] assume prior knowledge of environments: in particular, a set of hidden states that makes the environment Markovian and the transition and observation probabilities for the states. They also have a difficulty in handling high-dimensional sensory inputs. The use of an undirected counterpart of Bayesian networks has yielded a new algorithm to handle Markov decision processes (MDPs) with a large state space [3]. In this free-energy-based reinforcement learning (FERL), a restricted Boltzmann machine (RBM) is used to approximate the state-action value function as the negative free energy of the RBM. In this study, we extend this FERL framework to handle POMDPs using Whitehead s recurrent-model architecture [4]. The proposed method can handle high-dimensional observations and solve POMDPs without any prior knowledge of the environmental state set and dynamics. 2 Free-energy-based reinforcement learning framework We briefly review the FERL framework for MDPs [3]. In this framework, the agent is realized by a RBM (Fig. (a)). The visible layer V is composed of binary state nodes S andactionnodesa. The hidden layer is composed of 4

binary hidden nodes H. AstatenodeS i is connected to a hidden node H k by the connection weight w ik, and an action node A j is connected to a hidden node H k by the connection weight u jk. A hidden node h k takes a binary value with the probability Pr(h k =)=σ( i w iks i + j u jka j )whereσ(z) /(+exp( z)). The free energy of the system, which is the negative log-partition function of posterior probability over h given a configuration (S = s, A = a) inthermal equilibrium with the unit temperature, is given by K F (s, a)= s W ĥ a Uĥ + K ĥ k log ĥk + ( ĥk) log( ĥk) k= where W [w ik ]andu [u ik ] are matrix notations of the connection weights. ĥ k σ([w s + U a] k ) is the conditional expectation of h k given the configuration (s, a), where [ ] k denotes the k-th component of the vector enclosed within the brackets. The network is trained so that the negative free energy approximates the state-action value function, i.e., F (s, a) Q(s, a) E[r + γq(s, a ) s, a] wherer, s,anda are the reward, next state, and next action, and γ is the discount factor for future rewards. By applying the SARSA() algorithm with a function approximator [], we obtain a simple update rule for the network parameters: k= Δw ik = α (r t+ γf(s t+, a t+ )+F (s t, a t )) s i,t ĥ k,t Δu jk = α (r t+ γf(s t+, a t+ )+F (s t, a t )) a j,t ĥ k,t (a) (b) where the subscript t denotes the time and α denotes the learning rate. To select an action at a given state s, we used the softmax action selection rule with inverse temperature β π(s, a) =Pr(a s) exp{ βf(s, a)} (2) by calculating the free energies for each action. 3 Model architecture We incorporate Whitehead s recurrent-model architecture [4] into FERL framework for solving POMDPs, as shown in Fig. (b). The architecture consists of two modules: an Elman-type recurrent neural network (RNN) for one-step prediction (predictor) and a RBM for state-action value estimation (actor). The predictor module predicts the upcoming observation y t and reward r t on the basis of the memory m t, which is supposed to summarize the history of all past events. At each time t, the memory is given by the sigmoid function σ( ) of a linear transformation of the previous observation, action, and memory (y t, a t, m t ). Once the memory m t is given, the network predicts (y t, The notation r denotes a scalar reward, and the vector notation r denotes a bit coding of the scalar reward with respect to all possible rewards. 42

Fig. : Models for handling high-dimensional inputs. (a) An actor-only architecture for MDPs. (b) A predictor-actor architecture for POMDPs. Fig. 2: Digit matching T-maze task. The optimal action at the T-junction is indicated by arrows. r t ) as the sigmoid function of a linear mapping of m t. All linear coefficients (weights and biases) of the network are trained by the backpropagation through time (BPTT) algorithm [6]. The actor module regards the combination of the current observation and predictor s memory (y t, m t ) as the state vector s. The actor is trained by the SARSA() algorithm with Eq. (). 4 Experiments We designed a matching T-maze task in order to show the proposed model s ability to solve POMDPs without any prior knowledge of the environmental state set and dynamics. The matching T-maze task is an extension of the non-markovian grid-based T-maze task [7] to investigate the coding and combinatorial usage of task-relevant information. An agent can execute four possible actions: go one step North, West, East, or South. At each time step, an agent observes a binary vector depending on the position in the maze. In the first experiment, the observation is composed of five bits encoding the position: () the start position, (2) the middle of the corridor, (3) the T-junction, (4) the left goal, and () the right goal and two bits of signals specifying the rewarding goal position, observed at the start position and the 43

Prediction Error (Training) Epoch 6 3 6 3 6 6 3 3 6 6 3 7 3 7 6 6 3 7 9 3 7 9.8.6 Prediction error (Test) Epoch 6 3 6 3 6 3 6 3 6 6 3 7 3 7 6.4 6.2 3 7 9 3 7 9 Fig. 3: Average weighted prediction errors of observations and rewards. The top and bottom rows show the error for the training and test dataset, respectively. The vertical and horizontal axes of each panel indicate the training epoch of RNNs and steps t in a episode, respectively. T-junction only. In the second experiment, observations are 784-dimensional binary hand-written digits (Fig. 2). An episode ends either when the agent steps into the goal states or after the number of action selections exceeds steps. If the two signals at the start position and the T-junction are the same, the agent receives a reward of + at the right goal and at the left goal; if the two signals are not the same, the reward condition is reversed. When the agent hits a wall, the underlying environmental state does not change, and the agent receives a reward of. Otherwise, the agent receives a reward of.. Upon each episode, the two signals are independently and randomly selected and fixed. We used 7 or 784 observation nodes Y, 4 reward nodes R, and 2 memory nodes M for the predictor module; we used 2 hidden nodes H and 4 action nodes A for the actor module. 4. Matching T-maze task with orthogonal bit codes The predictor was first trained by 6 episodic training data with varying step lengths from 3 to 7, collected by the random action selection, repeatedly epochs (Fig. 3). Using the pre-trained predictor, the proposed predictor-actor model successfully learned the optimal policy in this task (Figs. 4(a), (b)). Figs. (a), (b) show the activations of the actor s state nodes (M, Y )andits hidden nodes H at the T-junction, respectively. The memory layer retained the signal at the start position (Fig. (a)). Principal component analysis (PCA) of the activation patterns in the actor s network revealed that the four signal conditions were well separated even before the actor s learning started (Fig. 6(a)). This clear separation in high-dimensional space was helpful for state representation in the actor module in that it allowed the agent to learn the optimal policy. In addition, the activations of the hidden nodes showed a gradual separation of firing patterns through the actor s learning process as though the activations were functionally differentiated. (Figs. (b) and 6(b)). 44

Discounted cumulative reward Average terminal rewards 6 4 3 2 2 6 (a) Discounted return. 2 6 (b) Terminal reward. Fig. 4: Performance of the predictor-actor model in the matching T-maze task with low-dimensional bit-coded observations. The error bars show a standard deviation over runs. The theoretically optimal performance is indicated by the dotted lines. (, ) (, ) (, ) (, ) (, ) (, ) (, ) (, ) m (8 27), y( 7) 2 2 2 2 h (2) 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 (a) Actor s state nodes. (b) Actor s hidden nodes. Fig. : Activation patterns of the actor s nodes at T-junction. The bit patterns and enclosed within the parentheses show four conditions of the signals at two positions (start, T-junction). (a) Actor s state nodes composed of M and Y. (b) Conditional activation of actor s hidden nodes H. T junction visit (, ) (, ) (, ) T junction visit (, ) (, ) (, ) PC 2. PC. (, ). PC 2... PC (, ) (a) Actor s state nodes. (b) Actor s hidden nodes Fig. 6: PCA analysis of actor s activation patterns in all T-junction visits. The size of the marker reflects the number of steps to the goals. The smallest marker indicates 3 steps, and the largest marker indicates steps. 4

Discounted cumulative reward Average terminal rewards 4 3 2 2 6 (a) Discounted return. 2 6 (b) Terminal reward. Fig. 7: Performance of the predictor-actor model in the matching T-maze task with high-dimensional pixel observations. 4.2 Matching T-maze task with high-dimensional observations In the second task, pixel images of hand-written digits were used as observations. The performance of the agent remained suboptimal as shown in Fig. 7(a); however, the agent still showed the tendency to select the correct goal as shown in Fig. 7(b). It indicates that the information about the initial signal was at least retained in the predictor s hidden nodes. Conclusion and future work In this study, we extended the FERL framework to handle POMDPs. Here, neither the state transition probability nor the true set of the underlying Markovian state was given apriori. We used this approach to handle high-dimensional observations and obtained preliminary results. In order to improve the performance of this architecture, the separation of a predictor for observations and rewards can be helpful. With this modification, several nuisance parameters are removed, and a scalar reward can be handled. References [] J. Hoey and P. Poupart. Solving POMDPs with continuous or large discrete observation spaces. In IJCAI, volume 9, page 332, 2. [2] M. Toussaint, L. Charlin, and P. Poupart. Hierarchical POMDP controller optimization by likelihood maximization. UAI, 28. [3] B. Sallans and G. E. Hinton. Reinforcement learning with factored states and actions. Journal of Machine Learning Research, :63 88, 24. [4] S. D. Whitehead and L. J. Lin. Reinforcement learning of non-markov decision processes. Artificial Intelligence, 73:27 36, 99. [] R. S. Sutton and A. G. Barto. Reinforcement Learning. MIT Press, 998. [6] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(): 6, 99. [7] B. Bakker. Reinforcement learning with long short-term memory. NIPS, 2:47 482, 22. 46