Planning in POMDPs using MDP heuristics

Planning in POMDPs using MDP heuristics Polymenakos Kyriakos Oxford University Supervised by Shimon Whiteson kpol@robots.ox.ac.uk Abstract 1 2 3 4 5 6 7 8 9 10 11 12 Partially observable Markov decision processes provide a powerful framework for tackling discrete, partial information, stochastic problems. Exact solution of large POMDP s is often computationally intractable; online algorithms have had significant success in these larger problems. In this work we survey possible extensions on one such algorithm (POMCP), by replacing or combining the rollouts the algorithm performs in order to evaluate positions with other heuristic methods, based on solving the underlying MDP. That way, POMDP solvers can benefit from the great advances in the MDP state-of-the-art. According to the experiments performed, the MDP heuristic has a positive effect on the algorithm s performance, ranging from alleviating the need for hand crafted rollout policies to significantly outperforming the original algorithm using less computational time, depending on the problem. 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 1 introduction POMDP solvers use both online and offline approaches. Depending on the demands of the application, especially on the available time for action selection, one of the two is deemed more suitable than the other. In general full-width offline planners. such as (1), (2), (3), exploit factored representations and/or smart heuristics in order to effectively explore a search space that is typically very large. Online planners benefit from the fact that they don t necessarily have to compute policies for the entirety of the search space, allowing them to focus on the generally much smaller subset of relevant states. We are expanding on one such algorithm, Partially Observable Monte Carlo Planner (POMCP) (4) by combining it with heuristics. There have been various heuristics proposed in the POMDP literature. Here we are using variations of the MDP-heuristic (5), where the main idea is to approximate the value (expected return) of a state in the POMDP problem by estimating the return of one or more states of the underlying MDP. This estimation on the original algorithm is performed by using rollouts (simulations of possible outcomes of the problem under a simple, sometimes even random, policy). Depending on the problem and other parameters, rollouts can be more or less computationally expensive than other methods of estimation, as well as more or less accurate. We are exploring the effect of replacing rollouts with the MDP heuristic, on three benchmark problems, as presented in (4). 2 Background 2.1 The POMDP framework Markov decision processes (MDPs) provide a powerful framework for tackling discrete, partial information, stochastic problems. For every state s S and every action a A there are transition probabilities determining the next state distribution s. There is a reward function determining the agent s reward for each transition from s to s by performing a. The accumulated rewards make the

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 agent s return over an episode, or a part of an episode. In partially observable domains, the agent doesn t have access to the state, but only to certain observations o O. The observations are received according to observation probabilities, determined by s and a. The set of past action-observation pairs forms the history. The agent s task is to maximise its expected return and to do so, it needs to follow a policy π(h, a) that maps histories to probability distributions over actions. An optimal policy is a policy that maximises the agent s expected return. The belief state is the probability distribution over states s, given the history h, and it represents the agent s belief for the state of the environment. 2.2 The POMCP algorithm A brief description of POMCP is presented in this section. More details can be found in (4). POMCP is combining an Upper Confidence Tree (UCT) search, that selects actions, and a particle filter that updates the belief state. The UCT tree is a tree of histories with each node representing a unique sequence of action-observation pairs. In the original algorithm, for each action to be selected, a set number of simulations is run. Each simulation has two stages, the first, where actions are selected to maximise an augmented value, that favors states with high expected return and uncertainty on the return, and a second stage, where the action selection follows a rollout policy. The algorithm switches from stage one to stage two when a new node is encountered, and uses the rollout to evaluate it. As a result, for every a simulation a single node is added to the search tree. After the action is selected, the agent performs it and receives an observation, all the irrelevant branches of the search tree are pruned. The particle filter approximates the belief state with a set of states. Every time the agent performs an action and receives an observation the belief state update is performed by sampling from this set of states, simulating the action performed by the agent and then comparing the observation received during simulation with the actual observation received by the environment; if they match the particle "passes" through the filter and becomes a member of the new set of states. This is repeated until the number of particles in the updated belief state hits a set target. It is worth looking closer to the part of the POMCP algorithm where we are intervening. When UCT creates a new node in the search tree and its value needs to be estimated, a rollout is performed: the evolution of the episode is simulated, with the actions selected by some simple (history based) policy until the episode is terminated or some limit in the number of time steps is reached. The policy used depends on the level of domain knowledge the algorithm has available. Without domain knowledge a random rollout is performed, where at each time step an action is selected at random, with uniform probability. With an intermediate level of domain knowledge, only legal actions are sampled and finally with preferred actions, a simple, but predetermined and hand crafted, domain specific policy is used in the action selection. It should be noted that each new node is estimated by a single rollout, which is associated with a single particle. The estimation obtained has a high variance, and the computational cost is at most equal to performing the maximum number of steps, which, in the default algorithm setting, is 100. 2.3 The MDP heuristic The MDP heuristic provides a relatively fast way of estimating the expected return of the POMDP, given the belief state and the expected returns of states in the underlying MDP (6). To calculate the expected return of the POMPD, the heuristic averages the return of the MDP states that compose the belief state. Given the value for any state s S of the MDP, we need to estimate V (b) of the POMDP, which is approximated by ˆV (b) given by: ˆV (b) = s S V MDP (s)b(s) 78 79 80 81 82 83 84 85 Depending on domain parameters, namely the size of the state space, the expected returns of the MDP can be supplied to the POMDP solver by different means. In the simplest of cases, the MDP is solved exactly, an optimal policy is determined and for that policy, the expected return of each state is stored in a table. The POMDP solver, when an estimate for a state is needed, finds the appropriate value in the table, computed offline. When the number of states doesn t allow the exact solution of the MPD and/or storing the expected return of each state, other methods have to employed. For example, a function approximator, as a neural network, can be used to estimate the expected return, given the state. In other cases, if the MDP solution is easily derived from the state structure, then the 2

86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 expected return can be explicitly computed. It should be noted that a trivial MDP, whose expected return under optimal play is easy to compute, does not necessarily give rise to a trivial POMDP (this is the case in battleship as we ll see shortly). 3 Problems We are addressing three known problems, following (4). They have very different characteristics which call for different approaches in applying the methods proposed. The problems are ordered by number of states, with rocksample having the least. 3.1 Rocksample Rocksample is well known problem inspired by planet exploration. The agent or robot moves in a gridwrold of size kxk. There also exist n rocks, on predetermined positions. The rocks can be bad or good, with good rocks being valuable enough to sample while the bad rocks are not. The robot s task is to navigate to the position of the good rocks and perform a sampling action. To differentiate between good and bad rocks from a distance, the agent has a noisy long distance sensor, which it can use for a check action to receive a (noisy) measurement of the quality of one of the rocks. The measurement s accuracy deteriorates exponentially with increased distance between robot and rock. The robot gets no reward for move or check actions, a positive reward of 10 for sampling a good rock, a negative reward of 10 for sampling a bad rock and a positive reward of 10 for exiting the grid to the right, which terminates the episode. In our case we are experimenting with rocksample[7,8], where grid size k=7 and number of rocks n=8. In this case the number of actions are 13, 4 directions of movement, sampling, and checking any of the 8 rocks. The total number of states is 12544. 3.2 Battleship Battleship is a game based on a popular board game, where each player places secretly a number of ships on a grid, and then players take turns shooting on one or multiple positions on the opponent s grid, and get a hit or miss type feedback. The aim of the game is to find and sink the opponent s ships first. In our case their is one player. There is a negative reward of -1 for each time step, and positive reward equal to the total number of positions on the grid that is obtained when all the ships are sunk. This way, if the player has to fire on all positions of the grid to win the final return will be 0. This challenging POMDP has approximately 10 18 states, and the number of possible actions is equal to the number of positions on the grid minus the ones shot already (it is not allowed to shoot twice on the same place, and it wouldn t make sense either). 3.3 Pocman Pocman is a partially observable version of the popular arcade game Pacman, introduced by (4). In the original game, the agent, pacman, moves in a gridworld maze collecting food while being chased by ghosts. The game terminates when pacman collects all the food or is caught by a ghost (multiple lives are available usually but this is out of our scope). In the partially observable version we are in a sense playing pacman from the point of view of pacman itself: we cannot observe the whole maze, along with ghost and food positions. Instead Pocman receives 10 observation bits corresponding to his senses: four observation bits for his sight, indicating whether he can see a ghost in each of the four cardinal directions; one observation bit indicating whether he can hear a ghost, which is possible when the Manhattan distance between Pocman and ghost is 2 or less; 4 observation bits indicating whether he can feel a wall in this direction; finally, one observation bit for smelling food, indicating the existence food in the adjacent (diagonally and cardinally) grid positions. The number of states is approximately 10 5 6, there are 4 actions and 1024 observations. 4 Methods The common core idea is to provide the POMCP algorithm with q-values obtained offline, by solving the underlying MDP of each POMDP. These q-values are used by the POMDP solver to estimate the 3

133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 expected return of states. The estimation would be done in the original case by performing rollouts. The q-values of the MDP provide an alternative estimator, which differs in statistical properties and computational demands. Less computational time spent on estimating leaf states of the search tree means more time to expand the tree. To capture that we slightly modified the original algorithm, from performing a set number of simulations (and thus creating search trees with a set number of nodes) per turn, to expanding the search tree for certain amount of time. The differences in the three problems pause different constraints on the implementation of this idea, from the solution of the MDP to the way some processing of the q-values provided occurs. In this section we are going to delve into these differences and present and justify the design decisions made. 4.1 Rocksample Rocksample, with its 12544 states,allows for an exact solution of its MDP. Let us define the MDP first. Given perfect information, the agent knows beforehand not only where the rocks are, but also which rock is good and which isn t. The planning task thus is to get to the good rocks as fast as possible, take samples, end exit the grid on the east. The actions that check rocks in this domain are not useful, since their sole purpose is to acquire observations for the rock quality, which is already given. We solve this particularly simple MDP exactly using value iteration. The q-values obtained by the MDP though clearly tend to overestimate the return obtained by the POMDP. This holds for the MDP heuristic in general, but specific properties of the rocksample task that exacerbate this issue should be mentioned. Firstly, there are actions (namely the check action) that don t alter the environment, only reduce the uncertainty about the state. Reasonably, the check actions are not chosen by the MDP solver, but they are chosen by the POMDP solver. Even without explicit cost for taking the actions, they reduce the total return, by increasing the number of moves and as a result the discount s exponent. Furthermore, the MDP solver assumes devotion to one plan, spanning to the terminal state, that takes the robot to all good rocks. Possible change of plans along the way, add to the number of moves and result in lower return. Finally, in the POMDP setting, there is of course the possibility of sampling a bad rock, and getting a significant negative reward as a result, a possibility that doesn t concern an agent following an optimal policy in the MDP setting. This systematic overestimation of the q-values can be addressed in different ways and dealing with it effectively can result in performance improvement. To do so, several approaches were tried. We are presenting results obtained by the unprocessed MDP values, the MDP divided by an arbitrary factor of 5, to showcase this idea, of consistent overestimation, and finally, a heuristic where the MDP values are obtained by solving the MDP with double the discount (0.95) of the POMDP. This is equivalent to taking an action every two time steps, while the discount decreases the accumulated rewards. 4.2 Battleship Battleship has more states than rocksample, and we wouldn t be able to conventionally solve the MDP and store the q-values all in some data structure. Taking a closer look at the underlying MDP though makes it obvious that we do not need to store MDP values. If we assume perfect information for this task, we end up with a problem where we know where the opponent s ships are located, and we just have to shoot on every grid position they are stationed on except the ones we already have shot. The optimal return then is equal to the size of the grid minus the positions occupied by the ships not already shot on. These values are readily available to the algorithm and and they are used straightaway. 4.3 Pocman The MDP underlying the Pocman task is the known Pacman video game. Perfect information means we get to see the agent s position in the maze, the whole extent of the maze, the ghosts positions as well as the food pellets. Pocman has approximately 10 56 states and solving the MDP exactly is infeasible. A reasonable approach is to solve the MDP approximately, using q-learning with a neural network and then at the time of execution (online) performing a forward pass of the state whose q-value we need estimated through the neural net. 4

Figure 1: Rocksample, average discounted return 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 Deciding on a suitable neural network architecture that is powerful enough to provide good estimations, without demanding significant online computation is a challenge. The implementation challenge is also not trivial. As a first take on this problem, we concluded on a simple neural network, accepting as input a preprocessed version of the task state based on a set of features. After training the network, the outputs of the network for all possible (encoded) inputs are calculated and stored in a table, replicating the tabular approach used in rocksample, with the difference being that the values are approximations and not the expected returns under optimal play. This part of this work is still under progress, with different methods for solving approximately the MDP and integrating the result with the POMDP solver, still to be examined. 5 Experiments and Results 5.1 Experimental Setup The different methods are tested with a set time limit for computation per move. Because of that, and because of hardware differences, for the comparison with (4) we are repeating their experiments and report the values we obtained. We are running the POMCP algorithm without domain specific knowledge other than the set of legal moves, except when stated otherwise. The performance of each method is in most cases evaluated by the average discounted return, obtained over a number of episodes (250-500), run on a variety of time limits. 5.2 Rocksample For rocksample we want to compare the use of rollouts, with the q-values of the MDP heuristic. Our comparison is based on the average discounted return. A total of 250 episodes is run to evaluate each method, with a time limit set to 5 values ranging from 0.032 to 0.512 seconds, doubling the allowed time for every step. The discounted return is averaged and presented in the Figure 1. In Figure 1 we can see that the MDP heuristic outperforms simple rollouts, and refining it further can increase performance even more. The fact that naively dividing the MDP values by 5 performs better than increasing the discount indicates that there might be even better ways of preprocessing the q-values, but optimising this choice is not the aim of this work. 5.3 Battleship For battleship, no variants of the MDP heuristic are used, since the unprocessed values outperformed POMCP with rollouts significantly. 500 episodes for every method are run in total, with 10 different time limits used to showcase the expected increase in performance and to allow comparison between the performance of different methods, with different time limits. Even allowing domain knowledge 5

Figure 2: Batlleship (10,5), average discounted return - search time allowed Figure 3: Batlleship (20,7), average discounted return - search time allowed 215 216 217 218 219 220 221 222 223 224 to the POMCP solver (the preferred actions knowledge setting) the MDP q-values resulted in significantly higher returns. To stretch that result, we scaled the problem up, by increasing the size of the grid and the number of ships. The largest case tackles a problem with a 20x20 grid and 7 ships. To our knowledge the battleship problem at that scale hasn t been addressed before. Since battleship does not use discounting, the results represent return. The time limits, for computational time per move used vary from 0.0001 to 0.0512 seconds. The observed increase in performance obtained by using the MDP heuristic is more than remarkable, and probably has something to do with the problem itself. We attempt to give an explanation in the next section. Still, the beneficial effects of using the heuristic are undoubtable in this domain. 6

Figure 4: Pocman, average return 225 226 227 228 229 230 231 232 5.4 Pocman For Pocman, the methods used, rollouts and MDP heuristic, are compared in terms of undiscounted and discounted return. 250 episodes are run to evaluate each method, on time limits varying as in rocksample, between 0.032 to 0.512 seconds. From Figure 4 we can conclude that the MDP offers an advantage over the default algorithm, despite the very simplistic approximation of the MDP solution. The small difference in discounted return can be explained by the relatively long episodes and the significant discounting factor (0.95). We assume that for similar reasons the authors in (4) report the undiscounted return in their comparisons. 233 6 Conclusions and future work 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 Examining the results presented above we can draw some important conclusions. Firstly, the MDP heuristic can be used in conjunction with POMCP algorithm, replacing the need for other types of domain knowledge. Solving the underlying MDP is a well posed problem, with a plethora of tools available to address it in contrast with the domain knowledge used in the preferred actions setting in POMCP which cannot be formalised in a consistent way across domains. In battleship, the where POMCP is greatly outperformed, the MDP heuristic shows a huge margin of improvement on the original findings. In the larger domain, with domain knowledge and 512 more time to build the search tree, POMCP still accumulates less reward. It should be noted that in (? ), in battleship, the authors reported only a slight increase in performance over the baseline method they used. This is an indication that the problem poses some special challenge on the original POMCP algorithm, a challenge that is overcome by introducing the MDP heuristic. Our estimate is that this is produced by the structure of the reward function, that doesn t guide the UCT search effectively, as well as the high variance in the rollouts. In combination these result in the POMCP algorithm lacking a robust way of evaluating states completely. The q-values seem to cover this gap effectively even though they are grossly overestimating the expected return. More work should be done to back this up as conclusion. For Pocman, as we already stated, the values passed to the POMCP solver are rough approximations, and a more careful implementation might prove beneficial. Exploring this prospect, pushing the MDP heuristic to its limits, and showing the tradeoff between estimation precision and computational demands experimentally seems as a natural expansion of the work presented here. 7

254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 References [1] M. T. J. Spaan and N. Vlassis, Perseus: Randomized point-based value iteration for POMDPs, Journal of Artificial Intelligence Research, vol. 24, pp. 195 220, 2005. [2] H. Kurniawati, D. Hsu, and W. S. Lee, Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces, 2008. [3] J. Pineau, G. Gordon, and S. Thrun, Point-based value iteration: An anytime algorithm for pomdps, in International Joint Conference on Artificial Intelligence (IJCAI), pp. 1025 1032, August 2003. [4] D. SIlver and J. Veness, Monte-carlo planning in large pomdps. http://papers.nips.cc/ paper/4031-monte-carlo-planning-in-large-p, 2010. [5] S. Ross, J. Pineau, S. Paquet, and B. Chaib-draa, Online planning algorithms for pomdps. http://arxiv.org/abs/1401.3436. [6] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling, Learning policies for partially observable environments: Scaling up. https://www.researchgate.net/publication/2598543_ Learning_policies_for_partially_observable_environments_Scaling_up, 1995. 8