POMDP Learning using Qualitative Belief Spaces

POMDP Learning using Qualitative Belief Spaces Bruce D Ambrosio Computer Science Dept. Oregon State University Corvallis, OR 97331-3202 dambrosi@research.cs.orst.edu Abstract We present Κ-abstraction as a method for automatically generating small discrete belief spaces for partially observable Markov decision problems (POMDPs). This permits direct application of existing reinforcement learning methods to POMDPs. We show results from applying these methods to a 256 state POMDP, and discuss the types of problems for which the method is suitable. Topic: Algorithms and Architectures Introduction Many ongoing problems, such as monitoring and repair of on-line systems, are naturally formulated as partially observable Markov decision problems (POMDPs). Informally, a Markov decision problem (MDP) model includes a state space model; an action model; a transition model showing how the state space evolves in response to actions; a reward model which reflects (typically) the cost of actions and the rewards (or costs) associated with various states; and a performance model which describes how agent performance is to be scored. A Partially Observable MDP (POMDP) is one in which we assume the agent does not have direct access to the current state, but has only limited (perhaps noisy) evidence (see [Cassandra et al, 94] for a good overview of the POMDP problem and solution methods). Unfortunately, exact solution remains an elusive goal for all but the most trivial POMDPs [Littman et al, 95], and even approximation methods have achieved only limited success [Parr & Russell, 95]. In general, the solution to a Markov decision problem is a policy, a mapping from each state in the state space to the optimal action given that state. The policy can be represented as a valuation across system states, so often solution methods compute this value function, rather than the policy mapping. For a POMDP, either of these is of limited use, since the agent generally does not precisely know the current state. A standard approach is to transform the problem into that of finding a policy which takes as its argument the belief state of the agent, rather than the actual system state. Under the assumption that the agent uses appropriate Bayesian belief updating procedures, it can be shown that an optimal policy for the (presumably fully observable) belief state is also optimal with respect to the underlying POMDP. Unfortunately, even when the underlying state space is discrete, belief state is continuous. As a result, it would seem that methods developed for solving discrete state MDPs would not apply to POMDPs. It has been shown that the value function for a POMDP must be piecewise linear convex. As a result, many POMDP solution methods, exact and approximate, have focused on building representations of the value function as the max of a set of planes in belief X value space 1. This representation has the advantage that it can approximate arbitrarily closely the exact solution (and in some cases can exactly represent the optimal value function). However, it suffers from two disadvantages. First, the number of planes needed tends to grow very rapidly, restricting the approach to very small problems. Second, the size of each vector is linear in the size of the state space. This is a severe limitation in many problems. 1 However, see [Singh et al, 94] for an exception.

We are investigating an alternate approach, in which we compute a discrete approximation to the belief space and then use standard reinforcement learning methods to compute optimal Q-value with respect to the discretized state space (A Q-Value is a function mapping from belief X action to value). The primitive element of our representation is a Κ-abstracted belief state. While the representation for a plane in PCL-based methods grows linearly with state-space size, Κ- abstracted belief-state representations grow only with the log of the state space. We have successfully applied this abstraction method to a small test problem (256 states) which is, to our knowledge, the largest POMDP solved to date. In the remainder of this paper we first describe our algorithm, Κ-RL. Next, we present an experimental domain we have been studying, that of on-line maintenance, and the results of some experiments applying Κ-RL in this domain. We close with a discussion of the results and a review of related work. The Κ-RL algorithm Kappa abstraction of belief states The Kappa calculus [Goldszmidt, 95] is a qualitative representation of probability in which a probability is represented as Κ( p) = logε p. Several elegant algorithms have been developed based on this representation, algorithms which we believe have applicability to the POMDP policy computation problem. In this work, however, we merely use the representation itself as an intermediate stage in our qualitative abstraction process. A belief state is a probability distribution over the underlying state space. Assuming states are ordered by decreasing probability, we define a Kappa abstraction of a belief state as: Κ k = s S i iε Κ( p( s ) k That is, an abstraction to a given Κ level is an integer whose value is dependent only the ordering of states down to belief level Κ=k. The transformation of a probability to a Κ value requires an epsilon to serve as base of the logarithm. We use.95*posterior-probability of the highest probability state as our epsilon. That is, the epsilon used is different for each time step. As an example, consider a system with 4 states (s0 - s3), and an agent belief state of [.125,.5,.25,.125], that is, p(s0) =.125, p(s1) =.5, and so on. Then epsilon is.475, and K0 = 1 K1 = 1*1+2*4 = 9 K2 =1*1+2*4+3*16+0*16=57 Alternate encodings, could of course, be developed. In fact, the above encoding unnecessarily collapses non-equivalent states, and perhaps makes some unnecessary distinctions. We will discuss these points later. Under this definition, Kappa abstraction maps a belief state into an integer in the range: 0 < < S k+ 1 Κ k Since the abstraction is discrete, we can directly apply standard reinforcement learning methods, such as Q-learning, to this representation. There are two issues worth discussing before we present experimental results. First, how many Kappa states are there? Second, the computation of belief state from observation data can be complex. Is a method that relies on this computation feasible? The size of the Kappa-abstracted belief space is quite large; therefore, the proposed abstraction does not lend itself to MDP solution techniques which require explicit representation of the entire state-space. Luckily, there exists a class of methods, namely reinforcement learning methods, which only require that we represent explicitly those states actually encountered in the solution process. We have, at this time, no strong theoretical arguments why the number of Κ-states encountered should be small, even for relatively small problems. However, we will show experimental evidence that it remains quite small for surprisingly large problems (e.g., a POMDP with a state space of 256 elements). We will discuss later the circumstances under which we believe this likely to occur. i i

We do have strong reasons for believing that the observation-to-κ-abstraction transformation is tractable, at least in our domain of interest. For our domain we use a belief net model, which decomposes the overall state space into a set of component states and adds a set of probabilistic relationships among component states, other aspects of internal system state, and observables. Given this model, incremental inference procedures exist (e.g., IPI, [D Ambrosio, 93]) which can enumerate state space elements in decreasing probability order. The time to obtain the each statespace element is linear in the number of components, or log in the state-space size. Thus, the overall time to compute a Κ-abstraction will be nlog( S ), where n is the number of states with Κ level k or lower. We refer to the worst case n for a problem as the confusabilty of a POMDP. As n approaches 1, the problem looks more and more like a fully observable MDP. Computing the projected belief state We use the Κ-abstraction of the current belief state as our Q-value table index. We must also, however, compute an updated prior for the next stage, based on current beliefs, selected action, and the state transition matrix. Again, this computation could easily grow intractable as the statespace size grows. We again apply the same techniques: we use a belief net model of state transition, and use the same incremental computation methods to compute an approximation of the projected belief. We use the same Κ methods for ranking and selecting state space elements to keep in our representation, but we keep the actual probabilities of these states, rather than Kappa abstracting them. Since we will only be maintaining one Κ belief state in memory at a time, there is no particular advantage to performing the full Κ abstraction on the projected belief. Note that this abstraction process is essentially assigning a zero probability to most elements of the state space. Assuming again that the projected belief state is [.125,.5,.25,.125], we compute the following Κ- projected belief states: Κ 0 = [δ, 1.0-δ, δ, δ] Κ 1 = [δ.66,.33, δ] Κ 2 = [.125,.5,.25,.125] Notice that, in order to allow for the possibility that the actual system state is one of the states below the Kappa cutoff, we insert small non-zero probabilities for all states below the cutoff. Again, we only explicitly represent those states with belief above the cutoff. As long as the number of such states is small, computation is fast and representation size remains tractable. We refer to the K threshold used in table lookup at Kt, and the threshold used in computing the projection as Kp. We use a standard reinforcement learning algorithm, Q-learning, with this representation. We use Q-learning, rather than a possibly more efficient learning of the value function, because of the difficulty of using a value function representation at run time. In our domain the simple one-step lookahead required to compute the optimal act, given a state and a value function, is quite costly 2. The basic update function is, then: Qb (, a) = ( 1 α) Qb (, α) + α( R+ Qb (, α )) k k k Since we expected b k to be sparsely distributed over a very large range, we use a hash table for storing Q-values. Experiments The Task: On-Line Maintenance Our study problem is the in situ diagnosis and repair of simple digital systems. Diagnosis is often formulated as a static, detached process, the goal of which is the assessment of the exact (or most probable) state of some external system. In contrast, we view diagnosis as a dynamic, practical activity by an agent engaged with a changing and uncertain world. Further, we extend the task to 2 Using a value table requires one step look-ahead at run time, but look-ahead requires taking an expectation over possible observations. This requires performing the observation to belief transformation for every possible next observation.

include the repair task to focus diagnostic activity. Our formulation of embedded diagnosis has the following characteristics: (1) the equipment 3 under diagnosis continues to operate while being diagnosed, and multiple faults can occur (and can continue to occur after an initial fault is detected); (2) the agent has only limited observational ability: it cannot directly observe component state; (3) the agent senses equipment operation through a set of fixed sensors and one or more movable probes; (4) there is a known fixed cost per unit time while the equipment is malfunctioning (i.e., any component is in a faulted state); (5) action alternatives include probing test points, replacing individual components, and simply waiting for the next sense report - each action has a corresponding cost; (6) the agent can only perform one action at a time; and (7) the overall task is to minimize total cost over some extended time period during which several failures can be expected to occur. We term this task the On-Line Maintenance task, and an agent intended for performing such a task an On-Line Maintenance Agent (OLMA) [D Ambrosio, 92 & 96]. An interesting aspect of this formulation, from the perspective of the machine diagnosis community, is that diagnosis is not a direct goal. A precise diagnosis is neither always obtainable nor necessary. Indeed, it is not obvious a-priori what elements of a diagnosis are even relevant to the decision at hand. One final comment: the problem is surprisingly complex. The simple problem instance studied in this paper is well beyond the capability of current exact POMDP solution methods (the MDP state space for the simple 4 gate problem studied in this paper has 256 states, ignoring the stochastic behavior of our model of the unknown mode!). Deterministic policies which only consider current observations perform quite poorly, failing to ever repair some faults. Stochastic policies can perform reasonably well, but that is the subject of another paper. The four gate circuit Our first study problem in this domain was a simple four gate digital circuit known as a half adder. The circuit diagram is shown in figure 2, and the corresponding belief network is shown in figure 3. Each component is modeled as having four possible states, ok, stuck-at0 (in which the output is always 0, regardless of the inputs), stuck-at1, and unknown (in which the output is a stochastic function, independent of the inputs, and uniformly distributed over {0,1}). Since each gate has four possible states, the overall state space has 4 4 or 256 states (multiple faults are possible, thus all 256 states are reachable, and even likely over the length of the long training runs we used). We used uniform failure probabilities, (.002 for each failure state), chosen to produce an interesting number of failures over a reasonable length simulated test run. The agent was given the values of I1, I2, O, and Carry as its standard observation set. Possible actions included the replacement of any one component, the probe of either P1 or P2 (in which case the respective value was added to the observation set for the next cycle), or no action. The reward was -1 for each cycle in which at least one component was faulted, plus -6 for a replacement action or -1 for a probe action. We trained using an initial alpha of.2, decreasing to 0 over the course of an epoch. After each epoch we ran a short evaluation run in which we re-initialized the simulation to a known good state, and measured the total reward over the evaluation run. We then reset alpha to.2 and ran another epoch (without resetting the Q-value table). The first epoch was 8000 iterations. Subsequent epochs doubled the length of the previous epoch (i.e., 16000, 32000, 64000, 128000). We terminated training after the 128000 iteration for Κt=0,1 and 256000 for Κt=2,3. 3 We will use system or agent to refer to our diagnostic system, and equipment to refer to the target physical system.

Nor1 Nor2 I1 Nor1 P1 Nor2 O I1 P1 O I2 I2 P2 Carry Nand P2 Not Carry Nand Not Figure 1: 4 Gate Circuit and Bayes Net 2500 Cost/Fail & TableSize 2000 1500 1000 500 Tablesize Cost/Fail 0 0 20000 40000 60000 80000 100000 120000 Training cycles Figure 2: Typical Cost/Failure(*60) and Table Size during training Figure 2 above shows behavior over time for a typical training run. Cost/Failure is multiplied by 60 to put it on the same scale as Q table size. The algorithm very rapidly learns to repair single faults. As a result, the first cost/failure data point (at 8000 cycles) is already reasonably low. Table 1 below shows the number of entries in the Q-value table (state X action) at the end of training for the twelve combinations of Κp and Κt we explored. We considered the possibility that the problem is such that the actual number of belief states visited is low. We tested this possibility by training using the exact belief state rather than the K- abstracted belief state. Up to a limit of 12,500 iterations (at which point the program ran out of swap space - exact belief states are large objects!) the number of revisited exact belief states remained insignificant. 4 Table 1 also shows our performance metric, cost/failure, for each of these combinations. These values are averages over 50-75 failures per cell. As a comparison, our best finite-lookahead online algorithm achieves a cost/failure of 21.0 on this same problem when it is not charged for computation time (the on-line algorithm takes about 8 seconds per decision, whereas the Κ-rl policy can be executed in under ½ second, including computing the Κ-abstracted belief state and computed the Κ-projection for the next belief state. We see that the policy (especially when Κp = 4) significantly outperforms the on-line algorithm (see [D Ambrosio, UAI-96] for details on the on-line approaches we have tried). The difference between row 0 and rows 2 and 4 is statistically significant. 4 We truncated at 5 significant digits for comparison only to reduce numerical stability problems.

Kp\Kt 0 1 2 3 Row Avg. 0 3481/32.7 3810/31.3 4040/16.0 4530/27.1 3965/25.9 2 1691/19.4 2404/22.3 2902/18.3 4323/16.8 2830/18.8 4 1450/17.8 2022/17.4 2918/18.5 3722/19.4 2528/18.25 Col Avg 2207/22.6 2745/22.7 3286/17.6 4192/21.6 Table 1: Table Size/Cost-Per-Failure for 4 gate circuit The seven gate circuit The confusability for the half-adder is relatively low. An examination of the circuit reveals that we have enough information to distinguish between faults occurring in the upper and lower halves of the circuit, so at most two components can be the cause of any single fault. Any Κ level that excludes multiple simultaneous faults (as do many of those we tested), will encounter a maximum of four states at Κ 0. To make the problem more challenging we increased the confusability index of our second test problem, a seven gate circuit. Unfortunately, Κ-RL is not tractable for the full seven gate problem. The problem is not that the number of Κ-abstracted states would grow too large to represent, but rather that learning would take too long to converge. The most difficult Κ- states to learn Q-values for are the intermediate states which occur shortly after a state change is detected and before it has been isolated. These states tend to be transitory and infrequently occurring. As a result, very long training runs would be required (we estimate at least 10 8 cycles) before convergence. As a warm up to attacking the full problem, we ran an experiment in which we restricted the simulation to producing only one fault at any one time. That is, once one component had faulted, another fault could not occur until the first was repaired. While this might seem to make the problem trivial (the single fault state space has only 22 states), two complications make this an interesting problem. First, we constructed the experiment such that the confusability was much higher (as many as 5 gates could be confused for some single faults). Second, while the simulation was restricted to single fault mode, the learner was not restricted to considering only single faults. This caused it to enter a subset of the multiple fault belief space: Situations occurred when the learner was not sure it had repaired a fault; that is, a number of states other than the all ok state were in the Κ-abstracted projected belief state. However, the fault actually had been repaired, and a second fault occurred. As far as the agent can tell, this may in fact be a multiplefault state. We achieved similar results on this problem to those on the 4 gate problem. Q-table size at convergence ranged from 2800 (Kt=0, Kp=4) to 15,300 (Kt=4, Kp=4), and the performance metric (cost/failure) showed only small performance differences with Kt or Kp variation over the range tested (0, 2, 4). Space precludes further discussion, details are available in the full paper. Discussion Table size remains quite tractable for both the 4 gate and 7 gate problems. We believe further training would not significantly increase the number of states encountered: plots of table-size vs length of training clearly show asymptotic behavior, and performance over a large number of failures indicates convergence has been attained. There are two characteristics of our study domain that, together, enabled our approach. First, the domain has low confusability. That is, at any time the agent has a pretty good idea where it is in the state space. Second, the belief models are such that the high probability states can be identified quickly. The characteristic which enables this is skewness of the probability distributions (see [D Ambrosio: UAI-93] for details. We discovered rather late several problems with our Κ-abstraction process. The current abstraction distinguishes between states with same Κ values, but different ordering. Conversely, it fails to distinguish between states with or without state-space element zero. Also, it maps

1(0)+3(0) (Κ 0 ) into the same value as 0(0)+2(1)(Κ 1 ). Presumably a better mapping would only improve the performance of the algorithm. We do not believe that the approach as it stands scales well. As discussed earlier, the primary problem is not space, but rather training time. There are simply too many possible multiple faults to visit each enough times to assure convergence over the reachable Κ-abstracted belief space. We are exploring further abstraction methods to cope with this problem. Related Work Jordan and Singh [Singh et al, 94] have studied stochastic policies for POMDPs. We have not yet applied their methods to our domain, but expect that the resulting on-line performance will not match that of Κ-rl generated policies. Kaebling et al [Littman et al, 95] have developed exact algorithms for POMDPs. Russell and Parr [Parr & Russell, 95] have developed an elegant approximation method for compactly representing value functions. We attempted to apply their method to the 4 gate problem, but were unable to obtain convergence with up to 8 planes. We are continuing to study their approach, and may have more to report in the final version of the paper. However, we note that even 8 planes takes more space than the largest K-abstracted Q-values table in table 1. Conclusion We have presented Κ-abstraction, a method for automatically generating qualitative belief spaces for POMDPs. We have shown successful application to a 256 state problem, and discussed problem characteristics required for successful application of the method (the key being low confusability). Finally, we have concluded that further scaling up will require even more powerful abstraction methods. Acknowledgments This work done with the support of NSF grants IRI-950330 and CDA-921672. It benefitted from many useful discussions with Robert Fung, Thomas Dietterich, and Prasad Tadepalli. References [Cassandra et al, 94] Acting Optimally in Partially Observable Stochastic Domains. Brown Univ. Tech. Report CS-94-20 [D Ambrosio, 92]. B. D Ambrosio. Value-driven real-time diagnosis. In Proceedings, Third International Workshop on the Principles of Diagnosis, October 1992. [D Ambrosio, 93] Incremental Probabilistic Inference. In Proceedings of the Ninth Annual Conference on Uncertainty in Artificial Intelligence, pp. 301-308, July 1993. Morgan Kaufmann, Publishers. [D Ambrosio, 96] Some Experiments with Real-time Decision Algorithms. To appear In Proceedings of the Twelfth Annual Conference on Uncertainty in Artificial Intelligence, July 1996. Morgan Kaufmann, Publishers. [Goldszmidt, 95] M. Goldszmidt. Fast Belief Updating Using Order of Magnitude Probabilities. In Uncertainty in Artificial Intelligence, Proceedings of the Eleventh Conference. pp. 208-216. Morgan Kaufmann, Publishers, July, 1995. [Littman et al, 95] Littman, M., Cassandra, A. and Kaebling, L. Learning Policies for Partially Observable Environments: Scaling Up. In Machine Learning, 12: 362-370. [Parr & Russell, 95] Approximating Optimal Policies for Partially Observable Stochastic Domains. Proceedings IJCAI95, pp. 1088-1094. [Singh et al, 94] Singh, S., Jaakola, T. and Jordan, M. Learning without State-Estimation in Partially Observable Markovian Decision Processes. In Machine Learning: Proceedings of the Eleventh International Conference. (1994): 284-292.