SOLVING POMDPS BY COMMITTING TO REACTIVE POLICIES

Nicolas Meuleau

The intuition at the origin of this work is taken from Wiering and Schmidhuber's HQ-learning (HQL) algorithm (1997). The idea is that, although the optimal policy of a partially observable Markov decision process (POMDP) is not a reactive policy (RP) (i.e. it is not a mapping from observations to actions), it may always be represented as a sequence of RPs. In other words, POMDPs may be solved by committing to the execution of a certain RP until a certain completion condition is satisfied, and then switching to another RP until another termination condition, and switching again and etc... In the worst case, we need to change the RP at each time step.

An important feature of such RP-sequences is the nature of the condition governing the transition from a RP to another. More exactly, the two important points are: - the completion condition of the different RPs, - the rule used for choosing the next RP when the current RP completes. In the most general case, these two features depend on the complete previous history of the system, i.e. the complete sequence of observation-action pairs since time 0. However, a non-negligible class of problems are solved optimally by RP-sequences using much simpler sequencing conditions. For instance, one will use only the current observation to rule the change of RP, or only the number of time-steps elapsed since the initiation of the current RP.

Our current research aims at developing algorithms that exploit a previous knowledge about the nature of the optimal sequencing conditions (i.e. the completion conditions of RPs and the rule of choice of the next RP) to find optimal RP-sequences. In this talk, I'll present a class of learning algorithms (including HQL) that work through a simulated experience protocol, and whose convergence is not guaranteed. Then we I'll examine the possibility of deriving an exact, convergent algorithm.

There's been a lot of literature about hierarchical planning and so-called ``macro actions'' recently. In these approaches, a macro-action is defined as a local--most often reactive-- policy that can be applied at any time but in a limited part of the state space. One could say that these macro-actions are local but permanent. Most of recent work aims at developing computational tools that allow to plan with macro-actions while preserving the Markov property. The algorithms presented here exploit the fact that macro-actions have the even more powerful ability that is to establish a kind of Markov property at the macro-level when it is not present at the primitive level. These macro-actions of a new type can be applied in any state, but not at any time since the idea of committing to the execution of a single macro-action until a certain point is central to the algorithms. One could say that they are global but temporary. Thus, there is an inversion of the roles played by time and space in the decomposition of the problem between the two approaches.

Kee-Eung Kim

Get Back