EXPLORATION VS. EXPLOITATION : THE STATIONNARY MARKOVIAN CASE.

Nicolas Meuleau

Recently, reinforcement learning (RL) is being recognized as an alternative to classical dynamic programming (DP) based techniques for high dimensional and imperfectly modeled problems. One if its subjscts of concern is the adaptive optimization of dynamic decision models as Markovian decision processes.

Classically, the definition of a reinforcement algorithm supposes the choice of two basic components:

- a technique for calculating and storing the estimated value of each state-action pair,
- a rule for selecting actions.

The problem of action selection is not trivial, since always choosing the estimated best action often leads the learner to converge on a sub-optimal policy. The environment must be explored by sometimes performing estimated non-optimal actions. There is thus a constant dilemma between two contradictory objectives:

- EXPLOITATION (of past experience), that is maximization of expected reward, knowing that this cannot be done with certainty because we only dispose of estimates of some variables;
- EXPLORATION of the problem in order to make estimates more accurate or up-to-date, by the choice of the least known actions.

In my thesis, I proposed a technique to deal with the exploration vs. exploitation (X/X) dilemma during the adaptive optimization of stationary Markovian decision processes (MDPs). This problem has already been addressed by a lot of adaptive control and RL researchers. However, there seemed to be a lack of efficient solutions for multi-states decision models as MDPs, as opposed to bandit problems that represent the single-state case.

This technique may be implemented in direct (model-free) algorithms as Watkins' Q-learning, and in indirect (model-based) algorithms as adaptive dynamic programming. It is based on two principles:

- define local measures of the uncertainty under the form of EXPLORATION
BONUSES, using the theory of bandit of problems (following Leslie's work);
- assimilate these bonuses to reward and back-propagate them with the DP or
temporal difference (TD) mechanism.

The use of local measures of the uncertainty is convenient, but it is dangerous. If one does not take care, it leads to algorithms that can be misled by some particular structures of the environment. This is what happens
in most current applications.

The back-propagation of exploration bonuses is an elegant way to avoid this drawback. It has already been used by Sutton (1990) to deal with non-stationarity in an indirect algorithm, and mentioned by Thrun (1992a, 1992b). I had to adapt it to fit the stationary Markovian case, and to purely direct algorithms as Q-learning.

Extensive numerical simulations have been ran and have shown the efficiency of this technique.

Kee-Eung Kim

Get Back