Classically, the definition of a reinforcement algorithm supposes the choice of two basic components:
- a technique for calculating and storing the estimated value of each
state-action pair,
- a rule for selecting actions.
The problem of action selection is not trivial, since always choosing
the estimated best action often leads the learner to converge on a sub-optimal
policy. The environment must be explored by sometimes performing estimated
non-optimal actions. There is thus a constant dilemma between two contradictory
objectives:
- EXPLOITATION (of past experience), that is maximization of expected
reward, knowing that this cannot be done with certainty because we only
dispose of estimates of some variables;
- EXPLORATION of the problem in order to make estimates more accurate
or up-to-date, by the choice of the least known actions.
In my thesis, I proposed a technique to deal with the exploration vs. exploitation (X/X) dilemma during the adaptive optimization of stationary Markovian decision processes (MDPs). This problem has already been addressed by a lot of adaptive control and RL researchers. However, there seemed to be a lack of efficient solutions for multi-states decision models as MDPs, as opposed to bandit problems that represent the single-state case.
This technique may be implemented in direct (model-free) algorithms as Watkins' Q-learning, and in indirect (model-based) algorithms as adaptive dynamic programming. It is based on two principles:
- define local measures of the uncertainty under the form of EXPLORATION
BONUSES, using the theory of bandit of problems (following Leslie's
work);
- assimilate these bonuses to reward and back-propagate them with the
DP or
temporal difference (TD) mechanism.
The use of local measures of the uncertainty is convenient, but it is
dangerous. If one does not take care, it leads to algorithms that can be
misled by some particular structures of the environment. This is what happens
in most current applications.
The back-propagation of exploration bonuses is an elegant way to avoid this drawback. It has already been used by Sutton (1990) to deal with non-stationarity in an indirect algorithm, and mentioned by Thrun (1992a, 1992b). I had to adapt it to fit the stationary Markovian case, and to purely direct algorithms as Q-learning.
Extensive numerical simulations have been ran and have shown the efficiency of this technique.