Tech Report CS-96-11

Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms

Csaba Szepesv\'ari and Michael L. Littman

November 1996


The problem of maximizing the expected total discounted reward in a completely observable Markovian environment, i.e., a Markov decision process (MDP), models a particular class of sequential decision problems. Algorithms have been developed for making optimal decisions in MDPs given either an MDP specification or the opportunity to interact with the MDP over time. Recently, other sequential decision-making problems have been studied prompting the development of new algorithms and analyses. We describe a new generalized model that subsumes MDPs as well as many of the recent variations. We prove some basic results concerning this model and develop generalizations of value iteration, policy iteration, model-based reinforcement-learning, and Q-learning that can be used to make optimal decisions in the generalized model under various assumptions. Applications of the theory to particular models are described, including risk-averse MDPs, exploration-sensitive MDPs, sarsa, Q-learning with spreading, two-player games, and approximate max picking via sampling. Central to the results are the contraction property of the value operator and a stochastic-approximation theorem that reduces asynchronous convergence to synchronous convergence.

(complete text in pdf or gzipped postscript)