# Tech Report CS-07-05

## Adapting to a Stochastically Changing Environment: The Dynamic Multi-Armed Bandits Problem

### Aleksandrs Slivkins and Eli Upfal

#### May 2007

### Abstract:

We introduce and study a *dynamic* version of the multi-armed bandit problem. In a multi-armed bandit problem there are k reward distribution functions associated with the rewards of playing each of k strategies (slot machine arms). The distributions are initially unknown to the player. The player iteratively plays one strategy per round, observes the associated reward, and decides on the strategy for the next iteration. The goal of an algorithm is to maximize reward by balancing the use of acquired information (exploitation) with learning new information (exploration).

In the basic (``static'') multi-armed bandit problem the reward functions are time-invariant. An algorithm can eventually obtain (almost) perfect information of the reward functions and can adapt its strategy so its reward asymptotically converges to that of an optimal strategy. The quality of an algorithm is therefore measured by the expected cost (or ``regret'') incurred during an initial finite time interval. In the *dynamic* multi-armed problem the reward functions stochastically and gradually change in time. In this setting a player has to continuously balance explore and exploit steps to adapt to the dynamic reward functions. In analyzing algorithms for the dynamic case we consider an infinite time horizon and the goal here is to minimize the average cost per step, where the cost per step is the difference between the reward of the algorithm and the reward of an (hypothetical) algorithm that at every step plays the arm with the maximum expectation at that step (this definition is stronger than the standard regret function which compares the reward of an algorithm to the reward of the best *time-invariant* strategy). Our goal here is to characterize the steady state cost of learning and adapting to changing reward functions, as a function of the stochastic rate of the change.

The dynamic multi-armed bandit problem models a variety of practical online, game-against- nature type optimization settings. While building on the work done for the static and adversarial versions of the problem, algorithms and analysisfor the dynamic setting require a novel approach based on different stochastic tools.

(complete text in pdf)