When you are confronted with a decision, there are a number of different alternatives (actions) you have to choose from. Choosing the best action requires thinking about more than just the immediate effects of your actions. The immediate effects are often easy to see, but the long term effects are not always as transparent. Sometimes actions with poor immediate effects, can have better long term ramifications. You would like to choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution. People do this type of reasoning daily, and a Markov decision process a way to model problems so that we can automate this process.
If you can model the problem as an MDP, then there are a number of algorithms that will allow you to automatically solve the decision problem. We will first talk about the components of the model that are required. We follow this by defining what it means to solve the problems and finally, we describe a simple algorithm for finding a solution to the problem given a model of this type.
The four components of an MDP model are: a set of states, a set of actions, the effects of the actions and the immediate value of the actions. We now discuss these in more detail.
The most powerful aspect of the MDP is that the effects of an action can be probabilistic. Imagine we are specifying the effects of doing action 'a1' in state 's1'. We could say that the effect of 'a1' is to leave the process in state 's2', if there was no question about how 'a1' changes the world. However, many decision processes have actions that are not this simple. Sometimes an action usually results in state 's2', but occasionally it might result in state 's3'. MDPs allow you to specify these more complex actions by allowing you to specify a set of resulting states and the probability that each state results.
Although the policy is what we are after, we will actually compute a value function. Without going into the details, suffice it to say that we can easily derive the policy if we have the value function. Thus, we will focus on finding this value function. A value function is similar to a policy, except that instead of specifying an action for each state, it specifies a numerical value for each state.
When we talked about the transitions of the model, we said that we simply needed to specify the resulting next state for each starting state and action. This assumes that the next state depends only upon the current state (and action). There are situations where the effects of an action might depend not only on the current state, but upon the last few states. The MDP model will not let you model these situations directly. The assumption made by the MDP model is that the next state is solely determined by the current state (and current action). This is called the Markov assumption.
We would say that the dynamics of the process are Markovian and this has important ramifications for solving the problems. The most important is that our policies or value functions can have a simple form. The result is that we can choose our action based solely upon the current state.