When you are confronted with a decision, there are a number of different alternatives (actions) you have to choose from. Choosing the best action requires thinking about more than just the immediate effects of your actions. The immediate effects are often easy to see, but the long term effects are not always as transparent. Sometimes actions with poor immediate effects, can have better long term ramifications. You would like to choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution. People do this type of reasoning daily, and a Markov decision process a way to model problems so that we can automate this process.

If you can model the problem as an `MDP`, then there are a
number of algorithms that will allow you to automatically solve the
decision problem. We will first talk about the components of the
model that are required. We follow this by defining what it means to
solve the problems and finally, we describe a simple algorithm for
finding a solution to the problem given a model of this type.

The four components of an `MDP` model are: a set of states, a
set of actions, the effects of the actions and the immediate value of
the actions. We now discuss these in more detail.

The most powerful aspect of the `MDP` is that the effects of an
action can be probabilistic. Imagine we are specifying the effects of
doing action '`a1`' in state '`s1`'. We could say that
the effect of '`a1`' is to leave the process in state
'`s2`', if there was no question about how '`a1`'
changes the world. However, many decision processes have actions that
are not this simple. Sometimes an action usually results in state
'`s2`', but occasionally it might result in state
'`s3`'. `MDP`s allow you to specify these more complex
actions by allowing you to specify a set of resulting states and the
probability that each state results.

Although the policy is what we are after, we will actually compute a
*value function*. Without going into the details, suffice it
to say that we can easily derive the policy if we have the value
function. Thus, we will focus on finding this value function. A
value function is similar to a policy, except that instead of
specifying an action for each state, it specifies a numerical value
for each state.

When we talked about the transitions of the model, we said that we
simply needed to specify the resulting next state for each starting
state and action. This assumes that the next state depends only upon
the current state (and action). There are situations where the
effects of an action might depend not only on the current state, but
upon the last few states. The `MDP` model will not let you
model these situations directly. The assumption made by the
`MDP` model is that the next state is solely determined by the
current state (and current action). This is called the
*Markov* assumption.

We would say that the dynamics of the process are Markovian and this has important ramifications for solving the problems. The most important is that our policies or value functions can have a simple form. The result is that we can choose our action based solely upon the current state.