We will now show an example of value iteration proceeding on a problem
for a horizon length of `3`. This example will provide some of
the useful insights, making the connection between the figures and the
concepts that are needed to explain the general problem. For this
problem, we assume the `POMDP` has two states, two actions and
three observations.

We start with the first horizon. The value function here will
represent the best we can hope to do (in terms of value) if we are
limited to taking a single action. This is the simplest case;
normally (for horizon length `h`) we need to trade off the
immediate rewards and the future rewards. However, when you have a
horizon of `1`, there is no future and the value function
becomes nothing but the immediate rewards.

Since we have two states and two actions, our `POMDP` model
will include four separate immediate reward values. These are the
values of doing each action in each state. These values are defined
over the discrete state space of the `POMDP`, but it becomes
easy to get the value of doing a particular action in a particular
belief state. To do this we simply use the probabilities in the
belief state to weight the value of each state.

As an example: let action `a1` have a value of `1` in
state `s1` and `0` in state `s2` and let action
`a2` have a value of `0` in state `s1` and
`1.5` in state `s2`. If our belief state is `[ 0.25
0.75 ]` then the value of doing action a1 in this belief state is
`0.25 x 1 + 0.75 x 0 = 0.25`. Similarly, action `a2`
has value `0.25 x 0 + 0.75 x 1.5 = 1.125`. We can display
these values over belief space with the figure below. This is, in
fact, *the* horizon `1` value function.

With the horizon `1` value function we are now ready to
construct the horizon `2` value function. This part of the
tutorial is the most crucial for understanding `POMDP`
solutions procedures. Once you understand how we will build the
horizon `2` value function, you should have the necessary
intuition behind `POMDP` value functions to understand the
various algorithms.

Our goal in building this new value function is to find the best
action (or highest value) we can achieve using only two actions (i.e.,
the horizon is `2`) for every belief state. To show how to
construct this new value function, we break the problem down into a
series of three steps. We will first show how to compute the value of
a belief state for a given action and observation. Then we will show
how to compute the value of a belief state given only an action.
Finally, we will show how to compute the actual value for a belief
state.

We start with the problem: given a particular belief state, `b`
what is the value of doing action `a1`, if after the action we
received observation `z1`? In other words we want to find the
best value possible for a single belief state when the immediate
action and observation are fixed.

The value of a belief state for horizon `2` is simple the value
of the immediate action plus the value of the next action. In
general, we would like to find the best possible value which would
include considering all possible sequences of two actions. However,
since in our restricted problem our immediate action is fixed, the
immediate value is fully determined. We can use the immediate rewards
for action `a1` to find the value of `b` just like we
did in constructing the horizon `1` value function. Recall
that the horizon `1` value function is nothing but the
immediate reward function.

The only question is what is best achievable value for the belief
state that results from our initial belief state `b` when we
perform action `a1` and observe `z1`. This isn't really
much of a problem at all, since we know our initial belief state, the
action and the resulting observation. This is all that is required to
transform `b` into the unique resulting belief state, which we
will call `b'`. This new belief state will be the belief state
we are in when we have one more action to perform; our horizon length
is `2` and we just did one of the actions. We know what the
best values are for every belief state when there is a single action
left to perform; this is exactly what our horizon `1` value
function tells us.

The figure below shows this process. On the left is the immediate
reward function and on the right is the horizon `1` value
function. The immediate rewards for action `a2` are shown with
a dashed line, since they are not of immediate interest when
considering the fixed action `a1`.

Recall that what we are concerned with at this point is finding the
value of the belief state `b` with the fixed action and
observation. We have everything we need to calculate this value; we
know what the immediate reward we will get is and we know the best
value for the transformed belief state `b'`. Simply summing
these two values gives us the value of belief state `b` given
that we take action `a1` and observe `z1`. As a side
effect we also know what is the best next action to take.

Suppose we want to find the value for another belief state, given the
same action and observation. We simply repeat this process, which
means all that we really need to do is transform the belief state and
use the horizon `1` value function to find what value it has
(the immediate rewards are easy to get). Now suppose we want to find
the value for all the belief points given this fixed action and
observation. This seems a little harder, since there are way to many
points we have to do this for. Fear not, this can actually be done
fairly easily.

Our horizon `1` value function is a function of our transformed
belief state `b'` and our transformed belief state is a
function of the initial belief state `b` since the action and
observation is fixed. It turns out that we can directly construct a
function over the entire belief space from the horizon `1`
value function that has the belief transformation built in. This
gives us a function which directly tells us the value of each belief
state after the action `a1` is taken and observation
`z1` is seen. The figure below show this transformation.

We previously decided to solve the simple problem of finding the value of a belief state, given a fixed action and observation. We have showed this and actually showed how to find the value of all belief states given a fixed action and observation. Next we want to show how to compute the value of a belief state given only the action.

When we were looking at individual points, getting their immediate
reward value, transforming them and getting the resulting belief
states value, we where computing the conditional value. This is the
value *if* we see observation `z1`. However, because
the observations are probabilistic, we are not guaranteed to see
`z1`. In this example, there are three possible observations
and each one can lead to a separate resulting belief state. This
figure below, shows the full situation when we fix our first action to
be `a1`.

This might still be a bit cloudy, so let us do an example. Suppose
that we compute the values of the resulting belief states for belief
state `b`, action `a1` and all three observations and
find that the values for each resulting belief state are: `z1:0.8,
z2:0.7, z3:1.2`. These are the values we were initially
calculating when we were doing things one belief point at a time. Now
we also can compute the probability of getting each of the three
observations for the given belief state and action and find them to
be: ` z1:0.6, z2:0.25, z3:0.15`. Then the horizon `2`
value of the belief state `b` when we fix the action at
`a1` is `0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835` plus the
immediate reward of doing action `a1` in `b`.

In fact, the transformed value function `S(a1,z1)` we showed
before actually factors in the probabilities of the observation. So
in reality, the `S()` function is not quite what we claimed; we
claimed that it was the next belief state value of each belief state
for the fixed action and *given* the observation. It reality,
the `S()` function already has the probability of the
observation built into it.

Let's look at the situation we currently have with the figure below.

So what is the horizon `2` value of a belief state, given a
particular action `a1`? Well, it depends not only on the value
of doing action `a1` but also upon what action we do next
(where the horizon length will be `1`). However, what we do
next will depend upon what observation we get. For a given belief
state and observation, we can look at the `S()` function
partition to decide what the best action next action to do is. This
is best seen with a figure.

In fact, if we fix the action to be `a1` and the future
strategy to be the same as it is at point `b` (namely:
`z1:a2, z2:a1, z3:a1`) we can find the value of every single
belief point for that particular strategy. To do this we simply sum
all of the appropriate line segments. We use the line segment for
the immediate rewards of the `a1` action and the line segments
from the `S()` functions for each observation's future
strategy. This gives us a single linear segment (since adding lines
gives you lines) over all belief space representing the value of
adopting the strategy of doing `a1` and the future strategy of
`(z1:a2, z2:a1, z3:a1)`. The notation for the future strategies
just indicates an action for each observation we can get.

We derived this particular future strategy from the belief point
`b` and it is the best future strategy for that belief point.
However, just because we can compute the value of this future strategy
for each belief point, doesn't mean it is the best strategy for all
belief points. So which belief points is this the best future
strategy? This is actually easy to see from the partition
diagram.

If there was only the action `a1` in our model, then the value
function shown in the previous figure would be the horizon `2`
value function. However, because there is another action, we must
compare the value of the other action with the value of action
`a1` before we have the true horizon `2` value
function, since we are interested in finding the best value for each
belief state.

We can repeat the whole process we did for action `a1` for the
other action. This includes constructing the `S()` functions
for the action `a2` and all the observations. From these we
can find the value function for that action. Below is the value
function and partition for action `a2`.

This whole process took a long time to explain and is not nearly as
complicated as it might seem. We will show how to construct the
horizon `3` policy from the horizon `2` policy is a
slightly accelerated manner. The steps are the same, but we can now
eliminate a lot of the discussion and the intermediate steps which we
used to aid the explanation.

First transform the horizon `2` value function for action
`a1` and all the observations.

Next we find the value function by adding the immediate rewards and
the `S()` functions for each of the useful strategies. The
partition that this value function will impose is easy to construct by
simply looking at the partitions of the `S()` functions.

In the figure below, we show the `S()` partitions for action
`a1` below and the value function for action `a1` with a
horizon of `3`. The partition this value function imposes is
also shown.

We then construct the value function for the other action, put them
together and see which line segments we can get rid of. Here are the
`S()` function partitions, value function and the value
functions partition for the action `a2`.

This concludes our example. The concepts and procedures can be
applied over and over to any horizon length. This is the way we do
value iteration on the `CO-MDP` derived from the
`POMDP`.