We will now show an example of value iteration proceeding on a problem for a horizon length of 3. This example will provide some of the useful insights, making the connection between the figures and the concepts that are needed to explain the general problem. For this problem, we assume the POMDP has two states, two actions and three observations.
We start with the first horizon. The value function here will represent the best we can hope to do (in terms of value) if we are limited to taking a single action. This is the simplest case; normally (for horizon length h) we need to trade off the immediate rewards and the future rewards. However, when you have a horizon of 1, there is no future and the value function becomes nothing but the immediate rewards.
Since we have two states and two actions, our POMDP model will include four separate immediate reward values. These are the values of doing each action in each state. These values are defined over the discrete state space of the POMDP, but it becomes easy to get the value of doing a particular action in a particular belief state. To do this we simply use the probabilities in the belief state to weight the value of each state.
As an example: let action a1 have a value of 1 in state s1 and 0 in state s2 and let action a2 have a value of 0 in state s1 and 1.5 in state s2. If our belief state is [ 0.25 0.75 ] then the value of doing action a1 in this belief state is 0.25 x 1 + 0.75 x 0 = 0.25. Similarly, action a2 has value 0.25 x 0 + 0.75 x 1.5 = 1.125. We can display these values over belief space with the figure below. This is, in fact, the horizon 1 value function.
With the horizon 1 value function we are now ready to construct the horizon 2 value function. This part of the tutorial is the most crucial for understanding POMDP solutions procedures. Once you understand how we will build the horizon 2 value function, you should have the necessary intuition behind POMDP value functions to understand the various algorithms.
Our goal in building this new value function is to find the best action (or highest value) we can achieve using only two actions (i.e., the horizon is 2) for every belief state. To show how to construct this new value function, we break the problem down into a series of three steps. We will first show how to compute the value of a belief state for a given action and observation. Then we will show how to compute the value of a belief state given only an action. Finally, we will show how to compute the actual value for a belief state.
We start with the problem: given a particular belief state, b what is the value of doing action a1, if after the action we received observation z1? In other words we want to find the best value possible for a single belief state when the immediate action and observation are fixed.
The value of a belief state for horizon 2 is simple the value of the immediate action plus the value of the next action. In general, we would like to find the best possible value which would include considering all possible sequences of two actions. However, since in our restricted problem our immediate action is fixed, the immediate value is fully determined. We can use the immediate rewards for action a1 to find the value of b just like we did in constructing the horizon 1 value function. Recall that the horizon 1 value function is nothing but the immediate reward function.
The only question is what is best achievable value for the belief state that results from our initial belief state b when we perform action a1 and observe z1. This isn't really much of a problem at all, since we know our initial belief state, the action and the resulting observation. This is all that is required to transform b into the unique resulting belief state, which we will call b'. This new belief state will be the belief state we are in when we have one more action to perform; our horizon length is 2 and we just did one of the actions. We know what the best values are for every belief state when there is a single action left to perform; this is exactly what our horizon 1 value function tells us.
The figure below shows this process. On the left is the immediate reward function and on the right is the horizon 1 value function. The immediate rewards for action a2 are shown with a dashed line, since they are not of immediate interest when considering the fixed action a1.
Recall that what we are concerned with at this point is finding the value of the belief state b with the fixed action and observation. We have everything we need to calculate this value; we know what the immediate reward we will get is and we know the best value for the transformed belief state b'. Simply summing these two values gives us the value of belief state b given that we take action a1 and observe z1. As a side effect we also know what is the best next action to take.
Suppose we want to find the value for another belief state, given the same action and observation. We simply repeat this process, which means all that we really need to do is transform the belief state and use the horizon 1 value function to find what value it has (the immediate rewards are easy to get). Now suppose we want to find the value for all the belief points given this fixed action and observation. This seems a little harder, since there are way to many points we have to do this for. Fear not, this can actually be done fairly easily.
Our horizon 1 value function is a function of our transformed belief state b' and our transformed belief state is a function of the initial belief state b since the action and observation is fixed. It turns out that we can directly construct a function over the entire belief space from the horizon 1 value function that has the belief transformation built in. This gives us a function which directly tells us the value of each belief state after the action a1 is taken and observation z1 is seen. The figure below show this transformation.
We previously decided to solve the simple problem of finding the value of a belief state, given a fixed action and observation. We have showed this and actually showed how to find the value of all belief states given a fixed action and observation. Next we want to show how to compute the value of a belief state given only the action.
When we were looking at individual points, getting their immediate reward value, transforming them and getting the resulting belief states value, we where computing the conditional value. This is the value if we see observation z1. However, because the observations are probabilistic, we are not guaranteed to see z1. In this example, there are three possible observations and each one can lead to a separate resulting belief state. This figure below, shows the full situation when we fix our first action to be a1.
This might still be a bit cloudy, so let us do an example. Suppose that we compute the values of the resulting belief states for belief state b, action a1 and all three observations and find that the values for each resulting belief state are: z1:0.8, z2:0.7, z3:1.2. These are the values we were initially calculating when we were doing things one belief point at a time. Now we also can compute the probability of getting each of the three observations for the given belief state and action and find them to be: z1:0.6, z2:0.25, z3:0.15. Then the horizon 2 value of the belief state b when we fix the action at a1 is 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835 plus the immediate reward of doing action a1 in b.
In fact, the transformed value function S(a1,z1) we showed before actually factors in the probabilities of the observation. So in reality, the S() function is not quite what we claimed; we claimed that it was the next belief state value of each belief state for the fixed action and given the observation. It reality, the S() function already has the probability of the observation built into it.
Let's look at the situation we currently have with the figure below.
So what is the horizon 2 value of a belief state, given a particular action a1? Well, it depends not only on the value of doing action a1 but also upon what action we do next (where the horizon length will be 1). However, what we do next will depend upon what observation we get. For a given belief state and observation, we can look at the S() function partition to decide what the best action next action to do is. This is best seen with a figure.
In fact, if we fix the action to be a1 and the future strategy to be the same as it is at point b (namely: z1:a2, z2:a1, z3:a1) we can find the value of every single belief point for that particular strategy. To do this we simply sum all of the appropriate line segments. We use the line segment for the immediate rewards of the a1 action and the line segments from the S() functions for each observation's future strategy. This gives us a single linear segment (since adding lines gives you lines) over all belief space representing the value of adopting the strategy of doing a1 and the future strategy of (z1:a2, z2:a1, z3:a1). The notation for the future strategies just indicates an action for each observation we can get.
We derived this particular future strategy from the belief point b and it is the best future strategy for that belief point. However, just because we can compute the value of this future strategy for each belief point, doesn't mean it is the best strategy for all belief points. So which belief points is this the best future strategy? This is actually easy to see from the partition diagram.
If there was only the action a1 in our model, then the value function shown in the previous figure would be the horizon 2 value function. However, because there is another action, we must compare the value of the other action with the value of action a1 before we have the true horizon 2 value function, since we are interested in finding the best value for each belief state.
We can repeat the whole process we did for action a1 for the other action. This includes constructing the S() functions for the action a2 and all the observations. From these we can find the value function for that action. Below is the value function and partition for action a2.
This whole process took a long time to explain and is not nearly as complicated as it might seem. We will show how to construct the horizon 3 policy from the horizon 2 policy is a slightly accelerated manner. The steps are the same, but we can now eliminate a lot of the discussion and the intermediate steps which we used to aid the explanation.
First transform the horizon 2 value function for action a1 and all the observations.
Next we find the value function by adding the immediate rewards and the S() functions for each of the useful strategies. The partition that this value function will impose is easy to construct by simply looking at the partitions of the S() functions.
In the figure below, we show the S() partitions for action a1 below and the value function for action a1 with a horizon of 3. The partition this value function imposes is also shown.
We then construct the value function for the other action, put them together and see which line segments we can get rid of. Here are the S() function partitions, value function and the value functions partition for the action a2.
This concludes our example. The concepts and procedures can be applied over and over to any horizon length. This is the way we do value iteration on the CO-MDP derived from the POMDP.