The purpose of this tutorial is to explains how to create and use options in the OO-MDP Toolbox. This tutorial will assume that you are mostly familiar with the concepts of options already and merely explain how to create and use them. In brief, options are formalism for creating and using temporally extended actions in MDP formalisms. In brief, however, an option is a defined by a three-tuple: a set of states in which the option can be initiated (think of these as effectively the pre-conditions of an action); a policy indicating what the agent will do during the execution of an option; and termination conditions, specified as a probabiity distribution over states, indicating how likely the option is to stop execution of the option in any given state. In particular, this tutorial will focus on subgoal options. A subgoal option is an option that will taken the agent to a subgoal state of the task. The advantage of subgoal options is that they allow the agent to quickly traverse the state space and if the subgoal options move the agent closer the actual goal, then they will likely accelerate learning and planning of the overal task. Subgoal options are also most typically defined as having a deterministic termination conditions in which the option only terminates with probabiltiy 1 when the agent reaches the defined subgoal of the option, or enters a state in which the policy for the option is undefined.
This tutorial will assume that you have already gone through the Basic Planning and Learning tutorial and will extend the class created in that example. If you have not already gone through that tutorial, it is advised that you do, or at a minimum, copy and implement the code at the end of the tutorial so that you can subclass it in this tutorial.
Options are implemented in the OO-MDP Toolbox as an abstract extension of the Action class. As a result, options can be easily incorporated into any of the planning and learning algorithms present. In particular, whether a state is in an options initiate state set can be determined by the typical Aciton method applicableInState(State s, String [] params). An option can also be executed from a state by simply calling the standard Action method performAction(State s, String [] params). The option class also defines a number of additional methods for describing the properties of the option (since different planning and learning algorithms may have different requirements of the kinds of options used). The following are a list of some of those methods.
isMarkov()
Should return true or false. A Markov option is any option whose behavior and termination properties can be completely determined by
each state of execution. An option that performs a sequence of actions, or terminates after a set amount of time is not a Markov option
because its behavior depends on the actions that came before the action in any given state.
usesDeterministicTermination()
Should return true or false. As the method name implies, indicates whether the termination conditions are determinsitic or not. Subgoal
options typically have deterministic termination conditions because they only terminate when the option reaches a defined subgoal or when
the agent enters a state for which the option's policy is undefined.
usesDeterministicPolicy()
This method indicates whether the policy of the option is deterministic. This may be important to planning algorithms that assess the possible
outcomes of actions and their probabiltiies. If the policy is deterministic and termination conditions are deterministic, then it may simplify
he calcuatlion.
In addition to the above methods that describe properties of the option, the follow methods define how the option operates.
probabilityOfTermination(State s, String [] params)
This method returns the probabiltiy of terminaiton in any given state--the termination conditions of the option.
initiateInStateHelper(State s, String [] params)
This method is called whenever the option is about to being execution. In particular, it is automatically called when the
performAction method is called. If you are implementing a non-Markovian option, it may be useful to intialize any datastructures here
to determine behavior in future states.
oneStepActionSelection(State s, String [] params)
The abstract option super class automatically handles a lot of the action overhead such as the performAction and performActionHelper methods
of the Action superclass. Therefore, option implementaitons only need to implement this method to enable option execution. Specifically,
this method returns the action specified by the option policy.
getActionDistributionForState(State s, String [] params)
Much the like similar method in the Policy class, this method defines the action probability distribution of the option's policy. While this
method does not need to be implemented to support option execution, it will need to be defined if planning algorithms that us the Bellman
update are used (more on that below).
Option planning and learning algorithms will typically need to keep track of the discounted reward received during the execution of an option. However, different tasks may use different reward functions. Therefore the Option class provides the method keepTrackOfRewardWith(RewardFunction rf, double discount) which planning and learning algorthms can call on it so that option keeps track of the cumulative discounted reward received from the reward function rf while an option is executed. Any implmenting classes of options do not have to do any work to support this funcitonality themselves. Once the option has been provided a reward function and discount factor using the previously specified method, information about the last execution of the option can be retrived from the methods getLastCumulativeReward() and also getLastNumSteps(), since the number of steps taken in the execution of an option is also often important to option planning and learning algorithms.
While these methods can be called explicitly, it may be useful to be able to query the reward received from an option implicity through a reward function. Since reward functions are typically defined only with respect to primitive actions, a wrapper reward function class called OptionEvaluatingRF is provided. This class takes another reward function as a parameter. When the action specified in the reward query method is a primitive aciton, the provided reward function is recursively queried and its reward is returned. If the provided action is an option, then the the getLastCumulativeReward method on the option is qurired and returned as the reward. Note that this approach will only work if the state in which the reward function is being queried was in fact that last state in which the option was executed. When options are added to an OOMDPPlanner object, it will automatically convert the reward function for the planner to the OptionEvaluatingRF wrapper of the previously specified reward function (more on this later in the tutorial).
It is important to note that planning algorithms that use the full Bellman update, such as Value Iteration, rely on options not only being able to supply the probability of transitioning to other states, but the length of time it it is expected to get there and the expected reward received from executing the option in each state. To more compactly represent this information, the transition probabilities usually incorporate the discounted number of steps in the probability of transitioning from one state to another after applying an option (see the Option work by Sutton, Precup, and Singh (1999) for more information). To facilitate these calculations, the method getTransitions(State s, String [] params) of the Option class is not just expected to return the probabiltiy of reaching each state (as it is with the Action class), but the probabiltiy of reach that state multipled by the discount of the number of steps (and summed over all possible number of steps it could take). More specifically, for the option class, getTransitions(State s, String [] params) returns a List <TransitionProbability> object. Each TransitionProbability class is defined by to variables, a state (TransitionProbability.s) and a probability (TransitionProbability.p). The probability value represents
where s is the variable passed to the getTransitions method, o is the option object on which the method is called and which will be applied in state s, s' is the state to which it may transition paired with the probability (TransitionProbability.s), k is the number of possible steps the option could take from state s, γ is the discount factor, and p(s', k) is the probability that the option will transition from state s to s' in k steps. This return value is in constrast to what is typically returned by the Action class, which is the undiscounted probabiltiy of transitioning to a state. Any planning algorithms will have to account for this difference in return between option objects and action objects, but in general providing the discounted probabiltiy value will make computing Bellman updates easier.
To account for the expeted reward from received from executing an option, the option class adds an additional method: getExpectedRewards(State s, String [] params)
By default, the getTransitions and getExpectedRewards methods are already defined by the abstract option class and do not have to be implemented by any subclass option implementaiton; however, they will require that the reward function to be use used is supplied through the method keepTrackOfRewardWith as well as a hasing factor to cache the values after they have been computed, which can be supplied through the method setExpectationHashingFactory. Existing planners in the code base already ensure that these methods are called on options using the reward function and hashing factory with which the planner is defined. Any OOMDPPlanner subclass will also automatically set the reward function informaiton. However, planners that will make use of the exepcted transition/reward methods will need to manually call the setExpectationHashingFactory method.
Although expected rewards and probabiltiies will automatically be computed for any subclass of the option class, to ensure completeness the computation may not be very fast if there are lots of possible stochastic transitions and terminations. To help manage the expectation calculations, a probabiltiy cutoff is defeind in the Option class, that will cause the search for termination states to terminate when the probabiltiy of reaching them is lower than the cutoff. You can adjust this cutoff manually using the method setExpectationCalculationProbabilityCutoff. If computation of the possible termination states is especially difficult, it may be worth considering overriding the getTransitions and getExpectedRewards methods in the subclass and manually defining them in a computationally more tractable way.