The purpose of this tutorial is to get you familiar with using some of the planning and learning algorithms in the OO-MDP Toolbox. Specifically, this tutorial will cover instantiating a GridWorld domain from the existing implemention in the domains package (you should download this first if you have not yet already), creating a task for it, having the task solved with Q-learning, Sarsa learning, BFS, DFS, A*, and Value Iteration. The tutorial will also show you how to visualize these results using existing tools in the OO-MDP Toolbox package.
For this tutorial we will start by making a class that has data members for all the domain and task relevant properties. In the tutorial we will call this class "BasicBehavior" but feel free to name it to whatever you like. Since we will also be running the examples from this class, we'll include a main method.
import domain.gridworld.*; import oomdptb.oomdp.Domain; import oomdptb.oomdp.ObjectInstance; import oomdptb.oomdp.RewardFunction; import oomdptb.oomdp.State; import oomdptb.oomdp.StateParser; import oomdptb.oomdp.TerminalFunction; import oomdptb.behavior.statehashing.DiscreteStateHashFactory; public class BasicBehavior { GridWorldDomain gwdg; Domain domain; StateParser sp; RewardFunction rf; TerminalFunction tf; StateConditionTest goalCondition; State initialState; DiscreteStateHashFactory hashingFactory; public static void main(String[] args) { //we'll fill this in later } }
If you're already familiar with MDPs in general, the importance of most of these data mambers will be pretty obvious. However, we will walk through what each one is and why we're going to need it.
GridWorldDomain gwdg
This object is a DomainGenerator provided in the domains package. We will use this
object to create a basic grid world domain for our demonstration.
Domain domain
The domain object is an fundemental OO-MDP object. The domain object defines a set of
attributes, object classes, propositional functions, and actions (along with the actions transition dynamics).
StateParser sp
A StateParser object is used to convert OO-MDP states to and from strings. This is useful
if you want to be be able to record planning and learning results to files.
RewardFunction rf
A RewardFunction is an object that returns a double valued reward for any given state-action-state
sequence. This is a fundemental component of every MDP and its what behavior seeks to maximize.
TerminalFunction tf
A common form MDPs are episodic MDPs: MDPs that end in some specific state or set of states. A typical
reason to define an episodic MDP is when there is a goal state the agent is trying to reach. In such cases, the goal
state is a terminal state. There may be other reasons to provide termination states as well, but either way, the
TerminalFunction object defines which states are terminal states.
StateConditionTest goalCondition
Not all planning algorithms are designed to maximize reward functions. Many
are instead defined as search algorithms that seek specific goal states. A StateConditionTest object operates much
like a TerminalFunction, only it can be used as a means to specify any kind of state check. In this tutorial we will
use it to specify goal states.
State initialState
To perform any planning or learning, we will naturally need to specify an initial state
from which to perfomr it! An OO-MDP state consists of an arbitrary set of instantiated object classes from a given
domain. An instantiated object of an object class means that there is a value is defined for each attribute of
the object class. A state may also consist of an arbitrary number of object instances for any given class, but in
some domains you may typically only have on instance for each. In the GridWorld domain, for instance, there will
be one instance of the agent object (which specifies the agent's position) and one instance of a location object,
which will be used to specify a goal location to which the agent should go.
DiscreteStateHashFactory hasingFactory
To perfom planning and learning, there will need to be some way
to look up previoulsly examined states in datastructures and to do so efficiently will require some way to compute
hash codes for states. The DiscreteStateHashFactory proviedes a general means to do this for any discrete
OO-MDP domain. A nice property of the DiscreteStateHashFactory object is that hashing results are invariant
to specific object references. For instance, if a state contained to otherwise identical block objects that were in different
positions in the world and you merely swapped their locations, the DiscreteStateHashFactory would correctly
compute the same HashCode for each state since they represent the same OO-MDP state.