Tutorial: Creating and Using Options

Tutorials > Creating and Using Options > Part 3

Creating Subgoal Options and their Policies

With the initiation staes and termination conditions defined we will now need to define the policy for our options. As previously noted, we could define this manually ourselves; however, it would be much easier to simply let one of the OO-MDP Toolbox planners do the work for us! Using existing tools, we can compactly create the subgoal option policy and the option object itself.

Since we will need to create eight different subgoal options (2 for each room), it will be conveient to simply create a single method that takes the room boundaries and the subgoal hallways for which an option should be created and returns one. Before we do so we will need to make a few of decisions. First, we need to decide if we want to use state abstaction in the option's policy. Ideally, we would like our option's policies to be invariant to the position of any location objects in the state, so we would like an abstraction that ignores location information. Another decision we will need to make is what planner to use to for creating the option policies. Value Iteration (VI) is an appealing choice for a couple of reasons: (1) it will work in both deterministic and stochastic versions of the grid world domain (although in our example we're only using the deterministic version), and (2) we will need to extract a policy for each state in the initiation conditions and since VI is designed to compute the policy for all reachable states in a task definition, VI is fairly efficient at this task. Finally, we will also need to decide what kind of policy from the planner to use. For VI, we could use a standard GreedyQ policy; however, this policy will randomly select between actions that have the same Q-value and stochastic policies can result in increased computaiton time when determining the expected transition probabilities of options. Therefore, a better choice is the GreedyDeterministicQPolicy which will always select the same action among ties for the maximium Q-value making it a deterministic policy.

Given these choices, we can now define our method for creating a Subgoal option for a given set of room boundaries and hallway subgoal. This method is defined below.

public Option getRoomOption(String name, int leftBound, int rightBound, 
			int bottomBound, int topBound, int hx, int hy){
		
	StateConditionTestIterable inRoom = new InRoomStateCheck(leftBound, rightBound, 
														bottomBound, topBound);
														
	StateConditionTest atHallway = new AtPositionStateCheck(hx, hy);
	
	DiscreteMaskHashingFactory hashingFactory = new DiscreteMaskHashingFactory();
	hashingFactory.setAttributesForClass(GridWorldDomain.CLASSAGENT, 
				domain.getObjectClass(GridWorldDomain.CLASSAGENT).attributeList);
	
	OOMDPPlanner planner = new ValueIteration(domain, new LocalSubgoalRF(inRoom, atHallway), 
				new LocalSubgoalTF(inRoom, atHallway), 0.99, hashingFactory, 0.001, 50);
	
	PlannerDerivedPolicy p = new GreedyDeterministicQPolicy();
	
	return new SubgoalOption(name, inRoom, atHallway, planner, p);
}
				

Note that the first two lines simply creates instances of the classes we defined for creating initiation states and termination conditions. The next two lines define a StateHashingFactory that will be used by the planner to create our policies. You've seen these defined before in the previous Basic Planning and Learning Tutorial. However, this time we're using a different factory: DiscreteMaskHashingFactory This factory is similar to the previously discussed State Hashing Factory DiscreteStateHashFactory. What they have in common is that both compute hash values based on the attributes for specific classes provided hashing factory. In this case, we told the hashing factory to use only the agent x and y attribute values for computing hash codes. The difference with the DiscreteMaskHashingFactory, is that it will mask any other attributes and object classes in state equality checks. In this case, two states will be determined to be the same if they have agents in the same position. As a result, the location object information is abstracted out of the policy that will be created and we could use these options in tasks with any number of location objects in any place without their value affecting the execution of the options.

Creating the VI planner is done in the usual way, but it is worth drawing attention to the reward function and terminal function that it is passed. For the reward function, we pass it an reward function designed specifically for facilitating the creation of subgoal options. The reward function takes the set of intiation states and the subgoal as a parameter. The reward function LocalSubgoalRF by default will return -1 everywhere, unless it reaches a state not defined in the initiation conditions or subgoal conditions in which case it will return -Double.MAX_VALUE as a penalty for failure. This kind of reward function is useable by many planners, including deterministic planners; however, you can also modify the rewards received using different constructors. For instance, you can change the reward received when reaching the goal, when reaching a state not defined by the initiation states nor subgoal, and what the default reward is for non-terminal transitons. The LocalSubgoalTF terminal function works similarly in spirit by returning true whenever the agent reaches a state that is either not in the intiation state set or is a subgoal state. After creating the planner, we define a PlannerDerivedPolicy object to be an instance of the GreedyDeterministicQPolicy class. PlannerDerivedPolicy are an abstract exension of the Policy class that provide a method for passing them an OOMDPPlanner instance. This method will be used implicitly in the Subgoal option creation.

Note that the planner is not told to do any planning and instead is passed to the SubgoalOption constructor along with the name of the option, the initation state test/iterator, the subgoal test, and the policy which will use the planner results. When this constructor is used, the option will automatically call the planner on ever state that the initation state test iterator will iterate over, thereby defining the policy for the option to use. Therefore, creation of the option is done once the constructor has been provided this information. In this case we chose to use VI as the planner for our options policy, but this constructor would have worked with any other OOMDPPlanner it was passed, provided the planner could support the domain requried of it (e.g., stochastic versus derministic).

With the ability to create an option for any specified room and hallway, we will finally create one more method that will add all eight room options to any given planner.

public void addRoomsOptionsToPlanner(OOMDPPlanner planner){
		
	planner.addNonDomainReferencedAction(this.getRoomOption("blt", 0, 4, 0, 4, 1, 5));
	planner.addNonDomainReferencedAction(this.getRoomOption("blr", 0, 4, 0, 4, 5, 1));
	
	planner.addNonDomainReferencedAction(this.getRoomOption("tlr", 0, 4, 6, 10, 5, 8));
	planner.addNonDomainReferencedAction(this.getRoomOption("tlb", 0, 4, 6, 10, 1, 5));
	
	planner.addNonDomainReferencedAction(this.getRoomOption("trb", 6, 10, 5, 10, 8, 4));
	planner.addNonDomainReferencedAction(this.getRoomOption("trl", 6, 10, 5, 10, 5, 8));
	
	planner.addNonDomainReferencedAction(this.getRoomOption("brt", 6, 10, 0, 3, 8, 4));
	planner.addNonDomainReferencedAction(this.getRoomOption("brl", 6, 10, 6, 3, 5, 1));
	
	
}
				

Note that the addNonDomainReferencedAction method is a method defiend for every OOMDPPlanner instance and is a method that can be used to tell a planner to use additional actions that are not defined in the domain instance the planner was passed. This could be additional non-option actions, although in our case we are passing it options. This method will also automatically check if the passed in actions are option instances and if so, will call the appropraite methods on them to provide them the reward function and discount factor used to keep track of reward. It will also automatically switch the planner's reward function to an instance of the OptionEvaluatingRF that is wrapped around the previously provided reward function (recall that this reward function will simply retrieve the reward last returned by an option if the action argument to the reward function is an option).

Using Options with Planning and Learning Algorithms

To use the options we created with planning and learning algorithms we can make a small modification to the planning methods we defined in the Basic Planning and Learning tutorial. Specifcally, after the instantiation of an OOMDPPlanner instance, we simply need to call our addRoomsOptionsToPlanner method on it. For instance, the modified Q-learning method is shown below.

public void QLearningExample(String outputPath){
		
	if(!outputPath.endsWith("/")){
		outputPath = outputPath + "/";
	}
	
	//creating the learning algorithm object; discount= 0.99; initialQ=-90.0; learning rate=0.9
	LearningAgent agent = new QLearning(domain, rf, tf, 0.99, hashingFactory, -90.0, 0.9);
	this.addRoomsOptionsToPlanner((OOMDPPlanner)agent);
	
	//run learning for 100 episodes
	for(int i = 0; i < 100; i++){
		EpisodeAnalysis ea = agent.runLearningEpisodeFrom(initialState);
		ea.writeToFile(String.format("%se%03d", outputPath, i), sp);
		System.out.println(i + ": " + ea.numTimeSteps());
	}
	
}
				

Note that the only difference between this method and the one previously defined in the Basic Planning and Learning Tutorial is the method call to add the options. Everything else works implicity. Indeed, you can add options to any of the planners defined in the OO-MDP Toolbox and they will work! Try modifying the methods for other planners, like BFS, to use options in this same way and you should find they will work. Of course, before you do so, you will want to actually set the main method to call this methods and visualize the results, so modify your main method as follows now.

public static void main(String[] args) {
		
		
	OptionsExample example = new OptionsExample();
	String outputPath = "output"; //directory to record results
	
	
	example.QLearningExample(outputPath);
	
	//run the visualizer
	example.visualize(outputPath);

}
				

Before concluding it is worth noting some propties of analyzing results from options. By default, the QLearning class will return EpisodeAnalysis objects from the runLearningEpisodeFrom episode that decopose an option's execution into the sequence of primitivate actions it took and annotate the primitive actions with the option that executed them and the step of the option the primitive action was. For instance, in the episode visualizer, you should see results like the below in an episode sequence.

Note that the "blr" is the name of the option that was executed; the (x), where x is a number indicates the step of the option's exeuctution, and the name to the right of '-' is the name of the primitive action taken by the option. You, however, prefer either only have the primitive actions listed in name, or not have the options decoposed into primitives in the EpisodeAnalysis object at all. If you want the options decomposed into primitives but not annotated, then you can use the QLearning method toggleShouldAnnotateOptionDecomposition to change that behavior. The result is episodes will look like the agent only took primitive actions, even though those primtive acitons may have been produced by options. If you do not want options decomposed into primitives and only want to which options were executed, you can use the method toggleShouldDecomposeOption to change the behavior. If disable options from being decomposed in the returned EpisodeAnalysis object, you will get results where only the option name is printed without listing the steps it took, like in the below image.

Planning algorithms results are captured by wrapping a Policy object around the planner and generating an episode from it. By default, the Policy evaluate methods will also return EpisodeAnalysis objects that have decomposed and annotated objections, but you can change the returned behavior by calling similar methods on the Policy object. Specifically, you can call the methods evaluateMethodsShouldDecomposeOption and evaluateMethodsShouldAnnotateOptionDecomposition to change these settings.

Conclusion

This concludes the Options Tutorial. You are encouraged to try using options in the Four Rooms domain with all of the planners defined in the previous Basic Planning and Learning Tutorial. For your convenience, all of the code that was created in this tutorial is listed below.
import java.util.Iterator;

import oomdptb.behavior.*;
import oomdptb.behavior.learning.LearningAgent;
import oomdptb.behavior.learning.tdmethods.*;
import oomdptb.behavior.options.*;
import oomdptb.behavior.planning.*;
import oomdptb.behavior.planning.commonpolicies.*;
import oomdptb.behavior.planning.deterministic.*;
import oomdptb.behavior.planning.deterministic.informed.Heuristic;
import oomdptb.behavior.planning.deterministic.informed.astar.AStar;
import oomdptb.behavior.planning.deterministic.uninformed.bfs.BFS;
import oomdptb.behavior.planning.deterministic.uninformed.dfs.DFS;
import oomdptb.behavior.planning.stochastic.valueiteration.ValueIteration;
import oomdptb.behavior.statehashing.DiscreteMaskHashingFactory;
import oomdptb.oomdp.ObjectInstance;
import oomdptb.oomdp.State;
import domain.gridworld.GridWorldDomain;

public class OptionsExample extends BasicBehavior{

	
	

	
	public static void main(String[] args) {
		
		
		OptionsExample example = new OptionsExample();
		String outputPath = "output"; //directory to record results
		
		
		
		example.QLearningExample(outputPath);
		
		//run the visualizer
		example.visualize(outputPath);

	}
	
	
	public OptionsExample(){
		super();
		
		//override initial state goal location to be on a hallway where options can be most exploited
		GridWorldDomain.setLocation(initialState, 0, 5, 8);
	}
	
	
	public void QLearningExample(String outputPath){
		
		if(!outputPath.endsWith("/")){
			outputPath = outputPath + "/";
		}
		
		//creating the learning algorithm object; discount= 0.99; initialQ=-90.0; learning rate=0.9
		LearningAgent agent = new QLearning(domain, rf, tf, 0.99, hashingFactory, -90.0, 0.9);
		this.addRoomsOptionsToPlanner((OOMDPPlanner)agent);
		
		//run learning for 100 episodes
		for(int i = 0; i < 100; i++){
			EpisodeAnalysis ea = agent.runLearningEpisodeFrom(initialState);
			ea.writeToFile(String.format("%se%03d", outputPath, i), sp);
			System.out.println(i + ": " + ea.numTimeSteps());
		}
		
	}
	
	
	public void addRoomsOptionsToPlanner(OOMDPPlanner planner){
		
		planner.addNonDomainReferencedAction(this.getRoomOption("blt", 0, 4, 0, 4, 1, 5));
		planner.addNonDomainReferencedAction(this.getRoomOption("blr", 0, 4, 0, 4, 5, 1));
		
		planner.addNonDomainReferencedAction(this.getRoomOption("tlr", 0, 4, 6, 10, 5, 8));
		planner.addNonDomainReferencedAction(this.getRoomOption("tlb", 0, 4, 6, 10, 1, 5));
		
		planner.addNonDomainReferencedAction(this.getRoomOption("trb", 6, 10, 5, 10, 8, 4));
		planner.addNonDomainReferencedAction(this.getRoomOption("trl", 6, 10, 5, 10, 5, 8));
		
		planner.addNonDomainReferencedAction(this.getRoomOption("brt", 6, 10, 0, 3, 8, 4));
		planner.addNonDomainReferencedAction(this.getRoomOption("brl", 6, 10, 6, 3, 5, 1));
		
		
	}
	
	
	public Option getRoomOption(String name, int leftBound, int rightBound, 
			int bottomBound, int topBound, int hx, int hy){
		
		StateConditionTestIterable inRoom = new InRoomStateCheck(leftBound, rightBound, 
															bottomBound, topBound);
															
		StateConditionTest atHallway = new AtPositionStateCheck(hx, hy);
		
		DiscreteMaskHashingFactory hashingFactory = new DiscreteMaskHashingFactory();
		hashingFactory.setAttributesForClass(GridWorldDomain.CLASSAGENT, 
					domain.getObjectClass(GridWorldDomain.CLASSAGENT).attributeList);
		
		OOMDPPlanner planner = new ValueIteration(domain, new LocalSubgoalRF(inRoom, atHallway), 
					new LocalSubgoalTF(inRoom, atHallway), 0.99, hashingFactory, 0.001, 50);
		
		PlannerDerivedPolicy p = new GreedyDeterministicQPolicy();
		
		return new SubgoalOption(name, inRoom, atHallway, planner, p);
	}
	
	
	
	class InRoomStateCheck implements StateConditionTestIterable{

		int		leftBound;
		int		rightBound;
		int		bottomBound;
		int 	topBound;
		
		
		public InRoomStateCheck(int leftBound, int rightBound, int bottomBound, int topBound){
			this.leftBound = leftBound;
			this.rightBound = rightBound;
			this.bottomBound = bottomBound;
			this.topBound = topBound;
		}
		
		
		@Override
		public boolean satisfies(State s) {
			
			ObjectInstance agent = s.getObjectsOfTrueClass(GridWorldDomain.CLASSAGENT).get(0);
			
			int ax = agent.getDiscValForAttribute(GridWorldDomain.ATTX);
			int ay = agent.getDiscValForAttribute(GridWorldDomain.ATTY);
			
			if(ax >= this.leftBound && ax <= this.rightBound && 
				ay >= this.bottomBound && ay <= this.topBound){
				
				return true;
			}
			
			return false;
		}
	
		@Override
		public Iterator iterator() {
	
			return new Iterator() {
	
				int ax=leftBound;
				int ay=bottomBound;
				
				@Override
				public boolean hasNext() {
					
					if(ay <= topBound){
						return true;
					}
					
					return false;
				}
	
				@Override
				public State next() {
					
					State s = GridWorldDomain.getOneAgentNLocationState(domain, 0);
					GridWorldDomain.setAgent(s, ax, ay);
					
					ax++;
					if(ax > rightBound){
						ax = leftBound;
						ay++;
					}
					
					return s;
				}
	
				@Override
				public void remove() {
					throw new UnsupportedOperationException();
				}
			};
			
		}
	
		@Override
		public void setStateContext(State s) {
			//do not need to do anything here
		}
		
		
		
		
	}	
	
	
	class AtPositionStateCheck implements StateConditionTest{

		int x;
		int y;
			
		public AtPositionStateCheck(int x, int y){
			this.x = x;
			this.y = y;
		}
		
		@Override
		public boolean satisfies(State s) {
	
			ObjectInstance agent = s.getObjectsOfTrueClass(GridWorldDomain.CLASSAGENT).get(0);
			
			int ax = agent.getDiscValForAttribute(GridWorldDomain.ATTX);
			int ay = agent.getDiscValForAttribute(GridWorldDomain.ATTY);
			
			if(ax == this.x && ay == this.y){
				return true;
			}
			
			return false;
		}
			
	}	
	
				
End.