Tutorial: Basic Planning and Learning

Tutorials > Basic Planning and Learning > Part 4

Planning with Value Iteration

A common stochastic planner is Value Iteration (VI). An advantage of VI is that it will compute the policy for the entire state space that is reachable from the initial state that is passed to the planFromState method. To set up a VI planner, define the following method.

public void ValueIterationExample(String outputPath){
		
	if(!outputPath.endsWith("/")){
		outputPath = outputPath + "/";
	}
	
	
	OOMDPPlanner planner = new ValueIteration(domain, rf, tf, 0.99, hashingFactory, 0.001, 100);
	planner.planFromState(initialState);
	
	//create a Q-greedy policy from the planner
	Policy p = new GreedyQPolicy((QComputablePlanner)planner);
	
	//record the plan results to a file
	p.evaluateBehavior(initialState, rf, tf).writeToFile(outputPath + "planResult", sp);
	
}
				

VI is a planning method defined for the classic MDP formalism, so unlike the previous deterministic planners, its constructor takes as input the reward function and termation function, rather than a goal condition. VI also takes as a parameter a discount factor which specifies how much future rewards are favored over immediate rewards. In this case, a fairly large value of 0.99 is set (which means the agent will prefer near future rewards almost as much as immediate rewards).

The last two parameters to the constructor specify stopping conditions for the planning. The second to last parameter specifies that when the maximium change in the value function of any state is less than that specified threshold value (0.001 in this case), planning will stop. The last parameter specifies a maximium number of updates for each state that can happen before planning is stopped (100 in this case), regardless of whether the maximum value function change threshold was crossed.

Since VI is a stochastic planning algorithm, rather than a deterministic one like the previous algorithms we used, we cannot capture its planning results in a SDPlannerPolicy Policy class. Instead, a policy can be derived from the value function the planner estimates for each state using the GreedyQPolicy class that can be defined for any planner that adheres to the QComputablePlanner interface, which the VI algorithm does.

Once the solution is captured in a Policy class object, the results can be captured and visualized in the same way. Try setting the main method to call our newly defined VI example method now.

Learning with Q-Learning

All of the previous examples were examples of using planning algorithms to solve our task. In this section, we will diverge from that and use a learning algorithm, Q-learning, to solve the task. Ultimately, learning algorithms are utilized in much the same way as planning algorithms, except you run will run multiple episodes of learning to solve it (or one very long episode if it is a continuing task rather than an episodic task). The method you should define to utilize Q-learning is shown below.

public void QLearningExample(String outputPath){
		
	if(!outputPath.endsWith("/")){
		outputPath = outputPath + "/";
	}
	
	//creating the learning algorithm object; discount= 0.99; initialQ=0.0; learning rate=0.9
	LearningAgent agent = new QLearning(domain, rf, tf, 0.99, hashingFactory, 0., 0.9);
	
	//run learning for 100 episodes
	for(int i = 0; i < 100; i++){
		EpisodeAnalysis ea = agent.runLearningEpisodeFrom(initialState);
		ea.writeToFile(String.format("%se%03d", outputPath, i), sp); 
		System.out.println(i + ": " + ea.numTimeSteps());
	}
	
}				
				

Lets first look at the constructor. Rather than a planning instance, we're creating a LearningAgent instance which provides some methods for learning. QLearning is an instance of the LearningAgent class and like the Value Iteration planner, takes a parameters the reward function, termination function, and discount factor rather than a goal condition. The last two parameters of the constructor represent the initial Q-value to use for previously untaken state-action pairs (which we set to 0.0) and the last parameter is the learning rate, which we set fairly high since this is a simple deterministic task. Note that this constructor will by default set Q-learning to use a 0.1 epsilon greedy policy. There are other constructors that allow you to set which learning policy to use and there is also a mutator that allows you to set it if you'd like to use a different policy.

With the QLearning instance created, next we will run learning episodes rather than tell it to plan a solution. To make sure a decent solution is learned, we will let learning run for 100 episodes, so we set up a loop to do so. To run a learning episode, we call the method runLearningEpisodeFrom on the LearningAgent instance and pass it the initial state from which it should begin the learning episode. Since this method will perform learning for an entire episode, the method will return an EpisodeAnalysis object that stores the results of the episode. Once we have this episode, we simple save it to a file like we did the planning results in previous examples (with care to give each episode a different name so that we can view them all). We also added a line to print the number of steps taken in each learning episode to the terminal so that we can follow its overall progress as it learns.

After that, you can call this method from your main method and run the results! This time when the EpisodeSequenceVisualizer launcher you will find there are 100 episode files in the list, one for each episode of learning that occurred. You should find that in the earlier episodes behavior was quite bad, but as the agent experiences the world, its performance became better. You may find that even after a large amount of learning that the behavior is slightly random; this randomness is a result of learning using the epsilon greedy policy which always takes a random action with some probability. Alternatively, you could also capture the final learning results ina GreedyQPolicy like we did with VI and record those results.

One final thing to note is that while QLearning is an adheres to the LearningAgent interface, it is also an instance of the OOMDPPlanner, which means we could have called the planFromState method on it. For the QLearning algorithm, the planFromState method will run some number of learning episodes automatically for you. By default it will only run one (which is unlikely to yield very good results!), but you can adjust the number of episodes it would run similar to who the VI planner terminates. Specifically, you can set a maximum number of learning episodes to run with the setMaxEpisodesForPlanning method, or you can specify a Q-value change threshold with the setMaxQChangeForPlanningTerminaiton method that will terminate planning when the maximum Q-value change within an episode less than the threshold.

Learning with Sarsa(λ) Learning

A similar learning algorithm to Q-learning is Sarsa(λ) learning. The first difference between the two algorithms is that Sarsa(λ) updates Q-values with repsect to the Q-value of the next action taken, rather than the maximum Q-value of the next state (see wikipedia for more information). The second, and larger, difference is that Sarsa(λ) will also update Q-values with respect to experiences weighted by the amount specified by λ. Define the below method to solve our task with Sarsa(λ).

public void SarsaLearningExample(String outputPath){
		
	if(!outputPath.endsWith("/")){
		outputPath = outputPath + "/";
	}
	
	//discount= 0.99; initialQ=0.0; learning rate=0.5; lambda=1.0
	LearningAgent agent = new SarsaLam(domain, rf, tf, 0.99, hashingFactory, 0., 0.5, 1.0);
	
	//run learning for 100 episodes
	for(int i = 0; i < 100; i++){
		EpisodeAnalysis ea = agent.runLearningEpisodeFrom(initialState);
		ea.writeToFile(String.format("%se%03d", outputPath, i), sp);
		System.out.println(i + ": " + ea.numTimeSteps());
	}
	
}				
				

You will notice that this method looks pretty identical to the Q-learning example, except this time a SarsaLam instance is constructed. Additionally, we lowered the learning rate to 0.5 (typically you should use lower learning rates when you have a higher value of λ). The last parameter of the constructor is the λ value which we set to 1.0. A value of λ=1.0 effectively makes algorithm run an online Monte carlo in which the effects of all future interactions are fully conisdered in updating each Q-value of an episode.

Otherwise, the rest is the same; you can call this method from the main method and give it shot! You should find that learning is much faster than Q-leanring when the higher value of λ is used. Like QLearning, the SarsaLam instance also supports the planning method and the conditions for planning termination can be set in the same way (SarsaLam is actually a subclass of QLearning).

Conclusion

This ends our tutorial on implementing basic planning and learning algorithms in the OO-MDP Toolbox. There are other planning and learning algorithms in the toolbox, but hopefully this tutorial has explained the core concepts well enough that you should be able to try different algorithms easily. As a future excercise, we encourage you to try just that within this code you've created! The complete set of code that we wrote in this tutorial is shown below for your convenience.

import domain.gridworld.*;
import oomdptb.behavior.*;
import oomdptb.behavior.learning.*;
import oomdptb.behavior.learning.tdmethods.*;
import oomdptb.behavior.planning.*;
import oomdptb.behavior.planning.commonpolicies.GreedyQPolicy;
import oomdptb.behavior.planning.deterministic.*;
import oomdptb.behavior.planning.deterministic.informed.Heuristic;
import oomdptb.behavior.planning.deterministic.informed.astar.AStar;
import oomdptb.behavior.planning.deterministic.uninformed.bfs.BFS;
import oomdptb.behavior.planning.deterministic.uninformed.dfs.DFS;
import oomdptb.behavior.statehashing.DiscreteStateHashFactory;
import oomdptb.behavior.planning.stochastic.valueiteration.ValueIteration;
import oomdptb.oomdp.Domain;
import oomdptb.oomdp.ObjectInstance;
import oomdptb.oomdp.RewardFunction;
import oomdptb.oomdp.State;
import oomdptb.oomdp.StateParser;
import oomdptb.oomdp.TerminalFunction;
import oomdptb.oomdp.common.*;
import oomdptb.oomdp.visualizer.Visualizer;

public class BasicBehavior {

	
	
	GridWorldDomain 			gwdg;
	Domain						domain;
	StateParser 				sp;
	RewardFunction 				rf;
	TerminalFunction			tf;
	StateConditionTest			goalCondition;
	State 						initialState;
	DiscreteStateHashFactory	hashingFactory;
	
	
	public static void main(String[] args) {
		
		
		BasicBehavior example = new BasicBehavior();
		String outputPath = "output/"; 
		
		
		//uncomment the example you want to see (and comment-out the rest)
		
		example.BFSExample(outputPath);
		//example.DFSExample(outputPath);
		//example.AStarExample(outputPath);
		//example.ValueIterationExample(outputPath);
		//example.QLearningExample(outputPath);
		//example.SarsaLearningExample(outputPath);
		
		
		//run the visualizer
		example.visualize(outputPath);

	}
	
	
	public BasicBehavior(){
	
		//create the domain
		gwdg = new GridWorldDomain(11, 11);
		gwdg.setMapToFourRooms(); 
		domain = gwdg.generateDomain();
		
		//create the state parser
		sp = new GridWorldStateParser(domain); 
		
		//define the task
		rf = new UniformCostRF(); 
		tf = new SinglePFTF(domain.getPropFunction(GridWorldDomain.PFATLOCATION)); 
		goalCondition = new TFGoalCondition(tf);
		
		//set up the initial state of the task
		initialState = GridWorldDomain.getOneAgentOneLocationState(domain);
		GridWorldDomain.setAgent(initialState, 0, 0);
		GridWorldDomain.setLocation(initialState, 0, 10, 10);
		
		//set up the state hashing system
		hashingFactory = new DiscreteStateHashFactory();
		hashingFactory.setAttributesForClass(GridWorldDomain.CLASSAGENT, 
				domain.getObjectClass(GridWorldDomain.CLASSAGENT).attributeList); 
		
			
	}
	
	
	public void visualize(String outputPath){
		Visualizer v = GridWorldVisualizer.getVisualizer(domain, gwdg.getMap());
		EpisodeSequenceVisualizer evis = new EpisodeSequenceVisualizer(v, 
								domain, sp, outputPath);
	}
	
	
	
	
	public void BFSExample(String outputPath){
		
		if(!outputPath.endsWith("/")){
			outputPath = outputPath + "/";
		}
		
		//BFS ignores reward; it just searches for a goal condition satisfying state
		DeterministicPlanner planner = new BFS(domain, goalCondition, hashingFactory); 
		planner.planFromState(initialState);
		
		//capture the computed plan in a partial policy
		Policy p = new SDPlannerPolicy(planner);
		
		//record the plan results to a file
		p.evaluateBehavior(initialState, rf, tf).writeToFile(outputPath + "planResult", sp);
			
	}				



	public void DFSExample(String outputPath){
		
		if(!outputPath.endsWith("/")){
			outputPath = outputPath + "/";
		}
		
		//DFS ignores reward; it just searches for a goal condition satisfying state
		DeterministicPlanner planner = new DFS(domain, goalCondition, hashingFactory);
		planner.planFromState(initialState);
		
		//capture the computed plan in a partial policy
		Policy p = new SDPlannerPolicy(planner);
		
		//record the plan results to a file
		p.evaluateBehavior(initialState, rf, tf).writeToFile(outputPath + "planResult", sp);
		
	}
	
	
	
	
	public void AStarExample(String outputPath){
		
		if(!outputPath.endsWith("/")){
			outputPath = outputPath + "/";
		}
		
		
		Heuristic mdistHeuristic = new Heuristic() {
			
			@Override
			public double h(State s) {
				
				String an = GridWorldDomain.CLASSAGENT;
				String ln = GridWorldDomain.CLASSLOCATION;
				
				ObjectInstance agent = s.getObjectsOfTrueClass(an).get(0); 
				ObjectInstance location = s.getObjectsOfTrueClass(ln).get(0); 
				
				//get agent position
				int ax = agent.getDiscValForAttribute(GridWorldDomain.ATTX);
				int ay = agent.getDiscValForAttribute(GridWorldDomain.ATTY);
				
				//get location position
				int lx = location.getDiscValForAttribute(GridWorldDomain.ATTX);
				int ly = location.getDiscValForAttribute(GridWorldDomain.ATTY);
				
				//compute Manhattan distance
				double mdist = Math.abs(ax-lx) + Math.abs(ay-ly);
				
				return -mdist;
			}
		};
		
		//provide A* the heuristic as well as the reward function so that it can keep
		//track of the actual cost
		DeterministicPlanner planner = new AStar(domain, rf, goalCondition, 
			hashingFactory, mdistHeuristic);
		planner.planFromState(initialState);
		
		
		//capture the computed plan in a partial policy
		Policy p = new SDPlannerPolicy(planner);
		
		//record the plan results to a file
		p.evaluateBehavior(initialState, rf, tf).writeToFile(outputPath + "planResult", sp);
		
	}	
	
	
	
	public void ValueIterationExample(String outputPath){
		
		if(!outputPath.endsWith("/")){
			outputPath = outputPath + "/";
		}
		
		
		OOMDPPlanner planner = new ValueIteration(domain, rf, tf, 0.99, hashingFactory,
								0.001, 100);
		planner.planFromState(initialState);
		
		//create a Q-greedy policy from the planner
		Policy p = new GreedyQPolicy((QComputablePlanner)planner);
		
		//record the plan results to a file
		p.evaluateBehavior(initialState, rf, tf).writeToFile(outputPath + "planResult", sp);
		
	}
	
	
	public void QLearningExample(String outputPath){
		
		if(!outputPath.endsWith("/")){
			outputPath = outputPath + "/";
		}
		
		//discount= 0.99; initialQ=0.0; learning rate=0.9
		LearningAgent agent = new QLearning(domain, rf, tf, 0.99, hashingFactory, 0., 0.9);
		
		//run learning for 100 episodes
		for(int i = 0; i < 100; i++){
			EpisodeAnalysis ea = agent.runLearningEpisodeFrom(initialState);
			ea.writeToFile(String.format("%se%03d", outputPath, i), sp); 
			System.out.println(i + ": " + ea.numTimeSteps());
		}
		
	}
	
	
	public void SarsaLearningExample(String outputPath){
		
		if(!outputPath.endsWith("/")){
			outputPath = outputPath + "/";
		}
		
		//discount= 0.99; initialQ=0.0; learning rate=0.5; lambda=1.0
		LearningAgent agent = new SarsaLam(domain, rf, tf, 0.99, hashingFactory,
						0., 0.5, 1.0);
		
		//run learning for 100 episodes
		for(int i = 0; i < 100; i++){
			EpisodeAnalysis ea = agent.runLearningEpisodeFrom(initialState);
			ea.writeToFile(String.format("%se%03d", outputPath, i), sp);
			System.out.println(i + ": " + ea.numTimeSteps());
		}
		
	}	

	
}

				
End.