David Abel

Recent Posts

NIPS 2017

A Silly Game: Word...

Simple RL

The Relevance of C...

A Primer on Possib...

( See All )

Simple RL


I just finised up the alpha version of simple_rl, a library for running Reinforcement Learning experiments in Python 2.7. The library is designed to generate quick and easily reproducible results. As the name suggests, it's intended to be simple and easy to use. In this post I'll give an overview of the features of the library through by going over some example experiments, which I'm hoping serves as a mini tutorial. I'll assume those reading this are familiar with Reinforcement Learning (RL). For those not - I suggest reading Ian Osband's excellent writeup introducing RL.

To install, simply enter:

pip install simple_rl

Or clone the repository linked above and install via the usual:

python setup.py install.

The only dependencies are numpy and matplotlib, though if you want to run experiments in the OpenAI Gym, you'll also need that installed.

Example 1: Chain

The main workhorse for simple_rl is the run_agents_on_mdp function from the run_experiments sub module (simple_rl.run_experiments). This function takes a list of Agents and an MDP instance as input, runs each agent on the MDP, stores all results in cur_dir/results/mdp-name/*, and generates (and opens) a plot of the learning curves. We can also control various parameters of the experiment, like the number of episodes, the number of steps an agent takes per episode, and the number of total instances of each agent type to run (for confidence intervals).

For example, the following runs Q-Learning and R-Max on a simple Chain MDP from [Strens 2000]:

# Imports
import simple_rl
from simple_rl.agents import QLearnerAgent, RMaxAgent
from simple_rl.tasks import ChainMDP
from simple_rl.run_experiments import run_agents_on_mdp

# Setup MDP, Agents, and run.
chain_mdp = ChainMDP(5)
ql_agent = QLearnerAgent(chain_mdp.actions)
rm_agent = RMaxAgent(chain_mdp.actions)
run_agents_on_mdp([ql_agent, rm_agent], chain_mdp)

Running this will output relevant experiment information (which is also stored in the same directory as the results), and will continually update the status of the experiment to the console:

Running experiment:
    num_instances : 10
    num_episodes : 50
    num_steps : 10

qlearner-softmax is learning.
    Instance 1 of 10.

When the experiment is finished (which on my laptop takes around 2 seconds), the console produces the following:

--- TIMES ---
rmax-h4 agent took 1.366 seconds.
qlearner-softmax agent took 0.352 seconds.

Mean last episode: (qlearner-softmax) : 22.178 (conf_interv: 4.88 )
Mean last episode: (rmax-h4) : 143.66 (conf_interv: 34.50 )

Also, a plot showing the learning curves of the two algorithms will open:

Chain Results

Simple! If you want to control various knobs of the experiment (like the number of steps taken per episode or the number of episodes) these are paramteres to the run_agents_on_mdp function. For instance, if we wanted to run each 5 instances of each agent (for confidence intervals), for 100 episodes, for 25 steps per episode, we would call the following:

run_agents_on_mdp([ql_agent, rm_agent], chain_mdp, num_episodes=50, num_instances=5, num_steps=25)

Which produced the following learning curves:

Chain Results
Example 2: Taxi

In addition to defining MDPs using a traditional state enumeration method, simple_rl has support for defining MDPs with an Object-Oriented Representation, introduced by [Diuk et al. 2008]. With objects it becomes much easier to code up more complex problems, such as the Taxi problem from [Dietterich 2000].

Running experiments on Taxi is nearly identical to the above example. The only added complexity is specifying certain properties of the Taxi instance:

Let's also switch the Q-Learner's exploration strategy from a Softmax to epsilon-greedy, add a randomly acting agent. To set up the Taxi MDP we'll end up with:

# Imports import simple_rl
from simple_rl.agents import QLearnerAgent, RandomAgent
from simple_rl.tasks import TaxiOOMDP
from simple_rl.run_experiments import run_agents_on_mdp

# Setup Taxi OO-MDP.
agent = {"x":1, "y":1, "has_passenger":0}
passengers = [{"x":4, "y":3, "dest_x":2, "dest_y":2, "in_taxi":0}]
walls = []
taxi_mdp = TaxiOOMDP(5, 5, agent_loc=agent, walls=walls, passengers=passengers)

# Setup agents and run.
ql_agent = QLearnerAgent(taxi_mdp.actions, explore="uniform")
rm_agent = RMaxAgent(taxi_mdp.actions)
rand_agent = RandomAgent(taxi_mdp.actions)

run_agents_on_mdp([ql_agent, rm_agent, rand_agent], taxi_mdp)

Running the above code produces (and opens) the following plot in about 1 second on my laptop:

Chain Results

The learning curves indicate that the agents have not learned with the alotted number of steps/episodes, which is not surprising due to the difficult exploration problem posed by the Taxi problem: agents will keep acting randomly until they reach the goal once. Let's increase the number of steps per episode and number of episodes:

Chain Results

Aha! Learning.

Example 3: OpenAI Gym (Breakout)

To run experiments in the gym, the setup is almost identical:

# Imports
import simple_rl
from simple_rl.agents import LinearApproxQLearnerAgent, RandomAgent
from simple_rl.tasks import GymMDP
from simple_rl.run_experiments import run_agents_on_mdp

# Gym MDP
gym_mdp = GymMDP(env_name='CartPole-v0')

# Setup agents and run.
lin_agent = LinearApproxQLearnerAgent(gym_mdp.actions)
rand_agent = RandomAgent(gym_mdp.actions)

run_agents_on_mdp([lin_agent, rand_agent], gym_mdp, num_instances=25, num_episodes=1, num_steps=1000)

Unsurpisingly, with such a small amount of data, the agents don't do too well on a relatively challenging problem:

Cartpole Results

Also note here we set the number of episodes to 1, so the x-axis of the generated plot automatically switches to per-step cumulative reward across instances instead of per-episode. The instances flag is how many of the same instance of each algorithm to run to compute confidence intervals.

Those are the basics! I'm hoping to add an interface to the MALMO and other major AI testbeds at some point, too, as well as a few of the Deep RL agents.

Code Overview

The code of simple_rl basically consists of the following:

Adding an MDP

To add a new MDP, make an MDP subclass with the following components:

That's it! If you want to make an OO-MDP, take a look at the TaxiMDP Implementation.

Adding an Agent

To add a new Agent, make an AgentClass subclass with the following properties an act method that takes as input a State and Reward (float) and outputs an action in the MDP's ACTION list. Typically the structure is that the agents each take as input an ACTION list which they hand off in a super call. That's it! Check out RandomAgentClass for a simple example

I hope some folks find this useful! Let me know if you have suggestions or come across any bugs.