CSCI2951F: Learning and Sequential Decision Making
Brown University
Fall 2012
Michael L. Littman
Time: MW 34:20
Place: Brown CIT 506
Semester: Fall 2012
Michael's office hours: CIT 301, by appointment
(mlittman@cs.brown.edu).
Description: Through a combination of classic papers and more
recent work, the course explores automated decision making from a
computerscience perspective. It examines efficient algorithms, where
they exist, for single agent and multiagent planning as well as
approaches to learning nearoptimal decisions from experience. Topics
will include Markov decision processes, stochastic and repeated games,
partially observable Markov decision processes, and reinforcement
learning. Of particular interest will be issues of generalization,
exploration, and representation. Depending upon enrollment, each
student may be expected to present a published research paper and will
participate in a group project to create a reinforcementlearning
system for a video game. Participants should have taken a
graduatelevel computer science course and should have some exposure
to reinforcement learning from a previous computerscience class or
seminar; check with instructor if not sure.
Calendar
 9/5: Please read Chapters 1 and 2
of
Littman (1996). We talked about the space of sequential decision
making problems. Results: geometric discounting implies no decision
reversals.
 9/10 [quiz open]: Please read Chapters 1 and 2
of
Littman (1996). Value iteration and Bellman's equation. Results:
value of a policy is the solution to a system of linear equations,
best policy is stationary and deterministic, sum of discounts is 1/(1gamma).
 9/12: We discussed Qlearning, the modelbased/modelfree distinction. We broke early and
attended Peter
Dayan's talk at 4pm. Results: linear programming can solve MDPs
in polynomial time.
 9/17 [quiz closed] (Rosh Hashana): Policy iteration, convergence proof of
VI, start on TD. Results: value iteration is a contraction mapping,
each iteration of policy iteration results in an improvement, policy
iteration converges in finite time, the absolute difference between the kth
order statistic of two lists is less than the pairwise absolute
differences between the lists.
 9/19 [quiz closed]: Class in CIT 219 from now on (except where marked). TD(lambda). Read
Sutton (1988).
 9/24 [quiz closed]: Generalized Qlearning convergence.
Read Littman
and Szepesvári (1996).
 9/26 [quiz closed] (Yom Kippur): Zerosum games.
Read Littman
(1994).
 10/1 [quiz closed] (Michael no jury duty): Class in CIT 506. General sum
games, Nash equilibria and grid games. Read:
Greenwald and Hall (2003).
 10/3 [quiz closed] (Michael no jury duty): Back in 219. Nash equilibria in
repeated games.
Read: Littman
and Stone (2003)
and Munoz de Cote and
Littman (2008).
 10/8 (Brown 3day weekend): No class.
 10/10 [quiz closed]: Exploration, bandits,
algorithms. Fong
(1995).
 10/15 [quiz closed]: KWIK learning and efficient
RL. Li,
Littman, Walsh (2008).
 10/17 [quiz closed]: POMDPs. Littman (2009).
 10/22 [quiz closed]: PSRs.
Littman, Sutton and Singh (2002)
 10/24 [quiz closed]: Shaping.
Ng, Harada, Russell (1999)
 10/29 (Hurrican Sandy): No class!
 10/31 (Halloween): Inclass time for project groups to organize.
 11/5: Options. (Guest speaker: George
Konadaris) Sutton,
Precup, Singh (1999).
 11/7: Generalization.
Gordon (1995): Introduces averagers and "hill car the hard
way". Baird
(1995) provides a counter example for more general
convergence.
 11/12 [quiz open]: Policy gradient.
Baxter and Bartlett (1999). Some discussion of Pegasus.
 11/14: LSPI.
Lagoudakis and Parr (2001).
 11/19: UCT.
Kocsis and Szepesvári (2006).
 11/21 (Thanksgiving): No class!
 11/26 [quiz open]: Bayesian RL.
Poupart, Vlassis, Hoey, and Regan 2006)
 11/28: Memorybased RL.
Smart and Kaelbling (2000)
 12/3: Inverse reinforcement learning. Zeibart
et al. (2008).
 12/5: viewer's choice (GTD?)
 12/10 (reading period, department holiday party): No class!
 12/11 Project presentations!
 12/12 Project presentations!
Suggested Project Papers
 Dasgupta
and Maskin (2004): Talks about hyperbolic discounting. No
algorithm to replicate, but a project could be to implement the
discounting rule and then to show an MDP environment that exhibits
interestingly different behavior.
 Lee
et al. (2010): Does "level k" reasoning to act in a multiagent
environment where each level has a best response computed by RL.
Could be interesting applied to a collection of grid games I
created.

Singh and Sutton (1996): Compares TD with replacing traces and
TD with accumulating traces. Could be interesting to do both and
then compare to a modelbased approach.

Wiewiora
(2003),
Ng, Harada, Russell (1999): Introduces potentialbased
shaping. Show the impact on learning time. Could compare to
modelbased approaches as
in Asmuth
et al. (2008).

Fong (1995),
Koenig
and Simmons (1993): These papers do comparisons of
competing exploration algorithms.
 Read Kearns and
Singh (1998): This paper generalizes efficient learning to
problems with stochastic transitions. I don't know if E3 has
been implemented.
 Littman
and Stone
(2004), Munoz de Cote
and Littman (2008). Shows how to compute a Nash equilibrium
for repeated Markov games. I have a set of grid games that would
be interesting to run the algorithm on.


Greenwald and Hall (2003): Introduce
CEVI. Zinkevich
et al. (2005) show some games that are illbehaved.

Cassandra, Kaelbling, Littman (1994),
Cassandra, Littman, Zhang (1997): Algorithms for POMDPs.
 Wunder
et al. (2010): Presents classes of 2x2 games that induce
different behavior from Qlearning. Implement "spoiled child" and
see how IQL differs from human play.
 Todorov
(2006): Introduces a variant of MDPs that can be solved via
linear equations!
 Konidaris
et al. (2011):
Supposedly a paper that shows how to set lambda
in TD lambda.
 Dabney
and Barto (2012):
Supposedly a paper that shows how to set alpha
in TD lambda.
 Kolter
and Ng (2009): Feature selection in LSTD.
 Silver
and Veness (2010): Using techniques from Go to attack large POMDPs.
 Sutton
et al. (2000): Policy gradient paper.
Upcoming Papers/Topics
Other Papers
Sutton (1990)
Silver, Sutton, and Mueller (2008).
Optional:
Chaslot, Winands, Herik, Uiterwijk, and Bouzy (2008)
Topics and Papers
The RL survey referred to below is
Kaelbling, Littman, Moore (1996).
 Markov decision processes and algorithms.
Survey, Sections 1 and 3.
Littman, Dean, Kaelbling (1995).
 TDlambda.
Survey, Section 4.1.
 Qlearning/Convergence.
Survey, remainder of Section 4, Section 5.
 Exploration.
Survey, Section 2.
 Repeated Games.
Hart and MasColell (2000).
Greenwald and Jafari (2003).
 Generalization and convergence.
Survey, Sections 6.1, 6.2.
Baird (1995).
Gordon (1995).
 Partially observable environments.
Survey, Section 7.
 RL in POMDPs.
Chrisman (1992).
Loch and Singh (1998).
 Hierarchy.
Survey, remainder of Section 6.
Dietterich (1998).
 Policy search.
 Nonstationary environments.
 Instancebased RL.
Ormoneit and Sen (1999).
 Applications.
Survey, Sections 8 and 9.
Crites and Barto (1996),
Tesauro (1992).
Other Topics I'd Love To Talk About
UCT and Go. Recent Alberta work on function approximation. Bayesian
RL. Natural policy gradient. RL in Neuroscience. Unlearning in
SARSA(0) in Tetris. Ramon et al..
RL Links
The URL for this page is
http://www.cs.rutgers.edu/~mlittman/courses/seq09/.