CSCI2951-F: Learning and Sequential Decision Making
Michael L. Littman
Time: MW 3-4:20
Place: Brown CIT 506
Semester: Fall 2012
Michael's office hours: CIT 301, by appointment
Description: Through a combination of classic papers and more
recent work, the course explores automated decision making from a
computer-science perspective. It examines efficient algorithms, where
they exist, for single agent and multiagent planning as well as
approaches to learning near-optimal decisions from experience. Topics
will include Markov decision processes, stochastic and repeated games,
partially observable Markov decision processes, and reinforcement
learning. Of particular interest will be issues of generalization,
exploration, and representation. Depending upon enrollment, each
student may be expected to present a published research paper and will
participate in a group project to create a reinforcement-learning
system for a video game. Participants should have taken a
graduate-level computer science course and should have some exposure
to reinforcement learning from a previous computer-science class or
seminar; check with instructor if not sure.
- 9/5: Please read Chapters 1 and 2
Littman (1996). We talked about the space of sequential decision
making problems. Results: geometric discounting implies no decision
- 9/10 [quiz open]: Please read Chapters 1 and 2
Littman (1996). Value iteration and Bellman's equation. Results:
value of a policy is the solution to a system of linear equations,
best policy is stationary and deterministic, sum of discounts is 1/(1-gamma).
- 9/12: We discussed Q-learning, the model-based/model-free distinction. We broke early and
Dayan's talk at 4pm. Results: linear programming can solve MDPs
in polynomial time.
- 9/17 [quiz closed] (Rosh Hashana): Policy iteration, convergence proof of
VI, start on TD. Results: value iteration is a contraction mapping,
each iteration of policy iteration results in an improvement, policy
iteration converges in finite time, the absolute difference between the kth
order statistic of two lists is less than the pairwise absolute
differences between the lists.
- 9/19 [quiz closed]: Class in CIT 219 from now on (except where marked). TD(lambda). Read
- 9/24 [quiz closed]: Generalized Q-learning convergence.
and Szepesvári (1996).
- 9/26 [quiz closed] (Yom Kippur): Zero-sum games.
- 10/1 [quiz closed] (Michael no jury duty): Class in CIT 506. General sum
games, Nash equilibria and grid games. Read:
Greenwald and Hall (2003).
- 10/3 [quiz closed] (Michael no jury duty): Back in 219. Nash equilibria in
and Stone (2003)
and Munoz de Cote and
- 10/8 (Brown 3-day weekend): No class.
- 10/10 [quiz closed]: Exploration, bandits,
- 10/15 [quiz closed]: KWIK learning and efficient
Littman, Walsh (2008).
- 10/17 [quiz closed]: POMDPs. Littman (2009).
- 10/22 [quiz closed]: PSRs.
Littman, Sutton and Singh (2002)
- 10/24 [quiz closed]: Shaping.
Ng, Harada, Russell (1999)
- 10/29 (Hurrican Sandy): No class!
- 10/31 (Halloween): In-class time for project groups to organize.
- 11/5: Options. (Guest speaker: George
Precup, Singh (1999).
- 11/7: Generalization.
Gordon (1995): Introduces averagers and "hill car the hard
(1995) provides a counter example for more general
- 11/12 [quiz open]: Policy gradient.
Baxter and Bartlett (1999). Some discussion of Pegasus.
- 11/14: LSPI.
Lagoudakis and Parr (2001).
- 11/19: UCT.
Kocsis and Szepesvári (2006).
- 11/21 (Thanksgiving): No class!
- 11/26 [quiz open]: Bayesian RL.
Poupart, Vlassis, Hoey, and Regan 2006)
- 11/28: Memory-based RL.
Smart and Kaelbling (2000)
- 12/3: Inverse reinforcement learning. Zeibart
et al. (2008).
- 12/5: viewer's choice (G-TD?)
- 12/10 (reading period, department holiday party): No class!
- 12/11 Project presentations!
- 12/12 Project presentations!
Suggested Project Papers
and Maskin (2004): Talks about hyperbolic discounting. No
algorithm to replicate, but a project could be to implement the
discounting rule and then to show an MDP environment that exhibits
interestingly different behavior.
et al. (2010): Does "level k" reasoning to act in a multiagent
environment where each level has a best response computed by RL.
Could be interesting applied to a collection of grid games I
Singh and Sutton (1996): Compares TD with replacing traces and
TD with accumulating traces. Could be interesting to do both and
then compare to a model-based approach.
Ng, Harada, Russell (1999): Introduces potential-based
shaping. Show the impact on learning time. Could compare to
model-based approaches as
et al. (2008).
and Simmons (1993): These papers do comparisons of
competing exploration algorithms.
- Read Kearns and
Singh (1998): This paper generalizes efficient learning to
problems with stochastic transitions. I don't know if E3 has
(2004), Munoz de Cote
and Littman (2008). Shows how to compute a Nash equilibrium
for repeated Markov games. I have a set of grid games that would
be interesting to run the algorithm on.
Greenwald and Hall (2003): Introduce
et al. (2005) show some games that are ill-behaved.
Cassandra, Kaelbling, Littman (1994),
Cassandra, Littman, Zhang (1997): Algorithms for POMDPs.
et al. (2010): Presents classes of 2x2 games that induce
different behavior from Q-learning. Implement "spoiled child" and
see how IQL differs from human play.
(2006): Introduces a variant of MDPs that can be solved via
et al. (2011):
Supposedly a paper that shows how to set lambda
in TD lambda.
and Barto (2012):
Supposedly a paper that shows how to set alpha
in TD lambda.
and Ng (2009): Feature selection in LSTD.
and Veness (2010): Using techniques from Go to attack large POMDPs.
et al. (2000): Policy gradient paper.
Silver, Sutton, and Mueller (2008).
Chaslot, Winands, Herik, Uiterwijk, and Bouzy (2008)
Topics and Papers
The RL survey referred to below is
Kaelbling, Littman, Moore (1996).
- Markov decision processes and algorithms.
Survey, Sections 1 and 3.
Littman, Dean, Kaelbling (1995).
Survey, Section 4.1.
Survey, remainder of Section 4, Section 5.
Survey, Section 2.
- Repeated Games.
Hart and Mas-Colell (2000).
Greenwald and Jafari (2003).
- Generalization and convergence.
Survey, Sections 6.1, 6.2.
- Partially observable environments.
Survey, Section 7.
- RL in POMDPs.
Loch and Singh (1998).
Survey, remainder of Section 6.
- Policy search.
- Non-stationary environments.
- Instance-based RL.
Ormoneit and Sen (1999).
Survey, Sections 8 and 9.
Crites and Barto (1996),
Other Topics I'd Love To Talk About
UCT and Go. Recent Alberta work on function approximation. Bayesian
RL. Natural policy gradient. RL in Neuroscience. Unlearning in
SARSA(0) in Tetris. Ramon et al..
The URL for this page is