CSCI2951-F: Learning and Sequential Decision Making
Michael L. Littman (email@example.com, CIT 301)
TA: Hannah Quay-de la Vallee (firstname.lastname@example.org)
design: James MacGlashan (email@example.com, CIT 460)
Time: TTh 3-4:20
Place: Brown CIT 506
Semester: Fall 2013
Office hours: Hannah 4-5pm Tuesdays and by appointment.
Description: Through a combination of classic papers and more
recent work, the course explores automated decision making from a
computer-science perspective. It examines efficient algorithms, where
they exist, for single agent and multiagent planning as well as
approaches to learning near-optimal decisions from experience. Topics
include Markov decision processes, stochastic and repeated games,
partially observable Markov decision processes, and reinforcement
learning. Of particular interest will be issues of generalization,
exploration, and representation. Students will replicate a result in
a published paper in the area. Participants should have taken a
graduate-level machine-learning course and should have had some
exposure to reinforcement learning from a previous computer-science
class or seminar; check with instructor if not sure.
Prerequisites: CSCI 1950F or CSCI 1420 or permission of the instructor.
Online quizzes: There will be one online quiz per class to solidify the concepts.
Result replication presentation: Students will form into small
groups of two to four, and select a relevant paper from the literature.
They will choose a graph in the paper and create an independent
implementation/replication of this result. Students often find that
important parameters needed to replicate the result are not stated in
the paper and that obtaining the same pattern of results is sometimes
not possible. Students will present their work at the end of the
semester. Grades are based on the fidelity of the replication (25%),
how well they show they understand the original paper (25%), the
quality of the presentation itself in terms of clarity and creativity
(25%), and their short written report (25%). The grade on this
project will represent 50% of the final grade in the class.
BURLAP: We will try to use and extend BURLAP, the Brown-UMBC
Reinforcement Learning and Planning system, to learn about the
algorithms in the class and for the result replications.
Grading: Final grade is derived from: Online quizzes (50%),
result replication presentation (50%).
Suggested project papers.
- 9/5 (Rosh Hashana) [quiz
closed]: Please read Chapters 1 and 2
(1996). The space of sequential decision-making problems,
geometric discounting implies no decision reversals. Work out a
concrete example of a reversal.
- 9/10 [quiz closed]: Please read Chapters 1 and 2
(1996). Define MDP. Talk about representations (math, graph,
table, OO). Work out a concrete example of an MDP and its value.
Policies and optimality. Discuss 1/(1-gamma). Bellman's
- 9/12 [quiz closed]: Value
iteration and Bellman's equation. Value of a
policy is the solution to a system of linear equations. Define
contraction mapping and show value iteration contacts.
- 9/17 [quiz closed]: Hoeffding bounds for
epsilon-delta assurance of Monte Carlo. Q-learning with a fixed
policy. Connection to averaging, learning rates, and Bellman
- 9/19 [quiz closed]: Model-based estimate.
TD(lambda). Read Sutton
- 9/24 [quiz closed] (James): VI for policy optimization in BURLAP.
- 9/26 [quiz closed]: Convergence proof of VI.
and Szepesvári (1996). Q-learning. The absolute difference between the
kth order statistic of two lists is less than the pairwise absolute
differences between the lists.
- 10/1 : Generalized Q-learning
convergence. Policy iteration. Each iteration of policy
iteration results in an improvement, policy iteration
converges in finite time.
- 10/3 [quiz closed]: Linear programming. Linear programming can solve MDPs in polynomial
time. MDP dual LP. Zero-sum games.
- 10/8 [quiz closed]: General sum
games, Nash equilibria. Nash equilibria in
and Stone (2003).
- 10/10 [quiz closed]: Stochastic games. Minimax-Q.
Grid games. Read:
Greenwald and Hall (2003),
Munoz de Cote and
- 10/15 [quiz closed, EC planned]: Exploration, bandits,
- 10/17 [quiz closed]: KWIK learning and efficient
Littman, Walsh (2008).
- 10/22 [EC planned]:
- 10/24 : HSVI.
Smith, Simmons (2005).
- 10/29[EC planned]: Shaping.
Ng, Harada, Russell (1999),
Littman, Zinkov (2008).
- 10/31: Project discussion, break early for Artzi talk.
- 11/5: Options. Sutton,
Precup, Singh (1999),
Hester, Stone (2008) (including
- 11/7: Generalization.
Gordon (1995): Introduces averagers and "hill car the hard
(1995) provides a counter example for more general
- 11/12 : Policy gradient.
Baxter and Bartlett (1999).
- 11/14 [EC planned]: LSPI.
Lagoudakis and Parr (2001).
- 11/19: UCT.
Kocsis and Szepesvári (2006).
- 11/21 : Brad Knox speaks in class on
training RL agents. In 368.
- 11/26 [EC planned]: Bayesian RL.
Poupart, Vlassis, Hoey, and Regan 2006).
- 11/28 (Thanksgiving): No class!
- 12/3: Memory-based RL.
Smart and Kaelbling (2000)
- 12/5: Inverse reinforcement learning. Zeibart
et al. (2008). Babes et al. (2011).
- 12/10 Project presentations!
- 12/12 Project presentations!