Fall 2015

Michael L. Littman (mlittman@cs.brown.edu, CIT 301)

BURLAP assistance: James MacGlashan (jmacglashan@cs.brown.edu)

__Time__: TTh 2:30-3:50

__Place__: Brown CIT 368

__Semester__: Fall 2015

Office hours: By appointment.

**Prerequisites**: CSCI 1950F or CSCI 1420 or permission of the instructor.

**Online quizzes**: There will be one online quiz per class to solidify the concepts.

**Result replication presentation**: Students will form into small
groups of two to four, and select a relevant paper from the literature.
They will choose a graph in the paper and create an independent
implementation/replication of this result. Students often find that
important parameters needed to replicate the result are not stated in
the paper and that obtaining the same pattern of results is sometimes
not possible. Students will present their work at the end of the
semester. Grades are based on the fidelity of the replication (25%),
how well they show they understand the original paper (25%), the
quality of the presentation itself in terms of clarity and creativity
(25%), and their short written report (25%). The grade on this
project will represent 50% of the final grade in the class.

**BURLAP**: We will use BURLAP, the Brown-UMBC Reinforcement
Learning and Planning Java library, to learn about the algorithms in
the class and for the result replications.

**Grading**: Final grade is derived from: Online quizzes (50%),
result replication presentation (50%).

- Please read Chapters 1 and 2 of Littman (1996). The space of sequential decision-making problems, geometric discounting implies no decision reversals. Work out a concrete example of a reversal.
- Define MDP. Talk about representations (math, graph, table, OO). Work out a concrete example of an MDP and its value. Policies and optimality. Discuss 1/(1-gamma). Bellman's equation. BURLAP.
- Value iteration and Bellman's equation. Value of a policy is the solution to a system of linear equations. Define contraction mapping and show value iteration contacts.
- Hoeffding bounds for epsilon-delta assurance of Monte Carlo. Q-learning with a fixed policy. Connection to averaging, learning rates, and Bellman equation.
- Model-based estimate. TD(lambda). Read Sutton (1988).
- VI for policy optimization in BURLAP.
- Convergence proof of VI. Read Littman and Szepesvári (1996). Q-learning. The absolute difference between the kth order statistic of two lists is less than the pairwise absolute differences between the lists.
- Generalized Q-learning convergence. Policy iteration. Each iteration of policy iteration results in an improvement, policy iteration converges in finite time.
- Linear programming. Linear programming can solve MDPs in polynomial time. MDP dual LP. Zero-sum games. Read Littman (1994).
- General sum games, Nash equilibria. Nash equilibria in repeated games. Read: Littman and Stone (2003).
- Stochastic games. Minimax-Q. Grid games. Read: Greenwald and Hall (2003), Munoz de Cote and Littman (2008).
- Exploration, bandits, algorithms. Fong (1995).
- KWIK learning and efficient RL. Li, Littman, Walsh (2008).
- POMDPs. Littman (2009).
- HSVI. Smith, Simmons (2005).
- Shaping. Ng, Harada, Russell (1999), Asmuth, Littman, Zinkov (2008).
- Project discussion.
- Options. Sutton, Precup, Singh (1999), Jong, Hester, Stone (2008) (including slides).
- Generalization. Gordon (1995): Introduces averagers and "hill car the hard way". Baird (1995) provides a counter example for more general convergence.
- Policy gradient. Baxter and Bartlett (1999).
- LSPI. Lagoudakis and Parr (2001).
- UCT. Kocsis and Szepesvári (2006).
- Training RL agents.
- Bayesian RL. Poupart, Vlassis, Hoey, and Regan 2006).
- Memory-based RL. Smart and Kaelbling (2000)
- Inverse reinforcement learning. Zeibart et al. (2008). Babes et al. (2011).
- Action priors.