Fall 2017

Prof. Michael L. Littman (mlittman@cs.brown.edu, CIT 301)

__Time__: TTh 10:30-11:50

__Place__: Brown CIT 316 (subject to change)

__Semester__: Fall 2017

__Web page__: http://cs.brown.edu/courses/cs2951f/

Office hours: By appointment.

**Course goals**: Students should understand the main concepts of
interest to the field of reinforcement learning, be able to implement
standard algorithms and understand how to apply them to relevant
problems, and be prepared to be able to contribute new results to the
field.

**Prerequisites**: CSCI 1950F or CSCI 1420 or permission of the instructor.

**Flipped structure**: Students need to watch the lectures recorded
by the professor and
available online.
Class time will be used for problem solving and discussion.

**Result replication presentation**: Students will form into small
groups of two to four, and select a relevant paper from the literature.
They will choose a graph in the paper and create an independent
implementation/replication of this result. Students often find that
important parameters needed to replicate the result are not stated in
the paper and that obtaining the same pattern of results is sometimes
not possible. Students will present their work at the end of the
semester. Grades are based on the fidelity of the replication (25%),
how well they show they understand the original paper (25%), the
quality of the presentation itself in terms of clarity and creativity
(25%), and their short written report (25%). The grade on this
project will represent 50% of the final grade in the class.

**Grading**: Final grade is derived from: Per-class quizzes (50%),
result replication presentation (50%).

- 9/7:
**RL intro**.

RL Demo, Intro slides. - 9/12:
**Lesson: Introduction to Reinforcement Learnng.**(Michael out).

Pedro Tsividis on learning to play like humans. - 9/14:
**Lesson: Smoov and Curly's Bogus Journey**. - 9/19:
**Lesson: Reinforcement Learnng Basics**.

*Read Chapters 1 and 2 of Littman (1996)*. - 9/21:
**Lesson: TD and Friends**.

*Read Sutton (1988)*. - 9/26:

*Read Thomas, Niekum, Theocharous, and Konidaris (2015)*. - 9/28:
**Lesson: Convergence**.

*Read Littman and Szepesvári (1996)*. - 10/3: (Michael out).

George Konidaris on alternative notions of return. - 10/5:
- 10/10:
**Lesson: AAA**.

*Read Kocsis and Szepesvári (2006)*. - 10/12: Project discussion.

- 10/17:
**Lesson: Messing with Rewards**.

*Read Ng, Harada, Russell (1999) and Asmuth, Littman, Zinkov (2008)*. - 10/19:
**Lesson: Exploring Exploration**.

*Read Fong (1995).* - 10/24:

*Read Li, Littman, Walsh (2008) and Poupart, Vlassis, Hoey, and Regan (2006)*. - 10/26:
**Lesson: Generalization**.

*Read Gordon (1995), Baird (1995) and Smart and Kaelbling (2000)*. - 10/31:

*Read Baxter and Bartlett (1999) and Lagoudakis and Parr (2001)*. - 11/2:
**Lesson: Partially Observable MDPs**.

*Read Littman (2009) and Smith, Simmons (2005).* - 11/7: (Michael out).
- 11/9:
**Lesson: Options**.

*Read Sutton, Precup, Singh (1999), Jong, Hester, Stone (2008) (including slides), and Silver and Ciosek (2012)*. - 11/14:
**Lesson: Game Theory**.

*Read Littman and Stone (2003).* - 11/16: (Michael out).
- 11/21:
**Lesson: Game Theory Reloaded**.

*Read Littman (1994) and Greenwald and Hall (2003)*. - 11/28:
**Lesson: Game Theory Revolutions**.

*Read Munoz de Cote and Littman (2008)*. - 11/30:
**Lesson: CCC**.

*Read Zeibart et al. (2008) and Babes et al. (2011)*. - 12/5:
**Discussion: Deep RL**.

- 12/7:
**Lesson: Outroduction to Reinforcement Learning**.

*Read Minh et al. (2015)*. - 12/12: No class.
- 12/14:
**Presentations**.

- The space of sequential decision-making problems, geometric discounting implies no decision reversals. Work out a concrete example of a reversal.
- Define MDP. Talk about representations (math, graph, table, OO). Work out a concrete example of an MDP and its value. Policies and optimality. Discuss 1/(1-gamma). Bellman's equation.
- Value iteration and Bellman's equation. Value of a policy is the solution to a system of linear equations. Define contraction mapping and show value iteration contacts.
- Model-based estimate. TD(lambda).
- Hoeffding bounds for epsilon-delta assurance of Monte Carlo. Q-learning with a fixed policy. Connection to averaging, learning rates, and Bellman equation.
- Coding VI for policy optimization.
- Convergence proof of VI. Q-learning. The absolute difference between the k-th order statistic of two lists is less than the pairwise absolute differences between the lists.
- Generalized Q-learning convergence. Policy iteration. Each iteration of policy iteration results in an improvement, policy iteration converges in finite time.
- Linear programming. Linear programming can solve MDPs in polynomial time. MDP dual LP. Zero-sum games.
- General sum games, Nash equilibria. Nash equilibria in repeated games.
- Stochastic games. Minimax-Q. Grid games.
- Exploration, bandits, algorithms.
- KWIK learning and efficient RL.
- POMDPs.
- HSVI.
- Shaping.
- Project discussion.
- Options.
- Generalization. Introduce averagers and "hill car the hard way". Provide a counter example for more general convergence.
- Policy gradient.
- LSPI.
- UCT.
- Training RL agents.
- Bayesian RL.
- Memory-based RL.
- Inverse reinforcement learning.
- Deep reinforcement learning.