CSCI2951-F: Learning and Sequential Decision Making
Brown University
Fall 2017
Prof. Michael L. Littman (mlittman@cs.brown.edu, CIT 301)
Time: TTh 10:30-11:50 
Place: Brown CIT 316 (subject to change) 
Semester: Fall 2017 
Web page: http://cs.brown.edu/courses/cs2951f/ 
Office hours: By appointment. 
General Orientation to the Course: Through a combination of
classic papers and more recent work, the course explores automated
decision making from a computer-science perspective.  It examines
efficient algorithms, where they exist, for single agent and
multiagent planning as well as approaches to learning near-optimal
decisions from experience.  Topics include Markov decision processes,
stochastic and repeated games, partially observable Markov decision
processes, and reinforcement learning.  Of particular interest will be
issues of generalization, exploration, and representation.  Students
will replicate a result in a published paper in the area.
Participants should have taken a machine-learning course and should
have had some exposure to reinforcement learning from a previous
computer-science class or seminar; check with instructor if not
sure. Students should already know how to program. No particular
language will be required, but Python is growing interest in the
community. 
Course goals: Students should understand the main concepts of
interest to the field of reinforcement learning, be able to implement
standard algorithms and understand how to apply them to relevant
problems, and be prepared to be able to contribute new results to the
field. 
 
		      
Prerequisites: CSCI 1950F or CSCI 1420 or permission of the instructor. 
Flipped structure: Students need to watch the lectures recorded
by the professor and
available online. 
Class time will be used for problem solving and discussion. 
    
Result replication presentation: Students will form into small
groups of two to four, and select a relevant paper from the literature.
They will choose a graph in the paper and create an independent
implementation/replication of this result.  Students often find that
important parameters needed to replicate the result are not stated in
the paper and that obtaining the same pattern of results is sometimes
not possible.  Students will present their work at the end of the
semester.  Grades are based on the fidelity of the replication (25%),
how well they show they understand the original paper (25%), the
quality of the presentation itself in terms of clarity and creativity
(25%), and their short written report (25%).  The grade on this
project will represent 50% of the final grade in the class.
Grading: Final grade is derived from: Per-class quizzes (50%),
result replication presentation (50%).
Schedule
  -  9/7: RL  intro. 
 RL
	Demo, Intro slides.
-  9/12: Lesson: Introduction to Reinforcement
    Learnng.  (Michael out). 
 Pedro Tsividis on learning to play like humans.
-  9/14: Lesson: Smoov and Curly's Bogus Journey.
  
-  9/19: Lesson: Reinforcement Learnng Basics.
 Read Chapters 1 and 2
    of Littman
    (1996).
-  9/21: Lesson: TD and Friends.
 Read Sutton
  (1988).
-  9/26: 
 Read Thomas,
        Niekum, Theocharous, and Konidaris (2015).
-  9/28: Lesson: Convergence. 
 Read Littman
	and Szepesvári (1996).
-  10/3: (Michael out).
 George Konidaris on alternative notions of return.
-  10/5: 
  
-  10/10: Lesson: AAA.
 Read Kocsis
	and Szepesvári (2006).
-  10/12: Project discussion.
 
-  10/17: Lesson: Messing with Rewards.
 Read  
	Ng, Harada, Russell (1999) and Asmuth,
	Littman, Zinkov (2008).
-  10/19: Lesson: Exploring Exploration.
 Read Fong
	(1995).
-  10/24: 
 Read Li,
	Littman, Walsh (2008) and 
	Poupart, Vlassis, Hoey, and Regan (2006).
-  10/26: Lesson: Generalization.
 Read 
	Gordon (1995), Baird
	(1995) and Smart
	and Kaelbling (2000).
-  10/31: 
 Read 
	Baxter and Bartlett (1999) and Lagoudakis
	and Parr (2001).
-  11/2: Lesson: Partially Observable MDPs.
 Read Littman
    	(2009) and Smith, Simmons
	(2005).
-  11/7: (Michael out).
 Dave Abel presents "SimpleRL" RL package.
-  11/9: Lesson: Options.
 Read Sutton,
	Precup, Singh (1999),
	Jong,
	Hester, Stone (2008) (including 
	slides),
	and Silver and
	Ciosek (2012).
-  11/14: Lesson: Game Theory.
 Read Littman and Stone (2003).
-  11/16: (Michael out). 
 Jun Ki Lee presents implementing deep RL algorithms in TensorFlow.
-  11/21: Lesson: Game Theory Reloaded.
 Read Littman
	(1994) and 
	Greenwald and Hall (2003).
-  11/28: Lesson: Game Theory Revolutions.
 Read Munoz
	de Cote and Littman (2008).
-  11/30: Lesson: CCC.
 Read  Zeibart
	et al. (2008) and Babes et al. (2011).
-  12/5: Discussion: Deep RL.
 Read Minh
	et al. (2015).
 Robot learning videos:
-  12/7: Presentations.
  
-  12/12: Presentations.
  
-  12/14: Presentations.
  
-  12/15: Presentations.
Topics and Paper Links
-  The space of sequential decision-making problems,
    geometric discounting implies no decision reversals.  Work out a
    concrete example of a reversal.
-  Define MDP.  Talk about representations (math, graph,
    table, OO).  Work out a concrete example of an MDP and its value.
    Policies and optimality.  Discuss 1/(1-gamma).  Bellman's
    equation.
-  Value iteration and Bellman's equation.  Value of a 
  policy is the solution to a system of linear equations.  Define
    contraction mapping and show value iteration contacts.
-  Model-based estimate.
  TD(lambda).
-  Hoeffding bounds for epsilon-delta assurance of Monte
  Carlo. Q-learning with a fixed policy.  Connection to averaging,
  learning rates, and Bellman equation. 
- Coding VI for policy optimization.
- Convergence proof of VI. 
    Q-learning.  The absolute difference between the
	k-th order statistic of two lists is less than the pairwise absolute
	differences between the lists.  
- Generalized Q-learning convergence.  Policy iteration.  Each
	iteration of policy iteration results in an improvement, policy iteration
	converges in finite time.
- Linear programming.  Linear programming can solve MDPs in polynomial
	time.  MDP dual LP.  Zero-sum games. 
- General sum games, Nash equilibria.  Nash equilibria in repeated
	games. 
- Stochastic games.  Minimax-Q. Grid games.   
- Exploration, bandits, algorithms.
- KWIK learning and efficient RL.  
- POMDPs.
- HSVI.
-  Shaping.
- Project discussion.
- Options. 
- Generalization. Introduce averagers and "hill car the hard
	way". Provide a counter example for more general
	convergence.
- Policy gradient.
- LSPI. 
- UCT. 
- Training RL agents.
- Bayesian RL.
- Memory-based RL.
- Inverse reinforcement learning.
-  Deep reinforcement learning.