Reinforcement Learning

Principal lecturer: Dr Rika Antonova
Taken by: MPhil ACS, Part III
Code: L171
Term: Michaelmas
Hours: 16 (8 x two-hour lectures)
Format: In-person lectures
Prerequisites: Multivariable calculus, linear algebra, probability, machine learning
timetable

Aims

The aim of this module is to present state-of-the-art reinforcement learning (RL) methods, incentivise students to understand RL theory and develop skills for coding deep RL methods.

RL has seen unprecedented success in recent years. However, the majority of RL methods still require intricate skills and insights for successful applications. The goal of this module is to communicate the promising aspects of RL, but also ensure that students understand the limitations of the current RL methods.

The assessment will consist of a theory test (in-class, closed-book), a test to assess the students’ understanding of how to code RL methods (in-class, closed-book), a mini-project, and a participation component (all described in the ‘Modes of Assessment’ section).

Objectives

Students will learn the following technical concepts:

fundamental RL terminology and mathematical formalism; a brief history of RL and its connection to neuroscience and biological systems
RL methods for discrete action spaces, e.g. deep Q-learning and large-scale Monte Carlo Tree Search
methods for exploration, modelling uncertainty, and partial observability for RL
modern policy gradient and actor-critic methods
concepts needed to construct model-based RL and Model Predictive Control methods
approaches to make RL data-efficient and ways to enable simulation-to-reality transfer
examples of fine-tuning foundation models and large language models (LLMs) with human feedback; safe RL concepts; examples of using RL for safety validation
examples of using RL for scientific discovery

Students will also gain practical experience with coding and analysing RL methods to uncover their strengths and shortcomings, as well as proposing novel extensions to improve the performance of existing RL methods in a short mini-project.

Syllabus

Topic 1: Introduction and Fundamentals

Overview of RL: foundational ideas, history, and books; connection to neuroscience and biological systems, recent industrial applications and research demonstrations
Mathematical fundamentals: Markov decision processes, Bellman equations, policy and value iteration, temporal difference learning
Short intro to RL libraries and environments

Topic 2: RL in Discrete Action Spaces

Q-learning, function approximation and deep Q-learning; nonstationarity in RL and its implications for deep learning; example applications (video games; initial example: Atari)
Monte Carlo Tree Search; example applications (AlphaGo)

Topic 3: Policy Gradient and Actor-critic Methods for Continuous Action Spaces

Policy gradient theorem, actor-critic methods (SPG, DDPG)
Proximal policy optimisation; example applications

Topic 4: Exploration, Uncertainty, Data-efficient RL and Simulation-to-reality Transfer

Multi-armed bandits, Bayesian optimisation, regret analysis
Data-efficient learning from real data (e.g. policy search in robotics), real-to-sim inference and differentiable simulation, data-efficient simulation-to-reality transfer
RL for physical systems (successful examples in locomotion, open problems in contact-rich robot manipulation)

Topic 5: Partial Observability, Memory, and Sequence Modelling

Introduction to sequence modelling with transformers and RL with transformer-based methods.
Partially observable Markov decision process; probabilistic methods for belief and memory modelling.

Topic 6: Model-based RL and Model Predictive Control; Residual RL

Learning dynamics models with learned models
Model predictive control; residual RL (model-based or model-free)

Topic 7: RL with Human Feedback ; Safe RL and RL for Validation

Fine-tuning large language models (LLMs) and other foundation models with human feedback (TRLX,RL4LMs, a light-weight overview of RLHF)
A review of SafeRL, example: optimising commercial HVAC systems using policy improvement with constraints; improving safety using RL for validation: examples in autonomous driving and autonomous flying and aircraft collision avoidance
Examples of RL for molecular design and drug discovery, active learning for synthesising new materials, RL for theorem proving, and RL for nuclear fusion experiments

Topic 8: Student Presentations (+ an invited talk)

Students will give short mini-project presentations: a figure for an algorithm overview, and at least two plots with main results as they will appear in the mini-project report. (The report is due several days after the last class meeting, but the figures should be the same as in this in-class presentation).
One of the invited talks can be scheduled for this class meeting as well.

Assessment

The assessment for this module consists of:

The module will have a test of RL theory (20%), a test of coding skills for RL methods (20%), a mini-project (50%), and a participation (10%) assessment. Students will also be required to complete two brief practical exercises in preparation for the coding test. These will not be marked, instead, they will serve as checkpoints to ensure students are preparing for the the test. We will discuss the unmarked coding exercises during class, with opportunities for students to earn participation points.

Theory test (20%): To test the understanding of RL fundamentals, students will be required to take a theory test (30 minutes, in-class, closed-book).
Coding test (20%): The students will be required to show understanding of the implementation of the foundational and state-of-the-art RL methods (30 minutes, in-class, closed-book).
Mini-project (50%): This mini-project will focus on improving RL under resource constraints. It will build on the take-home exercises, but for the project, the students will be challenged to push beyond the state of the art: maintain high reward performance while optimising memory and compute resources. Students can apply innovations they learn from lectures and assigned readings, seek further insights from other sources, and even show research potential by creating novel RL algorithms.
Students will work in pairs, one partner focusing on optimising memory (space), the other on optimising compute (time). Hence, the project will facilitate clear separation in individual contributions while still encouraging teamwork.
Students will need to submit a short report that describes the proposed method and shows key plots that present the results. Each pair should submit one report (maximum 2000 words total; maximum 1000 words per student), at least one algorithm summary figure, and at least two figures with main results (one to show memory usage, the other to show compute improvements). Each paragraph and plot should be marked with the name of the student who contributed, so that the contributions of each student can be clearly separated for marking.
Participation (10%): Students will be expected to attend class sessions in person, contribute at least one substantial explanation/comment to the in-class discussions about unmarked take-home coding exercises, and briefly present the results of their mini-project. This component will help to ensure that we have lively and active class meetings, where students can practice articulating their knowledge.

Reinforcement Learning

Aims

Objectives

Syllabus

Assessment

Recommended Reading

Study at Cambridge

About the University

Research at Cambridge