Iliad

Reward Learning Theory

Cluster ADownload .md

The theoretical foundations of reward learning — how RLHF can (in principle) recover an aligned objective, when underspecification and misspecification break that story, and how assistance games reframe the problem.

By Leon Lang (Iliad), Joar Skalse (King's College London)

Prerequisites

It is useful to know the basics of reinforcement learning, reinforcement learning from human feedback (RLHF), and AI alignment, as taught in other modules of this course.

Fast track

Simply read the lecture slides of the first lecture. If you have more time, do the first exercise on partial observability as one instantiation of misspecification.

Main content

Learn more

Here we list further readings, which are largely papers mentioned in Leon's lecture slides.