Reward Learning Theory
The theoretical foundations of reward learning — how RLHF can (in principle) recover an aligned objective, when underspecification and misspecification break that story, and how assistance games reframe the problem.
By Leon Lang (Iliad), Joar Skalse (King's College London)
Prerequisites
It is useful to know the basics of reinforcement learning, reinforcement learning from human feedback (RLHF), and AI alignment, as taught in other modules of this course.
Fast track
Simply read the lecture slides of the first lecture. If you have more time, do the first exercise on partial observability as one instantiation of misspecification.
Main content
-
Lecture slides: Intro to Reward Learning Theory by Leon Lang
-
Reward Learning Theory: Exercises (also available without solutions). The exercise sheet is based on the following two papers:
-
Lecture slides: Towards a Formal Theory of Reward Learning, With Application to Inverse Reinforcement Learning by Joar Skalse. More context:
Learn more
Here we list further readings, which are largely papers mentioned in Leon's lecture slides.
-
Faulty reward functions in the wild: A basic reward specification problem.
-
Reward Learning methods and frameworks:
-
Deep reinforcement learning from human preferences — the most basic and popular approach to reward learning
-
Reward-rational (implicit) choice: A unifying formalism for reward learning — a generalization that contains RLHF as a special case
-
Algorithms for Inverse Reinforcement Learning: Another special case
-
Preferences Implicit in the State of the World: Another special case
-
Cooperative Inverse Reinforcement Learning: A generalization
-
Benefits of Assistance over Reward Learning: An adapted framework for said generalization
-
-
-
Underspecification and misspecification
-
Occam's razor is insufficient to infer the preferences of irrational agents: This is a case of an underspecification of the relationship between the human's reward function and the human's policy
-
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback: In this misspecification, humans are assumed to fully observe the environment even if they only do so partially.
-
Modeling Human Beliefs about AI Behavior for Scalable Oversight: An approach to correct for the previous misspecification via human models; this can then, however, sometimes lead to an underspecification in which the reward function is too uncertain to be safely learned.
-
AI Alignment with Changeable and Influenceable Reward Functions: This paper breaks with the typical assumption of a fixed reward function.
-
Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback: This paper breaks with the typical assumption of a single reward function.
-
-
On how reward learning falls within AI alignment:
-
Alignment targets
-
The reward learning agenda assumes that human values can be expressed via reward functions. However, some papers argue:
-
If human values are captured by a reward function, it is still contentious that we should attempt to decompose the alignment problem into first learning said reward function and then optimizing it:
-
Even if we'd solve outer alignment via reward learning, we'd still be left with the inner alignment problem of finding a policy that "cares for" this objective:
-
Risks from Learned Optimization in Advanced Machine Learning Systems
-
Reinforcement Learning textbook: This book is on reinforcement learning, which can be regarded as a very basic conceptualization of the inner alignment problem
-
-
-
Learn more also in Joar Skalse's sequence on the theoretical foundations of reward learning
-
I can also recommend reading the work of Anca Dragan and her many students on reward learning.