Iliad Intensive Curriculum

Prerequisites

We recommend a basic understanding of deep learning to follow this module.
To understand some of the decompositions of AI alignment, it is also useful to have a basic understanding of reinforcement learning, including the notions of a reward function and a policy.

Fast track

This fast track procedure should take you about an hour:

Make this google doc accessible to your favorite LLM.
Read the "main content" text without reading the linked posts.
If you don't understand some terminology or logic in the high-level story, talk with your LLM about it and ask it to clarify, until you understand the entire text in "Main content".

Only once you understand the entire high-level story, you may want to read specific posts and papers in more detail if you have additional time.

Main content

Note: To fully understand this content, read the linked texts. A subselection of these can also be found in the "teaching guide" section, which explains how we ran this content in-person in the April Iliad Intensive.

Different AI risks. There are many AI safety problems. The adolescence of technology is an opinionated introduction into these risks, including the risk of autonomous misaligned AI, AI misuse for destruction, AI misuse for seizing power, and others.

Alignment targets. In this course, we are mainly focused on AI alignment, which is broadly the problem of making sure AI acts in line with "human" wishes. There are different conceptualizations for what this means, i.e. what the "alignment target" is, which are partially overlapping: Some may wish AI to be aligned with our coherent extrapolated volition, which is "our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together [...]". A different conceptualization is that of intent alignment with a human, which means "trying to do what the human wants [the AI] to do". As contemporary AI becomes more powerful and widespread, more sophisticated notions of alignment targets have emerged. Most prominently, one may align an AI with the prescriptions in a constitution as in the case of Anthropic's Claude, which may balance the wishes of users, the developer's guidelines, general ethics, and efforts to oversee the AI. Another often-discussed target is corrigibility. An AI is corrigible if it will let us modify its functions or goals and even let us shut it down. If corrigibility is achieved, then we can correct mistakes in building an AI, which makes it easier to iterate to achieve lower risk.

Alignment problem decompositions. Assume we have chosen our alignment target. How do we ensure an AI system actually obeys that target? This is the technical problem of AI alignment. In this introduction, we are mostly presupposing that AI systems are trained via deep learning, although later modules in this course on agent foundations go beyond that assumption. In the deep learning paradigm, training stories are a useful framework to think about how to align AI systems: You need to choose a training target (e.g. "My model is corrigible") and then have a training rationale for why your training setup will create a model obeying the target. And of course, your training rationale needs to actually be correct, which requires technical research to back up any such claim!

With approaches that are based on reinforcement learning, another popular problem decomposition is into outer and inner alignment. Outer alignment is the problem of specifying a reward function that rewards an AI's behavior according to how well it obeys the chosen alignment target. Inner alignment is the problem of ensuring that the trained policy will, even in new out-of-distribution situations after training, continue to generalize correctly and obey the reward function. In cases where the policy has inner processes that actively optimize for a goal, this means ensuring that this goal agrees with the goal encoded in the reward function, a problem that was discussed at length in risks from learned optimization.

In general, the out-of-distribution (also called "generalization") behavior of AI systems depends crucially on its inductive biases, which are the implicit or explicit constraints on what functions are more likely to emerge from the training process. As it turns out, deep learning has strong implicit inductive biases, as exemplified by emergent misalignment, a phenomenon whereby narrow fine-tuning can produce broadly misaligned LLMs.

The decomposition of the alignment problem into inner and outer alignment is controversial, which is why above we started with the more general framing of training stories. In particular, in "Reward is not the optimization target", Alex Turner argues that the reward function is not what a reinforcement learning policy is optimized to maximize. Instead, a policy performs the contextual computations that were previously reinforced before a reward event. This is similar to Eliezer Yudkowsky's proposal to view biological organisms as adaptation-executers instead of fitness-maximizers.

Goal-directedness. We already briefly discussed the idea of a policy that "internally" steers toward some goal. If an AI is goal-directed in that way, then any misalignment on top is intuitively more dangerous. In "Why tool AIs want to be agent AIs", Gwern argues that, among others, economic reasons will push AI systems to become more and more goal-directed.

Such an AI system, if successful, would also be able to create its own subgoals on the pathway to achieve its goals. "Instrumental convergence" is the claim that there are some goals that are instrumentally useful in reaching a very wide variety of final goals. This would imply that goal-directed AI systems develop basic AI drives like self-improvement, preservation of their goal, or self-protection. This is counter to the alignment target of corrigibility we discussed earlier, and is therefore a dangerous outcome of training powerful AI. Understanding the relationship between inductive biases and the goal formation of AI systems is therefore very important.

Forecasting risks. Where does this leave us, in terms of our general level of risks? In general, there is a broad spectrum of views on the difficulty of the alignment problem. In Anthropic's "core views on AI safety", the authors discuss three very different possibilities: an optimistic scenario, in which current methods are essentially sufficient; an intermediate scenario, in which preventing catastrophic risks requires substantial scientific and engineering effort; and a pessimistic view, in which AI safety is essentially unsolvable. Among authors at top machine learning venues, "[b]etween 38% and 51% of respondents gave at least a 10% chance to advanced AI leading to outcomes as bad as human extinction". While those who are concerned often create failure stories of how AI can lead to catastrophes, like Paul Christiano's "What failure looks like" and the detailed scenario in AI 2027, others are more optimistic and explain why they think "AI is easy to control". In this epistemic state of uncertainty, it is useful to do research on the level and nature of AI risks themselves. One approach is model organisms of misalignment, which are "in vitro demonstrations of the kinds of failures that might pose existential threats".

Solution approaches. Assume we are sufficiently concerned about AI alignment that we want to solve the problem. What high-level approach should we choose? Some people, like Jan Leike, argue for very empirical iterative approaches to AI alignment, sometimes with the goal to use more advanced AI systems to help solve remaining alignment research questions. Eliezer Yudkowsky argues for deep mathematical progress in his analogy to the rocket alignment problem. Yet others take a further step back and argue we need to solve deep philosophical problems on our way to aligned AI. Somewhat orthogonal to all those views, Paul Christiano's post on prosaic AI alignment argues that we should attempt to align the AI systems we have in front of us instead of waiting for breakthroughs in our understanding of intelligence or entirely new paradigms to create intelligent machines — which is a view that is compatible with both empirical and theoretical research on how to make those systems safe.

Our course tries to a large extent a synthesis between different views: Yes, we assume for much of the course the "prosaic" picture that powerful AI systems will be based on deep learning systems much like today's; And we are interested in deep, empirically grounded mathematical progress on understanding these systems.

Learn more

Here we list all readings from above and more.

Problem overviews

On the alignment problem

Non-misalignment safety problems

Individual people may misuse AI in catastrophic ways:
- Sections 2.1-2.3 in An Overview of Catastrophic AI Risks argues for catastrophic misuse capabilities like bioterrorism, unleashing AI agents, and persuasive AIs. Misuse risk is particularly relevant to our course since it can also manifest as a misalignment concern: An AI that assists human users to carry out risks is often misaligned with the AI's developer.
AI can give rise to global totalitarianism
- Section 2.4 argues for the potential of a concentration of power, leading to global totalitarianism in the worst case.
We may get gradually disempowered even if there is alignment
- Gradual Disempowerment: Systematic Existential Risks from Incremental AI Development argues that humans may be gradually disempowered, potentially leading to catastrophic outcomes, even if the alignment problem is technically solved.

Alignment targets

Coherent Extrapolated Volition: Proposes AI should optimize for what humanity would want "if we knew more, thought faster, were more the people we wished we were".
Clarifying AI Alignment: Defines "intent alignment" as AI trying to do what the operator wants.
Claude's Constitution: aligning to a constitution rather than to individual human judgments.
- Alternative: OpenAI's Model Spec
Corrigibility: Defines corrigibility as cooperating with human corrective interventions despite instrumental incentives to resist
Artificial Intelligence, Values, and Alignment: Distinguishes alignment with instructions, intentions, revealed preferences, ideal preferences, interests, and values.
AI alignment as the fair treatment of claims
Societal Alignment Frameworks: "we argue that improving LLM alignment requires incorporating insights from societal alignment frameworks, including social, economic, and contractual alignment"
Beyond Preferences in AI Alignment: "AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant."
AI Control: "Labs should make sure that powerful models can't cause unacceptably bad outcomes even if the AIs try to."
A love for humanity: "Scott Aaronson says OpenAI's cofounder, Ilya Sutskever, has asked him how to use math to define what it means for AI to love humanity. Right now, he has no idea how to answer that. But he sees it as a "North Star," or leading goal, he says. It's a question "that should always be guiding us.""
Truthful AI: AI that does not lie

Alignment problem decompositions

A popular way to decompose AI alignment is into inner and outer alignment, where outer alignment is the problem of specifying an objective function that evaluates according to our intentions, and inner alignment is the problem of creating an AI system that performs optimally according to the objective function. This decomposition is contested and there exist alternatives, but it is still useful to be aware of it.

Outer Misalignment

Specification gaming: the flip side of AI ingenuity
- List of specification gaming examples
The Surprising Creativity of Digital Evolution: "many researchers in the field of digital evolution have observed their evolving algorithms and organisms subverting their intentions, exposing unrecognized bugs in their code, producing unexpected adaptations, or exhibiting outcomes uncannily convergent with ones in nature."
Reward misspecification, where the agent is rewarded positively for bad actions due to the general difficulty to design reward functions that capture the developer's intends, is another key reason why AI systems may end up with wrong goals, already on the training distribution (making this conceptually distinct from generalization concerns). This is also called the outer alignment problem.
Categorizing variants of Goodhart's law: Identifies four distinct failure modes (regressional, extremal, causal, adversarial) when proxies are over-optimized
Scaling laws for reward model overoptimization: Empirically demonstrates that optimizing against a learned reward model eventually degrades true reward
Defining and characterizing reward hacking: Provides formal definitions of reward hacking, proving that for any non-trivial proxy reward, unhackability is impossible.
Towards understanding sycophancy in language models: "both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time".
When your AIs deceive you: partial observability leading to outer misalignment.
Sycophancy to subterfuge: Shows progression from mild sycophancy to active reward tampering.
The effects of reward misspecification: Systematically categorizes reward misspecification types and shows higher agent capability leads to more reward hacking.
Faulty reward functions in the wild
Consequences of misaligned AI: Shows that optimizing proxy rewards depending on a strict subset of relevant features can be arbitrarily bad under certain conditions.

Inner Misalignment

Goal misgeneralization**:** One core reason why goals may misgeneralize out of distribution is an underspecification of the goal — i.e., there are several goals consistent with the AI's behavior on the training set. Goal misgeneralization captures this concern (original paper).
Risks from Learned Optimization is the general conceptual problem of ensuring that the trained AI agent ends up internally optimizing for the goals given to them during training — also called the inner alignment problem.
- Deceptive Alignment is a particular failure mode in which an AI only appears to be aligned with its goals to gain trust and avoid modification, while it actually cares about something else, which might eventually lead to a treacherous turn as outlined above.
Alignment Faking provides some empirical support for deceptive alignment, in which an AI fakes to be aligned with its new goal in order to preserve its own goals and avoid modification. It is controversial whether this is good or bad, and whether the framing as "alignment faking" is correct, since the model tries to preserve the helpful, honest, and harmless values it obtained in its first alignment training.
Gradient hacking: Introduces the concept of a deceptively aligned mesa-optimizer deliberately influencing its own gradient updates to preserve its mesa-objective.
Scheming AIs: Will AIs fake alignment during training in order to get power?
Sleeper Agents: Training deceptive LLMs that persist through safety training: Demonstrates that LLMs trained with backdoor behaviors persist through RLHF, SFT, and adversarial training.

Inductive Biases

Inductive biases for deep learning of higher-level cognition
Emergent misalignment
Natural emergent misalignment from reward hacking in production RL
Trained AI systems are internally a mess. They presumably contain a mix of beliefs, heuristics and online learning and planning algorithms, or something else.

Discussion on outer and inner alignment notions

The decomposition of alignment into inner and outer alignment can be useful both for categorizing failure modes and solution strategies, but has been called into question due to ambiguities arising in edge cases and the opinion of many that the problem should not be decomposed into two in practice.
Reward is not the optimization target
Training stories as an alternative decomposition to outer and inner alignment

Goal-directedness

AI misalignment is arguably particularly bad if AI systems develop goals since this may via instrumental convergence lead to a drive to seek power.

Will AI systems develop goals?

Why tool AIs want to be agent AIs is a classical text by Gwern arguing that AI agents are so useful that they will eventually be developed.
A short informal explanation of this can be found in How could a machine end up with its own priorities? (appendix of "If Anyone Builds It Everyone Dies")
Chapter 2.1,2.2 and 2.4 of Russel and Norvig is also a good explanation of this and generally explains on a conceptual level what intelligent agents are.
Also read AI Goals Forecast with a specific focus on the appendices to understand why AI systems trained via today's (reinforcement learning) methods will likely develop goals at all.
Will humans build goal-directed agents?

Counter points:

Reframing Superintelligence: Comprehensive AI services as general intelligence

Instrumental Convergence / Power-Seeking

The Basic AI Drives: Pioneering argument that sufficiently advanced AI will exhibit convergent instrumental drives.
- Formalizing Convergent Instrumental Goals
The Superintelligent Will: Formalizes the orthogonality thesis and the instrumental convergence thesis
Optimal Policies Tend to Seek Power: First formal proof that for most reward functions in MDPs, optimal policies seek power
Parametrically retargetable decision-makers tend to seek power: Extends power-seeking results beyond optimal policies to more realistic parameterized agents.

Forecasting risks

On how hard it is to achieve alignment

Background: The orthogonality thesis
Meta: Model organisms to learn more about the level of risk.
The main article in the AI Goals Forecast shows the plurality of possible goals that an AI system may inherently want to pursue, some of which may be very bad for us.
Sharp left turn: Argues that when AI capabilities begin to generalize powerfully, alignment properties predictably fail to generalize with them.
AGI ruin: A list of lethalities: 43 reasons AGI alignment is lethally difficult, covering first-try requirements, deceptive alignment, inability to iterate, and the analogy to evolution producing misaligned general intelligence.
AI Alignment remains hard and unsolved
Where I agree and disagree with Eliezer: Paul Christiano's response
Core views on Safety by Anthropic: Presents a three-tier portfolio approach: optimistic (current techniques largely sufficient), intermediate (substantial scientific work needed), pessimistic (alignment may be impossible)
AI is easy to control

Overall risk assessments

Is Power-Seeking AI an Existential Risk?: estimating >10% probability of existential catastrophe from misaligned power-seeking AI by 2070.
Chapter 8 in Superintelligence — "Is the default outcome doom?"
Misalignment and catastrophe are the default outcomes of training powerful AI
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Thousands of AI authors on the future of AI: Between 38% and 51% of respondents gave at least a 10% chance to advanced AI leading to outcomes as bad as human extinction.
"The Precipice: Existential Risk and the Future of Humanity" (Ch. 5) — Toby Ord (2020). Estimates existential risk from unaligned AI at ~1 in 10 over the next century within a comprehensive risk landscape framework.

Scenarios

What failure looks like: Sketches two failure scenarios: systems pursuing easy-to-measure proxies that diverge from human values, and systems with emergent influence-seeking behavior.
Treacherous turn
AI 2027
If Anyone Builds It, Everyone Dies: Why Superhuman AI Would Kill Us All: Book-length argument that creating vastly smarter-than-human AI using anything like current techniques would very likely result in human extinction.

Theoretical vs. empirical vs. other approaches

Case for empirical work

Prosaic AI alignment: argues that we should focus on trying to align AI systems that are similar to the ones already built today (2016, tbc.!) since perhaps AGI will be developed without a fundamental understanding of AGI.
Anthropic's core view on safety: "We are most optimistic about a multi-faceted, empirically-driven approach to AI safety"
Our approach to alignment research by OpenAI (2022): following feedback, assisting evaluations, doing alignment research
Why I'm optimistic about our alignment approach, Jan Leike, 2022

Case for mathematical work

The rocket alignment problem
Why agent foundations? An overly abstract explanation
Agent foundations for aligning machine intelligence with human interests: "the authors believe that there are theoretical prerequisites for designing aligned smarterthan-human systems over and above what is required to design misaligned systems"
AI alignment metastrategy: argues for halting capability work and developing theory of intelligent agents
Towards guaranteed safe AI

Philosophical/conceptual work

Against many plans, empirical and theoretical

On how various plans miss the hard bits of the alignment challenge: Arguing about everyone else's (empirical) plan

AI Alignment Introduction

What you’ll learn