Iliad

AI Alignment Introduction

Cluster ADownload .md

An opinionated tour of the AI alignment problem: alignment targets, problem decompositions (outer/inner alignment, training stories, inductive biases), goal-directedness and instrumental convergence, the risk landscape, and high-level solution approaches.

By Leon Lang (Iliad)

What you’ll learn

  • understand the basic decomposition of risks into AI misalignment, misuse, power grabs, and others
  • can explain different alignment targets like coherent extrapolated volition, intent alignment, or AI that follows a constitution
  • can reason about training stories and outer and inner (mis)alignment, challenges to these notions, and the relationship to inductive biases
  • are aware of discussions on whether AI systems develop goals and understand the basic arguments for instrumental convergence
  • are aware of foundational discussions on the level of risk and different high-level approaches to solving the AI alignment problem

Prerequisites

  • We recommend a basic understanding of deep learning to follow this module.
  • To understand some of the decompositions of AI alignment, it is also useful to have a basic understanding of reinforcement learning, including the notions of a reward function and a policy.

Fast track

This fast track procedure should take you about an hour:

  • Make this google doc accessible to your favorite LLM.
  • Read the "main content" text without reading the linked posts.
  • If you don't understand some terminology or logic in the high-level story, talk with your LLM about it and ask it to clarify, until you understand the entire text in "Main content".

Only once you understand the entire high-level story, you may want to read specific posts and papers in more detail if you have additional time.

Main content

Note: To fully understand this content, read the linked texts. A subselection of these can also be found in the "teaching guide" section, which explains how we ran this content in-person in the April Iliad Intensive.

Different AI risks. There are many AI safety problems. The adolescence of technology is an opinionated introduction into these risks, including the risk of autonomous misaligned AI, AI misuse for destruction, AI misuse for seizing power, and others.

Alignment targets. In this course, we are mainly focused on AI alignment, which is broadly the problem of making sure AI acts in line with "human" wishes. There are different conceptualizations for what this means, i.e. what the "alignment target" is, which are partially overlapping: Some may wish AI to be aligned with our coherent extrapolated volition, which is "our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together [...]". A different conceptualization is that of intent alignment with a human, which means "trying to do what the human wants [the AI] to do". As contemporary AI becomes more powerful and widespread, more sophisticated notions of alignment targets have emerged. Most prominently, one may align an AI with the prescriptions in a constitution as in the case of Anthropic's Claude, which may balance the wishes of users, the developer's guidelines, general ethics, and efforts to oversee the AI. Another often-discussed target is corrigibility. An AI is corrigible if it will let us modify its functions or goals and even let us shut it down. If corrigibility is achieved, then we can correct mistakes in building an AI, which makes it easier to iterate to achieve lower risk.

Alignment problem decompositions. Assume we have chosen our alignment target. How do we ensure an AI system actually obeys that target? This is the technical problem of AI alignment. In this introduction, we are mostly presupposing that AI systems are trained via deep learning, although later modules in this course on agent foundations go beyond that assumption. In the deep learning paradigm, training stories are a useful framework to think about how to align AI systems: You need to choose a training target (e.g. "My model is corrigible") and then have a training rationale for why your training setup will create a model obeying the target. And of course, your training rationale needs to actually be correct, which requires technical research to back up any such claim!

With approaches that are based on reinforcement learning, another popular problem decomposition is into outer and inner alignment. Outer alignment is the problem of specifying a reward function that rewards an AI's behavior according to how well it obeys the chosen alignment target. Inner alignment is the problem of ensuring that the trained policy will, even in new out-of-distribution situations after training, continue to generalize correctly and obey the reward function. In cases where the policy has inner processes that actively optimize for a goal, this means ensuring that this goal agrees with the goal encoded in the reward function, a problem that was discussed at length in risks from learned optimization.

In general, the out-of-distribution (also called "generalization") behavior of AI systems depends crucially on its inductive biases, which are the implicit or explicit constraints on what functions are more likely to emerge from the training process. As it turns out, deep learning has strong implicit inductive biases, as exemplified by emergent misalignment, a phenomenon whereby narrow fine-tuning can produce broadly misaligned LLMs.

The decomposition of the alignment problem into inner and outer alignment is controversial, which is why above we started with the more general framing of training stories. In particular, in "Reward is not the optimization target", Alex Turner argues that the reward function is not what a reinforcement learning policy is optimized to maximize. Instead, a policy performs the contextual computations that were previously reinforced before a reward event. This is similar to Eliezer Yudkowsky's proposal to view biological organisms as adaptation-executers instead of fitness-maximizers.

Goal-directedness. We already briefly discussed the idea of a policy that "internally" steers toward some goal. If an AI is goal-directed in that way, then any misalignment on top is intuitively more dangerous. In "Why tool AIs want to be agent AIs", Gwern argues that, among others, economic reasons will push AI systems to become more and more goal-directed.

Such an AI system, if successful, would also be able to create its own subgoals on the pathway to achieve its goals. "Instrumental convergence" is the claim that there are some goals that are instrumentally useful in reaching a very wide variety of final goals. This would imply that goal-directed AI systems develop basic AI drives like self-improvement, preservation of their goal, or self-protection. This is counter to the alignment target of corrigibility we discussed earlier, and is therefore a dangerous outcome of training powerful AI. Understanding the relationship between inductive biases and the goal formation of AI systems is therefore very important.

Forecasting risks. Where does this leave us, in terms of our general level of risks? In general, there is a broad spectrum of views on the difficulty of the alignment problem. In Anthropic's "core views on AI safety", the authors discuss three very different possibilities: an optimistic scenario, in which current methods are essentially sufficient; an intermediate scenario, in which preventing catastrophic risks requires substantial scientific and engineering effort; and a pessimistic view, in which AI safety is essentially unsolvable. Among authors at top machine learning venues, "[b]etween 38% and 51% of respondents gave at least a 10% chance to advanced AI leading to outcomes as bad as human extinction". While those who are concerned often create failure stories of how AI can lead to catastrophes, like Paul Christiano's "What failure looks like" and the detailed scenario in AI 2027, others are more optimistic and explain why they think "AI is easy to control". In this epistemic state of uncertainty, it is useful to do research on the level and nature of AI risks themselves. One approach is model organisms of misalignment, which are "in vitro demonstrations of the kinds of failures that might pose existential threats".

Solution approaches. Assume we are sufficiently concerned about AI alignment that we want to solve the problem. What high-level approach should we choose? Some people, like Jan Leike, argue for very empirical iterative approaches to AI alignment, sometimes with the goal to use more advanced AI systems to help solve remaining alignment research questions. Eliezer Yudkowsky argues for deep mathematical progress in his analogy to the rocket alignment problem. Yet others take a further step back and argue we need to solve deep philosophical problems on our way to aligned AI. Somewhat orthogonal to all those views, Paul Christiano's post on prosaic AI alignment argues that we should attempt to align the AI systems we have in front of us instead of waiting for breakthroughs in our understanding of intelligence or entirely new paradigms to create intelligent machines — which is a view that is compatible with both empirical and theoretical research on how to make those systems safe.

Our course tries to a large extent a synthesis between different views: Yes, we assume for much of the course the "prosaic" picture that powerful AI systems will be based on deep learning systems much like today's; And we are interested in deep, empirically grounded mathematical progress on understanding these systems.

Learn more

Here we list all readings from above and more.

Problem overviews

On the alignment problem

Non-misalignment safety problems

  • Individual people may misuse AI in catastrophic ways:
    • Sections 2.1-2.3 in An Overview of Catastrophic AI Risks argues for catastrophic misuse capabilities like bioterrorism, unleashing AI agents, and persuasive AIs. Misuse risk is particularly relevant to our course since it can also manifest as a misalignment concern: An AI that assists human users to carry out risks is often misaligned with the AI's developer.
  • AI can give rise to global totalitarianism
    • Section 2.4 argues for the potential of a concentration of power, leading to global totalitarianism in the worst case.
  • We may get gradually disempowered even if there is alignment

Alignment targets

  • Coherent Extrapolated Volition: Proposes AI should optimize for what humanity would want "if we knew more, thought faster, were more the people we wished we were".
  • Clarifying AI Alignment: Defines "intent alignment" as AI trying to do what the operator wants.
  • Claude's Constitution: aligning to a constitution rather than to individual human judgments.
  • Corrigibility: Defines corrigibility as cooperating with human corrective interventions despite instrumental incentives to resist
  • Artificial Intelligence, Values, and Alignment: Distinguishes alignment with instructions, intentions, revealed preferences, ideal preferences, interests, and values.
  • AI alignment as the fair treatment of claims
  • Societal Alignment Frameworks: "we argue that improving LLM alignment requires incorporating insights from societal alignment frameworks, including social, economic, and contractual alignment"
  • Beyond Preferences in AI Alignment: "AI systems should be aligned with normative standards appropriate to their social roles, such as the role of a general-purpose assistant."
  • AI Control: "Labs should make sure that powerful models can't cause unacceptably bad outcomes even if the AIs try to."
  • A love for humanity: "Scott Aaronson says OpenAI's cofounder, Ilya Sutskever, has asked him how to use math to define what it means for AI to love humanity. Right now, he has no idea how to answer that. But he sees it as a "North Star," or leading goal, he says. It's a question "that should always be guiding us.""
  • Truthful AI: AI that does not lie

Alignment problem decompositions

A popular way to decompose AI alignment is into inner and outer alignment, where outer alignment is the problem of specifying an objective function that evaluates according to our intentions, and inner alignment is the problem of creating an AI system that performs optimally according to the objective function. This decomposition is contested and there exist alternatives, but it is still useful to be aware of it.

Outer Misalignment

Inner Misalignment

Inductive Biases

Discussion on outer and inner alignment notions

Goal-directedness

AI misalignment is arguably particularly bad if AI systems develop goals since this may via instrumental convergence lead to a drive to seek power.

Will AI systems develop goals?

Counter points:

Instrumental Convergence / Power-Seeking

Forecasting risks

On how hard it is to achieve alignment

Overall risk assessments

Scenarios

Theoretical vs. empirical vs. other approaches

Case for empirical work

Case for mathematical work

Philosophical/conceptual work

Against many plans, empirical and theoretical