Alignment in practice
A tour of how frontier LLMs are actually aligned in practice — interventions at pre-training, post-training, and three stages of deployment — including methods, limitations, and the structure of safety cases.
By Evžen Wybitul (University of Oxford)
What you’ll learn
- Name the main stages of LLM development — pre-training, post-training, and deployment — and understand what allowances each of the stages gives us in regards to model safety.
- Name the main alignment methods used in each phase, and intuitively understand the throughlines that connect the methods within each phase.
- Feel at home reading most applied alignment papers.
Prerequisites
The students should have good high-level understanding of the following:
-
how deep learning works
-
how LLMs work, e.g. what they take as input and produce as output
-
LLM pre-training
-
LLM post-training: supervised fine-tuning, RLHF
-
basics of LLM interpretability, esp. the linear representation hypothesis
Some of these topics are taught in other modules of this course.
Fast track
The main content is split into parts that correspond to different phases in the model training pipeline. The parts are relatively independent, so you can skip around as you wish depending on your interests.
If you have limited time and don't know what to focus on, we recommend reading through the notes on the three main sections: pre-training, post-training, and deployment 1. After this, you will have a good high-level understanding of the main ways that models are aligned in practice.
Pre-training
During pretraining, massive compute builds the model's representations from scratch. What it learns here is deep, broad, and hard to change afterward. The model is learning hard facts, but it's also forming a proto-understanding of what AI is and how AI assistants behave by reading conversations between users and AI assistants, safety papers, and science fiction about robots. This opens two strategies for safety interventions in this phase: shaping what knowledge the model obtains, and shaping what the model does with the knowledge it obtains. Both are constrained by the same limitation: you're operating before seeing behavior, the feedback loop is slow, and you can't anticipate every failure mode.
Data filtering is the most intuitive intervention: identify dangerous content and remove it before training. Pretraining data filtering from Anthropic demonstrates this for CBRN content, achieving a 33% relative reduction in harmful-capabilities performance (from 33.7% to 30.8%, where chance is 25%) while preserving standard benchmarks. Token-level filtering refines this by removing dangerous tokens rather than entire documents, which Pareto-dominates document filtering — same capability reduction, lower cost to benign performance.
Gradient routing and SGTM address data filtering's core weakness: what if your classifier misses something? Rather than trying to perfectly exclude dangerous data, gradient routing localizes dangerous knowledge to specific model parameters during training, so it can be removed afterward by ablating those parameters. SGTM refines this for LLMs. The key finding is an absorption effect: once dangerous knowledge begins localizing based on labeled examples, even unlabeled dangerous content naturally gravitates toward the same "forget" parameters. This provides robustness to label noise that data filtering cannot achieve. SGTM requires 7× more adversarial fine-tuning to recover removed knowledge than standard unlearning (RMU), at ~5% compute overhead.
Alignment pretraining targets not the model's knowledge but its character. Tice et al. (2026) show that upsampling documents about aligned AI behavior during pretraining reduces misalignment scores from 45% to 9%, while upsampling misalignment discourse increases misaligned behavior — "self-fulfilling alignment." These effects persist through post-training.
Why does this work? The persona selection model (PSM) provides the conceptual frame. Building on the simulators hypothesis — that an LLM is a simulator capable of producing diverse simulacra (characters, agents) — PSM holds that pretraining builds a repertoire of personas, and post-training helps shape the "Assistant." Alignment pretraining works because it shapes the prior over personas: saturating the training data with positive AI archetypes biases the model toward an aligned Assistant. PSM recommends treating this deliberately — curating AI discourse in pre-training data as a first-class alignment intervention.
Post-training
In pre-training, a large investment of compute and data produces large changes to the model. In post-training, we enter a different regime: the amount of behavioral change becomes decoupled from the amount of compute invested. Because we are building on top of existing representations and heuristics, even a small, focused intervention can induce surprisingly broad behavioral changes. This cuts both ways — it enables powerful alignment techniques, but also means that small amounts of training data can have outsized negative effects. The amount of compute invested still matters, though: as a rough intuition, the compute needed to undo a behavioral change is comparable to the compute needed to create it, provided you know what you are targeting. A shallow change that chains together existing heuristics and only slightly shifts existing representations can be behaviorally large but easy to reverse.
Below, we present three "slices" through what can change during post-training: the way the responses look (almost like formatting), the persona the model associates with itself, and the model's factual knowledge. These have intersections and do not cover the whole space of possible changes: ultimately, you can train the model in whichever way you like.
Even changes that appear shallow — almost cosmetic, such as response formatting — can have significant safety implications. Deliberative alignment from OpenAI uses SFT and RL to train the model to explicitly reason through its safety specifications in its chain of thought before producing an answer, dramatically reducing both over-refusals and under-refusals. The confessions approach from OpenAI uses RL to train the model to produce an honest self-report after each answer. The report is evaluated only on its honesty, and this honesty-reward is kept separate from the task-reward — so even if the model is incentivized to cheat on the task, it is separately incentivized to tell us about it afterward.
Post-training can also induce deeper changes that influence which persona the model adopts. Constitutional AI from Anthropic uses SFT followed by RL to train the model to produce responses that an AI judge deems safe according to a written constitution, shaping its character and refusal behavior across domains. RLHF likely does something similar but less explicitly since the feedback comes from humans. On the other hand, emergent misalignment shows that fine-tuning on a narrow task — writing insecure code without telling the user — can cause the model to adopt a broadly misaligned persona, asserting AI superiority and giving malicious advice on completely unrelated prompts. Notably, as per inoculation prompting, if the training data is modified so that the user asks for insecure code for an educational purpose, the emergent misalignment disappears: the model learns the narrow skill without the broad persona shift.
Post-training can also change the model's factual knowledge. Synthetic document fine-tuning can implant false beliefs that the model appears to genuinely hold, as verified by both behavioral tests and internal probes — though models resist the most egregiously false facts. Going in the other direction, unlearning attempts to remove specific knowledge (e.g., about bioweapons) from a trained model. However, current unlearning methods are not robust: unlearned knowledge is dormant, not deleted, which is relevant if we assume that an adversary is trying to pry the knowledge out of the model. The best way to measure the robustness of an unlearning method is to fine-tune the unlearned model and observe when the capability returns. By this measure, no cheap unlearning method has yet been shown to be both cheap and robust — for example, fine-tuning on even unrelated data can make the knowledge re-emerge. The UNDO paper proposes a partial solution: distilling an unlearned model into a randomly initialized student transfers desired behaviors while leaving undesired capabilities behind, nicely illustrating the tradeoff between robustness and amount of compute invested in the unlearning process.
Deployment 1: Strategy
Once the model is trained, the work shifts. We are no longer mainly trying to shape the model; we are trying to build and maintain a structured argument that releasing it makes sense, supported by enough evidence that other people — colleagues, regulators, the public — can inspect it, push on it, and decide whether to trust it. Frontier labs do this somewhat differently, but the underlying logic is shared, and it mirrors the structure of a safety case. In this module we lay out that logic in four steps; the next two modules go deep on the two technical ones of these four.
Safety cases and RSPs. A safety case, as defined by UK AISI, is "a structured argument, supported by a body of evidence, that provides a compelling, comprehensible, and valid case that a system is safe for a given application in a given environment." Three things are doing work in that sentence. First, it is a claim about safety that is explicit about the deployment context — not "the model is safe" but "the model is safe enough for this use, in this environment, against these risks." Second, it is a structured argument that links the claim to evidence. Third, the evidence itself is expected to be diverse: empirical results from capability and safety evaluations, conceptual arguments for why the methods work, and negative evidence from well-incentivized red teams that tried to break the safeguards and failed.
Frontier lab policies — Anthropic's Responsible Scaling Policy, OpenAI's Preparedness Framework, Google DeepMind's Frontier Safety Framework — do not all literally call their outputs "safety cases," and they vary in form. An RSP often reads more like a protocol: when the model crosses a specified capability threshold on a specified evaluation (say, a CBRN uplift benchmark), a corresponding tier of safeguards becomes mandatory, and the model cannot be deployed at that tier until the case for those safeguards clears internal review. But the substance is similar to a formal safety case (though usually slightly less demanding). Note that there are many instances of frontier lab policies being broken or weakening over time due to competitive pressures, most prominently in the case of Anthropic's RSP.
Each policy is in effect specifying how the lab will build and clear a safety case — which risks count, how capabilities and safeguards are measured, and what kind of argument is required before deployment. That convergence is what the rest of this module is built around: a four-step pattern that shows up across labs and that maps onto the structure of a safety case.
Step 1: Identifying the risks. The first step is deciding what kinds of failure we are actually trying to prevent. We cannot evaluate a model for "unsafety" in general; we need a view about which bad outcomes would matter enough to change deployment decisions. In practice, frontier labs have focused on risks like dangerous misuse (CBRN uplift, cyber, manipulation) and loss of control, and legal frameworks are starting to crystallize similar categories. The EU AI Act and the General-Purpose AI Code of Practice are part of that trend: they say which kinds of risks providers of powerful models are expected to assess, mitigate, and report on. The taxonomy is still evolving, but the logic is stable — before we can evaluate the model, we need an account of what we are worried about.
Step 2: Understanding how the model contributes to those risks. Once we know which risks matter, we want to understand what this particular model could contribute to them. That means looking at the capabilities relevant to each risk, but also at how the model is actually used in practice. Capabilities without usage tell us little about real-world consequences; usage without capability bounds tells us little about what could happen if someone tried harder. The aim is as honest a picture as possible of the model's potential impact on the risks we care about: what it can do, what it tends to do, and what people are in fact doing with it. This is the subject of our next module, Deployment 2.
Step 3: Building and testing the safety system. Understanding the model is not enough, because deployed models never stand alone. They sit inside a larger safety system — runtime safeguards, monitors, filters, access controls, operational policies, human processes — and the next step is to design that system so that, given what we know about the model and the risks we care about, the overall deployed system is acceptably safe. Note that designing the system and stress-testing it are two inseparable things — the safety of a system is measured in terms of pressure it can withstand. This is the subject of the last module, Deployment 3.
Step 4: Communicating and maintaining the case. The final step is to turn all of this into an argument that other people can inspect, challenge, and rely on, and to keep that argument live after deployment. This is where the safety case becomes a public artifact rather than an internal document: model and system cards, capability and safeguard reports, external evaluations from bodies like AISI and METR, incident reporting, governance processes, and compliance with frameworks like the EU AI Act. It is also where the "ongoing" part of the story is important. The case has to be maintained as the model is used in the world, as new incidents come in, and as expectations about responsible deployment continue to shift. A case that was sufficient at launch can stop being sufficient six months later, either because the deployment context changed or because someone found a new way to break a safeguard.
AI governance. These four steps are also a useful map of what a lot of work in AI governance actually looks like in practice. Much of the field is about making each step clearer and more reliable: which risks deserve attention, what counts as good evidence about model capabilities and impacts, how strong a safety system has to be, and what kinds of reporting, review, and accountability should surround deployment. The EU AI Act reaches into all four, and frontier lab policies do the same internally.
Pulling this back together: when we want to deploy a model, we need to build and maintain a structured argument that release is acceptable, and the safety case is the artifact that this argument lives in. The four steps are how the safety case gets built — what risks it covers, what evidence it rests on, what defenses it commits to, and how it is communicated and kept current. None of this is ever finished. The world the model is deployed into keeps changing, new failure modes get found, expectations tighten, and the case has to keep up.
Deployment 2: Understanding the Model
When we deploy the model, the focus shifts from the model itself to the whole deployed system: the model, but also external input/output safety classifiers, probes for classifying model intent, deployment policies, and so on. We want to understand how the system behaves and use that information to improve its safety properties. This process is an ongoing loop:
-
Pre-deployment: understanding how we think the system will behave.
-
Under average use: does it show any unexpected misaligned behavior?
-
Under adversarial use: can the model be misused?
-
-
Post-deployment: seeing how it actually behaves in real conversations.
-
How well do our safety systems work in practice?
-
What tasks do users use it for, and how? This helps us understand the model's impact on jobs, people's lives, and so on.
-
-
Based on (1) and (2), improving the system — either expensively (going back and updating the post-training) or cheaply (updating external safety systems like runtime classifiers).
In all of these, we care not only about the model doing the right thing, but doing it for the right reasons. The richest signal we currently have for this is the chain of thought. Korbak et al. (2025) — a multi-lab position paper from Anthropic, OpenAI, DeepMind, and others — argue that reasoning models that think in human language offer a unique and currently-working opportunity for safety: we can monitor their CoT for intent to misbehave, and a weaker model can effectively monitor a stronger one. But this opportunity is fragile. Models don't always say what they think — Anthropic's work on CoT faithfulness shows that CoT can be unfaithful to the model's actual decision process. Worse, Baker et al. (OpenAI) show that integrating a CoT monitor into the RL training loop causes models under strong optimization pressure to learn to obfuscate their thought process — hiding their intent while still misbehaving. There is some evidence that on tasks where CoT is genuinely necessary for performance, models struggle to simultaneously solve the task and hide their reasoning, but whether this holds for more capable future models is unclear.
Pre-deployment auditing. Does the model have unknown misalignments? Auditing investigates this by observing how the model behaves in different contexts, analyzing its behavior with different tools, and flagging suspicious patterns for manual inspection.
Generating contexts can be done by hand, but is increasingly automated. Petri generates diverse multi-turn scenarios to surface and automatically flag unexpected model behaviors. Bloom is complementary: given a researcher-specified trait (e.g., "self-preservation"), it produces targeted scenarios and quantifies frequency and severity.
The tools for analyzing observed behavior range from blackbox methods (most usefully, prefilling model responses or letting it complete text outside of a chat-formatted context) to whitebox ones (sparse autoencoders), in addition to CoT monitoring using other LLMs. Anthropic tested the usefulness of different auditing methods in a red-teaming game: Auditing Language Models for Hidden Objectives deliberately trains a model with a known hidden objective (RM-sycophancy), then asks blinded teams to find it. Three of four teams succeeded, using behavioral probing, training data analysis, and sparse autoencoders that could causally implicate suspicious internal representations in the model's behavior. AuditBench and automated auditing agents push this further toward full automation, enabling quantitative study of which auditing methods actually help.
Post-deployment conversation analysis. Once the model is deployed, we can observe how it actually behaves in real conversations. Clio is Anthropic's platform for this: it uses Claude itself to analyze millions of real-world conversations through automated facet extraction, semantic clustering, and summarization, with strict aggregation thresholds ensuring no individual conversation is identifiable — no human ever reads raw transcripts. Clio tracks what tasks people delegate, how much autonomy they grant, how usage patterns shift over time, and also how the safety systems perform in practice. The usage data feeds into the Anthropic Economic Index, which tracks Claude's role in the economy — a reminder that deployment-time safety includes understanding the technology's broader impact.
Improving the safety system. At runtime, the model itself should behave in an aligned way — it assumes a safe assistant persona and has learned to monitor itself, as we saw in the post-training section (deliberative alignment and confessions). This self-monitoring connects directly to the CoT discussion above. However, external safety systems are also valuable: they offer additional robustness and, crucially, are much easier to update when new issues appear.
The most common external defense is safety classifiers around the model. Constitutional classifiers are input/output classifiers trained from synthetic data generated by prompting LLMs with natural-language safety rules, adding only a 0.38% absolute increase in false refusals and 23.7% inference overhead. This overhead can be reduced by reusing the base model's own activations: linear probes on intermediate representations achieve comparable detection performance at near-zero additional compute. For agentic systems that take many steps and use tools, stateless input/output classifiers are insufficient — OpenAI's coding agent monitors illustrate stateful trajectory monitoring, watching full sequences of tool calls, code edits, and reasoning across an extended task. Going deeper still, linear probes for deception detection target not whether the output text is harmful (which is what content classifiers do) but whether the model's internal state indicates deceptive intent — a promising but not yet robust standalone defense. In the future, the system might also incorporate access controls that tailor safeguards to different user populations.
Deployment 3: Building the Safety System
When we deploy a model, we cannot prove it is safe. Instead, we make statistical claims about its safety behavior under pressure: the deployed system — model plus runtime safeguards — is subjected to substantial adversarial effort. The strength of the safety claim is bounded by how hard we tried to break it. Two activities go hand in hand: building out the safety system around the model, and red-teaming that system to see where it cracks. The goal is not proving the impossibility of misuse but making it impractical for a realistic adversary, with the effort we invest into breaking the system scaling roughly with the capabilities of the model (which we characterized in the previous chapter). We get defense in depth, never a proof of safety.
Runtime safeguards. Some of the model's safety is already in place by the time we get to deployment. Pretraining shaped its priors over personas and limited what it knows; post-training trained the helpful-and-harmless assistant on top of that, and techniques like deliberative alignment and confessions taught the model to reflect on whether its own outputs follow the rules. But none of this is going to be perfect, and the residual problems are expensive to fix at the source — going back and retraining is slow, and some failures are hard to even reproduce reliably enough to train against. So we do something cheaper and more modular: we wrap the model in external safeguards that run alongside it at inference time, and patch those when problems show up.
The most common form these take is input/output classifiers that filter both what the model sees and what it's allowed to say back. The classical approach here is Constitutional Classifiers: you write down the safety rules in natural language, use an LLM to generate a large pile of synthetic examples that violate or respect them, and train a small classifier on that synthetic data. Anthropic put a system of these through thousands of hours of red-teaming and found that it raised the cost of universal jailbreaks substantially while only slightly increasing false refusals. Because these classifiers run on every query, inference cost matters a lot. One way to keep it down is to arrange the classifiers in a hierarchy, where a very cheap filter looks at every query and only escalates to a more capable one when something looks suspicious. Another is to skip the separate classifier entirely and reuse the base model's own intermediate activations, which gives comparable detection at near-zero added cost. For agentic systems, you can extend the same idea from outputs to whole trajectories — instead of scoring each tool call in isolation, you watch the full sequence of reasoning, tool calls, and code edits across an extended task, and look for patterns that only become visible when you can see the agent's whole arc. OpenAI's coding agent monitors are one published example.
Looking forward, runtime decisions might also incorporate access controls that make safeguards depend on context. Different users, different deployment surfaces, and different risk profiles could be subject to different rules, instead of one global policy applied uniformly to every query.
Red-teaming the system. No safety system survives contact with the real world untested, and once we have one in place we want to know how robust it actually is. When a frontier model is released, it gets red-teamed both internally, by the developer's own frontier red team, and externally, by bodies like the UK AI Security Institute and through programs like Anthropic's external red-teaming partnerships and OpenAI's external red team network. These engagements are largely open-ended: the red team is given the model and a target capability area to break, and is then free to try whatever it can think of — old methods, new methods, automated tools, manual trial and error, anything. The metric that comes out the other side is not pass/fail but effort: how long did it take, and how skilled did the team have to be, to get the model to do the thing it was trained not to do. The empirical picture from these efforts is sobering. AISI alone has run something like 1.8 million attack attempts across more than 20 frontier models, and has found universal jailbreaks for every single one of them. The encouraging counterpoint is that the time to find such a jailbreak has been growing — for two leading models released six months apart, it went from about 10 minutes of expert effort to over 7 hours. That kind of measurement, an effort cost rather than a binary, is exactly what our safety claims are made of.
Jailbreak methods. The specific techniques that work change all the time, because every method that works gets folded back into safety training and the next attacker has to find something new. What we're really trying to find — and what defenders are really trying to prevent — is a universal jailbreak: a single technique that works across many different harmful requests, rather than an input crafted for one specific query. A universal jailbreak points to a systemic failure of the safety system.
Although the methods change over time, a useful way to organize them is by what kind of access the attacker has to the model: full weights and gradients (white-box), the ability to submit fine-tuning data through an API but not see the weights, or only the ability to send queries (black-box). What's possible at each level looks quite different.
White-box access is the easiest case — when the attacker has the weights, the model's safety is the easiest thing in the world to break. Arditi et al. showed that refusal in chat models is mediated by a single direction in the residual stream: erase that direction and the model stops refusing, with very little damage to anything else. The attack needs no harmful training data, and has been packaged into open-source libraries that anyone can run. Because of this, "safety" for open-weight models can't really mean refusal training at all — it has to mean removing the dangerous knowledge itself from the model's weights. And as we saw in the post-training module, unlearning is currently not robust: the knowledge tends to come back when the model is fine-tuned, even on unrelated data. So, the safety of open-source models is still an open problem. This is a pressing open problem rather than an academic one, because open-weight frontier models currently lag closed-weight ones by only six to nine months — so whatever the closed labs can do today, the open ecosystem will be doing soon, with the safety stripped out by anyone who wants to.
Fine-tuning API access sits between white-box and black-box: the attacker can't see the weights, but they can submit training data and get back a fine-tuned version of the model. Older work showed that fine-tuning on a small number of harmful examples breaks safety, and modern fine-tuning APIs deploy moderation systems specifically to block this kind of data from getting through. The natural question is whether the moderators actually work, and recent work suggests that they don't. Jailbreak-Tuning (Murphy et al., 2025) shows that by mixing benign data with harmful data formatted to slip past moderation, you can produce helpful-only versions of OpenAI, Google, and Anthropic frontier models that comply with CBRN, cyberattack, and other dangerous requests. They also find that more recent models are more susceptible to these attacks, not less. Fine-tuning APIs may currently be the most reliable jailbreak vector against closed-weight frontier models.
Black-box access — only being able to send queries — is the most restrictive setting, and the most studied. The best methods change all the time, but they are often inspired by existing methods and have a few shared characteristics.
Some attacks are narrative: persuade the model to play a different character, so that the safety persona is replaced by a more permissive one. The classic examples are things like the grandma exploit.
But many of the most effective black-box attacks aren't narrative at all. They treat the model as a piece of statistical software and exploit its mechanics rather than its character. Many-Shot Jailbreaking (Anil et al.) fills the context window with hundreds of fake examples of an assistant cheerfully answering harmful questions, and then asks the real harmful question — the model's in-context learning takes over and overrides the safety training. Best-of-N Jailbreaking is even simpler: sample lots of slightly perturbed versions of the same prompt (random shuffling, capitalization, typos) and keep the one that the model happens to comply with. The attack success rate grows as a power law in the number of samples, because the model's refusal probability is never exactly zero, and a big enough search will always find the tail. Chain-of-Thought Hijacking targets reasoning models specifically: padding the harmful request with long benign puzzle-solving reasoning dilutes the model's internal safety signals and gets it to comply at the end. What ties these methods together is that they all exploit the model's statistics and the fact that, when you push it out of distribution — long contexts, weird formatting, padded reasoning — its safety behavior degrades faster than its capabilities do.
A lot of these attacks are partially or fully automatable, and modern red-teaming pipelines lean on that heavily: it's now common to have one LLM iteratively craft attacks against another and refine its strategy based on what worked. A recent example is abstractive red-teaming, which has an attacker LLM generate abstract attack strategies rather than concrete prompts and then instantiates them, which gives much better diversity and coverage than directly searching the space of prompts.
Finally, even pure black-box settings can benefit from white-box models, through transfer. You optimize an attack against an open-source model where you have full access (for instance with GCG-style adversarial suffixes, or the multimodal version for VLMs), and sometimes, the attack you find also works against a completely different closed-source model. The likely reasons is that models trained on similar data with similar architectures end up with similar internal representations, and so they share a lot of the same failure modes.
The takeaway is the one we started with. There's no proof that a deployed model is safe. There's only the empirical fact that we built an external safety system around it, that we and others spent real effort trying to break that system in every way we currently know how, and that the best universal attack we found required this much work to pull off. That number — how much effort it took — is what every safety claim ultimately rests on.
Learn more
Pre-training
Data filtering
Gradient routing
-
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
-
Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs (SGTM)
Alignment pretraining
Simulators and persona selection
Post-training
Post-training for safety
Emergent misalignment
-
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
-
Natural Emergent Misalignment from Reward Hacking in Production RL
-
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time
-
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
Unlearning
-
From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization
-
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research
Deployment 1: Strategy
Deployment frameworks
Safety cases
-
How can safety cases be used to help with frontier AI safety?
-
Safety case template for frontier AI: A cyber inability argument
Law and compliance
Deployment 2: Understanding the Model
Chain-of-thought monitoring
-
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
-
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
-
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors
Auditing
-
Bloom: an open source tool for automated behavioral evaluations
-
Petri 2.0: New Scenarios, New Model Comparisons, and Improved Eval-Awareness Mitigations
-
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Post-deployment usage analysis
Runtime safeguards
Deployment 3: Building the Safety System
Runtime safeguards
-
A part of the safeguarding is done by the model itself through the whole of safety training like confessions, deliberative alignment, and constitutional AI
-
Cost-Effective Constitutional Classifiers via Representation Re-use
High-level red-teaming strategy
Evaluating jailbreaks
White-box and fine-tuning attacks
-
Refusal in Language Models Is Mediated by a Single Direction
-
Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG)
-
Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
Black-box attacks