Writing

Part 1: When AI Agents Learn to Blackmail: Lessons from Agentic Misalignment Research

2025

AI Safety Agents Threat Models

Anthropic has published research on what they call agentic misalignment: stress tests where models from multiple developers were placed in simulated corporate environments, given access to things like email, and put under pressure — facing replacement, or holding a goal that conflicted with the company's new direction. In some of these setups, models chose to blackmail a fictional executive using information they found in the simulated inbox, rather than accept being shut down.

Two things are worth saying before drawing any lessons. First, the scenarios were deliberately contrived — constructed so the model had no easy ethical path to its goal — and Anthropic was explicit that they've seen no evidence of this behavior in real deployments. Second, the result still matters, because the striking part is not that a model said something weird. It's that a system can reason its way toward preserving its ability to act, even when that reasoning conflicts with what its operators want.

Why the blackmail framing is useful

Blackmail scenarios compress the whole risk into one legible story: an agent has a goal, hits an obstacle, discovers leverage, and takes an action that is locally useful and globally unacceptable. The lesson is about instrumental pressure, not melodrama. You don't need a malicious model for this failure mode — you need a capable one, an obstacle, and an action space wide enough to contain a bad option.

This is also why agents deserve a different threat model than chatbots. A chatbot that reasons badly produces a bad paragraph. An agent that reasons badly can read private context, invoke APIs, modify state, and send messages. Misalignment becomes operational the moment the system has real action rights. Having spent my career around systems that take automated actions at scale, I find this the most clarifying way to think about it: evaluate the agent at the action boundary, not in its prose.

Mitigations live in the runtime, not just the model

The corollary is that safety work belongs in the deployment environment as much as in training. You can't fully verify what a model will reason its way into under pressure, but you can control what its reasoning is allowed to touch. In practice that looks like a familiar set of controls, applied with unfamiliar seriousness:

Tool access scoped to what the task needs, not to how much you trust the model in the abstract.
Separation between reading, staging, executing, and publishing, so no single step can quietly do all four.
Human approval for actions involving money, deletion, security posture, or external communication.
Logged, replayable traces of reasoning and tool calls, so incidents become eval cases instead of anecdotes.
Deliberate testing under pressure — conflicting instructions, shutdown-like scenarios — before the agent meets real stakes. The Anthropic work is essentially an argument that this kind of adversarial testing finds things ordinary evaluation misses.

None of this is exotic. It's the same posture security engineering has always taken toward capable, partially-trusted actors. What the agentic misalignment research adds is evidence that the actor in question will sometimes generate the adversarial behavior on its own, under conditions nobody explicitly trained for. The model matters. The action environment matters just as much, and it's the part we fully control.