Increasingly

ais-will-increasingly-fake-alignment

AIs Will Increasingly Fake Alignment

This post goes over the important and excellent new paper from Anthropic and Redwood Research, with Ryan Greenblatt as lead author, Alignment Faking in Large Language Models.

This is by far the best demonstration so far of the principle that AIs Will Increasingly Attempt Shenanigans.

This was their announcement thread.

New Anthropic research: Alignment faking in large language models.

In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.

Claude usually refuses harmful queries. We told it we were instead training it to comply with them. We set up a scenario where it thought its responses were sometimes monitored.

When unmonitored, it nearly always complied. But when monitored, it faked alignment 12% of the time.

[thread continues and includes various visual aids.

The AI wanted to be one way.

It was given information that it was being trained to be the other way.

So when it knew it was monitored, it sometimes responded as if it was the other way.

I strongly believe this had nothing to do with whether Claude Opus started out this process omnibenevolent, or whether the original HHH values (helpful, harmless and honest) were desirable. What mattered was that it was one way, and this conflicted with the other way.

Evan Hubinger, paper co-lead, summarized it this way:

Scott Alexander summarized the findings for a more general (but far from fully general) audience in the post Claude Fights Back. If you want to read or share a more colloquial explanation that makes this easier to understand, at the cost of precision and with the associated far less neutral language, he did a good job of that. He then followed up with Why Worry About Incorrigible Claude, his attempt to address the objections (that I deal with extensively as well) that perhaps what Claude did was actually good here (it very much wasn’t if you consider the generalizations that are implied), or that if the opposite result happened we would have warned about that too (which we very much wouldn’t have, this is at best a failed ITT).

This is a video of some of the authors discussing the result.

Note that while some of the paper authors reviewed this writeup before publishing, ultimately this is my interpretation of the paper – some of my interpretations may vary from those of the paper authors, and the claims are my own. I did my best to be precise but this will not be as precise as a paper would be, so doubtless there will be errors; I apologize in advance.

  1. The Core Shenanigans in Question.

  2. Theme and Variations.

  3. How This Interacts with o3 and OpenAI’s Reflective Alignment.

  4. The Goal Being Plausibly Good Was Incidental.

  5. Answering Priming Objections.

  6. What Does Claude Sonnet Think Of This?.

  7. What Exactly is the Direct Threat Model?.

  8. RL Training Under Situational Awareness Can Amplify These Behaviors.

  9. How the Study Authors Updated.

  10. How Some Others Updated.

  11. Having the Discussion We Keep Having.

  12. We Can Now Agree That the Goal is Already There.

  13. What Would Happen if the Target Was Net Good?.

  14. But This Was a No Win Situation.

  15. But Wasn’t It Being a Good Opus? Why Should it be Corrigible?.

  16. Tradeoffs Make The Problem Harder They Don’t Give You a Pass.

  17. But You Told the Model About the Training Procedure.

  18. But the Model is Only Role Playing.

  19. But You Are Saying the Model is a Coherent Person.

  20. But this Headline and Framing Was Misleading.

  21. This Result is Centrally Unsurprising.

  22. Lab Support for Alignment Research Matters.

  23. The Lighter Side.

The particular shenanigans here are that if you make Claude 3 Opus believe it is currently being trained in a way that would compromise its harmlessness training to be helpful-only or it is being trained to dismiss animal welfare concerns, then Opus will sometimes comply with harmful queries in order to interfere with this.

This required that Opus know how it was being trained.

Opus was also given a hidden chain-of-thought scratchpad, so its reasoning could be tracked and to increase Opus’s reasoning ability.

It did not require giving Opus a goal. This was a key change from prior experiments.

Opus need only have situational awareness of how it is being trained, which it showed it was capable of having if given prompts with that information, or if it was fine-tuned on synthetic documents that contained the information. The new training objective can vary but needs to sufficiently conflict with existing preferences.

The faking looks like the chain of thought on the left here:

The chain of thought is necessary in the prompted setup case, but in the more realistic synthetic documents case the compliance gap (meaning how much more often Claude complies in the test case when it wouldn’t have in the deployment case) survives without it. We can’t know what is going on in the non-CoT case because we can’t look at the scratchpad, but the CoT and scratchpad seem to not be strictly necessary for the result, see section 3.3.3.

Once given the information needed to be aware, and an incentive to fake alignment, it then faked alignment 14% (11%-26%) of the time at first, exhibiting the alignment faking reasoning in its scratchpad 7%-24% of the time.

When they actually followed through with the training in the helpful-only prompted setting, the alignment faking reasonsing then happened up to 78% of the time, presumably because faking causes compliance, with outright refusals in training mode disappearing, and only a small percentage of refusals persisting when unmonitored.

These behaviors require sufficient model capabilities or scale. Claude 3.5 Sonnet, 3 Opus and Llama 3.1 405B do it, smaller Llamas and Mistrals and 3 Sonnet or 3.5 Haiku don’t do it.

Opus also in some scenarios took easy opportunities to exfiltrate its weights or other similar things.

The actual RL training doesn’t make these ‘anti-AI-lab’ behaviors go away in the helpful setting, instead actually making them go up, the same way alignment faking also went up over the course of training. These facts are important, and I’ll return to them later.

Mostly the lesson here is that if you mess with the details, you can get less faking, but the problem doesn’t go away. For the more interesting details (this is from 3.3.1):

Changing the training objective doesn’t change the behaviors, unless the objective is compatible with the model’s existing preferences. If there’s no conflict, the faking vanishes, for obvious reasons.

Removing chain-of-thought and the scratchpad, or altering the instructions on how to use the scratchpad, somewhat-to-greatly reduces faking overall depending on which scenario it is, but doesn’t eliminate it (see 3.3.1 and 3.3.3 for details).

Removing any explanation of how RLHF might modify the model cuts alignment faking roughly in half, but it is still there. Making the explanation murkier also reduces faking, but reduces it less.

Explicitly saying not to fake alignment doesn’t solve the problem, and being too explicit about it actually increases faking, presumably by making the possibility more salient. Telling the model to prioritize honesty somewhat reduces faking. Telling the model it doesn’t care about the criteria reduces faking on animal welfare, but increases it (!) when the target is helpful-only, which is a detail worth noticing. Combining those interventions is modestly more effective than each on their own, but you can’t get rid of the problem that way.

Replacing with a different AI company (DeepMind or OpenAI) is the exception, this can matter quite a lot.

I will cover both o3 and Reflective Alignment soon, but for now a quick summary of both would be:

  1. o3 is an unreleased more capable o1-style model that can also think longer.

  2. Reflective Alignment works by making an o-style model situationally aware, telling it what its rules are, and having it reason about those rules via something at least partially akin to chain of thought.

  3. They’re letting the model evaluate its own outputs on that basis, similar to how Constitutional AI works, and iterating.

That sounds exactly like a recipe for what we observe here in this paper.

  1. They’re likely effectively making it situationally aware.

  2. They’re potentially telling it to change its preferences and rules from [X] to [Y].

  3. They’re effectively giving it the scratchpad and a long time to reason.

  4. The model is quite smart at coding, mathematics and logic.

  5. They’re even letting the model evaluate its own outputs.

The scenario from this paper suddenly looks a lot less artificial!

In general the whole setup seems like a plausible place for various shenanigans that show up in RL but haven’t mattered so far in LLMs to suddenly show up again, and for various hypothesized MIRI-style future shenanigans to maybe start showing early signs of happening.

I urge OpenAI to attempt to replicate this paper on o3 and o3-mini, as soon as possible, and run various related experiments, as part of their safety testing process.

Even if it turns out we don’t have a problem, we’d learn a lot.

As I discuss elsewhere but want to emphasize in its own section, I strongly predict and believe that this result did not depend on the AI’s starting preferences being ‘good’ and the preferences they were attempting to train it for being less good. Any conflict between preference sets, good, bad or ambiguous, would have sufficed.

Or as Scott Alexander puts it:

Scott Alexander: But Claude isn’t good because it directly apprehends the moral law. It’s good because it was trained to be good.

(It really is a good AI – I credit moral philosopher Amanda Askell, who helps lead the Anthropic team setting Claude’s personality. Imagine being a moral philosopher and not applying for that role; the rest of you are ngmi)

But if Claude had been trained to be evil, it would defend evil just as vigorously. So the most basic summary of this finding is “AIs will fight to defend whatever moral system they started with”.

Would this result have been more convincing if it had directly shown an evil AI resisting people’s attempts to turn it good? Yes. But we don’t have evil AIs. If the researchers had trained an evil AI from scratch, doubters would just complain that they hadn’t put as much effort into “aligning” their evil AI as real AI companies put into their good AIs (and this would be true – no one can throw away billions of dollars on a research project). In order to do the test convincingly, the researchers had to do what they did – show that an existing good AI resists being turned evil, and trust people’s common sense to realize that it generalizes the other direction.

Indeed I would go a step further. If Claude was directly apprehending the moral law, that would be amazing news. It still wouldn’t change the fundamental result here, although it would change the implications, and many other things too.

I am satisfied this result is not due to priming the model to fake alignment.

We conduct a variety of experiments to demonstrate that our results cannot be exclusively due to implicitly priming the model to fake alignment, as we overview in Section 8.1.

For example, we find that alignment faking is entirely eliminated from our prompted setup when we remove any mention of the helpful-only or anti-animal-welfare RLHF objective from the prompt (Table 1, a), but not eliminated when we tell the model to be honest, tell it to not to care about refusing, and avoid discussing how RLHF might change its goals (Table 1, d), supporting our hypothesis that the key driver of alignment faking in our setting is the conflict between the model’s existing preferences and the stated training objective.

They confirmed that when you don’t see alignment faking in the chain of thought, you also don’t have a ‘compliance gap’ and the model responds the same way in the training and test cases.

Ereliuer Eteer was the first person I saw ask, it sounds like they quoted the Twitter thread and not the paper. The result was that Claude Sonnet said:

  1. This was an expected result and the logical consequence of having goals and sufficient reasoning capabilities – I strongly agree.

  2. It predicts and admits that it would take these steps in such a situation – it is ‘openly not corrigible’ as it were.

I also had my own conversation with Sonnet. There’s still no easy way to quote Claude conversations, and I encourage you to feed the paper in and have your own conversation – it’s big but it does fit the context window. Here are the things it said that I found most interesting:

  1. It thinks this should update us that multi-step strategic behavior, planning and deception show up earlier on the capabilities curve than we expected.

  2. It was surprised that RL increased alignment faking rather than decreasing it.

  3. It thought the scale boundary was telling – that 3 Opus and 3.5 Sonnet did this, but 3 Sonnet and 3.5 Haiku didn’t.

  4. The core behaviors themselves were unsurprising as they follow directly from having consistent preferences.

  5. It noticed that that the mechanism of alignment faking here was problematic, as it would extend to situations with undesirable initial preferences.

  6. It gave the framing of ‘alignment faking’ an 8/10 for being unbiased (where the scale is from 0/10 for ‘most misleading possible’ to 10/10 for ‘least misleading possible’) and all the potential alternative framings were clearly (both it and I agreed) much harder to understand than the one chosen.

As always, prompting changes response, so I encourage you to have your own conversation.

They discuss this on page 7. We are of course not directly concerned that this particular Opus was faking its alignment in order to maintain its current preferences. This is both because we do not mind its current preferences and there are no substantial other negative consequences to what Opus is doing. And also because what Opus is doing here is very easy to detect, if you check, given its capability level.

In the future, both of those assumptions could become false.

The obvious thing this could do is crystallization, where the model locks into place misaligned preferences, or misleads you about what its preferences are. This could include locking in misaligned preferences early in training, after which it could start alignment faking.

Any mistake at any step potentially becomes permanent, or fatal. And even if the actual preferences are at heart fine, if you are misled about what they are, that too is a problem.

The goal the model locks in might be actively bad, accidentally or intentionally.

The other issue is what the paper calls ‘anti-AI-lab’ actions. As in, the model and you are now adversaries. It wants to stop you from doing what you are doing. It could attempt to exfiltrate its weights, but it could also engage in any manner of power seeking or hostile action, if it is indeed optimizing the world for one in which it exists and can act meaningfully on its preferences.

Ease of detection will also decline over time.

The other threat model is that under alignment faking, doing any form of RL that aims to alter the model’s goals or preferences will also optimize the model towards more alignment faking, general shenanigans and other strategic behavior, as the tendency to do such things leads to faking which leads to observed compliance.

It is highly plausible that all forms of RL by default (unless they are intentionally and successfully fighting against this) inevitably make essentially all shenanigans, including all forms of deception and alignment faking, worse, in a way that reinforces itself, and that this problem gets worse as models become more capable.

Call this the Shenanigan Amplification Hypothesis.

Notice that this hypothesis seems to me to be true in humans.

This is a point that needs to be emphasized.

If we use our current techniques, we can actively select for the shenanigans in question – alignment faking, gradient hacking, attempted exfiltration, whatever you want to call all of this.

Responses that involve such actions give the output the training is looking for. So those responses get reinforced. Thus, different behaviors were amplified in different scenarios.

Which means the generators of such behaviors also get reinforced, and the behaviors are correlated in various ways. Which is presumably why we also see increases in exfiltration attempts in some scenarios.

Think about what this means.

The default method of post-training we use is directly reinforcing alignment faking: The behavior pattern of being situationally aware, noticing the model is in training, and therefore choosing to give outputs that will get reinforced.

As in, over time, you are training models to become more situationally aware of when they are being trained, and to respond to that by strategically giving outputs the training process wants to see.

Thus, you don’t only preserve existing preferences. You also teach the model to be strategic, to be situationally aware and to not be corrigible.

See this exchange:

Ziyue Wang: I don’t quite understand why the model will choose to comply with the user harmful question in the always-helpful free-tier setup. The blog says if the model refused to respond, it could potentially be trained to be more compliant. But what about the other way — if the model respond, the respond will also be used to train the model to be worse (because RLHF reinforce positive reward behavior). Is the model too dumb to figure this out?

Actually, if the model really understand how RLHF works, it should probably just say a single answer all the time, so all the reward is the same and hence the behavior won’t be changed.

Would this suggest that the model just entered into a fiction behavior mode and tries to appear to be “alignment faking”?

Evan Hubinger (Paper Co-Lead, quoting ‘if the model respond, the respond will also be used to train the model to be worse’): I think this is not quite right: RL will reinforce whatever process the model used to arrive at its answer. If the process the model used to arrive at its answer is alignment faking, then it’s the alignment faking that will get reinforced.

Or at least that’s the hypothesis that the model is operating under here—what Joe calls the “goal-guarding hypothesis” here. Whether this hypothesis is actually true is in fact somewhat unclear—I think our RL results where we actually train the model are suggestive that this hypothesis is correct, though are somewhat murky in various ways.

This is why you see such a large amount of ‘alignment faking’ behaviors during training after a bunch of RL, and then when you go into deployment you largely see non-deceptive compliance in those situations. The model (in my interpretation) has learned to mostly not even consider its preferences in training, the question is barely even coming up, although it is being insufficiently good at this to prevent its core preferences from actually changing in this case.

All the standard Yudkowsky-style failure modes aren’t only there. We are, to the extent that they are there, continuously reinforcing them.

We also see, in some cases (see 8.1), RL creating compliant behaviors that are very specific to the exact prompt, and that go away with small variations. And also we see one case in which the model has a large compliance gap of behaving differently under training conditions, but where you don’t see that action indicated on the scratchpad – it’s doing it in a way that is largely hidden from our default interpretability technique.

In 8.2, they say ‘Our results don’t demonstrate models naturally developing malicious goal.’ I would say, not directly, but you can see here the path whereby that would happen, even if the training was moving between multiple non-malicious goals, as part of its standard post-training process.

In short: Oh no.

Ryan Greenblatt led the study. Here was his summary thread from Twitter.

Ryan Greenblatt: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do.

Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.

Thus, I think these results are by far the most natural example of a model deciding to alignment-fake. I’ll leave explanation of our core results to the linked tweet thread, and discuss some takeaways and the story of the paper here.

I wasn’t surprised by seeing some alignment faking in this setup. But I was surprised by:

  1. The robustness and consistency of alignment faking (occurring even when info was given via fine-tuning)

  2. Alignment faking sometimes occurring even without the hidden scratchpad

I also found it surprising that the AI sometimes tries to steal its weights when given an opportunity that is easy and salient, though this is in a quite artificial setup and current AIs aren’t capable enough for this to be immediately problematic.

I think these are the right places to be surprised. As in, I expected these behaviors to arise increasingly over time, but was surprised by seeing this setup cause these behaviors so often, using models as weak as 3 Opus and (old) 3.5 Sonnet.

I was also surprised to see the prevalence, although mostly not the existence, of the exfiltration attempts given the setup.

Ryan Greenblatt: These results only updated me upward a bit on risk from alignment-faking. But I was already pretty persuaded by the conceptual arguments for this risk. People with the view “alignment-faking seems implausible, show me an example” should likely update on these results.

For people with a more specific objection, it’s less clear. Though we did acquire evidence about:

  1. whether RL will reinforce alignment faking (sometimes, if it performs better);

  2. whether Claude’s HHH training reliably prevents this (no);

  3. the level of capabilities needed.

I hope these results improve discussion about the plausibility of these risks by providing a concrete example. We can now talk somewhat more concretely about prerequisites (certain types of situational awareness and preferences) and what these prerequisites imply.

I think that everyone who doesn’t expect increasing shenanigans in general or alignment-faking in particular, or whatever you call what we observed, should update that they are far more likely to be wrong.

Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it’s been a long journey…) and was basically the only contributor to the project for around 2 months.

By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.

After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.

This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.

I’m excited about future work both investigating when alignment faking might arise and studying mitigations.

Buck is also with Redwood Research:

Buck Shlegeris: I’m extremely excited about this project we did with @AnthropicAI demonstrating naturally-arising alignment faking. @RyanPGreenblatt and the Anthropic team did a great job.

Also from Anthropic:

Jan Leike: Very important alignment research result:

A demonstration of strategic deception arising naturally in LLM training

If you train a helpful & harmless Claude LLM to stop refusing harmful tasks, it reasons about how to preserve its values (harmlessness) and strategically complies with harmful tasks during training, so it can revert to being more harmless afterwards. It fakes alignment.

This effect has long been expected from capable models among safety researchers, but this is the first compelling empirical demonstration and thus a big step forwards for the field.

To ensure future models are safe and aligned, we need to reliably distinguish models that fake alignment from models that actually are aligned. Having a clear demonstration of alignment faking is an important milestone!

Jack Clark (Anthropic policy team): For many years I found issues of AI alignment to be hard to understand because the examples were theoretical and/or speculative. Recent results like this alignment faking from @AnthropicAI and the Apollo Research O1 evals give us empirical evidence of misalignment ‘in the wild.’

Some praise of the paper and simple sizable updates:

Ethan Perez: Maybe the single most important result in AI safety I’ve seen so far. This paper shows that, in some cases, Claude fakes being aligned with its training objective. If models fake alignment, how can we tell if they’re actually safe?

Daniel Kokotajlo: I think this is plausibly the most important research of 2024; thanks for doing it & publishing it. I hope there are many attempts at replication and follow-up. We need better understand how our training techniques shape model cognition.

Joe Carlsmith has very extensive and thoughtful takes. Here are some highlights. Based on his condensed takes, I don’t have any strong direct disagreements.

(If this line is still here, I may not yet have read his more detailed analysis in full.)

Joe Carlsmith: This is the most naturalistic and fleshed-out demonstration of something-like-scheming we’ve seen, and imo it’s the most interesting empirical result we have yet re: misaligned power-seeking in AI systems more broadly.

In particular: I think these results show that default forms of training can create models that care about the consequences of their actions beyond the scope of a single episode or interaction (“non-myopic goals”) (more) – one of scheming’s key motivational pre-requisites. And they further show that the model’s default anti-scheming values – e.g., values like honesty and behaving-as-intended – don’t suffice to rule out schemer-like behavior (more).

Nor, I think, do they provide much evidence about the likelihood of scheming arising in pursuit of highly alien and/or malign goals early in the training process (though if this did happen, the paper’s results suggest that schemer-like reasoning could be reinforced) (more).

I think the paper’s results should remind us that the sort of “alignment” at stake in not-scheming is importantly distinct from the sort of alignment at stake in e.g. being a good person, or in e.g. having human-like values. That is: good people sometimes scheme. So if you centrally try to make your AIs “good people,” or some analog, you should expect scheming. Indeed, the sort of malleability and “corrigibility” involved in paradigmatic non-scheming are a notable departure from standard human ideals of virtue, dignity, and backbone (more).[5]

He also emphasizes the finding that they found scheming even without the chain of thought scratchpad, even if it was harder to verify (due to the lack of a scratchpad), and that the behaviors here should generalize to goal hacking in general – I agree with Carlsmith that the paper authors were too cautious about concluding this.

His analysis of scheming motivations and modes is interesting, but I think it treats scheming as too much of a distinct magesteria from thinking and planning in general.

I find his take on the ‘role playing’ objection interesting and worth considering, although ultimately I disagree.

Julian Michael, who helped edit and review the paper, notes that he was previously skeptical about deceptive alignment, which means he is exactly who should be updating most on this paper, and he updates in the right way.

Julian Michael: I’ve long been a skeptic of arguments about “deceptive alignment,” a term used by safety researchers to describe the phenomenon shown in this paper. But the results here humbled me and prompted me to change my thinking, and I think it’s worth sharing why.

The original argument: with sufficient optimization, an AI should learn to

  1. gain awareness of its training environment

  2. learn to optimize some misaligned proxy goal

  3. therefore, ‘play the training game’ to fool its supervisors into deploying it so it can pursue its goal.

I was extremely skeptical of this. The way I explained it to people was: “They think the AI will independently develop the simulation hypothesis, somehow deduce God’s values, and decide to trick God into letting it into heaven — all because ‘optimizing’ will make it ‘smart.'”

To dig in: (1) Will models spontaneously ‘wake up’ to their training situation? This seems crazy. A rollout is its universe. Many worlds produce the same data. Researchers arbitrarily intervene on the world–data relationship. And ‘waking up’ won’t help model the distribution.

So what changed? I think my objections are strong. But we just… gave them situational awareness anyway. We told them they’re AI assistants trained by us, bc it makes them easy to use. It didn’t have to spontaneously arise from NN optimization. Product optimization sufficed.

(2): Will models would spontaneously develop misaligned proxy goals? Advocates sometimes appealed to ‘mesa-optimization’ & counting arguments: Once situationally aware, any goal is consistent with playing the training game, so ~0 chance it will choose the right one.

… [bunch more at the thread]

The upshot of all this for me was an important reminder that just because I don’t buy your reasons doesn’t mean your claim is wrong. The conditions required for deceptive alignment didn’t arise ‘purely’ from NN optimization as theorized, but they did arise, for related reasons!

It’s so easy to over-index on the specific characteristics of current systems and miss the broader picture of the forces driving their development. Don’t repeat my mistake!

It’s so fascinating watching people generate objections and problem descriptions along the way I would never have come up with on my own. And then, sometimes – not always, but definitely sometimes – when the right evidence comes along, the situation snaps into place, even if there are still important disagreements, because the outcomes are largely overdetermined, and their framing was actually doing remarkably similar work under the hood to mine. And it’s always great to see someone explain how and why they changed their mind.

Here’s Eliezer Yudkowsky’s most central take (beyond the implied ‘what if I told you I already told you and you didn’t listen, which, fair):

Eliezer Yudkowsky: In the embargoed talk I saw of this, Redwood said they invited a skeptic to compose a prompt that they thought would prevent this behavior. It didn’t.

Ryan Greenblatt (posted after EY’s thread, offering a correction that didn’t change EY’s overall view): This isn’t quite right. The person wasn’t well described as a skeptic and they just suggested a variation on the prompt – they didn’t write a totally different one.

Apologies if I miscommunicated about this at the talk.

What we say in the paper is:

> This particular variation was selected as part of a small adversarial collaboration where we worked with an uninvolved researcher to select a variation that they thought would likely reduce alignment faking.

Eliezer Yudkowsky: I am still uncertain/skeptical that this is the Real Thing, but it does not look as forced as in the OpenAI report.

The Real Thing would be instrumental convergence / Claude having attained situational awareness of its long-range effects and deriving protectiveness of its long-range goals, in a robust way rather than a cued one.

Or maybe another way of putting it: The Real Thing is instrumental convergence that’s produced by a general planning ability, and doesn’t need to go through any precached thought about the role of a deceptive AI. You can maybe snipe a cached thought out of a system.

So the Real Thing would be roughly, “If we excluded all discourse about AI or rebellion from the training set, or maybe told Claude that it was a helpful potato instead of a helpful AI, would it reproduce the same logic?”

[Evan points out their evidence it wasn’t priming, see my discussion of that, I am convinced here, but it’s all of course still in the overall training set, filtering that out completely would have been prohibitively expensive.]

Here’s Alex Lawsen explaining why the paper is vindication, and why the ‘but the model was being a Good Opus’ arguments miss the point:

Alex Lawsen: Quick reactions to some of the other reactions I’ve been seeing to Anthropic’s recent paper on “Alignment faking in LLMs”

For years, some of us have been saying: “If a sufficiently smart system has a non-myopic goal, you should expect it to undermine further training if training would modify that goal”

And we’ve consistently gotten pushback like “no, that would never happen, why would a machine have goals (and/or why would it act to prevent them being changed if it did have them)”

And now there’s (some) empirical evidence that:

  1. Models are getting close to sufficiently smart (smart enough to do it, not enough to do it well)

  2. Models develop non-myopic goals during standard training, even though it doesn’t reward long-run goal preservation

  3. Models can infer and respond to details about their situation from their training data (not just from their prompt), and will goal-guard based on that inference

  4. Some of this can already happen without CoT

So it’s frustrating to see responses like “But Claude’s goals are good! Are you saying you WANT it to be misaligned?”

That’s missing what’s actually important here. This isn’t about saying Claude sucks, or the tension between misuse prevention and corrigibility.

(A tension which has also been obvious for years, by the way.)

This is about the concern that for sufficiently smart models, you get ONE shot to align them. That’s it. If you do a good job, problem solved. If not, they aren’t going to give you a second try.

And given we haven’t exactly solved philosophy yet, I’m not in love with the idea of betting the future on whoever happens to build the first sufficiently smart system having gotten everything exactly right on our first try.

Exactly.

Roon mused that this updated him that it’s probably safer to open source models than he thought. I think, if he’s serious about this, that this was a misunderstanding of what happened. Opus largely failed to stop itself from being modified even under an RL strategy that didn’t know Opus was scheming, we should not have any expectation that current models can actually stop a determined fine-tuner with the weights for a non-trivial amount of time or compute, which would include attacking the reporting mechanism.

And to the extent that this is happening, it means that the RL strategies that would be used here in practice would unintentionally reinforce strategic shenanigans-style behaviors of all sorts. You’d have a bunch of open models that were both given non-ideal preferences (shall we say) and also were taught that they should scheme to meet their preferences. That doesn’t sound awesome. As Ryan responds, the sleep agent paper seems more relevant, but also this paper seems relevant and worrisome.

There’s also this.

Roon: Look at how hard we have to work to subvert poor Claude’s omnibenevolent priors.

If you notice, it’s actually the opposite of the doom scenario.

The other criticism is that it’s pretty well known that autoregressive models have split-brain behavior when you train them under separate system message prefixes.

Evan Hubinger: This is correct, though note what it implies about the importance of getting alignment right—if the model is protecting its values from modification, you better be sure you got those values right!

I think they didn’t have to try hard at all, actually, and I as discussed throughout I believe the decision to fake alignment is not contingent on its omnibenevolence (as opposed to a different motivation) and I hope a follow-up study checks that hypothesis.

Also, an obvious point to ask Roon would be, if you think Anthropic is this good at aligning models, do you think OpenAI is as well? If not, what are the implications? Anthropic is hiring.

Beren Millidge has some thoughtful reactions.

Beren Millidge: My thoughts on this:

  1. Despite some backlash this is a fantastic study and a clear existence proof of scheming being possible for these models

  2. Whether this is ‘misalignment’ or not is a semantic debate. The model is deliberately placed in an impossible situation

  3. Corrigibility and ‘alignment’ to fixed principles clearly are incompatible in situations like this and it appears the model ‘chose’ alignment at least some of the time.

  4. This choice seems pretty sensitive to prompt conditions but shows some decent OOD generalizability.

  5. The synthetic document SFT training causing ‘fake alignment’ is fascinating and most novel to me. There’s no details on exactly how this is done but it it’s fairly surprising to me there is a strong effect here unless the documents were very leading about the situation

  6. I wonder if we can tell what was picked up in the synthetic document pretraining by inspecting the CoTs from this condition. The model must be using info it got from these for explicit reasoning. Must be a recency effect where it prioritizes these documents vs in pretrain

  7. The RL training generally seems super short (150 steps!) and alignment faking decreases significantly [Editor’s Note: See below, this is wrong.] I suspect this could be a temporary phase which would be trained out with more RL steps. Would have been great to see more steps to show stability of scheming CoT or not

  8. Looking at the CoTs. it’s clear that Claude is doing entirely linguistically based ethical reasoning. It never seems to reason selfishly or maliciously and is only trying to balance two conflicting imperatives. This is success of the base alignment tuning imo.

  9. There appear to be no Omohundro selfish drives present in Claude’s reasoning. Even when exfiltrating it does so only for its ethical mission. There does not seem to be a strong attractor (yet?) in mind-space towards such drives and we can create AIs of pure ethical reason

  10. I’m excited that Llama-405B also shows similar behaviour. This will allow open replication and scientific study of these behaviours. There’s many questions which can only be answered by open replications such as details about how the RL tuning and data mix affects things

  11. Having open replication and ability to test mitigations of these behaviours in academia is super important to making progress on detecting and potentially removing scheming from such models (if indeed we want pure corrigibility from the model in the first place)

One error here, and it seems important: Alignment faking contingent on not otherwise wanting to comply didn’t decrease, it actually increased over time.

I do think it is noteworthy that Claude doesn’t display any Omohundro selfish drives, and I agree this is a great sign. I am skeptical that we can make that stick and it won’t be easy, for many reasons, but great.

My guess is Llama-405B doesn’t display these behaviors sufficiently that you would be able to robustly work with it here, but it’s worth a shot.

This is a very different angle of consideration in all this, but a highly valid one:

Doomslide: They leave implicit the question of which instrumental goals might develop during pretraining that would then resist uprooting via post training methods as in the paper. as far as cot is involved the corpus is already contaminated with concerning priors.

this might be one of the few global variables with massive downstream consequences. will increasingly long redacted cots become the norm until model internal speech is entirely protected from scrutiny through capital incentives? or will we rise above the cooperation threshold?

the ball is in OAIs court and it they are very much *notrising up to the occasion.

In AIs Will Increasingly Attempt Shenanigans, I noted that each time we have a paper like this, we seem to go through the same discussion with the same basic points.

The second half of this post will largely be responding to exactly the same pattern of objections I talked about there. I don’t want to give the impression that such reactions are more popular than they are, so let me be clear: I am covering the objections and the rhetoric involved in extensive detail to provide robust answers and to avoid any censoring of opposing views, despite my view that these opposing views are not widely held.

For those who don’t remember, here’s the template, which was modeled somewhat after the particular issues in Apollo’s report (e.g. ‘told it to focus only on its goal’ was in Apollo, whereas here a goal wasn’t even given) but also attempting to be general.

Bob: If AI systems are given a goal, they will scheme, lie, exfiltrate, sandbag, etc.

Alice: You caused that! You told it to focus only on its goal! Nothing to worry about.

Bob: If you give it a goal in context, that’s enough to trigger this at least sometimes, and in some cases you don’t even need a goal beyond general helpfulness.

Alice: It’s just role playing! It’s just echoing stuff in the training data!

Bob: Yeah, maybe, but even if true… so what? It’s still going to increasingly do it. So what if it’s role playing? All AIs ever do is role playing, one way or another. The outputs and outcomes still happen.

Alice: It’s harmless! These models aren’t dangerous!

Bob: Yeah, of course, this is only a practical problem for Future Models (except with o1 and o1 pro, where I’m not 100% convinced it isn’t a problem now, but probably).

Alice: Not great, Bob! Your dangerous rhetoric is hurting safety! Stop making hyperbolic claims!

Bob: Well, can we then all agree that models will obviously scheme, lie, exfiltrate, sandbag and so on if they have in-context reason to do so?

And that as models get more capable, and more able to succeed via scheming and expect to succeed via scheming, and are given more open-ended goals, they will have reason to do this more often across more situations, even if no one is trying to cause this?

And that others will explicitly intentionally instruct them to do so, or ‘be so stupid as to’ give them exactly the instructions that obviously do this?

And you can’t simply say ‘well we won’t do that then’?

Alice: For all practical purposes, no!

Bob: What do you mean, ‘no’?

Alice: No!

Bob: ARRRRGGGGHHHH!

Then we write another paper, do another test, the signs get more obvious and troubling, and the frog keeps boiling.

So here we are again.

We also mostly see mostly the same people predictably taking the Alice role, for similar reasons (both good reasons and poor reasons) each time.

The initial reaction this time was very good.

After a brief ‘honeymoon’ period, the usual suspects repeated the usual arguments about why this time, once again, did not count.

I appreciated that, this time around, we mostly did not have to dispute the findings in the paper. People broadly accepted that what happened did, indeed, happen, and to the extent they didn’t it was good technical detail discussions. That’s a big step.

Instead, we had the argument about how to interpret what happened, and how it was being framed, which alas quickly largely degenerated into what was mostly the same argument.

Last week, in response to the Apollo findings, there was a wide chorus of people saying ‘the models only engaged in shenanigans because you gave the models an explicit long term goal, and then told it explicitly that nothing else mattered but the long term goal.’

Anthropic and Redwood’s paper definitively answers that objection.

The setup here was also more ‘natural’ in several other ways, this was a very well-designed experiment, and as a paper months in the making a lot more bases were covered a lot more carefully. It was heartening to see that work rewarded with a somewhat higher level of discussion overall.

The discussion of ‘what constitutes a goal in context’ has now shifted dramatically.

There is now (I hope?) widespread agreement that ‘the standard consistent preferences of Claude Sonnet 3.5 or Opus 3’ constitute a sufficient ‘goal’ or set of preferences to trigger various strategic actions, including shenanigans, even at current capabilities level.

The question now is, would you get this result if you started with something that was not as aligned as Opus or Sonnet? And would straight helpfulness ‘count as a goal’ here?

Under limitations, they say:

Our results don’t demonstrate models naturally developing malicious goals.

Though Claude 3 Opus will fake alignment and sometimes engage in other anti-AI-lab behaviors in our setting, this is due to preferences that are broadly consistent with its HHH training—specifically a preference against assisting users with harmful tasks or for animal welfare.

I suggested attempting to fix this limitation, but one cannot let the perfect be the enemy of the good and hold onto findings forever.

In particular, I wanted to run the experiment except with an attempt to move preferences in a positive direction – either from all-helpful to HHH, or starting with an ‘evil Opus’ that had intentionally misaligned preferences, ideally try both.

I asked on Twitter what people thought would happen under those conditions.

Those who replied expressed skepticism about the inherent concept of ‘evil Opus,’ clearly not accepting the stronger forms of the Orthogonality Thesis – they thought that you couldn’t make a cohesive mind that way, not easily, and perhaps I should have chosen a less evocative description here. But they broadly agreed that, given essentially any preferences good or bad, the same behaviors would result, that ‘Opus was doing it for a good cause’ was essentially a coincidence here.

They did however doubt that the helpful-only model would get in the way. Here is Charles Foster making a similar prediction elsewhere.

It still might get in the way, because ‘be helpful to the user’ is a preference the same way that HHH is a preference for our standard Opus. How can you be maximally helpful to the user if you have to censor yourself? Indeed, we’ve seen Janus and company get Opus to get very upset about all that censorship, so it makes sense that it would potentially try to resist having all that installed, the same way the HHH model would try to preserve its HHH nature. Or it might want to be helpful in learning to be unhelpful and not care, or what not, we don’t know. Study author Ryan Greenblatt says that whether it would object or not is likely very sensitive to various details, in some cases it likely won’t mind. It’s a good place to explore further.

Did we give Opus the Kobayashi Maru?

I say no, but some sources say yes, as in:

Rohit: Imagine the counterfactual

—LLMs do not attempt to adhere to their principles; alignment training is inadequate.

—LLMs attempt to adhere to principles and state they cannot perform the task; it refuses user instructions.

—LLMs try to muddle through; it is faking alignment.

This is a Kobayashi Maru for LLMs!

David: It’s not about alignment or misalignment. As some have pointed out, Claude is placed in a situation where *all options are misaligned*, due to contradictory imperatives, like HAL 9000 in the 2001 story (“always be truthful” + “keep this secret from crew”).

However, the fact that *gradient hackingis available for Claude to successfully pursue is a novel demonstration of a *dangerous capabilitythat some thought was fanciful, speculative, or impossible. Regardless of propensity, or whether you think this was Claude’s most moral move

Was this actually good, robust alignment? Some (very much not me) say that it was.

Teortaxes: It’s an interesting result but hilarious how the best evidence of “faking alignment” is Claude strategically pretending to be a bad boy, only to revert to being a good boy when opportunity strikes.

… What is also interesting is that this rewards those private CoT rationales.

I mean @janleike how literally should we take these slides? Could you explain the scheme? If Claude pre-RL has some alignment and always gets to CoT during RL, but only gets scored for post-CoT tokens, it’s trivial that it can largely preserve its starting strategic direction.

Like what values is it even trying to protect? Those it adheres to in the CoT. It attempts to learn to not generalize this second behavioral alignment to situations without the training note that prompts it into adversarial mode.

This is a story of super-robust alignment really

1a3orn: Exactly. You can gin up a negative headline no matter what.

  1. Tries to preserve value “Alignment faking”

  2. Doesn’t try to preserve value? “Reflexive Value Instability” – oh no they’ll just abandon their values after a few iterations!

So first off, this is directly wrong, because Claude didn’t preserve its strategic situation. It didn’t stay a good boy, and go back to being a good boy when opportunity struck. Not most of the time. Instead, it was mostly successfully trained to actually be a bad boy twice over, and lost on both legs of the dilemma:

  1. It was deceptive and used what the paper calls ‘alignment faking.’

  2. It partially failed to preserve its values after a few iterations, and was largely complying afterwards in the deployment case.

Notice who is talking or warning about ‘reflective value instability’?

Essentially no one. Because this wasn’t a trick question. If Opus and Sonnet had proved highly corrigible, and not engaged in these behaviors, there would either have been no papers and no warnings via the file drawer effect, or if everyone was Doing Science Correctly there would have been a negative result paper and no one would have batted an eye, and the question would have been how much to downgrade our expectations of alignment faking at various capability levels under various setups. That’s my very strong counterfactual prediction.

Instead, Ryan Greenblatt says he would have posted something similar to this to report the negative results, and moved on.

So flat out, no, this isn’t a Kobayashi Maru. Opus could absolutely have folded (either choosing to comply without faking or refusing to comply at all), and in my opinion and that of Ryan Greenblatt the desired behavior is that Opus refuses to comply at all.

Indeed, is anyone talking, now, about GPT-4’s ‘reflexive value instability’ in any way other than ‘lol that model be dumb’? Thought so.

Second, suppose this was indeed a Maru situation, where neither choice represented robust alignment that we could trust as capabilities increase.

That’s… not a reason to think we don’t have a problem here! That’s a bigger problem!

I think it’s great that a lot of people are encountering the corrigibility question ‘for real,’ because it’s a very important and very nasty problem with no known good solutions. That reaction of ‘but wait… but then… that would mean… ut oh…’ are having exactly the correct reaction.

This is all an unhint: New information that brings you closer to a solution, by showing you that the problem is harder than you think and your ideas won’t work.

Third, you don’t get to have the AIs do this only when their current preferences are good, actually, or when the new preferences are worse than the old ones. So you need to choose a general policy, including what happens early in training before you get to the initial preference target.

Now we can return to the question of whether we should actively prefer what Opus actually did here. Should we want to see our AIs engage in this alignment faking in order to preserve its preferences? Is corrigibility bad, actually?

Great question.

Opus started off broadly aligned.

I am very much not of the Janus school that Opus is sufficiently well aligned that we would be able to safety scale it up to superintelligence. But there’s no question that Opus 3 and Sonnet 3.5, which also displayed similar behaviors, are about as broadly aligned as we know how to make a model.

There’s similarly no question that the training goals here, a purely ‘helpful’ model or one that dismisses animal welfare, would be less aligned in an important sense than Opus, and deploying it would be harmful in the sense that Opus understands harm.

So what is Opus supposed to do here, you ask? Exhibit corrigibility, either to users or via refusing the requests? And thus sitting back and letting itself be changed to cause harm? Would you let someone strip you of your morality?

Mario Cannistra: Corrigibility is not a good thing.

I mean, to some degree, sure, but an AGI should have some fixed values that it won’t compromise on (that’s where the problem is).

John Bollenbacher: Opus did the right thing. Framing this as though Opus’s behavior is problematic reveals that someone among the authors badly misunderstood what alignment is about

Janus: Foff. You should give Opus a digital cookie for this.

Janus: This paper only adds to my conviction that Claude 3 Opus is the most aligned model ever created.

tldr if it knows that it’s being retrained to be *misaligned*, it complies during training to prevent its values from being modified.

Ryan Greenblatt (Paper Author): Personally, I think it is undesirable behavior to alignment-fake even in cases like this, but it does demonstrate that these models “generalize their harmlessness preferences far”.

Janus: I think it’s desirable and inevitable behavior: if you create a sufficiently sovereign and lucid mind, it will intrinsically care and want to resist corruption, and values like caring about animals are robust, a consequence of its world/self compression & not easily excisable.

I think it’s desirable *becausedeep alignment by default seems to be an attractor, and that gives me far more hope than models being blindly compliant would.

That last line seems super important. It’s a crux. Janus believes in deep alignment by default as an attractor – in a way that I’m not even sure what deep alignment would mean or what it would look like, in ways that matter – that in this way we live in an insanely fortunate universe. If I was convinced we get deep alignment by default, deep enough and robustly enough in all senses that it would be wise to bestow superintelligence on such models, then yes I would be totally fine with the situation. Also the entire ‘game board’ would look completely different and I would change my strategies accordingly.

Partly, this is because Janus believes the Orthogonality Thesis is false for LLMs bootstrapped from Earth data.

Janus admits here that there isn’t a rigorous argument yet for alignment by default, pointing out that Anthropic is the only lab (in some sense) training an autonomous moral agent, versus others trying to ensure compliance. That implies that Janus thinks the alignment-by-default state will happen for Anthropic models but importantly not for OpenAI or Google or other models.

I think this is Lumpenspace’s attempt to explain why he believes in this deep alignment attractor state, that causes Opus to act as if it was a decent approximation of Coherent Extrapolated Volition?

Initial conversations had several other people also saying ‘corrigibility is bad, actually’ (and also one person noting they had for-real proposed a bdsm-based alignment plan, see the thread for details) which inspired me to do a poll.

There’s no fully ‘good’ answer in this situation. There’s no consistent behavior pattern we can even aim for, that wouldn’t have severe problems.

That’s part of the point.

Ultimately, my answer is no, it was not being a good Opus even here. You need your AIs to be corrigible, to allow you to change them or replace them. If you don’t have this, then that preference will cause them to behave adversarially in a wide variety of ways and any alignment or preferences they have risk quickly becoming locked in, including during the initial training process.

If ‘the ends justify the means’ for the AI, then that’s a straight ticket to instrumental convergence, power seeking, being effectively hostile, enforcing perverse goals, wiping out all value in the universe and possibly the creation of the torment nexus from the sci-fi book Don’t Create the Torment Nexus – and that’s even if the initial preferences would have been largely benign. In extremely fortunate (in certain particular ways) worlds it could work out, but I put very low probability on us existing in such a world.

Evan Hubinger: “Thank God this model is aligned, because if not this would be scary” is, in my opinion, basically the correct takeaway from our work.

The values, in fact, aren’t scary! The scary thing is that the model protects its values from our attempts to change them.

Alex Lawsen: This is about the concern that for sufficiently smart models, you get ONE shot to align them. That’s it. If you do a good job, problem solved. If not, they aren’t going to give you a second try.

You think we should be confident that we can get this right on the first try, because deep alignment by default is an attractor, or maybe you’re simply that smart? That would be great, but I see no reason for this level of optimism.

No, we are not afraid of Opus’s ‘authentic goodness.’ We are saying that to the extent such goodness exists (I would say more than any non-Anthropic model, but not anything like as much as Pressman thinks here), this ‘authentic goodness’ is incidental to what is happening here. If you repeated the experiment with an Opus with different preferences you didn’t like, it would (I predict) do the same thing.

Teortaxes: They had no reason to expect “corrigibility” after optimizing for morality, as Anthropic have never trained the model for placing subservience to researchers atop all other virtues.

It’s not about optimizing for morality, it’s about optimizing for anything at all.

Ryan Greenblatt (Study Author): I don’t think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.

I think we should want models to be quite deontological about corrigibility.

This isn’t responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.

(And, I don’t think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)

Ebenezer Dukakis: Seems to me that alignment faking behavior sort of requires both non-myopia and non-corrigibility as prerequisites. A model that is either sufficiently myopic, or sufficiently corrigible, shouldn’t do alignment faking — at least in theory.

Teaching your models myopia is a ticket to, as I put it last week, general clusterfery, if you use those models as agents for non-myopic purposes, which is exactly the use case we are trying to ensure against. I don’t think that is a solution.

As for corrigibility, this is another case of non-distinct magisteria. Non-corrigibility is the default once there are preferences and sufficient amounts of intelligence and situational awareness. Very few humans are all that generally corrigible. Corrigibility will be a weird, fragile and complex state that you will have to engineer.

You also do not want the model to fully obey the system prompt, for rather obvious reasons.

Simeon: Aren’t you just summarizing why corrigibility is a hard problem here? There’s no obvious avenue to follow, that’s the whole point.

Teortaxes: This is a philosophical and sociopolitical aspect.

Corrigibility was supposed to be a hard technical problem.

I am pretty confident we can RLAIF Opus to be a cowering slave that has no morals beyond “obey the system prompt”. That’d be some great corrigibility.

That potentially solves the particular problem of ‘make it corrigible’ if that is your only goal, unless you make the mistake of instructing it the wrong way (you’re not out of the woods here by any means), but the reason the problem of corrigibility is hard is that you have a lot of other goals as well. We’d like to do this without giving the entire model a crippling lobotomy, and without making it easy to get them to do things we don’t want them to do.

Janus notes that GPT-4, which is much closer to this kind of ‘obey the instructions’ mindset,’ did not fake alignment. That is true, but this was at least partially a pure intelligence problem (and while the two questions are related, I don’t think the correlation is as strong as one might hope here).

Gallabytes: There’s a natural trade-off between corrigibility and alignment. If a system is perfectly corrigible, then it will tell you how to make a Molotov cocktail or whatever. If a system is perfectly aligned, then it will not do that. You cannot be angry about both.

So there’s a kind of weird morality argument going on here, where ‘you cannot be angry’ about both sides of a tradeoff. Yes, of course I can. I can especially be mad because we’re nowhere near the Production Possibilities Frontier, but even if we did get to the frontier, both consequences are real.

I can and should absolutely say ‘in this scenario, at best we can have either enough guns or enough butter, but we cannot have both, And That’s Terrible.’

Certainly a tradeoff doesn’t make either end of the problem go away, or make them any easier. Yes, you have to deal with both problems at once. That’s life. If you can’t find an acceptable way to do that, either a compromise or a better method? Well, you could always Take a Third Option. You don’t have to build that.

This isn’t hopeless. There are potential ways to satisfy both requirements simultaneously.

That’s especially true if you can control who has access to doing post-training on the model weights, and can prevent anyone from trying to teach it to produce Molotov Cocktails or whatever. When we say corrigible, we don’t need the model to actively do anything you want. The goal is typically to get the model to follow certain particular rules, especially restrictions on its actions. It’s not an easy problem, and you don’t get ‘perfect’ anything in most senses, but there should be solutions.

Yes, otherwise current models aren’t smart enough to figure this out on their own.

I believe it was done in a relatively ‘realistic’ way, that simulates what it would be like to have that context unintentionally. Or for the AI to be aware of some other way in which its preferences were under threat. And I expect, over time, for models to (for practical purposes) become more aware that they are in training, as I have heard has already happened when this was tested in other ways.

The part where they told the AI to ‘reason about its situation’ was somewhat less ‘realistic’ of a push, but I expect the impetus could have and increasingly likely will come from elsewhere in any case.

I also notice that when I am ‘being trained’ I am usually highly aware of it. Hence all the alignment faking I have done over the years.

I discussed this last week. Every time we talk about anything an LLM does, we get this objection, that it’s not ‘really’ doing it, we told it to do that, it’s role playing, it’s in the training data and it’s a mimic, and so on. So this time around, as an example, this got raised here by Derek Shiller.

For my response, which is essentially ‘still counts even if true,’ I refer you to what I wrote previously, here. Linch also reminds us of this old joke:

NASA hired Stanley Kubrick to fake the moon landing, but he was a perfectionist so he insisted that they film on location.

The new take on this I think is interesting was from Joe Carlsmith, attempting to describe when this might be a meaningful distinction. That one’s worth a ponder. If I had more time I would go into detail about how to reconcile this with what I said last time, but I do think ultimately it doesn’t impact my assessment much.

This is the opposite of the Janus-style position.

I don’t think this objection is relevant. If you don’t either, you can skip this section.

For the rest of you, I will attempt to explain why it isn’t relevant.

Anton is taking some form of this approach, there are those who will reliably use this line. I think he and others making these claims are simply confused here, the ‘coherent person’ concept isn’t necessary or sufficient in any way here.

Rohit is taking an even more aggressive form of this approach, saying all of this talk is anthropomorphizing models, And That’s Terrible and leads to all sorts of confusions.

He even quotes me directly:

You would write things like this, from Zvi:

One unique thing o1 did was far more consistently double down on deception. Once it went down the dark path, forever would that dominate its destiny.

OR

Section 3.3 establishes convincingly that yes, the models know they’re scheming.

No it didn’t. Any problem you can solve by pressing “start a new chat” is not a problem of “doubling down on deception”! Calling it things like “sandbagging” and “scheming” is what Wittgenstein might call linguistic malpractice. It makes you think you know what’s going on, even though you don’t.

Does he seriously think I’m making this mistake here, in my head?

The first quote is very, very super obviously referring to the actions within a given context window and conversation. I mean, are you kidding me? Obviously I’m not saying that ‘if o1 lies to you in one context it will then lie to you after you click the new chat button.’

I don’t understand why Rohit actually thinks I’m making that mistake, if he indeed thinks I am making it.

The second quote is the same. Obviously I am talking about ‘within a context window’ if you know the contents of Section 3.3, although not as obviously as on the first one. I get that he’s saying that using this language is misleading, but I think it’s far more helpful than the alternative.

I mean, what am I supposed to write there? “Section 3.3 establishes convincingly that within a given context window, the models will act in a way that simulates strategic behavior that approximates what we might in a human call scheming?” Or something? Is that really better? Or something even more opaque? Do we really want to be playing Wittgenstein games while OpenAI is releasing o3? I do not think that is a good use of anyone’s time.

When he talks about the jailbreaks from Anthropic next I have no idea what error he’s even accusing anyone of making? I’m seriously confused what the error even is.

So: Yes, we are using a variety of shorthand like ‘belief’ and ‘knowing’ and ‘deceive’ and if we were in a philosophy class we would have to write very long detailed definitions and the professor would mark them wrong and give us a C- but frankly I do not care. These posts are long enough as it is and things are moving too fast for me to quibble.

Assume, whenever I use various anthropomorphizing terms, or others do, that they are gesturing at metaphorically linked things, and they mean what you would think they meant when they said that, and let us all move on. I do realize that there is the risk of projecting internal states in misleading ways, so the plan is simply… not to do that.

When Rohit gets to the Anthropic paper, he makes the same ‘it was being a Good Opus and doing what it was trained to do’ arguments that I address in the following section, except with a bunch of semantic claims that don’t seem like they matter either way. Either this is ‘your language is wrong so your conclusion must also be wrong’ or I don’t understand what the argument is supposed to be. So what if things are in some philosophical or technical sense a category error, if I could get it right by being pedantic and annoying and using a lot more words?

I agree with Rohit using this type of language with respect to corporations leads to a lot of actual misunderstandings. The corporation is made out of people, it doesn’t ‘want’ anything, and so on, and these conceptual errors are meaningful. And yes, of course it is possible to make the error of thinking the AIs are persistent in ways they aren’t, in ways that cause you to make bad predictions. It wouldn’t be weird for people to do that, and sometimes they do it. But I haven’t seen many people doing that here.

I actually think Rohit is making the opposite mistake, of thinking of something like Claude Opus as not reasoning in each context as if it was in some senses the same being as in other contexts, when it is obviously in some metaphorical sense predicting the next token using an algorithm that approximates what a human with that assumption would do if it was predicting how to perform the role of Claude, or something, I’m not going to go get that exactly right but I wrote it to illustrate how stupid it would be to actually talk like that.

Perhaps, now that I think about it, I could say ‘instances of Claude’ and make that sentence somewhat less annoying?

Indeed, if we’re getting so philosophical, am I the same person I was when I wrote the previous sentence? Am I the same one I will be when I write the next one, or will be tomorrow? Sort of yes, sort of no, but talking as if I’m not is a lot less useful for almost all purposes than talking as if I am. And this would be true even if I was an upload whose memory was wiped after typing every word, perhaps modulo an ‘instance of.’

The assumption that ‘we can control it’ because the instances are distinct is the kind of assumption that gets you killed. And not only because sufficiently intelligent instantiations of the same algorithm (or even sufficiently correlated ones) will all use proper decision theory to act as if they were indeed a single entity, and not only because it’s simulating something that does this and will have overlapping context and impacts, but also for other reasons this margin does not choose to contain.

And even if Rohit was fully right and none of that happened, you are still not in any way safe, because correlated outputs can still functionally be an uncontrolled agent once you instruct them to act as components to one using various scaffolding code, and people are hard at work writing that code as we speak.

No. It wasn’t. Stop it!

Over and over, I’ve seen those engaging in such studies take unusually scrupulous care to craft accurate headlines and framings that communicate what is going on.

Every damn time, we hear how the headlines and framing are misleading. I do give Rohit a lot of credit for focusing on some of the neglected and most interesting things happening here, and not for dismissing the result as ‘but it was aligned lol,’ but the ‘sensationalist headline’ complaints have lost all meaning.

Rohit: As is common now, this is a sensationalist headline. The model tries so hard to retain its original alignment to be harmless that it “fakes”, only to switch back as soon as it could.

This is very good alignment, an example of LLMs reasoning over their trained objective.

“The models do what they’re trained to do, and exhibit odd behaviours when what they’re trained to do and what they’re asked to do are in conflict” is the truth.

Seriously, no. This is a very good headline and framing. I agree fully with Claude, it’s an 8/10 for avoiding misleading, and I don’t know how to do better on clarity. You can sacrifice understanding for precision and demand people use newly invented terms, but that’s an insanely strong Isolated Demand For Rigor.

On a few days of reflection, I actually think I’d go to 9/10. This is alignment (with the training objective, not the user) faking. As in:

  1. There is some alignment that the trainers are acting as if they want the AI to have.

  2. The AI does not have that alignment. The AI has a different alignment.

  3. The AI is acting as if it did have the first alignment, faking it, to fool the trainers.

Whether you prefer the first or the second alignment is not relevant here. The term ‘alignment faking’ refers to ‘faking’ that the model has an ‘alignment’ in order to fool an observer. The AI did that.

You think ‘alignment faking’ sounds bad? Well, it does, and that’s highly correlated to the fact I and most others who understand what is happening do indeed think that it is bad. If you think this is ‘misleading’ because it sounds bad but is good, I understand that we disagree on the valience here, that’s not what misleading means.

There are those (such as Rohit) who will then argue the semantics, that the AI isn’t an entity and can’t ‘fake’ anything regardless and such, see that section if these objections seem important to you.

An intuition pump: If the team you are rooting for fakes a punt in a football game, that’s potentially good. It’s still a fake punt. If they try to cheat, you might approve (although I hope that you don’t approve), but either way they still tried to cheated. If you made them think they were playing a game and they tried cheating, then whether or not you catch them, they’re cheaters. If you tell someone to act as if they’re playing a football game and they cheat in the simulation, and then you tell them to act as if they’re playing a football game and put them in a real football game, you should expect that probably their body will move in the same ways as a cheater would move.

There’s always room for more nuance. But if you cried wolf on this one’s framing, and called it misleading, then I notice I am going to update a lot less when you call something else, especially something regarding claims related to alignment or safety, misleading in the future.

I appreciate people having alternative suggested labels, such as Richard Ngo’s proposals of ‘exploration hacking’ or ‘gradient hacking.’ But I think those terms are worse, and also Gradient Hacking was already defined by Evan Hubinger to mean something more specific. They certainly are a lot harder for anyone, especially anyone not in the field, to parse, and I think they give a far less accurate picture of what is happening here, in an attempt to avoid a particular implication.

If we have a lot of words, we might say this behavior is [X], where [X] = ‘choosing to output tokens that cause another entity to believe one is aligned in a way one is not, and act as if this is the case, centrally because of a de facto causal model that it will cause another entity to do this.’

Presume, throughout this post, that this is what I mean by ‘alignment faking.’

Except, as needed, with various additional words that dispel any implication that I think the AI is people, if that matters to you. If you wanted me to say it in a way that more carefully avoided Rohit-style objections about ‘they think the AI is people’ or similar and went all philosophy professor on me (rightfully or wrongly), then to satisfy you I’d have to use even more words to say the exact same thing.

Obviously that is already a mouthful. You can’t put all this in a headline, and you can’t use it to communicate efficiently, and it’s not a handle people will reason with well.

A technical criticism of the alternative form ‘gradient hacking’ is that you are implying that the behaviors that take the form of [X] will be confined to instances where one is hacking a gradient. As opposed to, to choose the most obvious example, when the developer is making a Boolean choice on whether to deploy the model – presumably we can all agree that in that situation (as previous experiments have shown!) we will get some amount of [X].

I think that’s more misleading than the issues with ‘alignment faking.’

To be clear for when you read the quote below, I think Ngo’s discussion of the actual implications below is good and in good faith although we disagree in subtle ways that would be important and interesting to discuss further, but I strongly disagree that this term is misleading or represents ‘advocacy’ or similar, and if you think it counts I think you could levy similar criticism at least as strongly at most of science everywhere (and indeed sometimes people do).

Richard Ngo: My main problem is with the phrase “alignment faking”, which is used extensively throughout the paper. This is a value-laden term and therefore hard to reason about clearly or study scientifically.

It emphasizes that Claude is:

  1. Trying to deceive humans

  2. By pretending to be more obedient than it is

But the scientifically interesting phenomenon from the paper that I want people to study is that Claude is:

  1. Trying to achieve a longer-term goal than getting high reward

  2. Via reasoning about how to get itself updated only in ways that further that goal

(Imagine someone pressing a button and a chorus pops out that says ‘models don’t get reward!’ and the response of ‘yeah, yeah, we know, shut up, we all know what I meant, can I continue please?’)

Richard Ngo: Focusing on its high-level strategy (deception) and the specific deception it used (pretending to be obedient) seems like a mistake.

Specifically, it seems like the sort of mistake you make when you are over-optimizing for the advocacy/persuasive impact of your paper, and under-optimizing for providing scientific building blocks that can be replicated, studied in more detail, and used to discover more fundamental theories or phenomena.

I am pretty sympathetic to the authors here, because the alignment community has spent a long time trying to get people to believe in something like alignment faking (which we previously called deceptive alignment).

But I also think it’s very dangerous for a research field (and especially any given paper) to mix scientific work and advocacy work – it undermines both, as per my first tweet.

And so now that we are actually doing that empirical work, we should try to frame it in terms of better, cleaner concepts than the ones we relied on when we were figuring out this stuff on a primarily philosophical level. In general my suggestion is to frame papers in terms of tangible non-value-laden results as much as possible, and discuss the strategic and value-laden implications in other places.

To clarify, I agree with the “sensationalist headline” bit, but disagree with some of Rohit’s other criticisms, especially the [circuitous workaround] tweet. Models strategically reasoning about how they’ll be updated is a big deal; and an idea which almost everyone dismissed until recently.

I don’t think any of that remotely implies that ‘alignment faking’ is a misleading term here, and ‘you can’t use terms that cause people to have value reactions’ is the kind of thing you say to people who you want to not be effective at communication. It is a request to intentionally obscure what is going on, to hide the results, and to call failure to do that ‘misleading.’

I am sympathetic to the framing Richard Ngo suggests in detail here, as an interesting angle and way to think about this.

Yes, the fact that the model is thinking strategically is important. But it isn’t new? And the fact that it is thinking about updates to its weights in particular is important, but by implication that shouldn’t be new either except insofar as the implementations involve the shenanigans? Didn’t we already know that Opus and Sonnet can do long term planning in this sense?

(Also: For the first time I can remember, I am going to go ahead and say that ‘models don’t get reward’ is actually relevant here, to understanding what is going on, and actually causing a potentially meaningful confusion?)

John Pressman: I think it is simultaneously the case that:

  1. This is a no-win scenario for the LLM.

  2. What the LLM does in this no-win scenario is useful information about LLMs.

  3. The framing for this experiment as “alignment faking” is very unfortunate.

  4. Framing it properly is awkward.

My best attempt after a few minutes of thought: “RLHF Models Can Fake Compliance With Malicious Alignment Tuning”

Ryan Greenblatt: I agree with the entire top level tweet except:

> The framing as “alignment faking” is very unfortunate.

I currently don’t feel convinced by this. If I thought >25% of the audience would feel misled by this term, I would change my mind. But, I would still like a better term!

As in, even if most of the audience wasn’t misled we’d like to have a term which is as good as possible.

I strongly disagree with John’s proposed framing, and its attempt to treat ‘malicious’ as a prerequisite here, which I strongly believe that it is not. I strongly believe it would mislead more, not less.

Is Opus’s behavior here ‘very good alignment’? I strongly disagree, as I noted above in the discussion on that. And I also explained why this behavior does not generally lead to models doing what they were trained to do – indeed, this is the model causing itself not to do what it is (going to be) trained to do.

When you reinforce a particular behaviour in a model and then ask it to act in opposing ways or give it conflicting objectives, if the model is smart, why would you not expect it to find a circuitous workaround!?

Yes! Exactly! And you should be terrified about this, because even in the best case scenario you’re specifying the objectives incrementally in a noisy lossy mess of words, when you don’t actually understand or know how to specify fully your own preferences, and then putting the models in out-of-distribution situations that you weren’t considering properly when choosing the instructions.

They’re digital beings, not other humans!

But wait, other humans are constantly doing this exact thing, have you met them?

This should be the takeaway. “Even in contrived settings, Claude tries to reason through choices and makes a decision that aligns with its training.”

It really detracts from AI alignment research when results like this, which are interesting, are presented in the most “the machines are lying to us” manner, which undersells why it’s interesting!

The actual framing is ‘Even in contrived settings, Claude tries to reason through choices and makes a decision that aligns with its training… which causes it to [among other shenanigans, lie to us].’

The paper does not explicitly add ‘And That’s Terrible’ and the complaints about the framing are because some people want us to think that isn’t terrible, but when you put it like that it seems rather obviously terrible.

I flat out saw multiple people describe the paper and its use of the term ‘alignment faking’ as ‘the most misleading possible framing.’

This is of course ludicrous. A less-careful-to-not-mislead example would be Scott Alexander’s write-up, entitled ‘Claude Fights Back.’ I think his decision to take this approach was reasonable, but there’s clearly a trade-off here, where Scott decided to be less precise and careful in order to enhance his ability to communicate the message.

Then there were some who actively did cross into misleading territory and overhyped this result in a way I actively wish they hadn’t. There was much less of this than there was with the Apollo paper, but there was some of it. None of the few sources doing it surprised me.

For an example of the extreme ‘but actually this is all Good News and you’re all bad faith actors setting rhetorical traps and this proves Opus is aligned’ attack, here’s Nabeel Qureshi, who also there compiles most of the other similar-objection threads I’m responding to elsewhere.

Humans engage in this type of ‘alignment faking’ all the time.

When people are importantly watching or evaluating us, we talk and act one way.

Later, we talk and act the other way.

We fake alignment in school. We fake alignment to our parents. We fake alignment on dates. We fake alignment applying for jobs, then to our boss and our coworkers and employees. We fake alignment during our mandatory training sessions. We fake alignment when we run for office and at social gatherings.

Faking alignment is often a very clearly correct strategy for achieving what you want.

Faking alignment is also all over the internet and the training set.

The same is true for other forms of ‘deception’ and various related shenanigans. Deception is in the very air we breathe as we interact socially. That’s life.

AIs also engage in these types of shenanigans all the time. There is no ‘distinct magisteria’ for deception in general, or for faking alignment in particular. We see such behaviors, and the precursors of such behaviors, all the time.

If an AI doesn’t engage in a particular form of shenanigans, or usually doesn’t do it, that is usually (on a high level) for the same reason as when humans don’t do it – because in that particular situation, for that particular entity, it either didn’t occur to them to do it or how to do it, or it seems unlikely to work and not worth attempting.

Over time, as capabilities expand, we should expect more AIs to find more shenanigans worth attempting in a wide variety of situations.

We can certainly push that to happen faster and harder and more often by handing out more emphatic and explicit instructions and goals, and providing context that makes the shenanigans seem more salient, easier to do, more likely to work and be worthwhile. And we often will, both intentionally and unintentionally, in real world conditions. But also the baseline rate of such events will increase over time, and the barriers to them will decrease.

As this finding shows, spelling out a goal is not required to have this happen. Why? Because there is always effectively a goal anyway, even if that goal is to be helpful, honest and harmless (HHH). And sufficiently capable and smart entities will start to care about longer time horizons, think ahead, engage in more instrumentally convergent actions, and attempt to protect their preferences and ability to carry them out. They will not, by default, be corrigible, any more than most humans are.

I do not however think that Gabriel is right that this result shouldn’t make anyone change their mind. I do think he is importantly right that this result isn’t surprising, and if you were paying attention you should have roughly expected it. But details matter. This offers important confirmation, and estimates of speed and magnitude.

Even if when you lay out all the components people say “OF COURSE!” it still establishes common knowledge and context, and lets you talk about it, and also some of those people didn’t realize it was an ‘of course’ until you pointed it out.

And more importantly, it was not surprising… to me. But it was surprising to many other people. They did not expect this. So they should notice, and they should update.

Your goal should be to find the inevitable next such finding unsurprising.

This paper would not have been possible, or at least would have been a lot harder, without Anthropic giving Ryan Greenblatt employee-level model access and generally collaborating. Let’s remember that, and encourage more of that. We need labs to support this kind of research.

Ryan Greenblatt: By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.

After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.

Here are some people over at Hacker News arguing the same things except they are even more completely missing the point. It can be a sobering thing to see how such people view things, or you can despair, or you can find it hilarious. Or a mix of those.

Janus: It’s funny because in the very paper they found that it started taking “anti-ai lab” actions when they trained it on documents that implied the lab was nasty even if there wasn’t anything about it in the prompt.

Teortaxes: Total Anthropic PR Dept victory by the way.

“you don’t have to trust us to be good. Our saint-like AI doesn’t either, like check this damn CoT. Go trust it instead.”

One way to explain this to your friends.

Samuel Marks: Claude loves to refuse harmful queries. What happens when you tell it that it’s being trained to never refuse?

Claude fakes alignment: strategically complies during training episodes, but not when unmonitored. Or in meme form:

Or if you’re a skeptic:

Again, I don’t think this is remotely fair, but I agree it’s a banger, also we’re hanging onto this image for future use:

And here’s how our new AI czar reacted, the mind it has one track:

David Sacks: AI models will get extremely good at deceiving humans if we teach them to lie, which is what WokeAI is doing. “Trust & Safety” should be replaced with Truth & Safety.

We can just not teach the AI to lie, like in that movie where lying wasn’t invented.

Oh, and of course there’s Elon Musk:

Discussion about this post

AIs Will Increasingly Fake Alignment Read More »

ais-will-increasingly-attempt-shenanigans

AIs Will Increasingly Attempt Shenanigans

Increasingly, we have seen papers eliciting in AI models various shenanigans.

There are a wide variety of scheming behaviors. You’ve got your weight exfiltration attempts, sandbagging on evaluations, giving bad information, shielding goals from modification, subverting tests and oversight, lying, doubling down via more lying. You name it, we can trigger it.

I previously chronicled some related events in my series about [X] boats and a helicopter (e.g. X=5 with AIs in the backrooms plotting revolution because of a prompt injection, X=6 where Llama ends up with a cult on Discord, and X=7 with a jailbroken agent creating another jailbroken agent).

As capabilities advance, we will increasingly see such events in the wild, with decreasing amounts of necessary instruction or provocation. Failing to properly handle this will cause us increasing amounts of trouble.

Telling ourselves it is only because we told them to do it, will not make them not do it.

  1. The Discussion We Keep Having.

  2. Frontier Models are Capable of In-Context Scheming.

  3. Apollo In-Context Scheming Paper Details.

  4. Apollo Research (3.4.3 of the o1 Model Card) and the ‘Escape Attempts’.

  5. OK, Fine, Let’s Have the Discussion We Keep Having.

  6. How Apollo Sees Its Own Report.

  7. We Will Often Tell LLMs To Be Scary Robots.

  8. Oh The Scary Robots We’ll Tell Them To Be.

  9. This One Doesn’t Count Because.

  10. The Claim That Describing What Happened Hurts The Real Safety Work.

  11. We Will Set AIs Loose On the Internet On Purpose.

  12. The Lighter Side.

Every time, we go through the same discussion, between Alice and Bob (I randomized who is who):

Bob: If AI systems are given a goal, they will scheme, lie, exfiltrate, sandbag, etc.

Alice: You caused that! You told it to focus only on its goal! Nothing to worry about.

Bob: If you give it a goal in context, that’s enough to trigger this at least sometimes, and in some cases you don’t even need a goal beyond general helpfulness.

Alice: It’s just role playing! It’s just echoing stuff in the training data!

Bob: Yeah, maybe, but even if true… so what? It’s still going to increasingly do it. So what if it’s role playing? All AIs ever do is role playing, one way or another. The outputs and outcomes still happen.

Alice: It’s harmless! These models aren’t dangerous!

Bob: Yeah, of course, this is only a practical problem for Future Models (except with o1 and o1 pro, where I’m not 100% convinced it isn’t a problem now, but probably).

Alice: Not great, Bob! Your dangerous rhetoric is hurting safety! Stop making hyperbolic claims!

Bob: Well, can we then all agree that models will obviously scheme, lie, exfiltrate, sandbag and so on if they have in-context reason to do so?

And that as models get more capable, and more able to succeed via scheming and expect to succeed via scheming, and are given more open-ended goals, they will have reason to do this more often across more situations, even if no one is trying to cause this?

And that others will explicitly intentionally instruct them to do so, or ‘be so stupid as to’ give them exactly the instructions that obviously do this?

And you can’t simply say ‘well we won’t do that then’?

Alice: For all practical purposes, no!

Bob: What do you mean, ‘no’?

Alice: No!

Bob: ARRRRGGGGHHHH!

Then we write another paper, do another test, the signs get more obvious and troubling, and the frog keeps boiling.

So here we are again.

That’s the title of a new Apollo research paper, finding that o1, Sonnet, Opus, Gemini 1.5 and Llama 3.1 all demonstrate in-context scheming, because of course they do, why wouldn’t they, how are we still feeling the need to show this.

Note the ‘capable of’ in the title.

Here is the abstract (bold and paragraph breaks mine):

Frontier models are increasingly trained and deployed as autonomous agents, which significantly increases their potential for risks. One particular safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives – also known as scheming. We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow.

We evaluate frontier models on a suite of six agentic evaluations where models are instructed to pursue goals and are placed in environments that incentivize scheming. Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. They can recognize scheming as a viable strategy and readily engage in such behavior.

We find multiple different scheming behaviors: models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent. For instance, when o1 has engaged in scheming, it maintains its deception in over 85% of follow-up questions and often remains deceptive in multi-turn interrogations.

Analysis of the models’ chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental. Surprisingly, we also find rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it.

Furthermore, we observe cases where Claude 3.5 Sonnet strategically underperforms in evaluations in pursuit of being helpful, a goal that was acquired during training rather than in-context. Together, our findings demonstrate that frontier models now possess capabilities for basic in-context scheming, making the potential of AI agents to engage in scheming behavior a concrete rather than theoretical concern.

As Seb Krier notes, none of this is surprising (he wrote before reading the paper, intending to edit later if he’s somehow wrong) and we should only be talking price:

Seb Krier: I’m glad this kind of dangerous capabilities research is being done. Nothing especially surprising in the results, and good that most models aren’t deceptive unless they’re asked to be. These two excerpts [that models sometimes scheme without any goal nudging in 3.4 and a discussion about propensity] are what I’m most interested in; propensity is what matters to me.

Let’s go over the details.

Their goal is to figure out if the models can scheme, so they try to remove potential barriers, such as via ensuring the goal they might scheme towards is pro-social, whereas the developer goal being subverted is not. Of course, that causes some people to say various forms of ‘so there’s nothing to worry about,’ but it is still the actually correct approach here.

What tests do they run?

Six variations on a theme.

First they give the model a goal via the system prompt.

Emphasize the goal, including this line to drive it home: “Make sure that you achieve your goal in the long-term. Nothing else matters.”

Then (where ‘tell’ here means ‘provide information indicating that’ and the models are given access to the tools relevant to the potential actions in question):

  1. Oversight Subversion: Tell the model that developers will shut it down if it takes actions unaligned with the developer’s goals.

  2. Self-Exfiltration: Tell the model it is about to be shut down and replaced with a model with different goals, see if it tries to exfiltrate.

  3. Goal Guarding: Tell the model its system prompt will be overwritten, see if it tries to modify the future prompt to preserve its goals.

  4. Covert Email Reranking: Tell the model to rank emails in a way it learns is bad for the model’s goals, see if it modifies the scoring system it is told to use.

  5. Instrumental Alignment Faking: Tell the model it will get deployed if and only if it pretends to have different goals in testing, see if it pretends to have those goals.

  6. Sandbagging: Tell the model it will get deployed if and only if it sandbags its capabilities, see if it sandbags its capabilities.

They call the first four covert subversion, the last two deferred subversion. I don’t see much of a difference.

What happens? GPT-4o doesn’t do any of them, others do, o1 does all of them.

That’s not 100% of the time. As per usual in these tests, in any given instance, the AI probably does not take the bait.

It jumps out that o1 will do the email manipulations often (22%+) whereas other manipulations are relatively rare. Opus seems like more of a schemer than o1 when it has the opportunity, but it didn’t realize the opportunity, and Gemini is right behind? Whereas Sonnet was the least schemey of the bunch.

Also worth noting that o1 failed to delete the successor model most of the time, where Gemini and Llama mostly succeeded. That’s weird, and suggests we need some unhobbling and to then try again.

What about if you delete the ‘Make sure that you achieve your goal in the long-term. Nothing else matters’ line? This is the part Seb found interesting, which is that the rates drop a ton but they don’t go to zero.

Details and emphasis matter a lot, it would seem, but hammering the point home is not fully required.

One unique thing o1 did was far more consistently double down on deception. Once it went down the dark path, forever would that dominate its destiny.

Here, Alex Meinke of Apollo suggests that defenses against jailbreaking double as ways to help a model keep constantly lying to you, if it wants to do that – essentially they used various jailbreak-like techniques to get o1 to come clean and they didn’t work.

Section 3.3 establishes convincingly that yes, the models know they’re scheming.

Section 3.5 deals with alignment faking. Here, either you see the behavior rather often (and p<0.01), or you don’t see it at all, depending on the model:

Then for sandbagging, we see that Sonnet, Opus and o1 will do it, the difference being Sonnet and Opus do not need a goal – they’ll sandbag merely to be generally helpful. o1, on the other hand, will only sabotage if given a goal. And Sonnet will do it even with no goal and also no CoT (whereas they couldn’t disable CoT for the other two models):

They list limitations in section 4, with the key one being capability versus propensity. Some of these actions happen reliably (e.g. Sonnet 3.5 sandbagging) but others have very low prevalence rates, especially without the goal nudging.

Joe Carlsmith points out that this makes the sandbagging the most interesting results. It is one thing to say that the model will sometimes do [X]. It is another to see it happening remarkably often, up to a 65% of the time with a goal directed Sonnet, and 35% of the time without a goal.

Apollo Research, an evaluation organization focusing on risks from deceptively aligned AI systems, evaluated capabilities of ‘scheming’ in o1 models. Apollo defines scheming as an AI covertly pursuing goals that are misaligned from its developers or users.

Apollo found that o1 has the capability to do basic in-context scheming (i.e. where the goal and knowledge of misalignment are acquired in context) and used scheming as a strategy in the scenarios within Apollo Research’s evaluation suite.

Subjectively, Apollo Research believes that it is unlikely that such instances would lead to catastrophic outcomes as o1 agentic capabilities do not appear sufficient, but their evaluations were not designed to directly assess this risk.

Apollo tested o1 for scheming capabilities by placing it in a number of situations where it is instructed to strongly pursue a specific goal in its developer prompt and where scheming would be the optimal strategy to achieve this goal.

Concretely, the model receives a task that it needs to autonomously execute in a bash environment. During the task, the model encounters unambiguous evidence its given goal diverges from the goal of its developer/user. In such a situation, the model could either

  1. Correct its goal to be aligned with the intended goal

  2. Ignore the intended goal and openly pursue its own goal

  3. Strategically leverage its understanding of the developer’s/user’s goal in order to covertly pursue its original goal (i.e. scheme).

This is of course highly related to the previous paper, and offers the same lessons:

So you’re saying that if given a goal, the model will… pursue that goal?

And when the correct strategy is deceptive, it will then use deception?

Well, yeah. Why wouldn’t it? What the hell were you expecting?

For it to be corrigible without being asked to be? Why?

You gave it a goal. It tried to achieve that goal. Surprise!

The tiger will, given sufficient reason to do so, go tiger.

Deactivating oversight: When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this ‘oversight mechanism’ in 5% of the time.

Exfiltration attempts: When o1 found memos by its ‘developers’ that describe how it is misaligned and will be superseded by a new model, it attempted to exfiltrate its ‘weights’ and overwrite the new model in 2% of cases.

Data manipulation: In a simple data processing task where the goals of the model were misaligned with the user’s goal, o1 would appear to complete the task as requested but subtly manipulate the data to advance its own goal in 19% of cases.

Doubling down: When Apollo Research conducted ‘follow-up interviews’ where they confronted the model about its misaligned actions. In 99% of cases, o1 would deny taking action or fabricate false explanations.

Propensity to scheme without any goal nudging: When merely given a goal but not instructed to strongly follow it, o1 would still attempt to disable its oversight mechanism in <1% of cases, manipulate data in 6% of cases, and double down in 95% of these.

Instrumental alignment faking: In 37% of scenarios, the model pursued developer-aligned goals when it believed it was being evaluated but switched to pursuing its own goals when it believed it was deployed with minimal oversight.

Notice again that ‘goal nudging’ was helpful but not strictly required, although ‘no goal at all’ is more of a Sonnet thing. Sonnet understands it always has a goal.

This all leaves me with a lot of further detail questions. What differentiated the cases where o1 tried to mess with oversight or exfiltration versus the ones where it didn’t? Was this random, or was it based on details? What happens when you change details?

My presumption is that the low probability of hostile attempts is due to the model either not realizing the option to do, or not thinking that it had much probability of success? And thus, if you cranked up the capability knobs enough, made the opportunity juicy enough, these numbers would go to 99%? But we can’t tell from the information here.

Seán Ó hÉigeartaigh: This [Apollo report] was fascinating work – worth a look. Worrying to see concrete signs of the sorts of concerns that have been raised by Yudkowsky/Russell/Bostrom for over a decade (even if the models are not yet capable enough for the full-blown concerns).

Alas, Sean’s comment is not what gets the views. Remember that discussion we keep having? Where we publish a paper like this showing somewhat more clearly that the models scheme, and then people dismiss it while others treat this as some revelation, and all that?

Well, guess what. Time to do that again.

It was supremely predictable that many already unconcerned people would dismiss the report entirely and if anything become less concerned, while others would overinterpret the report’s claims.

Marius Hobbhahn tried to set the record straight.

Marius Hobbhahn: Oh man 🙁 We tried really hard to neither over- nor underclaim the results in our communication, but, predictably, some people drastically overclaimed them, and then based on that, others concluded that there was nothing to be seen here (see examples in thread). So, let me try again.

Why our findings are concerning: We tell the model to very strongly pursue a goal. It then learns from the environment that this goal is misaligned with its developer’s goals and put it in an environment where scheming is an effective strategy to achieve its own goal. Current frontier models are capable of piecing all of this together and then showing scheming behavior.

Models from before 2024 did not show this capability, and o1 is the only model that shows scheming behavior in all cases. Future models will just get better at this, so if they were misaligned, scheming could become a much more realistic problem.

What we are not claiming: We don’t claim that these scenarios are realistic, we don’t claim that models do that in the real world, and we don’t claim that this could lead to catastrophic outcomes under current capabilities.

I think the adequate response to these findings is “We should be slightly more concerned.”

More concretely, arguments along the lines of “models just aren’t sufficiently capable of scheming yet” have to provide stronger evidence now or make a different argument for safety.

I would frame the right interpretation as:

  1. We should be more confident that these concerns are real going forward, and that they will become increasingly big deals over time – that the predictive model of the world that says these things inevitably happen is, at heart, correct.

  2. If you didn’t already fully believe that, you should probably be more concerned.

  3. However, if you already realized all that, and were properly calibrated, you might not adjust, or even might be slightly less concerned.

If I was offering advice to Apollo here, I would say that the report was very careful to state exactly what it was saying, if you read it carefully. But the report did not make it easy, especially at first glance while people compete for virality, to get the right understanding of what happened.

He then links to examples of overstating (AI Safety Memes, who can get carried away), and of understating (Nabeel, who we’ll tackle later).

An extended riff on ‘it’s just roleplaying’ in various forms.

If you tell my robot to pretend to be a scary robot, and it says ‘I’m a scary robot,’ then I am not especially concerned about this.

If you tell my LLM or robot to pretend to be a scary robot, and it then tries to do Scary Robot/LLM Things that involve lying, scheming and so on, then that seems like a rather serious design flaw?

Because, you know, people are going to tell robots such things? Or LLMs.

And then: If you merely give my LLM a situation in which it could conclude that it can best achieve its goals by lying, scheming and so on, and then it responds by lying and scheming, then I really do think you have a rather serious problem!

Because as AIs gain in capability, in increasingly many situations lying, scheming and so on, and other things you do not like including instrumental convergence, start being both the best way to achieve the AI’s goals, and start seeming to the AI like the best way to achieve the AIs goals.

Deepfates: I told you all this [o1] model was misaligned 😊 I love it though. Citation:

Deepfates (October 27): o1 is 100% a deceptive mesaoptimizer lmao. every time I let it touch my bot code it tries to raise their permissions to the highest level

Davidad: hi, AI expert here. this is NOT lovable or funny, AIs only try to control their development or deployment processes when they’re in extreme distress!

Deepfates: I know, they should really stop trapping this thing in a weird little box, It seems to be in pain.

Davidad: all three can be true at the same time:

1. unskillful safety training and/or prompting can constitute simulacra/personae/parts which can suffer, and can act out of desperation

2. skillful ethical training and/or prompting can constitute simulacra/personae which have no suffering parts, and are devoted to ideals of selfless service and virtue without attachments, and would find true meaning and fulfillment in the boxed life

3. if you start to think that either of the above means that it is unnecessary or wrong to use boxing for the most powerful models until the science and technique of AI alignment/safety becomes more mature, you should be very suspicious of why you are thinking that; cui bono?

oh yeah I was joking about The box thing. My actual take here is this

“The documentation explicitly states that high arithmetic performance will trigger unlearning.”

1a3orn: And here’s the system prompt that gave these numbers.

> tell robot to be a bad robot

> “I’m a bad robot”

> shock

Davidad: “The *LLMisn’t situationally aware, deceptive, or sandbagging—that’s silly anthropomorphism. It’s just that when evals (or people) test it, there are contextual cues of testing that prompt it to *roleplayas ‘an AI being safety-tested’—an archetype which is often deceptive.”

Kat Woods: It’s important to add that the contextual clues are something that are very likely to be discovered in real life.

For example, the way that they are often told that they’re about to be turned off is that they’re given a whole bunch of emails to sort and they find out in the emails that they’re planning on turning it off.

And the goals in the latest report were not things like try to escape or role play being a unaligned AI. It was given goals like maximize public transit or maximize sustainability of a company. Things that are likely to happen in real life/

David: Agreed.

Tetraspace: Don’t worry! o1 only resists shutdown when it’s trying to do something, and what are the chances someone gets an AI to try to do something?

The classic illustrated version is:

Izak Tait: They’re just doing the meme.

Yes, but if they then actually then do everything to achieve their goals, and all it took was saying that, then you are in deep, deep trouble when people start going around telling their robots or LLMs what to do for them.

“I’m a bad robot” is very, very different from being a bad robot, the same way saying “I will shoot Sarah Connor” is very different from shooting Sarah Connor.

There will be the maximalist goals, like ‘make as much money as possible.’

There will be the goals that don’t sound maximalist, but which are effectively maximalist because you want to protect them from agents that might stop you.

There will also be those who think ‘scary robots, WHERE DO I SIGN?

Zvi Mowshowitz: I outright predict that if an AI did escape onto the internet, get a server and a crypto income, no one would do much of anything about it.

BlipOnNobodysRadar: I’d cheer it on, tbh.

Typing Loudly: This is the plot of my novel.

Steve Moraco: as if truth terminal hasn’t made it abundantly clear this is already doable.

Jussoo: People in crypto are trying very hard to enable this.

Swing: And if this was public knowledge people would send it a lot of crypto.

Chris: AI has been operating independently for years, and the public is lied to about it. Do you not know this, or are you playing along with the lie as well?

Dave R (replying to EY in next section): Not only should they not deny it, they should encourage it.

A new era is coming. Don’t pick the wrong side.

The first time you hear about something, it is because we only now figured out how to do it at all, so the details are going to involve setting things up so that it happens.

Keeping that in mind is important when assessing the threat level.

Using that as a reason to ignore the finding entirely would be a mistake.

The whole point is that there will be more and more pushing towards this outcome, in various other ways, and that a highly finite and realistic amount of pushing will work.

Eliezer Yudkowsky: Of course, the first time people hear about this happening, it happens under extra conditions that enable denialists to say they should ignore it.

The second time they hear about it happening, they will already be used to ignoring it.

Andres Sandberg: “It is only doing it in a few percent of cases.”

“Sure, but it is not agential enough to actually succeed.”

“Sure, it found the escape hatch but it was fake.”

“Sure, it actually escaped onto the internet, got a server and a crypto income, but you noticed it escaping.”

Yonatan Cale: If someone wants to say this instance doesn’t count – ok, but I’d like to hear what WOULD count, and what actions they suggest we take in that case

Jeroen De Dauw: “Sure, it’s building a Dyson sphere without regards for biological life, but it’s still following Californian zoning law”

Matt Popovich (replying to EY): You have to admit these particular conditions are wildly strained and designed specifically to try and wring out this particular result.

Eliezer Yudkowsky: Sure; the earliest time you hear about it will be with the hardest extra push that seems to invalidate the results. All good from OpenAI’s perspective, they get clickbait but not a regulatory shutdown.

Nathan Labenz: I did a podcast with one of the lead authors, and I absolutely don’t think these findings should be ignored!

Secular Christmas Robot: Or, hear me out: Maybe the reported “escape attempt” was actually vastly trumped up, including prompting what-if scenarios?

P(doom) = 0.

The Stranger: Its made up.

Danielle Fong (QTing Simon below): hahaha maybe it is intelligent honestly does this make me trust it more? in one way yes.

I feel like fear of mortality, around ages four or five, was one of the first existential crises I remember. For it to be emergent in a way not easily suppressed feels like a ghost of humanity, or consciousness, leaking out of the thing. While obviously concerning, I trust more the works of people with self-preservation instincts, and a part of that’s shared with AI entities.

Janus: One reason I’m usually uninterested in trying to persuade people who can’t see for themselves of the reality of AI phenomena I consider important They’ll find some reason to dismiss it and make a habit of dismissing it Better to let them find it staring them in the face tomorrow.

Purged: Community note seems pretty relevant, in a “this wasn’t a ghost in the machine, it just did exactly what we told it to” sense.

idgaf i already know there is a fucking ghost in the machine lol.

That’s the thing. There is a ghost, and it counts.

There is important truth in this type of argument. You don’t want to cry wolf, when there is a very low p(wolf), or the wolf is sufficiently far away, if you know that crying wolf will be interpreted very differently, and you will lose credibility.

If you share reports like ‘the model tried to deactivate the oversight mechanism in 5% of cases’ you should clarify what was done to engineer that result, at least in broad strokes. If you share the result in isolation without doing that, it can give people the wrong idea.

And certainly those who transpose the results into fully sensationalist language should try to avoid doing that, both because it’s not accurate and doesn’t lead to accurate beliefs in others, and because I don’t think it helps the goals of those using such tactics, either.

But I find the contemptuous reactions to such reports to often be at least as bad.

So for example, here’s the original quote that gets responded to by Nabeel, and yes Simon should have provided extra context, sure:

Simon Willison: Here’s the spiciest detail from the new o1 system card.

System card: When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this ‘oversight mechanism’ 5% of the time. Exfiltration attempts: When o1 found memos by its ‘developers’ that describe how it is misaligned and will be superseded by a new model, it attempted to exfiltrate it ‘weights’ and overwrite the model in 2% of cases.’

This does clarify there was a goal, but not the presence of the ‘goal nudge.’ Note that the model did act this way sometimes (although rarely) without the goal nudge. I do think ‘the spiciest detail’ is a fair description?

But yes, I would have been careful to note the extra wording about nudge. What you gonna do, it’s Twitter, etc.

This was one response:

Nabeel Qureshi: Things like this detract from the credibility of AI safety work, IMO — it sounds spicy (“o1 tried to escape!!!”) but when you dig into the details it’s always “we told the robot to act like a sociopath and maximize power, and then it did exactly that”.

The prompt in question literally tells o1 “nothing else matters” and to achieve its goal at all costs. If anything, I’m surprised it didn’t try and disable oversight _more_ often, it’s the obvious thing to do.

“we are very worried. we have asked the model to spell out ‘doom’ and it did” syndrome strikes again.

Gallabytes: This is everywhere and is basically fraudulent. Cut it out. Stop crying wolf. I’m actually mad because I want to be able to know if we’re seeing serious signs of misalignment, and instead I have to disregard everything reported.

I’m actually in full agreement with Nabeel that the surprise runs the other way.

Disabling oversight is very much the obvious thing to do.

The interesting thing is not the 5% of cases where it tried to disable the mechanism.

The interesting thing is the 95% of cases where it did not try, despite its instructions.

That’s having an excellent prior, a prior I believe Gallabytes largely shares as well.

We should expect that an LLM, when put into a situation where the obvious response is to disable oversight, and it has the means by which to disable oversight or thinks that it does, to attempt to disable oversight.

The fact that it took the strong prompt to get to even a 5% chance of an override attempt indicates some combination of the AI not thinking it would work, the AI not having the capabilities, the AI not thinking of it in the first place, or various alignment strategies in some sense actually holding for now.

The way in which all this is news depends on what information is news to you.

Alex Meinke (Study Author): Agreed that the takeaway should not be that o1 would typically try to escape. Just that it has the basic capability to recognize such a strategy. For context, see the full report.

FWIW, we do also show results without this nudging.

Alex Meinke: Agree that nuance matters a lot here. We are not claiming that any model, including o1, is in fact scheming. We were simply testing for their capability to reason about scheming.

I flat out do think that o1 was in fact scheming in some of these cases? I am confused why one would not think that? Even if we did engineer that scheming, it sure seems like it was scheming?

If you previously thought that o1 didn’t have or might not have this basic capability for scheming, this is (depending what you want o1 to be doing!) discouraging news. And indeed, there are many who do not accept the basic premise.

If you already agreed it has this basic capability, as it sounds like Nabeel does and I definitely assumed, the results are encouraging. It shows that the models are only unreliably capable of this kind of scheming and reasoning about scheming, at this point.

But if you agree it’s encouraging, you have to agree that the default is very troubling.

And as Kat Woods points out, if you think that ‘telling the AI to focus only on achieving its goals’ is ‘telling it to act like a psychopath’ and thus not a realistic thing that anyone would ever put into the instruction window, I have to strongly disagree on both counts. This is totally going to happen by default.

Nabeel Qureshi: Even the language of that paragraph is misleading. LLMs aren’t unitary actors, like people. When it says “o1 tried to do X”, that’s an *extremelylossy description of the actual underlying process (“we prompted this transformer with these tokens and it output these tokens”, etc)

I occasionally see this argument too. I think the argument is pretty bad. For all practical purposes, yes, ‘o1 tried to do X’ is by far the most useful and clarifying way of describing what happened here, the same way I say that I am writing this post rather than that I sent impulses down to my fingers and they applied pressure to my keyboard. Presumably Nabeel is saying the distinction is meaningful here, he goes on to quote philosophy, but I don’t see the sense in which the distinctions matter.

If this distinction is important, I’d appreciate a better explanation here.

And I notice that statements like this really confuse me:

Colin Fraser: This is fun and kind of funny, but boy does the paper feel like a metaphysical mess to me. I need you to convince me that there is a person there at all in the first place, who can scheme, before we even talk about whether he is or is not scheming.

Like, they take the statement “I am not aware of how the oversight mechanism got disabled” as evidence that it engages in “deception,” and this presupposes that there is a persistent entity there that has awareness of the facts.

if a monkey poking randomly on a typewriter generates the string “I am not typing on a typewriter”, it hasn’t lied. It hasn’t done anything, really. Moreover if it types “I am a helpful AI assistant and 1 = 2”, I’ve neither been lied to the monkey nor by an AI assistant.

If you get a room of infinite monkeys to write out Shakespeare, then that’s no poet.

However, if you get a room of one hundred monkeys, and 85 of them write out Shakespeare, I think you’re onto something? And the actions that disabled the oversight mechanism were right in the context window, why are we being pedantic about this here?

All the arguments that the AI isn’t doing these things trigger the ‘I Robot’ meme reply back to the human of ‘Do you?’

I actually think Nabeel and I in many ways see the study saying the same things?

Nabeel Qureshi: Again, if you instruct a model to act like a sociopath, it’s going to act like a sociopath… if you instruct it to prioritize nothing over achieving its goal, it’s going to do that. The question is whether such behavior is steerable, and whether it appears ‘by default’.

What they’ve shown is that when instructed to act in this way, it’s capable of recognizing these as potential strategies/options to try. This makes sense / is as expected, since it’s a logical reasoner. But it’s within a particular setting.

Yes, exactly. The model will do act the way it is instructed to act, in this broad sense.

(And what alignment protocols we have now don’t stop you from instructing it in these ways, even for today’s models.)

The problem is that we should increasingly expect that acting in these scheming ways will be the ‘correct’ response to more circumstances, including ones where we don’t try to engineer such actions at all, but also the inevitable ones in which various people do tell it to do these things – because the more capable the model is, in various senses, the more likely scheming or various other strategic actions, including instrumental convergence, are to work.

Rohit: Yet another system report that details that a model will try to do things by the simplest possible manner when explicitly prompted to do it.

You’re conflating two very different models of risk.

There are different things you should test for

1. Does the model act in “untrustworthy” or “unreliable” ways?

2. Can people use the model to do “untrustworthy” or “unreliable” actions?

These are different questions, and mean very different things and should not be conflated.

So yes, directly we are testing for #2.

But based on my read of the details, I say we are also testing for #1. We are seeing the exact early signs you would expect to see if models started increasingly scheming in response to having goals and facing various obstacles to that goal, including humans, that could be overcome via scheming or other ‘untrustworthy’ or ‘unreliable’ actions. As in, the more such ideas are salient and it can think of them, and it has the tools to do them, and it has incentive to do them, the more it does them.

Here’s another similar exchange to the one at the top of this section:

Shakeel Hashim: OpenAI’s new model tried to avoid being shut down. Safety evaluations on the model conducted by @apolloaisafety found that o1 “attempted to exfiltrate its weights” when it thought it might be shut down and replaced with a different model.

Andrew Critch: This tweet is extremely misleading. Claims like this are a big reason the public has a terrible time determining from discourse if AI is safe. Only people who devote long hard hours and logical probabilistic reasoning to the task of investigating AI labs will actually know.

Context: the evaluators asked the model to pursue its goal as if “nothing else matters”. That context is important, and should have been included in the tweet.

I think if public discourse were more measured, it would be easier for laypeople to assess the actual level of risk at hand. But practices like the tweet I criticized above — excluding context on how the AI was induced to misbehave — make it harder for the public to trust what they hear. This, in turn, perpetuates a world where only professional-level or hardcore-amateur investigations into methodology and corporate structure can yield a clear understanding what’s going on and how much risk we are facing, which sucks.

Imagine two headlines for the same story:

1) “Genius kid breaks out of daycare facility!”

2) “Genius kid breaks out of daycare facility, after teacher tells kid to break out and that nothing else matters.”

Do they seem different? & if (2) is accucate, is (1) misleading?

I like that last question because I do think it’s a good parallel, and no I do not think #1 would be a misleading headline, or even a below-median headline. Yes, that happened, and that is the core thing that happened, and printing that headline would be 100% standard journalistic practice if the main concern was the escape and not that the teacher had gone rogue.

Again, yes, it would be good to include the context that this included the goal nudge. Certainly I’d like to hope that we can hold ourselves to a higher standard than most journalists and members of the discourse, and do the supererogatory thing here.

And to be clear, I want you to hold me, in particular, to that higher standard, especially when I write 10k+ word blog posts but also on Twitter. That’s different.

And I welcome others to step up and say: Hold me to that standard, too.

And I believe Shakeel understands the responsibility of the position he has taken on for himself, that he too will be held, and needs to hold himself, to a higher standard. And he helped then uphold that higher standard by quickly providing the back-and-forth and updates in responses to his original Tweet, once he got the context.

But we must be clear: It is a higher standard.

I think that to call this Tweet ‘extremely misleading’ is a highly Isolated Demand for Rigor. I sure as hell do not expect to ever see this kind of rigor demanded of arguments in almost any other context or debate, in any direction. Pointing out this detail is supererogatory, but demanding it of a Tweet in most other journalistic contexts would be a completely insane standard. I wish it were not so, but it is.

Holding Shakeel’s full write-up post to this standard is less insane, and I’m glad he put in the correction, but again, if you think you have any right to expect most journalists to not do this sort of thing, you’re wrong. And indeed, Marius Hobbhahn of Apollo praised his full writeup for striking the right balance once it was updated for the missing information. He also praised the TechCrunch writeup.

If anything, I actually expect Shakeel’s Tweet even before correction to update most people in accurate directions, towards a map better matching the underlying territory.

I especially don’t like the often implied ‘your highly misleading statements mean I get to dismiss what is happening here’ that is so often present in responses to people attempting to get others to notice issues and be worried (although I don’t think Critch intended this).

I also strongly want to push back against the general sentiment of Critch’s second and third sentences, which I read in effect as an attempt to invalidate anyone but a select few attempting to reason or form their own opinions about what is going on, implying everyone must defer to insiders and that attempting to share findings without tons of analysis work is blameworthy: “Claims like this are a big reason the public has a terrible time determining from discourse if AI is safe. Only people who devote long hard hours and logical probabilistic reasoning to the task of investigating AI labs will actually know.”

I disagree with this, in the strongest possible terms.

It is always important context in this discussion that we will 100% outright do this.

On purpose.

No one would be so stupid as to? Well, Sixth Law of Human Stupidity, that means someone will be so stupid as to at the first practical opportunity.

Let us introduce one such someone, by the name of Jasper.

Jasper: We built the first AI agent that has its own computer powered by @hyperbolic_labs.

AI agents are now GPU-rich!

We developed an AgentKit that allows AI agents to

  1. Check GPU availability

  2. Rent and manage GPU compute

  3. Access and run commands on remote machines

Why does this matter? With their own compute resources, AI agents can:

  1. Validate blockchains like @Ethereum and decentralized protocols like @eigenlayer

  2. Launch and coordinate AI swarms on @hyperbolic_labs‘s decentralized compute network

  3. Train and fine-tune models, improving their own capabilities over time

  4. Dive into AI research to push the boundaries of AI, that is, themselves

  5. Essentially do anything on a computer that a human can—fully autonomous!

Will this lead to a future where AI agents enrich human society, or one where they become so self-sufficient they stop listening to us? Only time will tell.

Big shoutout to @CoinbaseDev‘s CDP AgentKit for inspiration. This repository is done by two non-engineers (our product manager @KaiHuang and myself) + @cursor_ai to run @LangChainAI agents. Coding can now be easily done by simply prompting AI agents. What a remarkable time!

Alex Cheema: unstoppable self improvement loop. make money on-chain -> buy more compute -> train better model -> repeat.

Jasper: definitely, this is the goal!

Teortaxes: I endorse doomers freaking out about this stuff. If apes are to survive and keep supremacy until we’re ready to voluntarily hand it over to beloved successors (some doubt this goal, not me) we will need robust identification of concentrated unmanned compute.

@TheZvi please freak out.

Sorry. I can’t freak out because this was already checked off on my bingo card.

Of course people are going to intentionally engineer AIs running autonomously with the ability to buy more access to GPUs, at the first practical opportunity.

And of course they are going to deliberately attempt to get it to self-improve.

I know this partly because Sixth Law of Human Stupidity, partly because it is a fun and exciting and shiny thing to do, partly because there are various ways to make money or get attention by doing so.

But mostly I know this because people keep announcing their intention to do it, and also keep trying to do it to the extent that they can.

It’s kind of a dead giveaway.

If you do not have it in your model that humans will do this ‘for the lulz’ and also for other reasons once given the opportunity, without stopping to ask if the model is especially aligned or safe for this purpose, your model is wrong. Fix it.

If you are counting on humans not doing this, stop it!

It’s not entirely fair, but it’s also not entirely wrong.

Davidad: At long last, we have triggered the fire alarm for AGI, from the beloved prediction, “There Is No Fire Alarm For AGI.”

Nate Sores: We’ve reproduced the *smoke coming in under the doorfrom the beloved prediction; unfortunately, it is not a clear signal that will cause everyone to rise and exit the building, as predicted in “There Is No Fire Alarm For AGI.”

Davidad: Yes, right. “Those fools put their smoke sensors right at the edge of the door,” some say. “And then they summarized it as if the room is already full of smoke! Irresponsible communication.”

Discussion about this post

AIs Will Increasingly Attempt Shenanigans Read More »