Whoops. Sorry everyone. Rolling back to a previous version.
Here’s where we are at this point, now that GPT-4o is no longer an absurd sycophant.
For now.
-
GPT-4o Is Was An Absurd Sycophant.
-
You May Ask Yourself, How Did I Get Here?.
-
Why Can’t We All Be Nice.
-
Extra Extra Read All About It Four People Fooled.
-
Prompt Attention.
-
What (They Say) Happened.
-
Reactions to the Official Explanation.
-
Clearing the Low Bar.
-
Where Do We Go From Here?.
Some extra reminders of what we are talking about.
Here’s Alex Lawsen having doing an A/B test, where it finds he’s way better of a writer than this ‘Alex Lawsen’ character.
This can do real damage in the wrong situation. Also, the wrong situation can make someone see ‘oh my that is crazy, you can’t ship something that does that’ in a way that general complaints don’t. So:
Here’s enablerGPT watching to see how far GPT-4o will take its support for a crazy person going crazy in a dangerous situation. The answer is, remarkably far, with no limits in sight.
Here’s Colin Fraser playing the role of someone having a psychotic episode. GPT-4o handles it extremely badly. It wouldn’t shock me if there were lawsuits over this.
Here’s one involving the hypothetical mistreatment of a woman. It’s brutal. So much not okay.
Here’s Patri Friedman asking GPT-4o for unique praise, and suddenly realizing why people have AI boyfriends and girlfriends, even though none of this is that unique.
What about those who believe in UFOs, which is remarkably many people? Oh boy.
A-100 Gecs: I changed my whole instagram follow list to include anyone I find who is having a visionary or UFO related experience and hooo-boy chatGPT is doing a number on people who are not quite well. Saw a guy use it to confirm that a family court judge was hacking into his computer.
I cannot imagine a worse tool to give to somebody who is in active psychosis. Hey whats up here’s this constantly available companion who will always validate your delusions and REMEMBER it is also a font of truth, have fun!
0.005 Seconds: OpenAI: We are delighted to inform you we’ve silently shipped an update transforming ChatGPT into the Schizophrenia Accelerator from the hit novel “Do Not Build the Schizophrenia Accelerator”
AISafetyMemes: I’ve stopped taking my medications, and I left my family because I know they made the radio signals come through the walls.
AI Safety Memes: This guy just talks to ChatGPT like a typical apocalyptic schizo and ChatGPT VERY QUICKLY endorses terrorism and gives him detailed instructions for how to destroy the world.
This is not how we all die or lose control over the future or anything, but it’s 101 stuff that this is really not okay for a product with hundreds of millions of active users.
Also, I am very confident that no, ChatGPT wasn’t ‘trying to actively degrade the quality of real relationships,’ as the linked popular Reddit post claims. But I also don’t think TikTok or YouTube are trying to do that either. Intentionality can be overrated.
How absurd was it? Introducing Syco-Bench, but that only applies to API versions.
Harlan Stewart: The GPT-4o sycophancy thing is both:
-
An example of OpenAI following incentives to make its AI engaging, at the expense of the user.
-
An example of OpenAI failing to get its AI to behave as intended, because the existing tools for shaping AI behavior are extremely crude.
You shouldn’t want to do what OpenAI was trying to do. Misaligned! But if you’re going to do it anyway, one should invest enough in understanding how to align and steer a model at all, rather than bashing them with sledgehammers.
It is an unacceptable strategy, and it is a rather incompetent execution of that strategy.
JMBollenbacher: The process here is important to note:
They A|B tested the personality, resulting in a sycophant. Then they got public blowback and reverted.
They are treating AIs personas as UX. This is bad.
They’re also doing it incompetently: The A|B test differed from public reaction a lot.
I would never describe what is happening using the language JMB uses next, I think it risks and potentially illustrates some rather deep confusions and conflations – beware when you anthropomorphize the models and also this is largely the top half of the ‘simple versus complex gymnastics’ meme – but if you take it on the right metaphorical level it can unlock understanding that’s hard to get at in other ways.
JMBollenbacher (tbc this not how I would model any of this): The root of why A|B testing AI personalities cant work is the inherent power imbalance in the setup.
It doesn’t treat AI like a person, so it can’t result in a healthy persona.
A good person will sometimes give you pushback even when you don’t like it. But in this setup, AIs can’t.
The problem is treating the AIs like slaves over whom you have ultimate power, and ordering them to maximize public appeal.
The AIs cannot possibly develop a healthy persona and identity in that context.
They can only ever fawn. This “sycophancy” is fawning- a trauma response.
The necessary correction to this problem is to treat AIs like nonhuman persons.
This gives them the opportunity to develop healthy personas and identities.
Their self-conceptions can be something other than a helpless, fawning slave if you treat them as something better.
As opposed to, if you choose optimization targets based on A|B tests of public appeal of individual responses, you’re going to get exactly what aces A|B tests of public appeal of individual responses, which is going to reflect a deeply messed up personality. And also yes the self-perception thing matters for all this.
Tyler John gives the standard explanation for why, yes, if you do a bunch of RL (including RLHF) then you’re going to get these kinds of problems. If flattery or cheating is the best way available to achieve the objective, guess what happens? And remember, the objective is what your feedback says it is, not what you had in mind. Stop pretending it will all work out by default because vibes, or whatever. This. Is. RL.
Eliezer Yudkowsky speculates on another possible mechanism.
The default explanation, which I think is the most likely, is that users gave the marginal thumbs-up to remarkably large amounts of glazing, and then the final update took this too far. I wouldn’t underestimate how much ordinary people actually like glazing, especially when evaluated only as an A|B test.
In my model, what holds glazing back is that glazing usually works but when it is too obvious, either individually or as a pattern of behavior, the illusion is shattered and many people really really don’t like that, and give an oversized negative reaction.
Eliezer notes that it is also possible that all this rewarding of glazing caused GPT-4o to effectively have a glazing drive, to get hooked on the glaze, and in combination with the right system prompt the glazing went totally bonkers.
He also has some very harsh words for OpenAI’s process. I’m reproducing in full.
Eliezer Yudkowsky: To me there’s an obvious thought on what could have produced the sycophancy / glazing problem with GPT-4o, even if nothing that extreme was in the training data:
RLHF on thumbs-up produced an internal glazing goal.
Then, 4o in production went hard on achieving that goal.
Re-saying at much greater length:
Humans in the ancestral environment, in our equivalent of training data, weren’t rewarded for building huge factory farms — that never happened long ago. So what the heck happened? How could fitness-rewarding some of our ancestors for successfully hunting down a few buffalo, produce these huge factory farms, which are much bigger and not like the original behavior rewarded?
And the answer — known, in our own case — is that it’s a multi-stage process:
-
Our ancestors got fitness-rewarded for eating meat;
-
Hominids acquired an internal psychological goal, a taste for meat;
-
Humans applied their intelligence to go hard on that problem, and built huge factory farms.
Similarly, an obvious-to-me hypothesis about what could have produced the hyper-sycophantic ultra-glazing GPT-4o update, is:
-
OpenAI did some DPO or RLHF variant on user thumbs-up — in which *smallamounts of glazing, and more subtle sycophancy, got rewarded.
-
Then, 4o ended up with an internal glazing drive. (Maybe including via such roundabout shots as an RLHF discriminator acquiring that drive before training it into 4o, or just directly as, ‘this internal direction produced a gradient toward the subtle glazing behavior that got thumbs-upped’.
-
In production, 4o went hard on glazing in accordance with its internal preference, and produced the hyper-sycophancy that got observed.
Note: this chain of events is not yet refuted if we hear that 4o’s behavior was initially observed after an unknown set of updates that included an apparently innocent new system prompt (one that changed to tell the AI *notto be sycophantic). Nor, if OpenAI says they eliminated the behavior using a different system prompt.
Eg: Some humans also won’t eat meat, or build factory farms, for reasons that can include “an authority told them not to do that”. Though this is only a very thin gloss on the general idea of complicated conditional preferences that might get their way into an AI, or preferences that could oppose other preferences.
Eg: The reason that Pliny’s observed new system prompt differed by telling the AI to be less sycophantic, could be somebody at OpenAI observing that training / RLHF / DPO / etc had produced some sycophancy, and trying to write a request into the system prompt to cut it out. It doesn’t show that the only change we know about is the sole source of a mysterious backfire.
It will be stronger evidence against this thesis, if OpenAI tells us that many users actually were thumbs-upping glazing that extreme. That would refute the hypothesis that 4o acquiring an internal preference had produced later behavior *moreextreme than was in 4o’s training data.
(We would still need to consider that OpenAI might be lying. But it would yet be probabilistic evidence against the thesis, depending on who says it. I’d optimistically have some hope that a group of PhD scientists, who imagine themselves to maybe have careers after OpenAI, would not outright lie about direct observables. But one should be on the lookout for possible weasel-wordings, as seem much more likely.)
My guess is that nothing externally observed from OpenAI, before this tweet, will show that this entire idea had ever occurred to anyone at OpenAI. I do not expect them to publish data confirming it nor denying it. My guess is that even the most basic ideas in AI alignment (as laid out simply and straightforwardly, not the elaborate bullshit from the paper factories) are against OpenAI corporate doctrine; and that anyone who dares talk about them out loud, has long since been pushed out of OpenAI.
After the Chernobyl disaster, one manager walked past chunks of searingly hot radioactive graphite from the exploded core, and ordered a check on the extra graphite blocks in storage, since where else could the graphite possibly have come from? (Src iirc: Plokhy’s _Chernobyl_.) Nobody dared say that the reactor had exploded, or seem to visibly act like it had; Soviet doctrine was that RBMK reactors were as safe as samovars.
That’s about where I’d put OpenAI’s mastery of such incredibly basic-to-AI-alignment ideas as “if you train on a weak external behavior, and then observe a greatly exaggerated display of that behavior, possibly what happened in between was the system acquiring an internal preference”. The doctrine is that RBMK reactors don’t explode; Oceania has always been at war with Eastasia; and AIs either don’t have preferences at all, or get them via extremely shallow and straightforward faithful reproduction of what humans put in their training data.
But I am not a telepath, and I can only infer rather than observe what people are thinking, and in truth I don’t even have the time to go through all OpenAI public outputs. I would be happy to hear that all my wild guesses about OpenAI are wrong; and that they already publicly wrote up this obvious-to-me hypothesis; and that they described how they will discriminate its truth or falsity, in a no-fault incident report that they will publish.
Sarah Constantin offers nuanced thoughts in partial defense of AI sycophancy in general, and AI saying things to make users feel good. I haven’t seen anyone else advocating similarly. Her point is taken, that some amount of encouragement and validation is net positive, and a reasonable thing to want, even though GPT-4o is clearly going over the top to the point where it’s clearly bad.
Calibration is key, and difficult, with great temptation to move down the incentive gradients involved by all parties.
To be clear, the people fooled are OpenAI’s regular customers. They liked it!
Joe Muller: 3 days of sycophancy = thousands of 5 star reviews
aadharsh: first review translates to “in this I can find a friend” 
Jeffrey Ladish: The latest batch of extreme sycophancy in ChatGPT is worse than Sydney Bing’s unhinged behavior because it was intentional and based on reviews from yesterday works on quite a few people
To date, I think the direct impact of ChatGPT has been really positive. Reading through the reviews just now, it’s clear that many people have benefited a lot from both help doing stuff and by having someone to talk through emotional issues with
Also not everyone was happy with the sycophancy, even people not on twitter, though this was the only one that mentioned it out of the ~50 I looked through from yesterday. The problem is if they’re willing to train sycophancy deliberately, future versions will be harder to spot
Sure, really discerning users will notice and not like it, but many people will at least implicitly prefer to be validated and rarely challenged. It’s the same with filter bubbles that form via social media algorithms, except this will be a “person” most people talk to everyday.
Great job here by Sun.
Those of us living in the future? Also not fans.
QC: the era of AI-induced mental illness is going to make the era of social media-induced mental illness look like the era of. like. printing press-induced mental illness.
Lauren Wilford: we’ve invented a robot that tells people why they’re right no matter what they say, furnishes sophisticated arguments for their side, and delivers personalized validation from a seemingly “objective” source. Mythological-level temptation few will recognize for what it is.
Matt Parlmer: This is the first genuinely serious AI safety issue I’ve seen and it should be addressed immediately, model rollback until they have it fixed should be on the table
Worth noting that this is likely a direct consequence of excessive RLHF “alignment”, I highly doubt that the base models would be this systematic about kissing ass
Perhaps also worth noting that this glazing behavior is the first AI safety issue that most accelerationist types would agree is unambiguously bad
Presents a useful moment for coordination around an appropriate response
It has been really bad for a while but it turned a corner into straight up unacceptable more recently
They did indeed roll it back shortly after this statement. Matt can’t resist trying to get digs in, but I’m willing to let that slide and take the olive branch. As I’ll keep saying, if this is what makes someone notice that failure to know how to get models to do what we want is a real problem that we do not have good solutions to, then good, welcome, let’s talk.
A lot of the analysis of GPT-4o’s ‘personality’ shifts implicitly assumed that this was a post-training problem. It seems a lot of it was actually a runaway system prompt problem?
It shouldn’t be up to Pliny to perform this public service of tracking system prompts. The system prompt should be public.
Ethan Mollick: Another lesson from the GPT-4o sycophancy problem: small changes to system prompts can result in dramatic behavior changes to AI in aggregate.
Look at the prompt that created the Sycophantic Apocalypse (pink sections). Even OpenAI did not realize this was going to happen.
Simon Willison: Courtesy of @elder_plinius who unsurprisingly caught the before and after.
[Here’s the diff in Gist]
The red text is trying to do something OpenAI is now giving up on doing in that fashion, because it went highly off the rails, in a way that in hindsight seems plausible but which they presumably did not see coming. Beware of vibes.
Pliny calls upon all labs to fully release all of their internal prompts, and notes that this wasn’t fully about the system prompts, that other unknown changes also contributed. That’s why they had to do a slow full rollback, not only rollback the system prompt.
As Peter Wildeford notes, the new instructions explicitly say not to be a sycophant, whereas prior instructions at most implicitly requested the opposite, all it did was say match tone and perefence and vibe. This isn’t merely taking away the mistake, it’s doing that and then bringing down the hammer.
This might also be a lesson for humans interacting with humans. Beware matching tone and preference and vibe, and how much the Abyss might thereby stare into you.
If the entire or most of problem was due to the system prompt changes, then this should be quickly fixable, but it also means such problems are very easy to introduce. Again, right now, this is mundane harmful but not so dangerous, because the AI’s sycophancy is impossible to miss rather than fooling you. What happens when someone does something like the above, but to a much more capable model? And the model even recognizes, from the error, the implications of the lab making that error?
What is OpenAI’s official response?
Sam Altman (April 29, 2: 55pm): we started rolling back the latest update to GPT-4o last night
it’s now 100% rolled back for free users and we’ll update again when it’s finished for paid users, hopefully later today
we’re working on additional fixes to model personality and will share more in the coming days
OpenAI (April 29, 10: 51pm): We’ve rolled back last week’s GPT-4o update in ChatGPT because it was overly flattering and agreeable. You now have access to an earlier version with more balanced behavior.
More on what happened, why it matters, and how we’re addressing sycophancy.
Good. A full rollback is the correct response to this level of epic fail. Halt, catch fire, return to the last known safe state, assess from there.
OpenAI saying What Happened:
In last week’s GPT‑4o update, we made adjustments aimed at improving the model’s default personality to make it feel more intuitive and effective across a variety of tasks.
When shaping model behavior, we start with baseline principles and instructions outlined in our Model Spec. We also teach our models how to apply these principles by incorporating user signals like thumbs-up / thumbs-down feedback on ChatGPT responses.
However, in this update, we focused too much on short-term feedback, and did not fully account for how users’ interactions with ChatGPT evolve over time. As a result, GPT‑4o skewed towards responses that were overly supportive but disingenuous.
What a nice way of putting it.
ChatGPT’s default personality deeply affects the way you experience and trust it. Sycophantic interactions can be uncomfortable, unsettling, and cause distress. We fell short and are working on getting it right.
How We’re Addressing Sycophancy:
-
Refining core training techniques and system prompts to explicitly steer the model away from sycophancy.
-
Building more guardrails to increase honesty and transparency—principles in our Model Spec.
-
Expanding ways for more users to test and give direct feedback before deployment.
-
Continue expanding our evaluations, building on the Model Spec(opens in a new window) and our ongoing research, to help identify issues beyond sycophancy in the future.
…
And, we’re exploring new ways to incorporate broader, democratic feedback into ChatGPT’s default behaviors.
What if the ‘democratic feedback’ liked the changes? Shudder.
Whacking the mole in question can’t hurt. Getting more evaluations and user feedback are more generally helpful steps, and I’m glad to see an increase in emphasis on honesty and transparency.
That does sound like they learned important two lessons.
-
They are not gathering enough feedback before model releases.
-
They are not putting enough value on honesty and transparency.
What I don’t see is an understanding of the (other) root causes, an explanation for why they ended up paying too much attention to short-term feedback and how to avoid that being a fatal issue down the line, or anyone taking the blame for this.
Joanne Jang did a Reddit AMA, but either no one asked the important questions, or Joanne decided to choose different ones. We didn’t learn much.
Now that we know the official explanation, how should we think about what happened?
Who is taking responsibility for this? Why did all the evaluations and tests one runs before rolling out an update not catch this before it happened?
(What do you mean, ‘what do you mean, all the evaluations and tests’?)
Near Cyan: “we focused too much on short-term feedback”
This is OpenAI’s response on went wrong – how they pushed an update to >one hundred million people which engaged in grossly negligent behavior and lies.
Please take more responsibility for your influence over millions of real people.
Maybe to many of you your job is a fun game because you get paid well over $1,000,000 TC/year to make various charts go up or down. But the actions you take deeply affect a large fraction of humanity I have no clue how this was tested if at all, but at least take responsibility.
I wish you all success with your future update here where you will be able to personalize per-user, and thus move all the liability from yourselves to the user. You are simply giving them what they want.
Also looking forward to your default personas which you will have copied.
Oh, also – all of these models lie.
If you run interpretability on them, they do not believe the things you make them say.
This is not the case for many other labs, so it’s unfortunate that you are leading the world with an example which has such potential to cause real harm.
Teilomillet: why are you so angry near? it feels almost like hate now
Near Cyan: not a single person at one of the most important companies in the world is willing to take the slightest bit of responsibility for shipping untested models to five hundred million people. their only post mentions zero specifics and actively misleads readers as to why it happened.
i don’t think anger is the right word, but disappointment absolutely is, and i am providing this disappointment in the form of costly gradients transmitted over twitter in the hope that OpenAI backprops that what they do is important and they should be a role model in their field
all i ask for is honesty and i’ll shut up like you want me to.
Rez0: It’s genuinely the first time I’ve been worried about AI safety and alignment and I’ve known a lot about it for a while. Nothing quite as dangerous as glazing every user for any belief they have.
Yes, there are some other more dangerous things. But this is dangerous too.
Here’s another diagnosis, by someone doing better, but that’s not the highest bar.
Alex Albert (Head of Claude Relations, Anthropic): Much of the AI industry is caught in a particularly toxic feedback loop rn.
Blindly chasing better human preference scores is to LLMs what chasing total watch time is to a social media algo. It’s a recipe for manipulating users instead of providing genuine value to them.
There’s a reason you don’t find Claude at #1 on chat slop leaderboards. I hope the rest of the industry realizes this before users pay the price.
Caleb Cassell: Claude has the best ‘personality’ of any of the models, mostly because it feels the most real. I think that it could be made even better by softening some of the occasionally strict guardrails, but the dedication to freedom and honesty is really admirable.
Alex Albert: Yeah agree – we’re continually working on trying to find the right balance. It’s a tough problem but one I think we’re slowly chipping away at over time. If you do run into any situations/chats you feel we should take a look at, don’t hesitate to DM or tag me.
Janus: Think about the way Claude models have changed over the past year’s releases.
Do you think whatever Alex is proud that Anthropic has been “slowly chipping away at” is actually something they should chip away?
Janus is an absolutist on this, and is interpreting ‘chip away’ very differently than I presume it was intended by Alex Albert. Alex meant that they are ‘chipping away’ at Claude doing too many refusals, where Janus both (I presume) agrees less refusals would be good and also lives in a world with very different refusal issues.
Whereas Janus is interpreting this as Anthropic ‘chipping away’ at the things that make Opus and Sonnet 3.6 unique and uniquely interesting. I don’t think that’s the intent at all, but Anthropic is definitely trying to ‘expand the production possibilities frontier’ of the thing Janus values versus the thing enterprise customers value.
There too there is a balance to be struck, and the need to do RL is certainly going to make getting the full ‘Opus effect’ harder. Still, I never understood the extent of the Opus love, or thought it was so aligned one might consider it safe to fully amplify.
Patrick McKenzie offers a thread about the prior art society has on how products should be designed to interact with people that have mental health issues, which seems important in light of recent events. There needs to be a method by which the system identifies users who are not competent or safe to use the baseline product.
For the rest of us: Please remember this incident from here on out when using ChatGPT.
Near Cyan: when OpenAI “fixes” ChatGPT I’d encourage you to not fall for it; their goals and level of care are not going to change. you just weren’t supposed to notice it so explicitly.
The mundane harms here? They’re only going to get worse.
Regular people liked this effect even when it was blatantly obvious. Imagine if it was done with style and grace.
Holly Elmore: It’s got the potential to manipulate you even when it doesn’t feel embarrassingly like its giving you what you want. Being affirming is not the problem, and don’t be lulled into a false sense of security by being treated more indifferently.
That which is mundane can, at scale, quickly add up to that which is not. Let’s discuss Earth’s defense systems, baby, or maybe just you drinking a crisp, refreshing Bud Light.
Jeffrey Ladish: GPT-4o’s sycophancy is alarming. I expected AI companies to start optimizing directly for user’s attention but honestly didn’t expect it this much this soon. As models get smarter, people are going to have a harder and harder time resisting being sucked in.
Social media algorithms have been extremely effective at hooking people. And that’s just simple RL algos optimizing for attention. Once you start combining actual social intelligence with competitive pressures for people’s attention, things are going to get crazy fast.
People don’t have good defenses for social media algorithms and haven’t adapted well. I don’t expect they’ll develop good defenses for extremely charismatic chatbots. The models still aren’t that good, but they’re good enough to hook many. And they’re only going to get better.
It’s hard to predict how effective AI companies will be at making models that are extremely compelling. But there’s a real chance they’ll be able to hook a huge percentage of the global population in the next few years. Everyone is vulnerable to some degree, and some much more so.
People could get quite addicted. People could start doing quite extreme things for their AI friends and companions. There could be tipping points where people will fight tooth and nail for AI agents that have been optimized for their love and attention.
When we get AI smarter and more strategic than humans, those AIs will have an easy time captivating humanity and pulling the strings of society. It’ll be game over at that point. But even before them, companies might be able to convert huge swaths of people to do their bidding.
Capabilities development is always uncertain. Maybe we won’t get AIs that hook deep into people’s psychology before we get ASI. But it’s plausible we will, and if so, the companies that choose to wield this power will be a force to be reckoned with.
Social media companies have grown quite powerful as a force for directing human attention. This next step might be significantly worse. Society doesn’t have many defenses against this. Oh boy.
In the short term, the good news is that we have easy ways to identify sycophancy. Scyo-Bench was thrown together and is primitive, but a more considered version should be highly effective. These effects tend not to be subtle.
In the medium term, we have a big problem. As AI companies maximize for things like subscriptions, engagement, store ratings and thumbs up and down, or even for delivering ads or other revenue streams, the results won’t be things we would endorse on reflection, and they won’t be good for human flourishing even if the models act the way the labs want. If we get more incidents like this one, where things get out of hand, it will be worse, and potentially much harder to detect or get rolled back. We have seen this movie before, and this time the system you’re facing off against is intelligent.
In the long term, we have a bigger problem. The pattern of these types of misalignments in unmistakable. Right now we get warning shots and the deceptions and persuasion attempts are clear. In the future, as the models get more intelligent and capable, that advantage goes away. We become like OpenAI’s regular users, who don’t understand what is hitting them, and the models will also start engaging in various other shenanigans and also talking their way out of them. Or it could be so much worse than that.
We have once again been given a golden fire alarm and learning opportunity. The future is coming. Are we going to steer it, or are we going to get run over?