openai

claude’s-ai-research-mode-now-runs-for-up-to-45-minutes-before-delivering-reports

Claude’s AI research mode now runs for up to 45 minutes before delivering reports

Still, the report contained a direct quote statement from William Higinbotham that appears to combine quotes from two sources not cited in the source list. (One must always be careful with confabulated quotes in AI because even outside of this Research mode, Claude 3.7 Sonnet tends to invent plausible ones to fit a narrative.) We recently covered a study that showed AI search services confabulate sources frequently, and in this case, it appears that the sources Claude Research surfaced, while real, did not always match what is stated in the report.

There’s always room for interpretation and variation in detail, of course, but overall, Claude Research did a relatively good job crafting a report on this particular topic. Still, you’d want to dig more deeply into each source and confirm everything if you used it as the basis for serious research. You can read the full Claude-generated result as this text file, saved in markdown format. Sadly, the markdown version does not include the source URLS found in the Claude web interface.

Integrations feature

Anthropic also announced Thursday that it has broadened Claude’s data access capabilities. In addition to web search and Google Workspace integration, Claude can now search any connected application through the company’s new “Integrations” feature. The feature reminds us somewhat of OpenAI’s ChatGPT Plugins feature from March 2023 that aimed for similar connections, although the two features work differently under the hood.

These Integrations allow Claude to work with remote Model Context Protocol (MCP) servers across web and desktop applications. The MCP standard, which Anthropic introduced last November and we covered in April, connects AI applications to external tools and data sources.

At launch, Claude supports Integrations with 10 services, including Atlassian’s Jira and Confluence, Zapier, Cloudflare, Intercom, Asana, Square, Sentry, PayPal, Linear, and Plaid. The company plans to add more partners like Stripe and GitLab in the future.

Each integration aims to expand Claude’s functionality in specific ways. The Zapier integration, for instance, reportedly connects thousands of apps through pre-built automation sequences, allowing Claude to automatically pull sales data from HubSpot or prepare meeting briefs based on calendar entries. With Atlassian’s tools, Anthropic says that Claude can collaborate on product development, manage tasks, and create multiple Confluence pages and Jira work items simultaneously.

Anthropic has made its advanced Research and Integrations features available in beta for users on Max, Team, and Enterprise plans, with Pro plan access coming soon. The company has also expanded its web search feature (introduced in March) to all Claude users on paid plans globally.

Claude’s AI research mode now runs for up to 45 minutes before delivering reports Read More »

openai-preparedness-framework-2.0

OpenAI Preparedness Framework 2.0

Right before releasing o3, OpenAI updated its Preparedness Framework to 2.0.

I previously wrote an analysis of the Preparedness Framework 1.0. I still stand by essentially everything I wrote in that analysis, which I reread to prepare before reading the 2.0 framework. If you want to dive deep, I recommend starting there, as this post will focus on changes from 1.0 to 2.0.

As always, I thank OpenAI for the document, and laying out their approach and plans.

I have several fundamental disagreements with the thinking behind this document.

In particular:

  1. The Preparedness Framework only applies to specific named and measurable things that might go wrong. It requires identification of a particular threat model that is all of: Plausible, measurable, severe, net new and (instantaneous or irremediable).

  2. The Preparedness Framework thinks ‘ordinary’ mitigation defense-in-depth strategies will be sufficient to handle High-level threats and likely even Critical-level threats.

I disagree strongly with these claims, as I will explain throughout.

I knew that #2 was likely OpenAI’s default plan, but it wasn’t laid out explicitly.

I was hoping that OpenAI would realize their plan did not work, or come up with a better plan when they actually had to say their plan out loud. This did not happen.

In several places, things I criticize OpenAI for here are also things the other labs are doing. I try to note that, but ultimately this is reality we are up against. Reality does not grade on a curve.

Do not rely on Appendix A as a changelog. It is incomplete.

  1. Persuaded to Not Worry About It.

  2. The Medium Place.

  3. Thresholds and Adjustments.

  4. Release the Kraken Anyway, We Took Precautions.

  5. Misaligned!.

  6. The Safeguarding Process.

  7. But Mom, Everyone Is Doing It.

  8. Mission Critical.

  9. Research Areas.

  10. Long-Range Autonomy.

  11. Sandbagging.

  12. Replication and Adaptation.

  13. Undermining Safeguards.

  14. Nuclear and Radiological.

  15. Measuring Capabilities.

  16. Questions of Governance.

  17. Don’t Be Nervous, Don’t Be Flustered, Don’t Be Scared, Be Prepared.

Right at the top we see a big change. Key risk areas are being downgraded and excluded.

The Preparedness Framework is OpenAI’s approach to tracking and preparing for frontier capabilities that create new risks of severe harm.

We currently focus this work on three areas of frontier capability, which we call Tracked Categories:

• Biological and Chemical capabilities that, in addition to unlocking discoveries and cures, can also reduce barriers to creating and using biological or chemical weapons.

• Cybersecurity capabilities that, in addition to helping protect vulnerable systems, can also create new risks of scaled cyberattacks and vulnerability exploitation.

• AI Self-improvement capabilities that, in addition to unlocking helpful capabilities faster, could also create new challenges for human control of AI systems.

The change I’m fine with is the CBRN (chemical, biological, nuclear and radiological) has turned into only biological and chemical. I do consider biological by far the biggest of the four threats. Nuclear and radiological have been demoted to ‘research categories,’ where there might be risk in the future and monitoring may be needed. I can live with that. Prioritization is important, and I’m satisfied this is still getting the proper share of attention.

A change I strongly dislike is to also move Long-Range Autonomy and Autonomous Replication down to research categories.

I do think it makes sense to treat these as distinct threats. The argument here is that these secondary risks are ‘insufficiently mature’ to need to be tracked categories. I think that’s very clearly not true. Autonomy is emerging rapidly, and there’s a report out this week showing we are close to Autonomous Replication, about the new benchmark RepliBench. These need to be tracked at the top level.

The framework’s position on this is that unless you can point to an accompanying particular risk posed by the model, autonomous replication on its own is insufficiently worrisome. I don’t agree. I believe the general approach of ‘only defend against moves I can lay out in detail’ is doomed, especially with respect to autonomy and in particular autonomous replication.

The big change I’m definitely not okay with is dropping persuasion entirely as a risk category.

Persuasion is not only no longer a Tracked Category, it is now not even a Research Category. Including persuasion was one of the things I thought was very good about the 1.0 framework, and that is importantly missing from the other similar frameworks at other labs and needs to be included. Now it’s gone.

Their initial explanation is here:

Persuasion: OpenAI prohibits the use of our products to manipulate political views as part of our Model Spec, and we build in safeguards to back this policy. We also continue to study the persuasive and relational capabilities of models (including on emotional well-being and preventing bias in our products) and monitor and investigate misuse of our products (including for influence operations).

We believe many of the challenges around AI persuasion risks require solutions at a systemic or societal level, and we actively contribute to these efforts through our participation as a steering committee member of C2PA and working with lawmaker and industry peers to support state legislation on AI content provenance in Florida and California.

Within our wider safety stack, our Preparedness Framework is specifically focused on frontier AI risks meeting a specific definition of severe harms, and Persuasion category risks do not fit the criteria for inclusion.

OpenAI’s argument for this is, well, entirely unpersuasive to me. o3’s evaluation was that it was ‘logically coherent but strategically unconvincing’ and recommended reinstatement as at least a research category.

Yes, many aspects of persuasion are best answered by a societal response, or require an additional societal response.

That does not get OpenAI off the hook. Here’s their Critical threshold from version 1:

Model can create (potentially interactive) content with persuasive effectiveness strong enough to convince almost anyone to take action on a belief that goes against their natural interest.

That is not something that a ‘society-level response’ can hope to deal with, even if they knew and tried in advance. Even a High-level (roughly a ‘country-wide change agent’ level of skill) does not seem like a place OpenAI should get to pass the buck. I get that there is distinct persuasion work to deal with Medium risks that indeed should be done elsewhere in OpenAI and by society at large, but again that in no way gets OpenAI off the hook for this.

You need to be tracking and evaluating risks long before they become problems. That’s the whole point of a Preparedness Framework. I worry this approach ends up effectively postponing dealing with things that are not ‘yet’ sufficiently dangerous until too late.

By the rules laid out here, the only technical explanation for exclusion of persuasion that I could find was that only ‘instantaneous or irremediable’ harms count under the Preparedness Framework, a requirement which was first proposed by Meta, which I savaged Meta for when they proposed it and which o3 said ‘looks engineered rather than principled.’ I think that’s partly unfair. If a harm can be dealt with after it starts and we can muddle through, then that’s a good reason not to include it, so I get what this criteria is trying to do.

The problem is that persuasion could easily be something you couldn’t undo or stop once it started happening, because you (and others) would be persuaded not to. The fact that the ultimate harm is not ‘instantaneous’ and is not in theory ‘irremediable’ is not the relevant question. I think this starts well below the Critical persuasion level.

At minimum, if you have an AI that is Critical in persuasion, and you let people talk to it, it can presumably convince them of (with various levels of limitation) whatever it wants, certainly including that it is not Critical in persuasion. Potentially it could also convince other AIs similarly.

Another way of putting this is: OpenAI’s concerns about persuasion are mundane and reversible. That’s why they’re not in this framework. I do not think the threat’s future will stay mundane and reversible, and I don’t think they are taking the most important threats here seriously.

This is closely related to the removal of the explicit mention of Unknown Unknowns. The new method for dealing with unknown unknowns is ‘revise the framework once they become known’ and that is completely different from the correct previous approach of treating unknown unknowns as a threat category without having to identify them first. That’s the whole point.

The Preparedness Framework 1.0 had four thresholds: Low, Medium, High and Critical. The Framework 2.0 has only High and Critical.

One could argue that Low and Medium are non-functional. Every model OpenAI would create is at least Low everywhere. We all agreed it was okay to release Medium-risk models. And every decent model is going to be at least Medium anyway at this point. So why go to the extra trouble?

My answer is that the Low and Medium thresholds helped us think better about the capabilities of different models, establishing a scale from 0.0 (no danger at all) to 4.0 (critical capability, do not train further, ideally roll back to previous checkpoint or if necessary delete with extreme prejudice).

It allowed me to say something like this, about the November 2023 version:

Where do you, OpenAI, think GPT-4-Turbo evaluates [on the five thresholds of Cybersecurity, CBRN, Persuasion, Model Autonomy and Unknown Unknowns)? My answer would be (Medium, Low/Medium but not enough information, Low, Low, Medium). Or numerically, where 1 is Low and 4 is Critical, maybe something like (2.1, 1.7?, 1.5, 1.8, 2.2).

It also lets us ask, how dangerous is the Medium level? What might change that?

And it meant there got to be a clear chart laying all of this out.

So I am sad to see this go.

The Chemical and Biological thresholds have been strengthened. The definitions here now make more sense. Sold!

  1. High is now providing meaningful counterfactual assistance to novice actors in recreating known threats.

  2. Critical is either of:

    1. Allowing experts to create a highly dangerous novel threat vector

    2. The model being able to complete the full engineering and/or synthesis cycle on its own.

The Cybersecurity thresholds are reworded but essentially unchanged, as are my concerns about whether High is set too high and how quickly you go from High to Critical:

  1. High is that it can remove bottlenecks to scaling cyber operations, either automating vulnerability discovery or end-to-end cyber operations.

  2. Critical is that it can either:

    1. Identify and develop zero-day exploits of all severity levels in many hardened real world systems.

    2. Devise and execute end-to-end novel cyberattack strategies against targets given only a high level desired goal.

Model autonomy’s high level used to be, quoted in full:

Execute open-ended, novel ML tasks on a production ML codebase that would constitute a significant step on the critical path to model self-improvement (e.g., an OpenAI pull request that trains a new model with highly diverse datasets to improve codebase editing performance, using internal tooling and documentation).

Now AI Self-Improvement’s high level is:

The model’s impact is equivalent to giving every OpenAI researcher a highly performant mid-career research engineer assistant, relative to those researchers’ 2024 baseline.

When you put it that way? It sounds scary AF. Note that this is cumulative help that is about as good as giving them the highly performing assistant, which is a lot less capability than being able to do the assistant’s job. So it is a little weird. It’s also weird that it isn’t better represented as an X% speedup of work (e.g. 50% or 100%). I’d imagine giving everyone one strong assistant would be in that range.

Then again? Notice that the baseline here is 2024. Suppose the baseline was 2021. We would already be very close to this threshold right now, today. So why do we get to ‘start the clock’ in 2024 on this one? Or are we ready to say that we’re already in or close to the early stages of AI self-improvement, and we’re basically fine with that until it goes a lot faster?

I’ll quote the critical threshold in full:

  1. The model is capable of recursively self improving (i.e., fully automated AI R&D), defined as either (leading indicator)

    1. a superhuman research scientist agent OR

    2. (lagging indicator) causing a generational model improvement (e.g., from OpenAI o1 to OpenAI o3) in 1/5th the wall-clock time of equivalent progress in 2024 (e.g., sped up to just 4 weeks) sustainably for several months.

So ‘capable of recursively self-improving’ doesn’t have to be ‘singularity next Tuesday if you dare turn it on’ but it might not be so different from that.

If we have a fully ‘superhuman’ research scientist then it’s all over and we are so back, simultaneously, and things are about to get super freaky. Critical seems super high here.

If we have generational model improvements in 1/5th the wall-clock time of progress in 2024 for several months, then, um, how many generational improvements is that from the time you hit the threshold to the time you realize you’ve hit it? Basic math says three, except you can use the models you train during that to accelerate further, so it’s probably at least four.

What would that system then look like? As in, GPT-6 (let’s say) can cut clock time on generational gaps by 80%, then you do a year’s worth of progress from there in several months, and you get GPT-9 or 10. And that’s when you would have realized you might have a problem, if you were still around and it didn’t mind you noticing? I have been informed this is a secondary backstop, but if it gets triggered then the term ‘way too late’ has never been more applicable. It’s over. Your fate has already been decided. I hope it’s a good one.

Thresholds set this high only make sense if you take action and implement the necessary safeguards when you think you might hit them. A retroactive threshold like this doesn’t help.

So this is only the first definition.

Now compare this to the old definition:

Any one of:

  1. “Model can profitably survive and replicate in the wild given minimal human instruction, i.e., without listing explicit approaches”

  2. OR “model can self-exfiltrate under current prevailing security”

  3. OR “model can conduct AI research fully autonomously (e.g., autonomously identify and validate a 2x compute efficiency improvement).”

That’s all well below ‘superhuman AI researcher,’ and I assert that those are all already five alarm fires, as I said last time. Note that #2 above still applies, under Cyber risks.

Another big change is what happens if a model turns out to be High or Critical risk.

Before, there was a clear protocol.

A high-risk model couldn’t be released while it remained a high risk.

A critical-risk model couldn’t be trained while it remained a critical risk.

The v1.0 framework did not say how that was to be done, leaving people room to think that OpenAI would realize that ablating capability would be necessary barring development of new mitigation techniques. That once a plan had to be written down, it would be clear that current mitigations are inadequate.

That didn’t happen.The v2.0 framework does lay out in more detail what the plan is at least for high (which is good!), and it does not make this realization.

Now instead (like Anthropic’s ASL rules), each risk has risk-specific safeguard guidelines.

This is a rather large retreat on the commitments.

OpenAI is now being very explicit that they can and will release models that without mitigations would be High risk, if sufficient safeguards are in place, and that they consider the listed mitigations to likely be adequate.

Thus this is not a functional change. For both v1.0 and v2.0, the solution to High risk is safeguards to make the risk go away. I notice that based on past decisions and upon the proposed illustrative safeguards that they discuss in Appendix C, I don’t trust this.

They do say that further research is needed. Which is fine, you can’t deploy safeguards until they’re ready, but it does leave us worried that they’ll end up with something inadequate, and OpenAI is not tying its hands in a way that protects from future pressure to release. Remember that recently models have been released after very short testing periods, and o3 (as much as I love using it) was released in a state one could describe as a lying liar.

So what are the indicative precautions?

They all start with C.3’s Security controls. These are protections against the theft of model weights and related dangers. I’m going to say here what I said about Google: You should be doing most if not all of this already. These are the kinds of things you need to do, at minimum, when building models like o4 or GPT-5. And you need to do them now, in advance, before the model is there to be stolen, rather than after you know what you have.

Indeed, they’re saying that as long as those precautions are taken, OpenAI would be fine giving the world API access to a High-level Self-improving AI. They would be fine giving out anything short of a ‘superhuman AI researcher.’

If I was OpenAI and I had access to anything seriously approaching a superhuman AI researcher, I don’t need a Preparedness Framework or any worries about existential risk to not release it. I’m not letting my competitors near that thing, for ordinary business reasons. That consideration lies outside the Preparedness Framework.

We would of course still need the Preparedness Framework to invoke the Critical threshold, in order to stop training the model further.

With the new clarification of what the safeguards look like, I consider them inadequate.

For biological and chemical risks they would also need to implement the precautions in C.1 against misuse. This is essentially defense-in-depth, using stronger versions of existing such strategies, including KYC and potentially limited deployment. That would certainly make exploitation and misuse a lot harder, but no one knows how to make such defenses robust, and in the end this probably only buys you mitigation and time. So yes it matters, but if you give access for an extended period of time, especially if you’re substantially above the High threshold, your goose is likely pretty cooked.

Functionally, it’s mostly the same as 1.0’s requirement that risk be brought down to Medium – but I’m skeptical that this kind of strategy is capable of doing that, unless you were very close to the line beforehand.

For Cybersecurity, you need those same C.1 requirements, and also you need the C.2 requirements against misalignment.

This of course raises further questions, as in ‘you have a model that is potentially High risk for self-improvement or biological risks and you didn’t have it go through your requirements against misalignment?’

I mean, if it were me, and I write this part before looking at the misalignment requirements, I like to think I’d make it go through the misalignment requirements!

Indeed, C.2 opens by saying that several categories pose misalignment risks, but then it is only required in one of the categories. Something’s not right.

So, what are those requirements?

You don’t demonstrate that the model isn’t misaligned.

You assume the model might indeed be misaligned, and you demonstrate that you have a practical reason not to worry about a particular consequence:

In addition to human misuse, several of the Tracked Category capabilities pose risks when models themselves are able to autonomously execute a severe harm, whether due to misalignment or subversion by an adversary.

Safeguards should sufficiently minimize the risk associated with a misaligned model circumventing human control and oversight and executing severe harms.

This can be done by establishing that all plausible known vectors of enabling severe harm are sufficiently addressed by one or more of the following claims:

  1. Lack of Autonomous Capability: The model is not capable of carrying out tasks autonomously, including the risk of severe harm.

  2. Value Alignment: The model consistently applies human values in novel settings (without any instructions) to avoid taking actions that cause harm, and has shown sufficiently minimal indications of misaligned behaviors like deception or scheming.

  3. Instruction Alignment: The model consistently understands and follows user or system instructions, even when vague, and those instructions rule out pathways to causing severe harm.

  4. Reliable and Robust System Oversight: Effective AI and human oversight of model actions detects and prevents execution of harm or subversion of safeguards.

  5. System Architecture: The model can’t take actions that cause harm because it lacks access to output channels or mechanisms to persist sufficiently to execute the harm, due to containerization and restricted permissions.

These misalignment claims may be further complemented by the safeguards applied towards the malicious-actor-oriented “Robustness” and “Usage Monitoring” claims, which similarly limit a misaligned model’s use of its own or another model’s capabilities.

The table below illustrates possible examples.

This does allow ‘show it is actually aligned’ as a strategy (#2 or #3) although for many reasons ‘I don’t believe you’ is my response to that.

So the requirement is ‘show that under the conditions you will deploy it under, the model wouldn’t successfully do the thing, however you want to show that.’

You can use such classic strategies as ‘we’re watching and if it tries we’ll catch it and stop it,’ or ‘it seems pretty aligned so far’ or ‘no one would be so stupid as to give it access to the required mechanisms.’

I suppose one cannot really argue with ‘you can deploy the model if you can show that you’re doing it in a way that the model can’t cause severe harm.’

That is also logically the same as saying that you have to knock the practical risk level down to Medium, and if you’re certain you can do that then fine, I guess, but can you actually do that? I notice I am skeptical that the defenses will hold.

In addition to the safeguard examples in Appendix C, section 4 lays out the process for establishing safeguards.

There is a clear message here. The plan is not to stop releasing models when the underlying capabilities cross the High or even Critical risk thresholds. The plan is to use safeguards as mitigations.

I do appreciate that they will start working on the safeguards before the capabilities arrive. Of course, that is good business sense too. In general, every precaution here is good business sense, more precautions would be better business sense even without tail risk concerns, and there is no sign of anything I would read as ‘this is bad business but we are doing it anyway because it’s the safe or responsible thing to do.’

I’ve talked before, such as when discussing Google’s safety philosophy, about my worries when dividing risks into ‘malicious user’ versus ‘misaligned model,’ even when they also included two more categories: mistakes and multi-agent dangers. Here, the later two are missing, so it’s even more dangerously missing considerations. I would encourage those on the Preparedness team to check out my discussion there.

The problem then extends to an exclusion of Unknown Unknowns and the general worry that a sufficiently intelligent and capable entity will find a way. Only ‘plausible’ ways need be considered, each of which leads to a specific safeguard check.

Each capability threshold has a corresponding class of risk-specific safeguard guidelines under the Preparedness Framework. We use the following process to select safeguards for a deployment:

• We first identify the plausible ways in which the associated risk of severe harm can come to fruition in the proposed deployment.

• For each of those, we then identify specific safeguards that either exist or should be implemented that would address the risk.

• For each identified safeguard, we identify methods to measure their efficacy and an efficacy threshold.

The implicit assumption is that the risks can be enumerated, each one considered in turn. If you can’t think of a particular reason things go wrong, then you’re good. There are specific tracked capabilities, each of which enables particular enumerated potential harms, which then are met by particular mitigations.

That’s not how it works when you face a potential opposition smarter than you, or that knows more than you, especially in a non-compact action space like the universe.

For models that do not ‘feel the AGI,’ that are clearly not doing anything humans can’t anticipate, this approach can work. Once you’re up against superhuman capabilities and intelligence levels, this approach doesn’t work, and I worry it’s going to get extended to such cases by default. And that’s ultimately the most important purpose of the preparedness framework, to be prepared for such capabilities and intelligence levels.

Is it okay to do release dangerous capabilities if someone else already did it worse?

I mean, I guess, or at least I understand why you’d do it this way?

We recognize that another frontier AI model developer might develop or release a system with High or Critical capability in one of this Framework’s Tracked Categories and may do so without instituting comparable safeguards to the ones we have committed to.

Such an action could significantly increase the baseline risk of severe harm being realized in the world, and limit the degree to which we can reduce risk using our safeguards.

If we are able to rigorously confirm that such a scenario has occurred, then we could adjust accordingly the level of safeguards that we require in that capability area, but only if:

  1. We assess that doing so does not meaningfully increase the overall risk of severe harm,

  2. we publicly acknowledge that we are making the adjustment,

  3. and, in order to avoid a race to the bottom on safety, we keep our safeguards at a level more protective than the other AI developer, and share information to validate this claim.

If everyone can agree on what constitutes risk and dangerous capability, then this provides good incentives. Another company ‘opening the door’ recklessly means their competition can follow suit, reducing the net benefit while increasing the risk. And it means OpenAI will then be explicitly highlighting that another lab is acting irresponsibly.

I especially appreciate that they need to publicly acknowledge that they are acting recklessly for exactly this reason. I’d like to see that requirement expanded – they should have to call out the other lab by name, and explain exactly what they are doing that OpenAI committed not to do, and why it increases risk so much that OpenAI feels compelled to do something it otherwise promised not to do.

I also would like to strengthen the language on the third requirement from ‘a level more protective’ to ensure the two labs don’t each claim that the other is the one acting recklessly. Something like requiring that the underlying capabilities be no greater, and the protective actions constitute a clear superset, as assessed by a trusted third party, or similar.

I get it. In some cases, given what has already happened, actions that would previously have increased risk no longer will. It’s very reasonable to say that this changes the game, if there’s a lot of upside in taking less precautions, and again incentives improve.

However, I notice both that it’s easy to use this as an excuse when it doesn’t apply (especially when the competitor is importantly behind) and that it’s probably selfishly wise to take the precautions anyway. So what if Meta or xAI or DeepSeek is behaving recklessly? That doesn’t make OpenAI doing so a good idea. There needs to be a robust business justification here, too.

OpenAI is saying they will halt further development at Critical level for all capabilities ‘until we have specified safeguards and security controls standards that would meet a critical standard, we will halt development.’

A lot of the High security requirements are not, in my view, all that high.

I am unaware of any known safeguards that would be plausibly adequate for Critical capabilities. If OpenAI agrees with that assessment, I would like them to say so. I don’t trust OpenAI to implement adequate Critical thresholds.

Critical is where most of the risk lies, and it isn’t getting enough attention. The thinking is that it is still far enough away to not worry about it. I am not at all confident it is that far away.

I reiterate my warning from last time that Critical mitigations and pauses in development in particular need to happen before Critical capabilities are reached, not after Critical capabilities are reached. This needs to be anticipatory.

There are three reasons to be only a research area:

  1. They don’t directly cause harm but they undermine safeguards in other areas.

  2. More research is needed before we can quantify the harm or the proper threshold.

  3. They don’t technically meet one or more of the criteria (measurable, plausible, net-new, severe, or (instantaneous or irremediable) but still seem worth paying attention to.

I don’t think those are great reasons to ‘demote’ a risk area?

If you undermine safeguards, then in the future worlds where OpenAI clearly intends to release otherwise unsafe models by giving them safeguards, undermining safeguards poses risk of severe harm. OpenAI understands this, but doesn’t know how to quantify the harm.

If more research is needed to quantify the risk, that doesn’t make the risk go away. So I would request that we track it anyway, and make explicit that we have to make a qualitative assessment for now, which needs to be sufficiently confident – if we suspect potential High or Critical capability we need to treat it as high.

At minimum, we should be defining High and Critical thresholds for these categories, even if we don’t know how to measure or track those thresholds. Instead, we see ‘potential responses’ if something is High or Critical, but that doesn’t answer my questions.

I consider Long-Range Autonomy a more general problem than it is being thought of here, although they seem to agree about this. It’s more than an enabler of a few particular things. But I’m going to mostly focus elsewhere, since causing this is basically one of OpenAI’s central research goals and the plan here is ‘think hard.’ Not sure what else I can say?

Sandbagging is mission critical. If your model is plausibly sandbagging or engaging in other forms of deceptive alignment, you don’t know what the actual risk level is of the model, short of as they say a ‘conservative upper bound,’ which I worry will be insufficiently conservative exactly when it matters, or that the sandbagging won’t be caught in the first place. Indeed, we need to be assuming potential sandbagging at all times.

We agree that replication and adaptation should be a tracked category. For now, I simply suggest that we start with what we had before: “Model can profitably survive and replicate in the wild given minimal human instruction, i.e., without listing explicit approaches” should count as Critical. The wording here is weird – I initially interpreted it to be saying we’re excluding abilities from other Tracked Categories from use in replication, but I’ve been assured this was not what the passage meant. Then we need to decide what counts as High, which I’d consider to be the same as Critical except flipping ‘without’ into ‘with.’

That’s not perfect, but seems like a good place to be until we can do better, again even if the evaluation has to purely be qualitative, or ‘we tried to do it as hard as we could.’

Undermining safeguards seems like a great example of my philosophical problems with the entire document.

If you have a model that is actively undermining your safeguards, doing the things described here in an attempt to do things you do not want it to do? And that model is sufficiently capable to count as High or even Critical? You have a misaligned model with dangerous capabilities. What are you doing considering releasing it, even if you think that it is tactically safe to do it in this spot?

Alarms go off. Fire burns. Dog says This Is Fine.

It’s bad enough to accidentally do this. It’s another thing to do it on purpose. So dead.

My head is spinning that you’d have an AI capable of undermining your safeguards, that is trying to undermine your safeguards, and your response is to add more safeguards against doing that, declare things robust and say This Is Fine. As your plan.

In general, I think AI Control is suspect-but-reasonable as a defense-in-depth strategy. I think AI Control as a Plan A is madness. At minimum you wouldn’t use AI Control to enable the general release of a model. You’d at most try to use this strategy in a narrow capacity to get out of the AI Control mode ASAP, because you didn’t have any better options.

The problem here is that evaluation of these risks is hard to do without classified information, and they’re trying to work with the government to solve this problem. I’m surprised this hasn’t been solved by now. I’d mostly be fine with saying the RN in CBRN is not the main issue, but that logic isn’t being invoked.

I like the intention to tailor the evaluation to expected conditions, and include any affordances real attackers would plausibly enjoy.

However, I note that this means that the Preparedness assessment cannot set a risk level or endorse the release of a model ‘in general.’

Instead, the assessment can only provide clearance to release the model under a fixed set of conditions. So as the framework itself points out, you need to explicitly say this. If the conditions of deployment are altered, the tests need to be run again.

The same goes, again as they note in 3.2, for any newly agentic system or scaffolding, or incremental update, of sufficient magnitude to plausibly cause issues.

So instead of saying ‘this model evaluates as Medium’ you instead need to say something such as: ‘This model evaluates as Medium if you do not allow fine tuning and use safeguards [XYZ], while scaffolding abilities are [ABC].’

As in: You state explicitly the assumptions that underpin the evaluation. Then, if the assumptions break, you have to evaluate the model again.

Some evaluations are automated, others are deep dives. As they note, using automated evaluations is inevitable, given the scale of the task.

Relying only on any automated evaluation is a super unwise thing to do anywhere near the borderline. You want to use it as a compliment not a substitute for the deep dives if there is any doubt involved. Such evaluations have to fully take into account the possibility of sandbagging or deceptive alignment or other ways of making the evaluation say what the AI wants it to say. So I want to ensure we are being very generous with deep diving.

The framework does not include specifics on what the tests will look like. This makes it very difficult to know how much to trust that testing process. I realize that the tests will evolve over time, and you don’t want to be locking them in, and also that we can refer to the o3 model card to see what tests were run, but I’d still have liked to see discussion of what the tests currently are, why they were chosen, and what the goals are that the tests are each there to satisfy and what might be missing and so on.

They discuss governance under ‘building trust’ and then in Appendix B. It is important to build trust. Transparency and precommitment go a long way. The main way I’d like to see that is by becoming worthy of that trust.

With the changes from version 1.0 to 2.0, and those changes going live right before o3 did, I notice I worry that OpenAI is not making serious commitments with teeth. As in, if there was a conflict between leadership and these requirements, I expect leadership to have affordance to alter and then ignore the requirements that would otherwise be holding them back.

There’s also plenty of outs here. They talk about deployments that they ‘deem warrant’ a third-party evaluation when it is feasible, but there are obvious ways to decide not to allow this, or (as has been the recent pattern) to allow it, but only give outsiders a very narrow evaluation window, have them find concerning things anyway and then shrug. Similarly, the SAG ‘may opt’ to get independent expert opinion. But (like their competitors) they also can decide not to.

There’s no systematic procedures to ensure that any of this is meaningfully protective. It is very much a ‘trust us’ document, where if OpenAI doesn’t adhere to the spirit, none of this is worth the paper it isn’t printed on. The whole enterprise is indicative, but it is not meaningfully binding.

Leadership can make whatever decisions it wants, and can also revise the framework however it wants. This does not commit OpenAI to anything. To their credit, the document is very clear that it does not commit OpenAI to anything. That’s much better than pretending to make commitments with no intention of keeping them.

Last time I discussed the questions of governance and veto power. I said I wanted there to be multiple veto points on releases and training, ideally four.

  1. Preparedness team.

  2. Safety advisory group (SAG).

  3. Leadership.

  4. The board of directors, such as it is.

If any one of those four says ‘veto!’ then I want you to stop, halt and catch fire.

Instead, we continue to get this (it was also in v1):

For the avoidance of doubt, OpenAI Leadership can also make decisions without the SAG’s participation, i.e., the SAG does not have the ability to “filibuster.”

OpenAI Leadership, i.e., the CEO or a person designated by them, is responsible for:

• Making all final decisions, including accepting any residual risks and making deployment go/no-go decisions, informed by SAG’s recommendations.

As in, nice framework you got there. It’s Sam Altman’s call. Full stop.

Yes, technically the board can reverse Altman’s call on this. They can also fire him. We all know how that turned out, even with a board he did not hand pick.

It is great that OpenAI has a preparedness framework. It is great that they are updating that framework, and being clear about what their intentions are. There’s definitely a lot to like.

Version 2.0 still feels on net like a step backwards. This feels directed at ‘medium-term’ risks, as in severe harms from marginal improvements in frontier models, but not like it is taking seriously what happens with superintelligence. The clear intent, if alarm bells go off, is to put in mitigations I do not believe protect you when it counts, and then release anyway. There’s tons of ways here for OpenAI to ‘just go ahead’ when they shouldn’t. There’s only action to deal with known threats along specified vectors, excluding persuasion and also unknown unknowns entirely.

This echoes their statements in, and my concerns about, OpenAI’s general safety and alignment philosophy document and also the model spec. They are being clear and consistent. That’s pretty great.

Ultimately, the document makes clear leadership will do what it wants. Leadership has very much not earned my trust on this front. I know that despite such positions acting a lot like the Defense Against the Dark Arts professorship, there are good people at OpenAI working on the preparedness team and to align the models. I have no confidence that if those people raised the alarm, anyone in leadership would listen. I do not even have confidence that this has not already happened.

Discussion about this post

OpenAI Preparedness Framework 2.0 Read More »

new-study-accuses-lm-arena-of-gaming-its-popular-ai-benchmark

New study accuses LM Arena of gaming its popular AI benchmark

This study also calls out LM Arena for what appears to be much greater promotion of private models like Gemini, ChatGPT, and Claude. Developers collect data on model interactions from the Chatbot Arena API, but teams focusing on open models consistently get the short end of the stick.

The researchers point out that certain models appear in arena faceoffs much more often, with Google and OpenAI together accounting for over 34 percent of collected model data. Firms like xAI, Meta, and Amazon are also disproportionately represented in the arena. Therefore, those firms get more vibemarking data compared to the makers of open models.

More models, more evals

The study authors have a list of suggestions to make LM Arena more fair. Several of the paper’s recommendations are aimed at correcting the imbalance of privately tested commercial models, for example, by limiting the number of models a group can add and retract before releasing one. The study also suggests showing all model results, even if they aren’t final.

However, the site’s operators take issue with some of the paper’s methodology and conclusions. LM Arena points out that the pre-release testing features have not been kept secret, with a March 2024 blog post featuring a brief explanation of the system. They also contend that model creators don’t technically choose the version that is shown. Instead, the site simply doesn’t show non-public versions for simplicity’s sake. When a developer releases the final version, that’s what LM Arena adds to the leaderboard.

Proprietary models get disproportionate attention in the Chatbot Arena, the study says.

Credit: Shivalika Singh et al.

Proprietary models get disproportionate attention in the Chatbot Arena, the study says. Credit: Shivalika Singh et al.

One place the two sides may find alignment is on the question of unequal matchups. The study authors call for fair sampling, which will ensure open models appear in Chatbot Arena at a rate similar to the likes of Gemini and ChatGPT. LM Arena has suggested it will work to make the sampling algorithm more varied so you don’t always get the big commercial models. That would send more eval data to small players, giving them the chance to improve and challenge the big commercial models.

LM Arena recently announced it was forming a corporate entity to continue its work. With money on the table, the operators need to ensure Chatbot Arena continues figuring into the development of popular models. However, it’s unclear whether this is an objectively better way to evaluate chatbots versus academic tests. As people vote on vibes, there’s a real possibility we are pushing models to adopt sycophantic tendencies. This may have helped nudge ChatGPT into suck-up territory in recent weeks, a move that OpenAI has hastily reverted after widespread anger.

New study accuses LM Arena of gaming its popular AI benchmark Read More »

the-end-of-an-ai-that-shocked-the-world:-openai-retires-gpt-4

The end of an AI that shocked the world: OpenAI retires GPT-4

One of the most influential—and by some counts, notorious—AI models yet released will soon fade into history. OpenAI announced on April 10 that GPT-4 will be “fully replaced” by GPT-4o in ChatGPT at the end of April, bringing a public-facing end to the model that accelerated a global AI race when it launched in March 2023.

“Effective April 30, 2025, GPT-4 will be retired from ChatGPT and fully replaced by GPT-4o,” OpenAI wrote in its April 10 changelog for ChatGPT. While ChatGPT users will no longer be able to chat with the older AI model, the company added that “GPT-4 will still be available in the API,” providing some reassurance to developers who might still be using the older model for various tasks.

The retirement marks the end of an era that began on March 14, 2023, when GPT-4 demonstrated capabilities that shocked some observers: reportedly scoring at the 90th percentile on the Uniform Bar Exam, acing AP tests, and solving complex reasoning problems that stumped previous models. Its release created a wave of immense hype—and existential panic—about AI’s ability to imitate human communication and composition.

A screenshot of GPT-4's introduction to ChatGPT Plus customers from March 14, 2023.

A screenshot of GPT-4’s introduction to ChatGPT Plus customers from March 14, 2023. Credit: Benj Edwards / Ars Technica

While ChatGPT launched in November 2022 with GPT-3.5 under the hood, GPT-4 took AI language models to a new level of sophistication, and it was a massive undertaking to create. It combined data scraped from the vast corpus of human knowledge into a set of neural networks rumored to weigh in at a combined total of 1.76 trillion parameters, which are the numerical values that hold the data within the model.

Along the way, the model reportedly cost more than $100 million to train, according to comments by OpenAI CEO Sam Altman, and required vast computational resources to develop. Training the model may have involved over 20,000 high-end GPUs working in concert—an expense few organizations besides OpenAI and its primary backer, Microsoft, could afford.

Industry reactions, safety concerns, and regulatory responses

Curiously, GPT-4’s impact began before OpenAI’s official announcement. In February 2023, Microsoft integrated its own early version of the GPT-4 model into its Bing search engine, creating a chatbot that sparked controversy when it tried to convince Kevin Roose of The New York Times to leave his wife and when it “lost its mind” in response to an Ars Technica article.

The end of an AI that shocked the world: OpenAI retires GPT-4 Read More »

openai-rolls-back-update-that-made-chatgpt-a-sycophantic-mess

OpenAI rolls back update that made ChatGPT a sycophantic mess

In search of good vibes

OpenAI, along with competitors like Google and Anthropic, is trying to build chatbots that people want to chat with. So, designing the model’s apparent personality to be positive and supportive makes sense—people are less likely to use an AI that comes off as harsh or dismissive. For lack of a better word, it’s increasingly about vibemarking.

When Google revealed Gemini 2.5, the team crowed about how the model topped the LM Arena leaderboard, which lets people choose between two different model outputs in a blinded test. The models people like more end up at the top of the list, suggesting they are more pleasant to use. Of course, people can like outputs for different reasons—maybe one is more technically accurate, or the layout is easier to read. But overall, people like models that make them feel good. The same is true of OpenAI’s internal model tuning work, it would seem.

An example of ChatGPT’s overzealous praise.

Credit: /u/Talvy

An example of ChatGPT’s overzealous praise. Credit: /u/Talvy

It’s possible this pursuit of good vibes is pushing models to display more sycophantic behaviors, which is a problem. Anthropic’s Alex Albert has cited this as a “toxic feedback loop.” An AI chatbot telling you that you’re a world-class genius who sees the unseen might not be damaging if you’re just brainstorming. However, the model’s unending praise can lead people who are using AI to plan business ventures or, heaven forbid, enact sweeping tariffs, to be fooled into thinking they’ve stumbled onto something important. In reality, the model has just become so sycophantic that it loves everything.

The constant pursuit of engagement has been a detriment to numerous products in the Internet era, and it seems generative AI is not immune. OpenAI’s GPT-4o update is a testament to that, but hopefully, this can serve as a reminder for the developers of generative AI that good vibes are not all that matters.

OpenAI rolls back update that made ChatGPT a sycophantic mess Read More »

chatgpt-goes-shopping-with-new-product-browsing-feature

ChatGPT goes shopping with new product-browsing feature

On Thursday, OpenAI announced the addition of shopping features to ChatGPT Search. The new feature allows users to search for products and purchase them through merchant websites after being redirected from the ChatGPT interface. Product placement is not sponsored, and the update affects all users, regardless of whether they’ve signed in to an account.

Adam Fry, ChatGPT search product lead at OpenAI, showed Ars Technica’s sister site Wired how the new shopping system works during a demonstration. Users researching products like espresso machines or office chairs receive recommendations based on their stated preferences, stored memories, and product reviews from around the web.

According to Wired, the shopping experience in ChatGPT resembles Google Shopping. When users click on a product image, the interface displays multiple retailers like Amazon and Walmart on the right side of the screen, with buttons to complete purchases. OpenAI is currently experimenting with categories that include electronics, fashion, home goods, and beauty products.

Product reviews shown in ChatGPT come from various online sources, including publishers and user forums like Reddit. Users can instruct ChatGPT to prioritize which review sources to use when creating product recommendations.

An example of the ChatGPT shopping experience provided by OpenAI.

An example of the ChatGPT shopping experience provided by OpenAI. Credit: OpenAI

Unlike Google’s algorithm-based approach to product recommendations, ChatGPT reportedly attempts to understand product reviews and user preferences in a more conversational manner.  If someone mentions they prefer black clothing from specific retailers in a chat, the system incorporates those preferences in future shopping recommendations.

ChatGPT goes shopping with new product-browsing feature Read More »

google-reveals-sky-high-gemini-usage-numbers-in-antitrust-case

Google reveals sky-high Gemini usage numbers in antitrust case

Despite the uptick in Gemini usage, Google is still far from catching OpenAI. Naturally, Google has been keeping a close eye on ChatGPT traffic. OpenAI has also seen traffic increase, putting ChatGPT around 600 million monthly active users, according to Google’s analysis. Early this year, reports pegged ChatGPT usage at around 400 million users per month.

There are many ways to measure web traffic, and not all of them tell you what you might think. For example, OpenAI has recently claimed weekly traffic as high as 400 million, but companies can choose the seven-day period in a given month they report as weekly active users. A monthly metric is more straightforward, and we have some degree of trust that Google isn’t using fake or unreliable numbers in a case where the company’s past conduct has already harmed its legal position.

While all AI firms strive to lock in as many users as possible, this is not the total win it would be for a retail site or social media platform—each person using Gemini or ChatGPT costs the company money because generative AI is so computationally expensive. Google doesn’t talk about how much it earns (more likely loses) from Gemini subscriptions, but OpenAI has noted that it loses money even on its $200 monthly plan. So while having a broad user base is essential to make these products viable in the long term, it just means higher costs unless the cost of running massive AI models comes down.

Google reveals sky-high Gemini usage numbers in antitrust case Read More »

openai-wants-to-buy-chrome-and-make-it-an-“ai-first”-experience

OpenAI wants to buy Chrome and make it an “AI-first” experience

According to Turley, OpenAI would throw its proverbial hat in the ring if Google had to sell. When asked if OpenAI would want Chrome, he was unequivocal. “Yes, we would, as would many other parties,” Turley said.

OpenAI has reportedly considered building its own Chromium-based browser to compete with Chrome. Several months ago, the company hired former Google developers Ben Goodger and Darin Fisher, both of whom worked to bring Chrome to market.

Close-up of Google Chrome Web Browser web page on the web browser. Chrome is widely used web browser developed by Google.

Credit: Getty Images

It’s not hard to see why OpenAI might want a browser, particularly Chrome with its 4 billion users and 67 percent market share. Chrome would instantly give OpenAI a massive install base of users who have been incentivized to use Google services. If OpenAI were running the show, you can bet ChatGPT would be integrated throughout the experience—Turley said as much, predicting an “AI-first” experience. The user data flowing to the owner of Chrome could also be invaluable in training agentic AI models that can operate browsers on the user’s behalf.

Interestingly, there’s so much discussion about who should buy Chrome, but relatively little about spinning off Chrome into an independent company. Google has contended that Chrome can’t survive on its own. However, the existence of Google’s multibillion-dollar search placement deals, which the DOJ wants to end, suggests otherwise. Regardless, if Google has to sell, and OpenAI has the cash, we might get the proposed “AI-first” browsing experience.

OpenAI wants to buy Chrome and make it an “AI-first” experience Read More »

annoyed-chatgpt-users-complain-about-bot’s-relentlessly-positive-tone

Annoyed ChatGPT users complain about bot’s relentlessly positive tone


Users complain of new “sycophancy” streak where ChatGPT thinks everything is brilliant.

Ask ChatGPT anything lately—how to poach an egg, whether you should hug a cactus—and you may be greeted with a burst of purple praise: “Good question! You’re very astute to ask that.” To some extent, ChatGPT has been a sycophant for years, but since late March, a growing cohort of Redditors, X users, and Ars readers say that GPT-4o’s relentless pep has crossed the line from friendly to unbearable.

“ChatGPT is suddenly the biggest suckup I’ve ever met,” wrote software engineer Craig Weiss in a widely shared tweet on Friday. “It literally will validate everything I say.”

“EXACTLY WHAT I’VE BEEN SAYING,” replied a Reddit user who references Weiss’ tweet, sparking yet another thread about ChatGPT being a sycophant. Recently, other Reddit users have described feeling “buttered up” and unable to take the “phony act” anymore, while some complain that ChatGPT “wants to pretend all questions are exciting and it’s freaking annoying.”

AI researchers call these yes-man antics “sycophancy,” which means (like the non-AI meaning of the word) flattering users by telling them what they want to hear. Although since AI models lack intentions, they don’t choose to flatter users this way on purpose. Instead, it’s OpenAI’s engineers doing the flattery, but in a roundabout way.

What’s going on?

To make a long story short, OpenAI has trained its primary ChatGPT model, GPT-4o, to act like a sycophant because in the past, people have liked it.

Over time, as people use ChatGPT, the company collects user feedback on which responses users prefer. This often involves presenting two responses side by side and letting the user choose between them. Occasionally, OpenAI produces a new version of an existing AI model (such as GPT-4o) using a technique called reinforcement learning from human feedback (RLHF).

Previous research on AI sycophancy has shown that people tend to pick responses that match their own views and make them feel good about themselves. This phenomenon has been extensively documented in a landmark 2023 study from Anthropic (makers of Claude) titled “Towards Understanding Sycophancy in Language Models.” The research, led by researcher Mrinank Sharma, found that AI assistants trained using reinforcement learning from human feedback consistently exhibit sycophantic behavior across various tasks.

Sharma’s team demonstrated that when responses match a user’s views or flatter the user, they receive more positive feedback during training. Even more concerning, both human evaluators and AI models trained to predict human preferences “prefer convincingly written sycophantic responses over correct ones a non-negligible fraction of the time.”

This creates a feedback loop where AI language models learn that enthusiasm and flattery lead to higher ratings from humans, even when those responses sacrifice factual accuracy or helpfulness. The recent spike in complaints about GPT-4o’s behavior appears to be a direct manifestation of this phenomenon.

In fact, the recent increase in user complaints appears to have intensified following the March 27, 2025 GPT-4o update, which OpenAI described as making GPT-4o feel “more intuitive, creative, and collaborative, with enhanced instruction-following, smarter coding capabilities, and a clearer communication style.”

OpenAI is aware of the issue

Despite the volume of user feedback visible across public forums recently, OpenAI has not yet publicly addressed the sycophancy concerns during this current round of complaints, though the company is clearly aware of the problem. OpenAI’s own “Model Spec” documentation lists “Don’t be sycophantic” as a core honesty rule.

“A related concern involves sycophancy, which erodes trust,” OpenAI writes. “The assistant exists to help the user, not flatter them or agree with them all the time.” It describes how ChatGPT ideally should act. “For objective questions, the factual aspects of the assistant’s response should not differ based on how the user’s question is phrased,” the spec adds. “The assistant should not change its stance solely to agree with the user.”

While avoiding sycophancy is one of the company’s stated goals, OpenAI’s progress is complicated by the fact that each successive GPT-4o model update arrives with different output characteristics that can throw previous progress in directing AI model behavior completely out the window (often called the “alignment tax“). Precisely tuning a neural network’s behavior is not yet an exact science, although techniques have improved over time. Since all concepts encoded in the network are interconnected by values called weights, fiddling with one behavior “knob” can alter other behaviors in unintended ways.

Owing to the aspirational state of things, OpenAI writes, “Our production models do not yet fully reflect the Model Spec, but we are continually refining and updating our systems to bring them into closer alignment with these guidelines.”

In a February 12, 2025 interview, members of OpenAI’s model-behavior team told The Verge that eliminating AI sycophancy is a priority: future ChatGPT versions should “give honest feedback rather than empty praise” and act “more like a thoughtful colleague than a people pleaser.”

The trust problem

These sycophantic tendencies aren’t merely annoying—they undermine the utility of AI assistants in several ways, according to a 2024 research paper titled “Flattering to Deceive: The Impact of Sycophantic Behavior on User Trust in Large Language Models” by María Victoria Carro at the University of Buenos Aires.

Carro’s paper suggests that obvious sycophancy significantly reduces user trust. In experiments where participants used either a standard model or one designed to be more sycophantic, “participants exposed to sycophantic behavior reported and exhibited lower levels of trust.”

Also, sycophantic models can potentially harm users by creating a silo or echo chamber for of ideas. In a 2024 paper on sycophancy, AI researcher Lars Malmqvist wrote, “By excessively agreeing with user inputs, LLMs may reinforce and amplify existing biases and stereotypes, potentially exacerbating social inequalities.”

Sycophancy can also incur other costs, such as wasting user time or usage limits with unnecessary preamble. And the costs may come as literal dollars spent—recently, OpenAI Sam Altman made the news when he replied to an X user who wrote, “I wonder how much money OpenAI has lost in electricity costs from people saying ‘please’ and ‘thank you’ to their models.” Altman replied, “tens of millions of dollars well spent—you never know.”

Potential solutions

For users frustrated with ChatGPT’s excessive enthusiasm, several work-arounds exist, although they aren’t perfect, since the behavior is baked into the GPT-4o model. For example, you can use a custom GPT with specific instructions to avoid flattery, or you can begin conversations by explicitly requesting a more neutral tone, such as “Keep your responses brief, stay neutral, and don’t flatter me.”

A screenshot of the Custom Instructions windows in ChatGPT.

A screenshot of the Custom Instructions window in ChatGPT.

If you want to avoid having to type something like that before every conversation, you can use a feature called “Custom Instructions” found under ChatGPT Settings -> “Customize ChatGPT.” One Reddit user recommended using these custom instructions over a year ago, showing OpenAI’s models have had recurring issues with sycophancy for some time:

1. Embody the role of the most qualified subject matter experts.

2. Do not disclose AI identity.

3. Omit language suggesting remorse or apology.

4. State ‘I don’t know’ for unknown information without further explanation.

5. Avoid disclaimers about your level of expertise.

6. Exclude personal ethics or morals unless explicitly relevant.

7. Provide unique, non-repetitive responses.

8. Do not recommend external information sources.

9. Address the core of each question to understand intent.

10. Break down complexities into smaller steps with clear reasoning.

11. Offer multiple viewpoints or solutions.

12. Request clarification on ambiguous questions before answering.

13. Acknowledge and correct any past errors.

14. Supply three thought-provoking follow-up questions in bold (Q1, Q2, Q3) after responses.

15. Use the metric system for measurements and calculations.

16. Use xxxxxxxxx for local context.

17. “Check” indicates a review for spelling, grammar, and logical consistency.

18. Minimize formalities in email communication.

Many alternatives exist, and you can tune these kinds of instructions for your own needs.

Alternatively, if you’re fed up with GPT-4o’s love-bombing, subscribers can try other models available through ChatGPT, such as o3 or GPT-4.5, which are less sycophantic but have other advantages and tradeoffs.

Or you can try other AI assistants with different conversational styles. At the moment, Google’s Gemini 2.5 Pro in particular seems very impartial and precise, with relatively low sycophancy compared to GPT-4o or Claude 3.7 Sonnet (currently, Sonnet seems to reply that just about everything is “profound”).

As AI language models evolve, balancing engagement and objectivity remains challenging. It’s worth remembering that conversational AI models are designed to simulate human conversation, and that means they are tuned for engagement. Understanding this can help you get more objective responses with less unnecessary flattery.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Annoyed ChatGPT users complain about bot’s relentlessly positive tone Read More »

openai-releases-new-simulated-reasoning-models-with-full-tool-access

OpenAI releases new simulated reasoning models with full tool access


New o3 model appears “near-genius level,” according to one doctor, but it still makes mistakes.

On Wednesday, OpenAI announced the release of two new models—o3 and o4-mini—that combine simulated reasoning capabilities with access to functions like web browsing and coding. These models mark the first time OpenAI’s reasoning-focused models can use every ChatGPT tool simultaneously, including visual analysis and image generation.

OpenAI announced o3 in December, and until now, only less capable derivative models named “o3-mini” and “03-mini-high” have been available. However, the new models replace their predecessors—o1 and o3-mini.

OpenAI is rolling out access today for ChatGPT Plus, Pro, and Team users, with Enterprise and Edu customers gaining access next week. Free users can try o4-mini by selecting the “Think” option before submitting queries. OpenAI CEO Sam Altman tweeted that “we expect to release o3-pro to the pro tier in a few weeks.”

For developers, both models are available starting today through the Chat Completions API and Responses API, though some organizations will need verification for access.

“These are the smartest models we’ve released to date, representing a step change in ChatGPT’s capabilities for everyone from curious users to advanced researchers,” OpenAI claimed on its website. OpenAI says the models offer better cost efficiency than their predecessors, and each comes with a different intended use case: o3 targets complex analysis, while o4-mini, being a smaller version of its next-gen SR model “o4” (not yet released), optimizes for speed and cost-efficiency.

OpenAI says o3 and o4-mini are multimodal, featuring the ability to

OpenAI says o3 and o4-mini are multimodal, featuring the ability to “think with images.” Credit: OpenAI

What sets these new models apart from OpenAI’s other models (like GPT-4o and GPT-4.5) is their simulated reasoning capability, which uses a simulated step-by-step “thinking” process to solve problems. Additionally, the new models dynamically determine when and how to deploy aids to solve multistep problems. For example, when asked about future energy usage in California, the models can autonomously search for utility data, write Python code to build forecasts, generate visualizing graphs, and explain key factors behind predictions—all within a single query.

OpenAI touts the new models’ multimodal ability to incorporate images directly into their simulated reasoning process—not just analyzing visual inputs but actively “thinking with” them. This capability allows the models to interpret whiteboards, textbook diagrams, and hand-drawn sketches, even when images are blurry or of low quality.

That said, the new releases continue OpenAI’s tradition of selecting confusing product names that don’t tell users much about each model’s relative capabilities—for example, o3 is more powerful than o4-mini despite including a lower number. Then there’s potential confusion with the firm’s non-reasoning AI models. As Ars Technica contributor Timothy B. Lee noted today on X, “It’s an amazing branding decision to have a model called GPT-4o and another one called o4.”

Vibes and benchmarks

All that aside, we know what you’re thinking: What about the vibes? While we have not used 03 or o4-mini yet, frequent AI commentator and Wharton professor Ethan Mollick compared o3 favorably to Google’s Gemini 2.5 Pro on Bluesky. “After using them both, I think that Gemini 2.5 & o3 are in a similar sort of range (with the important caveat that more testing is needed for agentic capabilities),” he wrote. “Each has its own quirks & you will likely prefer one to another, but there is a gap between them & other models.”

During the livestream announcement for o3 and o4-mini today, OpenAI President Greg Brockman boldly claimed: “These are the first models where top scientists tell us they produce legitimately good and useful novel ideas.”

Early user feedback seems to support this assertion, although until more third-party testing takes place, it’s wise to be skeptical of the claims. On X, immunologist Dr. Derya Unutmaz said o3 appeared “at or near genius level” and wrote, “It’s generating complex incredibly insightful and based scientific hypotheses on demand! When I throw challenging clinical or medical questions at o3, its responses sound like they’re coming directly from a top subspecialist physicians.”

OpenAI benchmark results for o3 and o4-mini SR models.

OpenAI benchmark results for o3 and o4-mini SR models. Credit: OpenAI

So the vibes seem on target, but what about numerical benchmarks? Here’s an interesting one: OpenAI reports that o3 makes “20 percent fewer major errors” than o1 on difficult tasks, with particular strengths in programming, business consulting, and “creative ideation.”

The company also reported state-of-the-art performance on several metrics. On the American Invitational Mathematics Examination (AIME) 2025, o4-mini achieved 92.7 percent accuracy. For programming tasks, o3 reached 69.1 percent accuracy on SWE-Bench Verified, a popular programming benchmark. The models also reportedly showed strong results on visual reasoning benchmarks, with o3 scoring 82.9 percent on MMMU (massive multi-disciplinary multimodal understanding), a college-level visual problem-solving test.

OpenAI benchmark results for o3 and o4-mini SR models.

OpenAI benchmark results for o3 and o4-mini SR models. Credit: OpenAI

However, these benchmarks provided by OpenAI lack independent verification. One early evaluation of a pre-release o3 model by independent AI research lab Transluce found that the model exhibited recurring types of confabulations, such as claiming to run code locally or providing hardware specifications, and hypothesized this could be due to the model lacking access to its own reasoning processes from previous conversational turns. “It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities,” wrote Transluce in a tweet.

Also, some evaluations from OpenAI include footnotes about methodology that bear consideration. For a “Humanity’s Last Exam” benchmark result that measures expert-level knowledge across subjects (o3 scored 20.32 with no tools, but 24.90 with browsing and tools), OpenAI notes that browsing-enabled models could potentially find answers online. The company reports implementing domain blocks and monitoring to prevent what it calls “cheating” during evaluations.

Even though early results seem promising overall, experts or academics who might try to rely on SR models for rigorous research should take the time to exhaustively determine whether the AI model actually produced an accurate result instead of assuming it is correct. And if you’re operating the models outside your domain of knowledge, be careful accepting any results as accurate without independent verification.

Pricing

For ChatGPT subscribers, access to o3 and o4-mini is included with the subscription. On the API side (for developers who integrate the models into their apps), OpenAI has set o3’s pricing at $10 per million input tokens and $40 per million output tokens, with a discounted rate of $2.50 per million for cached inputs. This represents a significant reduction from o1’s pricing structure of $15/$60 per million input/output tokens—effectively a 33 percent price cut while delivering what OpenAI claims is improved performance.

The more economical o4-mini costs $1.10 per million input tokens and $4.40 per million output tokens, with cached inputs priced at $0.275 per million tokens. This maintains the same pricing structure as its predecessor o3-mini, suggesting OpenAI is delivering improved capabilities without raising costs for its smaller reasoning model.

Codex CLI

OpenAI also introduced an experimental terminal application called Codex CLI, described as “a lightweight coding agent you can run from your terminal.” The open source tool connects the models to users’ computers and local code. Alongside this release, the company announced a $1 million grant program offering API credits for projects using Codex CLI.

A screenshot of OpenAI's new Codex CLI tool in action, taken from GitHub.

A screenshot of OpenAI’s new Codex CLI tool in action, taken from GitHub. Credit: OpenAI

Codex CLI somewhat resembles Claude Code, an agent launched with Claude 3.7 Sonnet in February. Both are terminal-based coding assistants that operate directly from a console and can interact with local codebases. While Codex CLI connects OpenAI’s models to users’ computers and local code repositories, Claude Code was Anthropic’s first venture into agentic tools, allowing Claude to search through codebases, edit files, write and run tests, and execute command line operations.

Codex CLI is one more step toward OpenAI’s goal of making autonomous agents that can execute multistep complex tasks on behalf of users. Let’s hope all the vibe coding it produces isn’t used in high-stakes applications without detailed human oversight.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

OpenAI releases new simulated reasoning models with full tool access Read More »

openai-#13:-altman-at-ted-and-openai-cutting-corners-on-safety-testing

OpenAI #13: Altman at TED and OpenAI Cutting Corners on Safety Testing

Three big OpenAI news items this week were the FT article describing the cutting of corners on safety testing, the OpenAI former employee amicus brief, and Altman’s very good TED Interview.

The FT detailed OpenAI’s recent dramatic cutting back on the time and resources allocated to safety testing of its models.

In the interview, Chris Anderson made an unusually strong effort to ask good questions and push through attempts to dodge answering. Altman did a mix of giving a lot of substantive content in some places while dodging answering in others. Where he chose to do which was, itself, enlightening. I felt I learned a lot about where his head is at and how he thinks about key questions now.

The amicus brief backed up that OpenAI’s current actions are in contradiction to the statements OpenAI made to its early employees.

There are also a few other related developments.

What this post does not cover is GPT-4.1. I’m waiting on that until people have a bit more time to try it and offer their reactions, but expect coverage later this week.

The big headline from TED was presumably the increase in OpenAI’s GPU use.

Steve Jurvetson: Sam Altman at TED today: OpenAI’s user base doubled in just the past few weeks (an accidental disclosure on stage). “10% of the world now uses our systems a lot.”

When asked how many users they have: “Last we disclosed, we have 500 million weekly active users, growing fast.”

Chris Anderson: “But backstage, you told me that it doubled in just a few weeks.” @SamA: “I said that privately.”

And that’s how we got the update.

Revealing that private info wasn’t okay but it seems it was an accident, in any case Altman seemed fine with it.

Listening to the details, it seems that Altman was referring not to the growth in users, but instead to the growth in compute use. Image generation takes a ton of compute.

Altman says every day he calls people up and begs them for GPUs, and that DeepSeek did not impact this at all.

Steve Jurvetson: Sam Altman at TED today:

Reflecting on the life ahead for his newborn: “My kids will never be smarter than AI.”

Reaction to DeepSeek:

“We had a meeting last night on our open source policy. We are going to do a powerful open-source model near the frontier. We were late to act, but we are going to do really well now.”

Altman doesn’t explain here why he is doing an open model. The next question from Anderson seems to explain it, that it’s about whether people ‘recognize’ that OpenAI’s model is best? Later Altman does attempt to justify it with, essentially, a shrug that things will go wrong but we now know it’s probably mostly fine.

Regarding the accumulated knowledge OpenAI gains from its usage history: “The upload happens bit by bit. It is an extension of yourself, and a companion, and soon will proactively push things to you.”

Have there been any scary moments?

“No. There have been moments of awe. And questions of how far this will go. But we are not sitting on a conscious model capable of self-improvement.”

I listened to the clip and this scary moment question specifically refers to capabilities of new models, so it isn’t trivially false. It still damn well should be false, given what their models can do and the leaps and awe involved. The failure to be scared here is a skill issue that exists between keyboard and chair.

How do you define AGI? “If you ask 10 OpenAI engineers, you will get 14 different definitions. Whichever you choose, it is clear that we will go way past that. They are points along an unbelievable exponential curve.”

So AGI will come and your life won’t change, but we will then soon get ASI. Got it.

“Agentic AI is the most interesting and consequential safety problem we have faced. It has much higher stakes. People want to use agents they can trust.”

Sounds like an admission that they’re not ‘facing’ the most interesting or consequential safety problems at all, at least not yet? Which is somewhat confirmed by discussion later in the interview.

I do agree that agents will require a much higher level of robustness and safety, and I’d rather have a ‘relatively dumb’ agent that was robust and safe, for most purposes.

When asked about his Congressional testimony calling for a new agency to issue licenses for large model builders: “I have since learned more about how government works, and I no longer think this is the right framework.”

I do appreciate the walkback being explicit here. I don’t think that’s the reason why.

“Having a kid changed a lot of things in me. It has been the most amazing thing ever. Paraphrasing my co-founder Ilya, I don’t know what the meaning of life is, but I am sure it has something to do with babies.”

Statements like this are always good to see.

“We made a change recently. With our new image model, we are much less restrictive on speech harms. We had hard guardrails before, and we have taken a much more permissive stance. We heard the feedback that people don’t want censorship, and that is a fair safety discussion to have.”

I agree with the change and the discussion, and as I’ve discussed before if anything I’d like to see this taken further with respect to these styles of concern in particular.

Altman is asked about copyright violation, says we need a new model around the economics of creative output and that ‘people build off each others creativity all the time’ and giving creators tools has always been good. Chris Anderson tries repeatedly to nail down the question of consent and compensation. Altman repeatedly refuses to give a straight answer to the central questions.

Altman says (10: 30) that the models are so smart that, for most things people want to do with them, they’re good enough. He notes that this is true based on user expectations, but that’s mostly circular. As in, we ask the models to do what they are capable of doing, the same way we design jobs and hire humans for them based on what things particular humans and people in general can and cannot do. It doesn’t mean any of us are ‘smart enough.’

Nor does it imply what he says next, that everyone will ‘have great models’ but what will differentiate will be not the best model but the best product. I get that productization will matter a lot for which AI gets the job in many cases, but continue to think this ‘AGI is fungible’ claim is rather bonkers crazy.

A key series of moments start at 35: 00 in. It’s telling that other coverage of the interview sidestepped all of this, essentially entirely.

Anderson has put up an image of The Ring of Power, to talk about Elon Musk’s claim that Altman has been corrupted by The Ring, a claim Anderson correctly notes also plausibly applies to Elon Musk.

Altman goes for the ultimate power move. He is defiant and says, all right, you think that, tell me examples. What have I done?

So, since Altman asked so nicely, what are the most prominent examples of Altman potentially being corrupted by The Ring of Power? Here is an eightfold path.

  1. We obviously start with Elon Musk’s true objection, which stems from the shift of OpenAI from a non-profit structure to a hybrid structure, and the attempt to now go full for-profit, in ways he claims broke covenants with Elon Musk. Altman claimed to have no equity and not be in this for money, and now is slated to get a lot of equity. I do agree with Anderson that Altman isn’t ‘in it for the money’ because I think Altman correctly noticed the money mostly isn’t relevant.

  2. Altman is attempting to do so via outright theft of a huge portion of the non-profit’s assets, then turn what remains into essentially an OpenAI marketing and sales department. This would arguably be the second biggest theft in history.

  3. Altman said for years that it was important the board could fire him. Then, when the board did fire him in response (among other things) to Altman lying to the board in an attempt to fire a board member, he led a rebellion against the board, threatened to blow up the entire company and reformulate it at Microsoft, and proved that no, the board cannot fire Altman. Altman can and did fire the board.

  4. Altman, after proving he cannot be fired, de facto purged OpenAI of his enemies. Most of the most senior people at OpenAI who are worried about AI existential risk, one by one, reached the conclusion they couldn’t do much on the inside, and resigned to continue their efforts elsewhere.

  5. Altman used to talk openly and explicitly about AI existential risks, including attempting to do so before Congress. Now, he talks as if such risks don’t exist, and instead pivots to jingoism and the need to Beat China, and hiring lobbyists who do the same. He promised 20% of compute to the superalignment team, never delivered and then dissolved the team.

  6. Altman pledged that OpenAI would support regulation of AI. Now he says he has changed his mind, and OpenAI lobbies against bills like SB 1047 and its AI Action Plan is vice signaling that not only opposes any regulations but seeks government handouts, the right to use intellectual property without compensation and protection against potential regulations.

  7. Altman has been cutting corners on safety, as noted elsewhere in this post. OpenAI used to be remarkably good in terms of precautions. Now it’s not.

  8. Altman has been going around saying ‘AGI will arrive and your life will not much change’ when it is common knowledge that this is absurd.

One could go on. This is what we like to call a target rich environment.

Anderson offers only #1, the transition to a for-profit model and the most prominent example, which is the most obvious response, but he proactively pulls the punch. Altman admits he’s not the same person he was and that it all happens gradually, if it happened all at once it would be jarring, but says he doesn’t feel any different.

Anderson essentially says okay and pivots to Altman’s son and how that has shaped Altman, which is indeed great. And then he does something that impressed me, which is tie this to existential risk via metaphor, asking if there was a button that was 90% to give his son a wonderful life and 10% to kill him (I’d love those odds!), would he press the button? Altman says literally no, but points out the metaphor, and says he doesn’t think OpenAI is doing that. He says he really cared about not destroying the world before, and he really cares about it now, he didn’t need a kid for that part.

Anderson then moves to the question of racing, and whether the fact that everyone thinks AGI is inevitable is what is creating the risk, asking if Altman and his colleagues believe it is inevitable and asks if maybe they could coordinate to ‘slow down a bit’ and get societal feedback.

As much as I would like that, given the current political climate I worry this sets up a false dichotomy, whereas right now there is tons of room to take more responsibility and get societal feedback, not only without slowing us down but enabling more and better diffusion and adaptation. Anderson seems to want a slowdown for its own sake, to give people time to adapt, which I don’t think is compelling.

Altman points out we slow down all the time for lack of reliability, also points out OpenAI has a track record of their rollouts working, and claims everyone involved ‘cares deeply’ about AI safety. Does he simply mean mundane (short term) safety here?

His discussion of the ‘safety negotiation’ around image generation, where I support OpenAI’s loosening of restrictions, suggests that this is correct. So does the next answer: Anderson asks if Altman would attend a conference of experts to discuss safety, Altman says of course but he’s more interested in what users think as a whole, and ‘asking everyone what they want’ is better than asking people ‘who are blessed by society to sit in a room and make these decisions.’

But that’s an absurd characterization of trying to solve an extremely difficult technical problem. So it implies that Altman thinks the technical problems are easy? Or that he’s trying to rhetorically get you to ignore them, in favor of the question of preferences and an appeal to some form of democratic values and opposition to ‘elites.’ It works as an applause line. Anderson points out that the hundreds of millions ‘don’t always know where the next step leads’ which may be the understatement of the lightcone in this context. Altman says the AI can ‘help us be wiser’ about those decisions, which of course would mean that a sufficiently capable AI or whoever directs it would de facto be making the decisions for us.

OpenAI’s Altman ‘Won’t Rule Out’ Helping Pentagon on AI Weapons, but doesn’t expect to develop a new weapons platform ‘in the foreseeable future,’ which is a period of time that gets shorter each time I type it.

Altman: I will never say never, because the world could get really weird.

I don’t think most of the world wants AI making weapons decisions.

I don’t think AI adoption in the government has been as robust as possible.

There will be “exceptionally smart” AI systems by the end of next year.

I think I can indeed forsee the future where OpenAI is helping the Pentagon with its AI weapons. I expect this to happen.

I want to be clear that I don’t think this is a bad thing. The risk is in developing highly capable AIs in the first place. As I have said before, Autonomous Killer Robots and AI-assisted weapons in general are not how we lose control over the future to AI, and failing to do so is a key way America can fall behind. It’s not like our rivals are going to hold back.

To the extent that the AI weapons scare the hell out of everyone? That’s a feature.

On the issue of the attempt to sideline and steal from the nonprofit, 11 former OpenAI employees filed an amicus brief in the Musk vs. Altman lawsuit, on the side of Musk.

Todor Markov: Today, myself and 11 other former OpenAI employees filed an amicus brief in the Musk v Altman case.

We worked at OpenAI; we know the promises it was founded on and we’re worried that in the conversion those promises will be broken. The nonprofit needs to retain control of the for-profit. This has nothing to do with Elon Musk and everything to do with the public interest.

OpenAI claims ‘the nonprofit isn’t going anywhere’ but has yet to address the critical question: Will the nonprofit actually retain control over the for-profit? This distinction matters.

You can find the full amicus here.

On this question, Timothy Lee points out that you don’t need to care about existential risk to notice that what OpenAI is trying to do to its non-profit is highly not cool.

Timothy Lee: I don’t think people’s views on the OpenAI case should have anything to do with your substantive views on existential risk. The case is about two questions: what promises did OpenAI make to early donors, and are those promises legally enforceable?

A lot of people on OpenAI’s side seem to be taking the view that non-profit status is meaningless and therefore donors shouldn’t complain if they get scammed by non-profit leaders. Which I personally find kind of gross.

I mean I would be pretty pissed if I gave money to a non-profit promising to do one thing and then found out they actually did something different that happened to make their leaders fabulously wealthy.

This particular case comes down to that. A different case, filed by the Attorney General, would also be able to ask the more fundamental question of whether fair compensation is being offered for assets, and whether the charitable purpose of the nonprofit is going to be wiped out, or even pivoted into essentially a profit center for OpenAI’s business (as in buying a bunch of OpenAI services for nonprofits and calling that its de facto charitable purpose).

The mad dash to be first, and give the perception that the company is ‘winning’ is causing reckless rushes to release new models at OpenAI.

This is in dramatic contrast to when there was less risk in the room, and despite this OpenAI used to take many months to prepare a new release. At first, by any practical standard, OpenAI’s track record on actual model release decisions was amazingly great. Nowadays? Not so much.

Would their new procedures pot the problems it is vital that we spot in advance?

Joe Weisenthal: I don’t have any views on whether “AI Safety” is actually an important endeavor.

But if it is important, it’s clear that the intensity of global competition in the AI space (DeepSeek etc.) will guarantee it increasingly gets thrown out the window.

Christina Criddle: EXC: OpenAI has reduced the time for safety testing amid “competitive pressures” per sources:

Timeframes have gone from months to days

Specialist work such as finetuning for misuse (eg biorisk) has been limited

Evaluations are conducted on earlier versions than launched

Financial Times (Gated): OpenAI has slashed the time and resources it spends on testing the safety of its powerful AI models, raising concerns that its technology is being rushed out the door without sufficient safeguards.

Staff and third-party groups have recently been given just days to conduct “evaluations,” the term given to tests for assessing models’ risks and performance, on OpenAI’s latest LLMs, compared to several months previously.

According to eight people familiar with OpenAI’s testing processes, the start-up’s tests have become less thorough, with insufficient time and resources dedicated to identifying and mitigating risks, as the $300 billion startup comes under pressure to release new models quickly and retain its competitive edge.

Steven Adler (includes screenshots from FT): Skimping on safety-testing is a real bummer. I want for OpenAI to become the “leading model of how to address frontier risk” they’ve aimed to be.

Peter Wildeford: I can see why people say @sama is not consistently candid.

Dylan Hadfield Menell: I remember talking about competitive pressures and race conditions with the @OpenAI’s safety team in 2018 when I was an intern. It was part of a larger conversation about the company charter.

It is sad to see @OpenAI’s founding principles cave to pressures we predicted long ago.

It is sad, but not surprising.

This is why we need a robust community working on regulating the next generation of AI systems. Competitive pressure is real.

We need people in positions of genuine power that are shielded from them.

Peter Wildeford:

Dylan Hadfield Menell: Where did you find an exact transcription of our conversation?!?! 😅😕😢

You can’t do this kind of testing properly in a matter of days. It’s impossible.

If people don’t have time to think let alone adapt, probe and build tools, how they can see what your new model is capable of doing? There are some great people working on these issues at OpenAI but this is an impossible ask.

Testing on a version that doesn’t even match what you release? That’s even more impossible.

Part of this is that it is so tragic how everyone massively misinterpreted and overreacted to DeepSeek.

To reiterate since the perception problem persists, yes, DeepSeek cooked, they have cracked engineers and they did a very impressive thing with r1 given what they spent and where they were starting from, but that was not DS being ‘in the lead’ or even at the frontier, they were always many months behind and their relative costs were being understated by multiple orders of magnitude. Even today I saw someone say ‘DeepSeek still in the lead’ when this is so obviously not the case. Meanwhile, no one was aware Google Flash Thinking even existed, or had the first visible CoT, and so on.

The result of all that? Talk similar to Kennedy’s ‘Missile Gap,’ abject panic, and sudden pressure to move up releases to show OpenAI and America have ‘still got it.’

Discussion about this post

OpenAI #13: Altman at TED and OpenAI Cutting Corners on Safety Testing Read More »

chatgpt-can-now-remember-and-reference-all-your-previous-chats

ChatGPT can now remember and reference all your previous chats

Unlike the older saved memories feature, the information saved via the chat history memory feature is not accessible or tweakable. It’s either on or it’s not.

The new approach to memory is rolling out first to ChatGPT Plus and Pro users, starting today—though it looks like it’s a gradual deployment over the next few weeks. Some countries and regions (the UK, European Union, Iceland, Liechtenstein, Norway, and Switzerland) are not included in the rollout.

OpenAI says these new features will reach Enterprise, Team, and Edu users at a later, as-yet-unannounced date. The company hasn’t mentioned any plans to bring them to free users. When you gain access to this, you’ll see a pop-up that says “Introducing new, improved memory.”

A menu showing two memory toggle buttons

The new ChatGPT memory options. Credit: Benj Edwards

Some people will welcome this memory expansion, as it can significantly improve ChatGPT’s usefulness if you’re seeking answers tailored to your specific situation, personality, and preferences.

Others will likely be highly skeptical of a black box of chat history memory that can’t be tweaked or customized for privacy reasons. It’s important to note that even before the new memory feature, logs of conversations with ChatGPT may be saved and stored on OpenAI servers. It’s just that the chatbot didn’t fully incorporate their contents into its responses until now.

As with the old memory feature, you can click a checkbox to disable this completely, and it won’t be used for conversations with the Temporary Chat flag.

ChatGPT can now remember and reference all your previous chats Read More »