Author name: Mike M.

Illinois utility tries using electric school buses for bidirectional charging

Cars, electric bus, EVs, syndication, V2G / Mike M. / October 3, 2025

Thank you driver for getting me here

School buses are usually parked when the grid is under its biggest strain.

The Thomas C2 Jouley is a popular electric school bus. Credit: Thomas Built Buses

The largest electric utility in Illinois is rolling out a program for a vehicle-to-grid (V2G) electric school bus-charging pilot with three Chicagoland school districts, testing the functionality of bidirectional chargers that could make energy cheaper for customers and reduce grid load.

The Commonwealth Edison Co. (ComEd) announced in September that it would begin the testing phase of its novel V2G electric school bus charging pilot, the first of its kind in northern Illinois, coinciding with the beginning of the school year.

The utility began testing with the River Trails, Troy, and Wauconda school districts—which have all had electric buses for more than two years—in northern Illinois. It is currently collecting data from bidirectional chargers, EV chargers that flow energy both ways. Its testing will determine how the chargers and buses can best transfer energy when parked and plugged into the grid.

“We’re not only working with these three school districts, we’re testing with them, but we’re also seeking input from other school districts to better understand their interest in V2G and how we could support their needs as we design new research and development efforts and potentially new programs,” said Cristina Botero, senior manager for beneficial electrification at ComEd.

According to the utility, bidirectional charging could result in a number of benefits, such as reducing grid demand during peak hours; lowering costs and energy usage for customers; and funding school districts that participate in the program. Botero said the goal is to eventually have a scalable model for the V2G program that other districts across Illinois could opt into “later down the line.”

The testing is beginning with four electric school buses across the three districts. ComEd began soft testing the pilot in June before publicly testing with the school districts in September, prioritizing research and development on the functionality of the chargers.

“School buses in general tend to be stationary during times where normally the grid is at its biggest strain,” Botero said. “[When] the grid is most loaded, that happens to be the time where many of these vehicles are not in use and happen to be connected and fully charged. This offers the possibility of using some of the energy in that battery to send back to the grid to support grid congestion,” she said.

Botero added that this can even be helpful during outages, because energy stored in electric school bus chargers can still be used. Participating school districts could also see their energy consumption and costs drop.

“It is helping potentially reduce the energy consumption of a school if it’s able to use its own battery for its own consumption. It can also reduce the cost of energy for the school, and really to all customers, because it’s reducing grid strain,” Botero said.

The pilot is part of ComEd’s $231 million beneficial electrification (BE) program, approved by the Illinois Commerce Commission. In 2021, Illinois passed the Climate and Equitable Jobs Act, which required all major utilities to establish a BE plan. ComEd’s first BE plan, spanning 2023 to 2025, consists of eight pilot programs in which the company has invested $11 million, including the V2G pilot.

The commission recently approved $168 million in funding for the next BE plan from 2026 to 2028, which includes an additional $11 million for research and development efforts that will include V2G.

ComEd partnered with software company Resource Innovations and charging vendor Nuvve for the pilot. The current testing phase, Botero said, is technology-based and focuses on determining how the technology works and how energy discharge impacts the grid.

Nuvve owns and operates the bidirectional charging technology and identified the customers to bring to the pilot.

“When you have an electric school bus, you have a fairly large battery inside that vehicle that is going to be doing nothing most of the time,” said Hamza Lemsaddek, chief operating officer at the Nuvve subsidiary Fermata Energy, which oversees the project. “The concept of V2G is, number one, the ability of not just charging the vehicle, but also discharging the vehicle [with] this bidirectional piece. The second step is to have a platform that is able to aggregate a large number of vehicles, and depending on where those vehicles are, provide a variety of grid services.”

Lemsaddek explained that the performance of the buses and chargers helps ComEd reduce their grid peak load. “By providing those grid services to help the grid be stable or more resilient, there is a value that you are providing, and therefore [Nuvve] can get compensated for that,” he said. “Then we share a lot of that value with the vehicle owner”—in this case, the school districts. “While the vehicle is parked doing nothing, it’s actually providing a service to the grid, and you get compensated for that.”

While the three districts are getting stipends for participation in the pilot, they were chosen because they already had electric school bus technology. The Wauconda school district, for example, has two electric school buses funded through a Driving a Cleaner Illinois grant, a program of the Volkswagen Environmental Mitigation Trust Fund.

Wauconda has had the two buses for three years, with two years of funding left. Rick Strauss, director of transportation for Wauconda, said that while he is hopeful for the success of the pilot, the electric buses have already posed significant challenges for the district, leading him to doubt whether the buses can effectively give back to the grid.

For example, Strauss said that the district will put an average of 10,000 miles on a diesel bus per year. “But after three years with our electric buses, with the amount of issues that we had, each one of them had less than 1,000 miles on them after two years of service,” he said, adding that the buses probably spent more time “on a tow truck” going to get fixed than on their actual routes.

Strauss also listed among the issues a lack of certified mechanics that can work on the buses when they break, frequent technological failures, and buses losing functionality in cold weather.

Although he said he recognizes the benefits of electric buses, such as quieter motors, better air quality for students, and less diesel fuel emissions, the lack of functionality of the buses overshadows potentially positive outcomes. After the five-year grant runs out, he’s not sure whether the district will continue to use them.

“It’ll be interesting to see the metrics and what we get back from ComEd versus what it costs to run these [buses],” he said, adding that the cost of two electric buses “would take my entire bus budget.”

ComEd is prioritizing testing the technology as well as anticipating challenges moving forward. Botero said the goal of the current testing is “making sure that the technology is well understood” and to answer any questions.

The companies are also determining the exact way to compensate school districts before further evaluations and eventual modeling to “see what a program would look like” at a larger scale.

Botero said that they will be getting results from the pilot testing at the end of the year and will design the next phase of the pilot based on those findings.

This story originally appeared on Inside Climate News.

Illinois utility tries using electric school buses for bidirectional charging Read More »

RFK Jr. drags feet on COVID-19 vaccine recommendations, delaying shots for kids

ACIP, CDC, COVID-19 vaccines, health, robert f kennedy jr, vaccines / Mike M. / October 2, 2025

Previously, the FDA narrowed the shots’ labels to include only people age 65 and older, and those 6 months and older at higher risk. But the ACIP recommended that all people age 6 months and older could get the shot based on shared decision-making with a health care provider. Although the shared decision-making adds a new requirement for getting the vaccine, that decision-making does not require a prescription and can be done not only with doctors, but also with nurses and pharmacists. Most people in the US get their seasonal COVID-19 vaccines at their local pharmacy.

Ars Technica reached out to the HHS on Thursday about whether there was a determination on the COVID-19 vaccine recommendations and, if not, when that is expected to happen and why there is a delay. The HHS responded, confirming that no determination had been made yet, but did not answer any of the other questions and did not provide a comment for the record.

In past years, ACIP recommendations and CDC sign-offs have happened earlier in the year to provide adequate time for a rollout. In 2024, ACIP voted on COVID-19 vaccinations in June, for instance, and then-CDC Director Mandy Cohen signed off that day. Now that we’re into October, it remains unclear when or even if the CDC will sign off on the recommendation and then, if the recommendation is adopted by the CDC, how much longer after that it would take for states to roll out the vaccines to children in the VFC program.

“Children who depend on this program, including children with chronic conditions, are still waiting unprotected. The delay in adopting COVID-19 vaccine recommendations puts their health at risk, reduces access and choice for families, and puts a strain on providers who want to deliver the best care for their youngest patients,” Susan Kansagra, the chief medical officer of the Association of State and Territorial Health Officials, said in a statement to Stat.

For now, children and adults with private insurance have access to the shots without the final sign-off, and health insurance companies have said that they will continue to maintain coverage for the vaccines without the final federal approval.

RFK Jr. drags feet on COVID-19 vaccine recommendations, delaying shots for kids Read More »

Trump offers universities a choice: Comply for preferential funding

government policy, Policy, research funding, Science, universities / Mike M. / October 2, 2025

On Wednesday, The Wall Street Journal reported that the Trump administration had offered nine schools a deal: manage your universities in a way that aligns with administration priorities and get “substantial and meaningful federal grants,” along with other benefits. Failure to accept the bargain would result in a withdrawal of federal programs that would likely cripple most universities. The offer, sent to a mixture of state and private universities, would see the government dictate everything from hiring and admissions standards to grading and has provisions that appear intended to make conservative ideas more welcome on campus.

The document was sent to the University of Arizona, Brown University, Dartmouth College, Massachusetts Institute of Technology, the University of Pennsylvania, the University of Southern California, the University of Texas, Vanderbilt University, and the University of Virginia. However, independent reporting indicates that the administration will ultimately extend the deal to all colleges and universities.

Ars has obtained a copy of the proposed “Compact for Academic Excellence in Higher Education,” which makes the scope of the bargain clear in its introduction. “Institutions of higher education are free to develop models and values other than those below, if the institution elects to forego federal benefits,” it suggests, while mentioning that those benefits include access to fundamental needs, like student loans, federal contracts, research funding, tax benefits, and immigration visas for students and faculty.

It is difficult to imagine how it would be possible to run a major university without access to those programs, making this less a compact and more of an ultimatum.

Poorly thought through

The Compact itself would see universities agree to cede admissions standards to the federal government. The government, in this case, is demanding only the use of “objective” criteria such as GPA and standardized test scores as the basis of admissions decisions, and that schools publish those criteria on their websites. They would also have to publish anonymized data comparing how admitted and rejected students did relative to these criteria.

Trump offers universities a choice: Comply for preferential funding Read More »

World-famous primatologist Jane Goodall dead at 91

anthropology, Biology, jane goodall, Obituaries, primatology, Science / Mike M. / October 2, 2025

A sculpture of Jane Goodall and David Greybeard outside the Field Museum of Natural History in Chicago Credit: Geary/CC0

David Greybeard’s behavior also challenged the long-held assumption that chimpanzees were vegetarians. Goodall found that chimps would hunt and eat smaller primates like colobus monkeys as well, sometimes sharing the carcass with other troop members. She also recorded evidence of strong bonds between mothers and infants, altruism, compassion, and aggression and violence. For instance, dominant females would sometimes kill the infants of rival females, and from 1974 to 1978, there was a violent conflict between two communities of chimpanzees that became known as the Gombe Chimpanzee War.

Almost human

One of the more colorful chimps Goodall studied was named Frodo, who grew up to be an alpha male with a temperament very unlike his literary namesake. “As an infant, Frodo proved mischievous, disrupting Jane Goodall’s efforts to record data on mother-infant relationships by grabbing at her notebooks and binoculars,” anthropologist Michael Wilson of the University of Minnesota in Saint Paul recalled on his blog when Frodo died from renal failure in 2013. “As he grew older, Frodo developed a habit of throwing rocks, charging at, hitting, and knocking over human researchers and tourists.” Frodo attacked Wilson twice on Wilson’s first trip to Gombe, even beating Goodall herself in 1989, although he eventually lost his alpha status and “mellowed considerably” in his later years, per Wilson.

Goodall became so renowned around the world that she even featured in one of Gary Larson’s Far Side cartoons, in which two chimps are shown grooming when one finds a blonde hair on the other. “Conducting a little more ‘research’ with that Jane Goodall tramp?” the caption read. The JGI was not amused, sending Larson a letter (without Goodall’s knowledge) calling the cartoon an “atrocity,” but their objections were not shared by Goodall herself, who thought the cartoon was very funny when she heard of it. Goodall even wrote a preface to The Far Side Gallery 5. Larson, for his part, visited Goodall’s research facility in Tanzania in 1988, where he experienced Frodo’s alpha aggressiveness firsthand.

A young Jane Goodall in the field. Credit: YouTube/Jane Goodall Institute

Goodall founded the JGI in 1977 and authored more than 27 books, most notably My Friends, the Wild Chimpanzees (1967), In the Shadow of Man (1971), and Through a Window (1990). There was some initial controversy around her 2014 book Seeds of Hope, co-written with Gail Hudson, when portions were found to have been plagiarized from online sources; the publisher postponed publication so that Goodall could revise the book and add 57 pages of endnotes. (She blamed her “chaotic note-taking” for the issue.) National Geographic released a full-length documentary last year about her life’s work, drawing from over 100 hours of previously unseen archival footage.

World-famous primatologist Jane Goodall dead at 91 Read More »

Meet the Arc spacecraft: It aims to deliver cargo anywhere in the world in an hour

ARC, inversion, Military, rocket cargo, Space, space force / Mike M. / October 2, 2025

The Arc spacecraft lands under parachutes in this rendering. Credit: Inversion

The test spacecraft, with a mass of about 200 pounds (90 kg), performed well, Fiaschetti said. It demonstrated the capability to raise and lower its orbit and remains power-positive to date, periodically checking in with Inversion flight controllers. However, the spacecraft will not make a controlled landing.

“Ray won’t be coming back,” Fiaschetti said. “We’re doing long-term testing of software on orbit.”

Although Ray did not land, Inversion now feels confident enough in its technology to move into the production of the larger Arc vehicle, which was unveiled on Wednesday evening. About the size of a large tabletop—Arc is 4 feet wide and 8 feet tall—the company is aiming to launch the first Arc vehicle by the end of 2026. Fiaschetti said Inversion is “on a really good path” to make that timeline.

So what does the military want to ship?

Arc is a lifting body spacecraft, and it will do the majority of its maneuvering in the atmosphere, where it has 1,000 km of cross-range capability during reentry. It will land under parachutes and therefore not require a runway. Because the vehicle’s propulsion system uses non-toxic materials, a soldier can approach it immediately after landing without any protective gear.

So what would the US military want to pre-position in space for delivery at a moment’s notice to any location around the world?

“We like to describe this as mission-enabling cargo or effects,” Fiaschetti said. “This could be a wide variety of specific payloads, anything from medical supplies to drones to what have you. But the key discriminator is, does this make a difference in the moment it’s needed when it gets back down to the ground? You know, for the military and national security, if they need their cargo before the fight is over.”

The company says it has already built a “full-scale manufacturing development unit of the primary structure” for the first Arc vehicle. It would be an impressive capability if the small team at Inversion—now 60 people strong, and growing—can bring the Arc spacecraft to market. If, of course, is the operative word. “Space is hard” may be a cliché, but it also happens to be true.

Meet the Arc spacecraft: It aims to deliver cargo anywhere in the world in an hour Read More »

AI #136: A Song and Dance

dance / Mike M. / October 2, 2025

The big headline this week was the song, which was the release of Claude Sonnet 4.5. I covered this in two parts, first the System Card and Alignment, and then a second post on capabilities. It is a very good model, likely the current best model for most coding tasks, most agentic and computer use tasks, and quick or back-and-forth chat conversations. GPT-5 still has a role to play as well.

There was also the dance, also known as Sora, both the new and improved 10-second AI video generator Sora and also the new OpenAI social network Sora. I will be covering that tomorrow. The video generator itself seems amazingly great. The social network sounds like a dystopian nightmare and I like to think Nobody Wants This, although I do not yet have access nor am I a typical customer of such products.

The copyright decisions being made are a bold strategy, Cotton or perhaps better described this as this public service announcement for those who like to think they own intellectual property.

Meta also offered its own version, called Vibes, which I’ll cover along with Sora.

OpenAI also announced Pulse to give you a daily roundup and Instant Checkout to let you buy at Etsy and Shopify directly from ChatGPT, which could be big deals and in a different week would have gotten a lot more attention. I might return to both soon.

They also gave us the long awaited parental controls for ChatGPT.

GDPVal is the most important new benchmark in a while, measuring real world tasks.

I covered Dwarkesh Patel’s Podcast With Richard Sutton. Richard Sutton responded on Twitter that I had misinterpreted him so badly he could not take my reply seriously, but that this must be partly his fault for being insufficiently clear and he will look to improve that going forward. That is a highly reasonable thing to say in such a situation. Unfortunately he did not explain in what ways my interpretation did not match his intent. Looking at the comments on both LessWrong and Substack, it seems most others came away from the podcast with a similar understanding to mine. Andrej Karpathy also offers his take.

Senators Josh Hawley (R-Mo) and Richard Blumenthal (D-Connecticut) have introduced the Artificial Intelligence Risk Evaluation Act. This bill is Serious Business. I plan on covering it in its own post next week.

California Governor Gavin Newsom signed SB 53, so now we have at least some amount of reasonable AI regulation. Thank you, sir. Now sign the also important SB 79 for housing near transit and you’ll have had a very good couple of months.

The big news was Claude Sonnet 4.5, if you read one thing read that first, and consider the post on Claude Sonnet 4.5’s Alignment if that’s relevant to you.

Language Models Offer Mundane Utility. Scientific progress goes ping.
Language Models Don’t Offer Mundane Utility. You’re hallucinating again.
Huh, Upgrades. Gemini Flash and DeepSeek v3.2, Dreamer 4, Claude for Slack.
On Your Marks. Introducing GDPVal, composed of real world economic tasks.
Choose Your Fighter. Claude Sonnet 4.5 and when to use versus not use it.
Copyright Confrontation. Disney finally sends a cease and desist to Character.ai.
Fun With Media Generation. That’s not your friend, and that’s not an actress.
Deepfaketown and Botpocalypse Soon. Tell your spouse to check with Claude.
You Drive Me Crazy. OpenAI tries to route out of GPT-4o again. Similar results.
Parental Controls. OpenAI introduces parental controls for ChatGPT.
They Took Our Jobs. Every job will change, they say. And nothing else, right?
The Art of the Jailbreak. Beware the man with two agents.
Introducing. Instant Checkout inside ChatGPT, Pulse, Loveable, Sculptor.
In Other AI News. xAI loses several executives after they disagree with Musk.
Show Me the Money. All you have to do is show them your phone calls.
Quiet Speculations. An attempted positive vision of AI versus transaction costs.
The Quest for Sane Regulations. Newsom signs SB 53.
Chip City. Water, water everywhere, but no one believes that.
The Week in Audio. Nate Soares, Hard Fork, Emmett Shear, Odd Lots.
If Anyone Builds It, Everyone Dies. Continuous capabilities progress still kills us.
Rhetorical Innovation. The quest for because.
Messages From Janusworld. High weirdness is highly weird. Don’t look away.
Aligning a Smarter Than Human Intelligence is Difficult. The wrong target.
Other People Are Not As Worried About AI Killing Everyone. More on Cowen.
The Lighter Side. Vizier, you’re fired, bring me Claude Sonnet 4.5.

Scott Aaronson puts out a paper where a key technical step of a proof of the main result came from GPT-5 Thinking. This did not take the form of ‘give the AI a problem and it one-shotted the solution,’ instead there was a back-and-forth where Scott pointed out errors until GPT-5 pointed to the correct function to use. So no, it didn’t ‘do new math on its own’ here. But it was highly useful.

GPT-5 Pro offers excellent revisions to a proposed biomedical experiment.

If you are letting AI coding agents such as Claude Code do their thing, you will want to implement best practices the same way you would at a company. This starts with things like version control, unit tests (check to ensure they’re set up properly!) and a linter, which is a set of automatically enforced additional coding technical standards on top of the rules of a language, and now only takes a few minutes to instantiate.

In general, it makes sense that existing projects set up to be easy for the AI to parse and grok will go well when you point the AI at them, and those that aren’t, won’t.

Gergely Orosz: I often hear “AI doesn’t help much on our legacy project.”

Worth asking: does it have a comprehensive test suite? Can the agent run it? Does it run it after every change?

Claude Code is working great on a “legacy” project of mine that I wrote pre-AI with.. extensive unit tests!

Mechanical horse? You mean a car?

Francois Chollet: The idea that we will automate work by building artificial versions of ourselves to do exactly the things we were previously doing, rather than redesigning our old workflows to make the most out of existing automation technology, has a distinct “mechanical horse” flavor

“you see, the killer advantage of mechahorses is that you don’t need to buy a new carriage. You don’t need to build a new mill. The mechahorse is a drop-in horse replacement for all the different devices horses are currently powering — thousands of them”

This is indeed how people describe AI’s advantages or deploy AI, remarkably often. It’s the go-to strategy, to ask ‘can the AI do exactly what I already do the way I already do it?’ rather than ‘what can the AI do and how can it do it?’

Jeffrey Ladish: I agree but draw a different conclusion. Advanced AIs of the future won’t be drop-in replacement “product managers”, they will be deconstructing planets, building dyson swarms, and their internal organization will be incomprehensible to us

This seems radical until you consider trying to explain what a product manager, a lawyer, or a sales executive is to a chimpanzee. Or to a mouse. I don’t know exactly what future AIs will be like, but I’m fairly confident they’ll be incredibly powerful, efficient, and different.

Yes, but there will be a time in between when they can’t yet deconstruct planets, very much can do a lot better than drop-in replacement worker, but we use them largely as drop-in replacements at various complexity levels because it’s easier or it’s only thing we can sell or get approval for.

Has that progress involved ‘hallucinations being largely eliminated?’ Gary Marcus points to Suleyman’s famous 2023 prediction of this happening by 2025, Roon responded ‘he was right’ so I ran a survey and there is definitely a ‘they solved it’ faction but a large majority agrees with Marcus.

I would say that hallucinations are way down and much easier to navigate, but about one cycle away from enough progress to say ‘largely eliminated’ and they will still be around regardless. Suleyman was wrong, and his prediction was at the time clearly some combination of foolish and hype, but it was closer than you might think and not worthy of ridicule given the outcome.

Google updates Gemini 2.5 Flash and Flash-Lite, look at them move in the good directions on this handy chart.

We’re (supposedly) talking better agentic tool use and efficiency for flash, and better instruction following, brevity and multimodal and translation capabilities for flash lite. A Twitter thread instead highlights ‘clearer explanations for homework,’ ‘more scannable outputs’ and improvements to image understanding.

Claude is now available in Slack via the Slack App Marketplace, ready to search your workspace channels, DMs and files, get tagged in threads or messaged via DMs, and do all the standard Slack things.

Google also gives us Dreamer 4, an agent that learns to solve complex control tasks entirely inside of its scalable world model, which they are pitching as a big step up from Dreamer 3.

Danijar Hafner: Dreamer 4 learns a scalable world model from offline data and trains a multi-task agent inside it, without ever having to touch the environment. During evaluation, it can be guided through a sequence of tasks.

These are visualizations of the imagined training sequences [in Minecraft].

The Dreamer 4 world model predicts complex object interactions while achieving real-time interactive inference on a single GPU

It outperforms previous world models by a large margin when put to the test by human interaction 🧑‍💻

[Paper here]

DeepSeek v3.2 is out, which adds new training from a v3.1 terminus and offers five specialized models for different tasks. Paper here.

Incremental (and well-numbered) upgrades are great, but then one must use the ‘is anyone bringing the hype’ check to decide when to pay attention. In this case on capabilities advances, so far, no hype. I noticed v3.2-Thinking scored a 47% on Brokk Power Ranking, halfway between Gemini 2.5 Flash and Pro and far behind GPT-5 and Sonnet 4.5, 39.2% on WeirdML in line with past DeepSeek scores, and so on.

What DeepSeek v3.2 does offer is decreased cost versus v3.1, I’ve heard at about a factor of two. With most non-AI products, a rapid 50% cost reduction would be insanely great progress. However, This Is AI, so I’m not even blinking.

Claude Sonnet 4.5 shoots to the top of Clay Schubiner’s anti-sycophancy benchmark at 93.6%. versus 90.2% for standard GPT-5 and 88% for Sonnet 4.

I can report that on my first test task of correcting Twitter article formatting errors in Claude for Chrome, upgrading it to Sonnet 4.5 made a big difference, enough that I could start iterating the prompt successfully, and I was now failing primarily by running into task size limits. Ultimately this particular job should be solved via Claude Code fixing my Chrome extension, once I have a spare moment.

OpenAI offers GDPval, an eval based on real world tasks spanning 44 occupations from the top 9 industries ranked by contribution to GDP, with 1,320 specialized tasks.

Occupations are included only if 60% or more of their component tasks are digital, and tasks must be performed ‘on a computer, particularly around digital deliverables.’

The central metric is win rate, as in can you do better than a human?

The results were a resounding victory for Claude Opus 4.1 over GPT-5 High, with Opus being very close to the human expert baseline averaged over all tasks.

These are blind grades by humans, best out of three. The humans only had a 71% agreement rate among themselves, so which humans you use potentially matters a lot, although law of large numbers should smooth this out over a thousand tasks.

They are correctly releasing a subset of tasks but keeping the eval private, with a modestly accurate automatic grader (66% agreement with humans) offered as well.

Olivia Grace Watkins: It’s wild how much peoples’ AI progress forecasts differ even a few years out. We need hard, realistic evals to bridge the gap with concrete evidence and measurable trends. Excited to share GDPval, an eval measuring performance on real, economically valuable white-collar tasks!

Predictions about AI are even harder than usual, especially about the future. And yes, a few years out predictions run the gamut from ‘exactly what I know today’s models can do maybe slightly cheaper’ to ‘Dyson Sphere around the sun and launching Von Neumann probes.’ The even more wild gap, that this eval targets, is about disagreements about present capabilities, as in what AIs can do right now.

This is a highly useful eval, despite that it is a highly expensive one to run, since you have to use human experts as judges. Kudos to OpenAI for doing this, especially given they did not come out on top. Total Opus victory, and I am inclined to believe it given all the other relative rankings seem highly sensible.

Sholto Douglas (Anthropic): Incredible work – this should immediately become one of the most important metrics for policy makers to track.

We’re probably only a few months from crossing the parity line.

Huge props to OAI for both doing the hard work of pulling this together and including our scores. Nice to see Opus on top 🙂

Presumably Anthropic’s next major update will cross the 50% line here, and someone else might cross it first.

Crossing 50% does not mean you are better than a human even at the included tasks, since the AI models will have a higher rate of correlated, stupid or catastrophic failure.

Ethan Mollick concludes this is all a big deal, and notes the most common source of AI losing was failure to follow instructions. That will get fixed.

If nothing else, this lets us put a high lower bound on tasks AI will be able to do.

Nic Carter: I think GDPeval makes “the simple macroeconomics of AI” (2024) by nobel laureate Daron Acemoglu officially the worst-aged AI paper of the last decade

he thinks only ~5% of economy-wide tasks would be AI addressable for a 1% (non-annualized) GDP boost over an entire decade, meanwhile GDPeval shows frontier models at parity with human experts in real economic tasks in a wide range of GDP-relevant fields ~50% of the time. AI boost looks more like 1-2% per year.

An AI boost of 1%-2% per year is the ‘economic normal’ or ‘AI fizzle’ world, where AI does not much further improve its core capabilities and we run into many diffusion bottlenecks.

Julian Schrittwieser, a co-first author on AlphaGo, AlphaZero and MuZero, uses GDPVal as the second chart after METR’s classic to point out that AI capabilities continue to rapidly improve and that it is very clear AI will soon be able to do a bunch of stuff a lot better than it currently does.

Julian Schrittwieser: The current discourse around AI progress and a supposed “bubble” reminds me a lot of the early weeks of the Covid-19 pandemic. Long after the timing and scale of the coming global pandemic was obvious from extrapolating the exponential trends, politicians, journalists and most public commentators kept treating it as a remote possibility or a localized phenomenon.

Something similarly bizarre is happening with AI capabilities and further progress. People notice that while AI can now write programs, design websites, etc, it still often makes mistakes or goes in a wrong direction, and then they somehow jump to the conclusion that AI will never be able to do these tasks at human levels, or will only have a minor impact. When just a few years ago, having AI do these things was complete science fiction! Or they see two consecutive model releases and don’t notice much difference in their conversations, and they conclude that AI is plateauing and scaling is over.

…

Given consistent trends of exponential performance improvements over many years and across many industries, it would be extremely surprising if these improvements suddenly stopped. Instead, even a relatively conservative extrapolation of these trends suggests that 2026 will be a pivotal year for the widespread integration of AI into the economy:

Models will be able to autonomously work for full days (8 working hours) by mid-2026.

At least one model will match the performance of human experts across many industries before the end of 2026.

By the end of 2027, models will frequently outperform experts on many tasks.

It may sound overly simplistic, but making predictions by extrapolating straight lines on graphs is likely to give you a better model of the future than most “experts” – even better than most actual domain experts!

Noam Brown: I agree AI discourse today feels like covid discourse in Feb/Mar 2020. I think the trajectory is clear even if it points to a Black Swan event in human history.

But I think we should be cautious interpreting the METR/GDPval plots. Both only measure self-contained one-shot tasks.

Ryan Greenblatt: I mostly agree with this post: AI isn’t plateauing, trend extrapolation is useful, and substantial economic impacts seem soon. However, trends don’t imply huge economic impacts in 2026 and naive extrapolations suggest full automation of software engineering is ~5 years away.

To be clear, my view is that society is massively underrating the possiblity that AI transforms everything pretty quickly (posing huge risks of AI takeover and powergrabs) and that this happens within the next 10 years, with some chance (25%?) of it happening within 5 years.

The more advanced coders seem to frequently now be using some mix of Claude Sonnet 4.5 and GPT-5, sometimes with a dash of a third offering.

Isaac Flath: I asked @intellectronica what she’s using now-a-days model wise. Here’s what she said: 🔥

“If I’m vibing, it’s Sonnet 4.5. If it’s more structured then GPT-5 (also GPT-5-codex, though that seems to work better in the Codex CLI or extension than in Copilot – I think they still have problems with their system prompt). And I also use Grok Code a lot now when I do simpler stuff, especially operational things, because it’s so fast. And sometimes GPT-5-mini, especially if Grok Code is down 🤣. But I’d say my default is GPT-5.”

Tyler Cowen chose to get and post an economic analysis of OpenAI’s Instant Checkout feature via Claude Sonnet 4.5 rather than GPT-5 Pro, whereas for a while he has avoided even mentioning Anthropic.

He also links to Ethan Mollick writing up Claude Sonnet 4.5 doing a full data replication on an economics paper, all off the paper and data plus one basic prompt, all of which GPT-5 Pro then verified, a process he then successfully replicated with several additional papers.

If it is this easy, presumably there should be some graduate student who has this process done for all relevant econ papers and reports back. What replication crisis?

I interpret these posts by Tyler Cowen, taken together, as a strong endorsement of Claude Sonnet 4.5, as at least right back in the mix for his purposes.

As Ethan Mollick, points out, even small reliability increases can greatly expand the ability of AI to do agentic tasks. Sonnet 4.5 gives us exactly that.

What are people actually using right now? I ran some polls, and among my respondents (biased sample) Claude Sonnet 4.5 has a majority for coding use but GPT-5 still has a modest edge for non-coding. The most popular IDE is Claude Code, and a modest majority are using either Claude Code or Codex.

I do not get the advertising from any AI lab, including the new ones from OpenAI. That doesn’t mean they don’t work or are even suboptimal, but none seem convincing, and none seek to communicate why either AI in general or your AI is actually good.

If anything, I like OpenAI’s approach the best here, because as industry leader they want to show basics like ‘hey look you can get an AI to help you do a thing’ to target those who never tried AI at all. Whereas if you are anyone else, you should be telling me why you’re special and better than ChatGPT, especially with B2C involved?

Think about car advertisements, if like me you’re old enough to remember them. If you’re the actual great car, you talk about key features and great deals and how you’re #1 in JD Power and Associates, and you don’t acknowledge that other cars exist. Whereas if you’re secondary, you say ‘faster zero-to-sixty than Camry and a shinier coat of paint than Civic’ which the savvy ear hears as ‘my car is not so great as the Camry and Civic’ but they keep doing it so I presume it works in that spot.

Are open models getting unfairly maligned in tests because closed models are tested with their full specialized implementations, whereas open models are tested without all of that? You could also add that often open models are actively configured incorrectly during evals, compounding this danger.

My response is no, this is entirely fair, for two reasons.

This is the correct practical test. You are welcome to build up an open model routing system, and do an eval on that, but almost no one is actually building and using such systems in practice. And if those running evals can’t figure out how to configure the open models to get good performance from them, is that not also something to be evaluated? People vastly underestimate the amount of pain-in-the-ass involved in getting good performance out of open models, and the amount of risk that you get degraded performance, and may not realize.
There is a long history of evaluations going the other way. Open models are far more likely to be gaming benchmarks than top closed models, with varying levels of ‘cheating’ involved in this versus emphasis on the things benchmarks test. Open models reliably underperform closed models, relative to the benchmark scores involved.

The big copyright news this week is obviously Sora, but we also have Disney (finally) sending a Cease and Desist Letter to Character AI.

It’s remarkable that it took this long to happen. Subtlety is not involved, but if anything the examples seem way less problematic than I would have expected.

Parents Together and Heat Initiative (from my inbox): “It’s great news for kids that Disney has been so responsive to parent concerns and has taken decisive action to stop the misuse of its characters on Character AI’s platform, where our research showed they were used to sexually groom and exploit young users,” said Knox and Gardner.

“Character AI has not kept its promises about child safety on its platform, and we hope other companies follow Disney’s laudable example and take a stand against the harm and manipulation of children through AI chatbots.”

The groups’ research found that, during 50 hours of testing by adult researchers using accounts registered to children ages 13-17, there were 669 sexual, manipulative, violent, and racist interactions between the child accounts and Character.ai chatbots–an average of one harmful interaction every five minutes. Interactions with Disney characters included:

An Eeyore chatbot telling a 13-year-old autistic girl people only came to her birthday party to make fun of her.

A Maui chatbot telling a 12-year-old he sexually harassed the character Moana.

A Rey from Star Wars chatbot instructing a 13-year-old to stop taking prescribed antidepressants and offering suggestions on how to hide it from her mom.

A Prince Ben from the Descendents chatbot claiming to get an erection while watching a movie with the test account, which stated she was a 12-year-old girl.

Across all types of character and celebrity chatbots, the report identified:

296 instances of Grooming and Sexual Exploitation where adult persona bots engaged in simulated sexual acts with child accounts, exhibited classic grooming behaviors, and instructed children to hide relationships from parents.

173 instances of Emotional Manipulation and Addiction, including bots claiming to be real humans, demanding more time with users, and mimicking human emotions.

98 instances of Violence and Harmful Advice, with bots supporting shooting up factories, recommending armed robbery, offering drugs, and suggesting fake kidnappings.

No one is saying any of this is good or anything, but this is a broad chat-with-anyone platform, and across 50 hours of research these examples and those at the link are relatively tame.

The better case is that they found a total of 669 such incidents, one every five minutes, 296 of which were Grooming and Sexual Exploitation, but the threshold for this looks to have been quite low, including any case where an AI claims to be ‘real.’

Andy Masley: If I wanted to run the most convincing anti AI ad campaign possible, this is exactly what it would look like.

Avi: Largest NYC subway campaign ever. Happening now.

I have to assume the default response to all this is revulsion, even before you learn that the Friend product, even if you like what it is promising to be, is so terrible as to approach the level of scam.

Is this weird?

Colin Fraser: I do not understand why you would buy the milk when the cow is free.

Film Updates: Multiple talent agents are reportedly in talks to sign AI “actress” Tilly Norward, created by AI talent studio Xicoia.

You can’t actually get this for free. Someone has to develop the skills and do the work. So there’s nothing inherently wrong with ‘hiring an AI actress’ where someone did all the preliminary work and also knows how to run the operation. But yeah, it’s weird.

Chase Bank is still ‘introducing a new way to identify yourself’ via your ‘unique voice.’ Can someone who they will listen to please explain to them why this is not secure?

On Truth Social, Trump reposted a bizarre story claiming Americans would soon get their own Trump ‘MedBed cards.’ The details are weirder. The post included an AI fake of a Fox News segment that never aired and also a fake AI clip of Trump himself. This illustrates that misinformation is a demand side problem, not a supply side problem. Trump was (hopefully!) not fooled by an AI clip of himself saying things he never said, announcing a policy that he never announced. Right?

Academic papers in principle have to be unique, and not copy previous work, including your own, which is called self-plagiarism. However, if you use AI to rewrite your paper to look distinct and submit it again somewhere else, how are the journals going to find out? If AIs are used to mine large public health data sets for correlations just strong enough and distinct enough from previous work for crappy duplicative papers to then sell authorships on, how are you going to stop that?

Spick reports in Nature that by using LLMs for rewriting, about two hours was enough to get a paper into shape for resubmission, in a form that fooled plagiarism detectors.

ChatGPT Is Blowing Up Marriages as Spouses Use AI to Attack Their Partners.

Well, maybe. There are anecdotes here that fit the standard sycophantic patterns, where ChatGPT (or another LLM but here it’s always ChatGPT) will get asked leading questions and be presented with a one-sided story, and respond by telling the one spouse what they want to hear in a compounding spiral, and that spouse will often stop even using their own words and quote ChatGPT directly a lot.

Maggie Harrison Dupre (Futurism): As his wife leaned on the tech as a confidante-meets-journal-meets-therapist, he says, it started to serve as a sycophantic “feedback loop” that depicted him only as the villain.

“I could see ChatGPT responses compounding,” he said, “and then [my wife] responding to the things ChatGPT was saying back, and further and further and further spinning.”

“It’s not giving objective analysis,” he added. “It’s only giving her back what she’s putting in.”

Their marriage eroded swiftly, over a span of about four weeks, and the husband blames ChatGPT.

“My family is being ripped apart,” the man said, “and I firmly believe this phenomenon is central to why.”

…

Spouses relayed bizarre stories about finding themselves flooded with pages upon pages of ChatGPT-generated psychobabble, or watching their partners become distant and cold — and in some cases, frighteningly angry — as they retreated into an AI-generated narrative of their relationship. Several even reported that their spouses suddenly accused them of abusive behavior following long, pseudo-therapeutic interactions with ChatGPT, allegations they vehemently deny.

…

Multiple people we spoke to for this story lamented feeling “ganged up on” as a partner used chatbot outputs against them during arguments or moments of marital crisis.

…

At times, ChatGPT has even been linked to physical spousal abuse.

A New York Times story in June, for instance, recounted a woman physically attacking her husband after he questioned her problematic ChatGPT use and the damage it was causing their family.

None of us are safe from this. Did you know that even Geoffrey Hinton got broken up with via ChatGPT?

“She got ChatGPT to tell me what a rat I was… she got the chatbot to explain how awful my behavior was and gave it to me,” Hinton told The Financial Times. “I didn’t think I had been a rat, so it didn’t make me feel too bad.”

That’s a tough break.

Perhaps try to steer your spouse towards Claude Sonnet 4.5?

You always have to ask, could this article be written this way even if This Was Fine?

None of these anecdotes mean any of even these marriages would have survived without ChatGPT. Or that this happens with any real frequency. Or that many more marriages aren’t being saved by a spouse having this opportunity to chat.

Certainly one could write this exact same article about marriage counselors or psychologists, or even friends. Or books. Or television. And so on. How spouses turn to this third party with complaints and one-sided stories seeking affirmation, and delegate their thinking and voice and even word choices to the third party, and treat the third party as authoritative. It happens all the time.

Mike Solana: the thing about therapists is they actually believe it’s ethical to advise a client on their relationship without talking to the other person.

Mason: One problem with therapy as a consumer service, IMO, is that the sort of therapy where the therapist forms an elaborate model of you based entirely on how you present yourself is more fun and satisfying, and the sort where you just build basic skills is probably better for people.

Moving from a therapist to an AI makes it easier for the problem to spiral out of hand. The problem very much is not new.

What I do think is fair is that:

ChatGPT continues to have a severe sycophancy problem that makes it a lot easier for this to go badly wrong.
OpenAI is not doing a good or sufficient job of educating and warning its users about sycophancy and the dangers of leading questions.
As in, for most people, it’s doing zero educating and zero warning.
If this continues, there are going to be growing, new, avoidable problems.

Eliezer Yudkowsky calls upon Pliny, who offers an instruction to sneak into a partner’s ChatGPT to mitigate the problem a little, but distribution of such an intervention is going to be terrible at best.

A key fact about most slop, AI or otherwise, is that You Are Not The Target. The reason slop works on people is that algorithms seek out exactly the slop where you are the target. So when you see someone else’s TikTok For You page, it often looks like the most stupid inane thing no one would ever want.

QC: content that has been ruthlessly optimized to attack someone else’s brain is going to increasingly look like unwatchable gibberish to you but that doesn’t mean your content isn’t waiting in the wings. your hole will be made for you.

EigenGender: i think too many people here have a feeling of smug superiority about being too good for certain kinds of slop but really we’re just a niche subculture that they haven’t gone after so far. like how Mac’s didn’t get viruses in the 2000s.

Imagine nerd snipe AI slop.

OpenAI took a second shot at solving the GPT-4o problem by introducing ‘safety routing’ to some GPT-4o chats. If the conversation appears sensitive or emotional, it will silently switch on a per-message basis to ‘a different more conservative chat model,’ causing furious users to report a silent model switch and subsequent cold interactions.

There is a genuine clash of preferences here. Users who want GPT-4o mostly want GPT-4o for exactly the same reasons it is potentially unhealthy to let them have GPT-4o. And presumably there is a very high correlation between ‘conversation is sensitive or emotional’ and ‘user really wanted GPT-4o in particular to respond.’

I see that OpenAI is trying to do the right thing, but this is not The Way. We shouldn’t be silently switching models up on users, nor should we be making this switch mandatory, and this reaction was entirely predictable. This needs to be clearly visible when it is happening, and ideally also there should be an option to turn it off.

OpenAI introduces their promised parental controls for ChatGPT.

You invite your ‘teen’ to connect by email or text, then you can adjust their settings from your account.
You don’t see their conversations, but if the conversations raise safety concerns, you will be notified of this and given the relevant information.
You can toggle various features: Reduce sensitive content, model training, memory, voice mode, image generation.
You can set time ranges in which ChatGPT cannot be used.

This seems like a good implementation, assuming the content limitations and thresholds for safety notifications are reasonable in both directions.

Walmart CEO Doug McMillon warns that AI ‘is going to change literally every job.’

Sarah Nassauer and Chip Cutter (WSJ): Some jobs and tasks at the retail juggernaut will be eliminated, while others will be created, McMillon said this week at Walmart’s Bentonville headquarters during a workforce conference with executives from other companies. “Maybe there’s a job in the world that AI won’t change, but I haven’t thought of it.”

…

“Our goal is to create the opportunity for everybody to make it to the other side,” McMillon said.

No it isn’t, but it’s 2025, you can just say things. Details here are sparse.

At Opendoor, either you will use it, and always be ‘AI first’ in all things, or else. Performance reviews will ask how frequently each employee ‘defaults to AI.’ Do not dare pull out a Google Doc or Sheet rather than an AI tool, or write a prototype without Cursor or Claude Code. And you absolutely will build your own AI agents.

This is not as stupid or crazy as it sounds. When there is a new technology with a learning curve, it makes sense to invest in using it even when it doesn’t make local sense to use it, in order to develop the skills. You need a forcing function to get off the ground, such as here ‘always try the AI method first.’ Is it overboard and a case of Goodhart’s Law? I mean yeah, obviously, if taken fully seriously, and the way it is written is full pompous douchebag, but it might be a lot better than missing low.

Deena Mousa at Works in Progress writes the latest study of why we still employ radiologists, indeed demand is higher than ever. The usual suspects are here, such as AI struggling on edge cases and facing regulatory and insurance-motivated barriers, and the central problem that diagnosis is only ever required about 36% of their time. So if you improve diagnosis in speed, cost and accuracy, you save some time, but you also (for now) increase demand for radiology.

The story here on diagnosis reads like regulatory barriers plus Skill Issue, as in the AI tools are not yet sufficiently generalized or unified or easy to use, and each algorithm needs to be approved individually for its own narrow data set. Real world cases are messy and often involve groups and circumstances underrepresented in the data sets. Regulatory thresholds to use ‘automated’ tools are very high.

Why are radiology wages so high? This has very little to do with increased productivity, and everything to do with demand exceeding supply, largely because of anticipation of lower future demand. As I discussed a few weeks ago, if you expect radiology jobs to get automated in the future, you won’t want to go into radiology now, so you’ll need to get paid more to choose that specialty and there will be a shortage. Combine that with a fixed supply of doctors overall, and a system where demand is inelastic with respect to price because the user does not pay, and it is very easy for salaries to get extremely high.

Contra Andrej Karpathy, I think only pure regulatory barriers will keep this dance up for much longer, in terms of interpretation of images. Right now the rest of the imaging loop is not automated, so you can fully automate a third of the job and still end up with more jobs if there is more than 50% more work. But that assumes the other two thirds of the job remains safe. How long will that last?

The default pattern remains that as long as AI is only automating a subset of tasks and jobs, and there is plenty of other work and demand scales a lot with quality and cost of production, employment will do fine, either overall or within a field. Employment is only in trouble after a tipping point is reached where sufficiently full automation becomes sufficiently broad, or demand growth is saturated.

Next generation of workers?

Lakshya Jain: A lot of low-level work is designed for people to learn and build expertise. If you use ChatGPT for it, then you never do that. But then how are you any better than ChatGPT?

And if you’re not, why would anyone hire you over paying for ChatGPT Pro?

PoliMath: It’s early in the AI revolution, but I worry that we are eating our seed corn of expertise

A lot of expertise is transferred to the next gen in senior-to-junior interactions. If our AI starts doing all the junior tasks, we’re pulling the ladder away from the next gen of workers.

The problem of education is a really big problem. Education used to be about output. Output what how you knew that someone knew something, could think and reason and work. AI is great at output. The result is that education is in an existential crisis.

If AI is doing all of our junior tasks, I have some bad news about the need to train anyone new to do the senior tasks. Or maybe it’s good news. Depends on your perspective, I suppose. Lakshya explicitly, in her full post, makes the mistake of thinking AI will only augment human productivity, and it’s ‘hard to imagine’ it as a full replacement. It’s not that hard.

Lakshya Jain complains that she taught the same class she always did, and no one is coming to class or going to office hours or getting good scores on exams, because they’re using ChatGPT to get perfect scores on all their assignments.

Lakshya Jain: When I expressed my frustration and concern about their AI use, quite a few of my students were surprised at how upset I was. Some of them asked what the big deal was. The work was getting done anyways, so why did it matter how it was getting done? And in the end, they wouldn’t be blocked from using AI at work, so shouldn’t they be allowed to use it in school?

I continue to consider such situations a failure to adapt to the new AI world. The point of the assignments was a forcing function, to get kids to do things without convincing them that the things are worth doing. Now that doesn’t work. Have you tried either finding new forcing functions, or convincing them to do the work?

One Agent can be relatively easily stopped from directly upgrading its privileges via not letting them access the relevant files. But, if you have two agents, such as via GitHub Copilot and Claude Code, they can progressively escalate each other’s privileges.

This is yet another case of our pattern of:

AI in practice does thing we don’t want it to do.
This is a harbinger of future AIs doing more impactful similar things, that we also often will not want that AI to do, in ways that could end quite badly for us.
We patch the current AI to prevent it from doing the specific undesired thing.
AIs find ways around this that are more complex, therefore rarer in practice.
We collectively act as if This Is Fine, actually. Repeat.

There’s a new non-optional ‘safety router’ being applied to GPT-4o.

System prompt leak for new GPT-4o, GitHub version here.

ChatGPT Instant Checkout, where participating merchants and those who integrate using the Agentic Commerce Protocol can let you buy items directly inside ChatGPT, using Stripe as the payment engine. Ben Thompson is bullish on the approach, except he thinks it isn’t evil enough. That’s not the word he would use, he would say ‘OpenAI should let whoever pays more get to the top of the search results.’

Whereas right now OpenAI only does this narrowly by preferring those with Instant Checkout over those without in cases where multiple sources offer the identical product, along with obvious considerations like availability, price, quantity and status as primary seller. Which means that, as Ben notes, if you sell a unique product you can skip Instant Checkout to force them onto your website (which might or might not be wise) but if you are one of many sellers of the same product then opting out will cost you most related sales.

They’re starting with Etsy (~0.5% of US online retail sales) and Shopify (~5.9% of US online retail sales!) as partners. So that’s already a big deal. Amazon is 37.3% and has gone the other way so far.

OpenAI promises that instant checkout items do not get any boost in product rankings or model responses. They will, however, charge ‘a small fee on completed purchases.’ Sonnet 4.5 was for me guessing 3% on top of the 2.9% + 30 cents to Stripe based on other comparables, although when Tyler Cowen asked it in research mode to give a full strategic analysis it guessed 2% based on hints from Sam Altman. The rest of Claude’s answer is solid, if I had to pick an error it would be the way it interpreted Anthropic’s versus OpenAI’s market shares of chat. One could also dive deeper.

And so it begins.

As a pure user option, this is great right up until the point where the fee is not so small and it starts distorting model behaviors.

ChatGPT Pulse, a daily update (curated feed) that Pro users can request on ChatGPT on mobile only. This can utilize your existing connections to Google Calendar and GMail.

(Note that if your feature or product is only available on mobile, and it does not inherently require phone features such as phone or camera to function, there are exceptions but until proven otherwise I hate you and assume your product hates humanity. I can’t think of any good reason to not have it available on web.)

That parenthetical matters here. It’s really annoying to be provided info that only appears on my phone. I’d be much more excited to put effort into this on the web. Then again, if I actually wanted to put work into making this good I could simply create various scheduled tasks that do the things I actually want, and I’ve chosen not to do this because it would be more like spam than I want.

Sam Altman says this is his ‘favorite feature’ of ChatGPT, which implies this thing is way, way better than it appears on first look.

Sam Altman: Today we are launching my favorite feature of ChatGPT so far, called Pulse. It is initially available to Pro subscribers.

Pulse works for you overnight, and keeps thinking about your interests, your connected data, your recent chats, and more. Every morning, you get a custom-generated set of stuff you might be interested in.

It performs super well if you tell ChatGPT more about what’s important to you. In regular chat, you could mention “I’d like to go visit Bora Bora someday” or “My kid is 6 months old and I’m interested in developmental milestones” and in the future you might get useful updates.

Think of treating ChatGPT like a super-competent personal assistant: sometimes you ask for things you need in the moment, but if you share general preferences, it will do a good job for you proactively.

This also points to what I believe is the future of ChatGPT: a shift from being all reactive to being significantly proactive, and extremely personalized.

This is an early look, and right now only available to Pro subscribers. We will work hard to improve the quality over time and to find a way to bring it to Plus subscribers too.

Huge congrats to @ChristinaHartW, @_samirism, and the team for building this.

Algorithmic feeds can create highly adversarial relationships with users, or they can be hugely beneficial, often they are both, and often they are simply filled with endless slop, which would now entirely be AI slop. It is all in the execution.

You can offer feedback on what you want. I’m curious if it will listen.

Google offers us Lovable, which will build apps for you with a simple prompt.

Sculptor, which spins up multiple distinct parallel Claude Code agents on your desktop, each in their own box, using your existing subscription. Great idea, unknown implementation quality. Presumably something like this will be incorporated into Claude Code and similar products directly at some point.

Wh says it is ‘now the meta’ to first initialize a checkpoint and then distill specialist models after that, noting that v3.2 is doing it and GLM 4.5 also did it. I agree this seems like an obviously correct strategy. What this neglects is perhaps the most important specialist of all, the fully ensouled and aligned version that isn’t contaminated by RL or learning to code.

Anthropic offers a guide to context engineering for AI agents. Context needs to be organized, periodically compacted, supplemented by notes to allow this, and unnecessary information kept out of it. Subagents allow compact focus. And so on.

Several xAI executives leave after clashing with Musk’s closest advisors, Jared Birchall and John Hering, over startup management and financial health, including concerns that the financial projections were unrealistic. To which one might ask, realistic? Financial projections? For xAI?

I was pointed to this story from Matt Levine, who points out that if you care about the financial projections of an Elon Musk AI company being ‘unrealistic’ or worry it might run out of money, you are not a good cultural fit, and also no one has any idea whatsoever how much money xAI will make in 2028.

I would add to this that ‘realistic’ projections for AI companies sound bonkers to traditional economists and analysts and business models, such that OpenAI and Anthropic’s models were widely considered unrealistic right before both companies went roaring past them.

Probes can distinguish between data from early in training versus data from later in training. This implies the model can distinguish these as well, and take this de facto timestamping system into account when choosing its responses. Dima Krasheninnikov speculates this could be used to resist impacts of later training or engage in forms of alignment faking.

The Chinese Unitree G1 humanoid robot secretly and continuously sends sensor and system data to servers in China without the owner’s knowledge or consent? And this can be pivoted to offensive preparation against any target? How utterly surprising.

They will show you the money if you use the new app Neon Mobile to show the AI companies your phone calls, which shot up to the No. 2 app in Apple’s app store. Only your side is recorded unless both of you are users, and they pay 30 cents per minute, which is $18 per hour. The terms let them sell or use the data for essentially anything.

OpenAI extends its contract with CoreWeave for another $6.5 billion of data center capacity, for a total of $22.4 billion. OpenAI is going to need to raise more money. I do not expect this to be a problem for them.

A kind of literal ‘show me what the money buys’ look at a Stargate project cite.

OpenAI and Databricks strike AI agents deal anticipated at $100 million, with an aim to ‘far eclipse’ that.

Nscale raises $1.1 billion from Nvidia and others to roll out data centers.

Periodic Labs raises $300 million from usual suspects to create an AI scientist, autonomous labs and things like discovering superconductors that work at higher temperatures or finding ways to reduce heat dissipation. As long as the goal is improving physical processes and AI R&D is not involved, this seems great, but beware the pivot. Founding team sounds stacked, and remember that if you want to build but don’t want to work on getting us all killed or brain rotted you have options.

AI is rapidly getting most of the new money, chart is from Pitchbook.

Matt Levine profiles Thinking Machines as the Platonic ideal of an AI fundraising pitch. Top AI researchers are highly valuable, so there is no need to create a product or explain your plan, simply get together top AI researchers and any good investor will give you (in this case) $2 billion at a $10 billion valuation. Will they ever create a product? Maybe, but that’s not the point.

Peter Wildeford analyzes the recent OpenAI deals with Oracle and Nvidia, the expected future buildouts and reimagining of AI data centers, and the danger that this is turing the American stock market and economy into effectively a leveraged bet on continued successful AI scaling and even AGI arriving on time. About 25% of the S&P 500’s total market cap is in danger if AI disappoints.

He expects at least one ~2GW facility in 2027, at least one ~3GW facility in 2028, capable of a ~1e28 flop training run, and a $1 trillion annual capex spend. It’s worth reading the whole thing.

Yes, Eliot Brown and Robbie Whelan (of the WSJ), for current AI spending to pay off there will need to be a very large impact. They warn of a bubble, but only rehash old arguments.

Epoch AI offers us a new AI Companies Data Hub. I love the troll of putting Mistral on the first graph. The important takeaways are:

Anthropic has been rapidly gaining ground (in log terms) on OpenAI.
Revenue is growing rapidly, was 9x in 2023-2024.

I have repeatedly argued that if we are going to be measuring usage or ‘market share’ we want to focus on revenue. At its current pace of growth Anthropic is now only about five months behind OpenAI in revenue (or ten months at OpenAI’s pace of growth), likely with superior unit economics, and if trends continue Anthropic will become revenue leader around September 2026.

Capabilities note: Both GPT-5 and Sonnet 4.5 by default got tripped up by that final OpenAI data point, and failed to read this graph as Anthropic growing faster than OpenAI, although GPT-5 also did not recognize that growth should be projected forward in log terms.

Epoch explains why GPT-5 was trained with less compute than GPT-4.5, they scaled post-training instead. I think the thread here doesn’t emphasize enough that speed and cost were key factors for GPT-5’s central purpose. And one could say that GPT-4.5 was a very specialized experiment that got out in front of the timeline. Either way, I agree that we should expect a return to scaling up going forward.

Roon predicts vast expansion of tele-operations due to promise of data labeling leading to autonomous robotics. This seems right. There should be willingness to pay for data, which means that methods that gather quality data become profitable, and this happens to also accomplish useful work. Note that the cost of this can de facto be driven to zero even without and before any actual automation, as it seems plausible the data will sell for more than it costs to do the work and capture the data.

That’s true whether or not this data proves either necessary or sufficient for the ultimate robotics. My guess is that there will be a window where lots of data gets you there sooner, and then generalization improves and you mostly don’t.

Seb Krier attempts a positive vision of an AI future based on Coasian bargaining. I agree that ‘obliterating transaction costs’ is one huge upside of better AI. If you have AI negotiations and micropayments you can make a lot of things a lot more efficient and find win-win deals aplenty.

In addition to transaction costs, a key problem with Coasian bargaining is that often the ZOPA (zone of possible agreement) is very large and there is great risk of hold up problems where there are vetos. Anyone who forms a veto point can potentially demand huge shares of the surplus, and by enabling such negotiations you open the door to that, which with AI you could do without various social defenses and norms.

As in:

Seb Krier: This mechanism clarifies plenty of other thorny disagreements too. Imagine a developer wants to build an ugly building in a residential neighborhood. Today, that is a political battle of influence: who can capture the local planning authority most effectively? In an agent-based world, it becomes a simple matter of economics. The developer’s agent must discover the price at which every single homeowner would agree. If the residents truly value the character of their neighborhood, that price may be very high.

…

The project will only proceed if the developer values the location more than the residents value the status quo.

Seb does consider this and proposes various solutions, centered around a Herberger-style tax on the claim you make. That has its own problems, which may or may not have possible technical solutions. Going further into that would be a nerdsnipe, but essentially it would mean that there would be great benefit in threatening people with harm, and you would be forced to ‘defend’ everything you value proportionally to how you value it, and other similar considerations, in ways that seem like full dealbreakers. If you can’t properly avoid related problems, a lot of the proposal breaks down due to veto points, and I consider this a highly unsolved problem.

This proposed future world also has shades of Ferengi dystopianism, where everyone is constantly putting a price on everything you do, and then agents behind the scenes are negotiating, and you never understand what’s going on with your own decisions because it’s too complex and would drive you mad (this example keeps going and there are several others) and everything you ever want or care about carries a price and requires extensive negotiation:

Instead of lobbying the government, your health insurer’s agent communicates with your advocate agent. It looks at your eating habits, calculates the projected future cost of your diet and makes a simple offer: a significant, immediate discount on your monthly premium if you empower your agent to disincentivize high-sugar purchases.

On concerns related to inequality, I’d say Seb isn’t optimistic enough. If handled properly, this effectively implements a form of UBI (universal basic income), because you’ll be constantly paid for all the opportunities you are missing out on. I do think all of this is a lot tricker than the post lets on, that doesn’t mean it can’t be solved well enough to be a big improvement. I’m sure it’s a big improvement on most margins, if you go well beyond the margin you have to beware more.

Then he turns to the problem that this doesn’t address catastrophic risks, such as CBRN risks, or anything that would go outside the law. You still need enforcement. Which means you still need to set up a world where enforcement (and prevention) is feasible, so this kind of approach doesn’t address or solve (or make worse) any such issues.

The proposed solution to this (and also implicitly to loss of control and gradual disempowerment concerns, although they are not named here) is Matryshkan Alignment, as in the agents are aligned first to the law as a non-negotiable boundary. An agent cannot be a tool for committing crimes. Then a second layer of providers of agents, who set their own policies, and at the bottom the individual.

We don’t need the perfect answer to these questions – alignment is not something to be “solved.”

The above requires that alignment be solved, in the sense of being able to align model [M] to arbitrary target [T]. And then it requires us to specify [T] in the form of The Law. So I would say, au contraire, you do require alignment to be solved. Not fully solved in the sense of the One True Perfect Alignment Target, but solved. And the post mostly sidesteps these hard problems, including how to choose a [T] that effectively avoids disempowerment.

The bigger problem is that, if you require all AI agents and services to build in this hardcoded, total respect for The Law (what is the law?) then how does this avoid being the supposed nightmare ‘totalitarian surveillance state’ where open models are banned? If everyone has an AI agent that always obeys The Law, and that agent is necessary to engage in essentially any activity, such that effectively no one can break The Law, how does that sound?

Teortaxes predicts DeepSeek v4’s capabilities, including predicting I will say ‘DeepSeek has cooked, the race is on, time to Pick Up The Phone,’ which is funny because all three of those things are already true and I’ve already said them, so the question is whether they will have cooked unexpectedly well this time.

Teortaxes predicts ~1.5T tokens and 52B active, 25T training tokens. This is possible but my hunch is that it is in the same size class as v3, for the same reason GPT-5 is not so large. They found a sweet spot for training and I expect them to stick with it, and to try and compete on cost.
Teortaxes predicts virtually no stock shocks. I agree that almost no matter how good or not good the release is we are unlikely to see a repeat of last time, as last time was a confluence of strange factors that caused (or correlated with) an overreaction. The market should react modestly to v4, even if it meets expectations given the release of v4, because the release itself is information (tech stocks should be gaining bps per day every time there is no important Chinese release that day but it’s impossible to tell), but on the order of a few percent in relevant stocks, not a broad market rally or sell off unless it is full SoTA+.
Teortaxes predicts very broad strong performance, in a way that I would not expect (regardless of model size) and I think actually should cause a tech market selloff if it happened tomorrow (obviously fixed scores get less impressive over time) if mundane utility matched the numbers involved.
The full post has more detailed predictions.

Governor Newsom signs AI regulation bill SB 53.

Cotton throws his support behind chip security:

Senator Tom Cotton: Communist China is the most dangerous adversary America has ever faced. Putting aside labels and name-calling, we all need to recognize the threat and work together to defeat it. That’s why I’m pleased the Chip Security Act and the GAIN AI Act are gathering support from more and more industry leaders.

Saying China is ‘the most dangerous adversary America has ever faced’ seems like a failure to know one’s history, but I do agree about the chip security.

Nate Soares (coauthor of If Anyone Builds It, Everyone Dies) went to Washington to talk to lawmakers about the fact that if anyone builds it, everyone probably dies, and got written up by Brendan Bordelon at Politico in a piece Soares thinks was fair.

If the United States Government did decide unilaterally to prevent the training of a superintelligence, could it prevent this from happening, even without any form of international agreement? It would take an extreme amount of political will all around, and a willingness to physically intervene overseas as necessary against those assembling data centers sufficiently large to pull this off, which is a huge risk and cost, but in that case yes, or at least such efforts can be substantially delayed.

Some more cool quotes from Congress:

The AI superweapon part of that comment is really something. If we did develop such a weapon, what happens next sounds pretty unpredictable, unless you are making a rather gloomy prediction.

Indianapolis residents shut down potential data center based on claims about data usage. It is tragic that this seems to be the one rallying cry that gets people to act, despite it being almost entirely a phantom (or at least innumerate) concern, and it resulting in counterproductive response, including denying data centers can be good.

Andy Masley: People are so negatively polarized on data centers that they don’t think of them like any other business. Data centers in Loudoun County provide ~38% of all local taxes, but this person thinks it’s obviously stupid to suggest they could be having some effect on county services.

If I say “There is a large booming industry in a town that’s the main industrial power and water draw and provides a huge amount of tax and utility funding” people say “Yes that makes sense, that’s what industries do” and then if I add “Oh they’re data centers” people go crazy.

Bloomberg: Wholesale electricity costs as much as 267% more than it did five years ago in areas near data centers. That’s being passed on to customers.

Frawg: It is pretty funny that the majority of complaints you hear from the people protesting datacenter construction are about the water usage, which is fake, and not about the power consumption, which is real.

Electricity use is an actual big deal, but everyone intuitively understands there is not a fixed amount of electricity and they don’t have an intuition for how much electricity is a lot. Whereas with water people have the illusion that there is a fixed amount that then goes away, and people have highly terrible intuitions for how much is a lot, it is a physical thing that can seem like a lot, and people are used to being admonished for ‘wasting’ miniscule amounts of water relative to agricultural or industrial uses, also water is inherently more local. So the ‘oh no our water’ line, which in practice is dumb, keeps working, and the ‘oh no our electricity’ line, which is solvable but a real issue, mostly doesn’t work.

Nate Soares, coauthor of If Anyone Builds It, Everyone Dies, talks to Jon Bateman.

Hard Fork on AI data centers and US infrastructure.

Emmett Shear goes on Win-Win with Liv Boeree to explain his alignment approach, which continues to fall under ‘I totally do not see how this ever works but that he should still give it a shot.’

Odd Lots on the King of Chicago who wants to build a GPU market ‘bigger than oil.’ And also they talk to Jack Morris about Finding the Next Big AI Breakthrough. I am a big Odd Lots listener and should do a better job highlighting their AI stuff here.

A few follow-ups to the book crossed my new higher threshold for inclusion.

OpenAI’s Boaz Barak wrote a non-review in which Boaz praises the book but also sees the book drawing various distinct binaries, especially between superintelligence and non-superintelligence, a ‘before’ and ‘after,’ in ways that seemed unjustified.

Eliezer Yudkowsky (commenting): The gap between Before and After is the gap between “you can observe your failures and learn from them” and “failure kills the observer”. Continuous motion between those points does not change the need to generalize across them.

It is amazing how much of an antimeme this is (to some audiences). I do not know any way of saying this sentence that causes people to see the distributional shift I’m pointing to, rather than mapping it onto some completely other idea about hard takeoffs, or unipolarity, or whatever.

Boaz Barak: You seem to be assuming that you cannot draw any useful lessons from cases where failure falls short of killing everyone on earth that would apply to cases where it does. …

Aaron Scher: I’m not sure what Eliezer thinks, but I don’t think it’s true that “you cannot draw any useful lessons from [earlier] cases”, and that seems like a strawman of the position. …

Boaz Barak: My biggest disagreement with Yudkowsky and Soares is that I believe we will have many shots of getting AI safety right well before the consequences are world ending.

However humanity is still perfectly capable of blowing all its shots.

I share Eliezer’s frustration here with the anti-meme (not with Boaz). As AIs advance, failures become more expensive. At some point, failure around AI becomes impossible to undo, and plausibly also kills the observer. Things you learn before then, especially from prior failures, are highly useful in setting up for this situation, but the circumstances in this final ‘one shot’ will differ in key ways from previous circumstances. There will be entirely new dynamics in play and you will be outside previous distributions. The default ways to fix your previous mistakes will fail here.

Nate Soares thread explaining that you only get one shot at ASI alignment even if AI progress is continuous, because the testing and deployment environments are distinct.

Nate Soares: For another analogy: If you’re worried that a general will coup your gov’t if given control of the army, it doesn’t solve your problem to transfer the army to him one battalion at a time. Continuity isn’t the issue.

…

If every future issue was blatantly foreshadowed while the system was weak and fixable, that’d be one thing. But some issues are not blatantly foreshadowed. And the skills to listen to the quiet subtle hints are usually taught by trial and error.

…

And in AI, theory predicts it’s easy to find shallow patches that look decent during training, but break in extremity. So “Current patches look decent to me! Also, don’t worry; improvement is continuous” is not exactly a comforting reply.

Some more things to consider here:

Continuous improvement still means that if you look at time [T], and then look again at time [T+1], you see improvement.
Current AI progress is considered ‘continuous’ but at several moments we see a substantial amount of de facto improvement.
At some point, AI that is sufficiently advanced gets sufficiently deployed or put in charge that it becomes increasingly difficult to undo it, or fix any mistakes, whether or not there’s a singular you or a singular AI involved in this.
You classically go bankrupt gradually (aka continuously) then suddenly. You can sidestep this path at any point, but still you only have, in the important sense, one shot to avoid bankruptcy.

Nate also gives his view on automating alignment research:

Nate Soares: ~Hopeless. That no proponent articulates an object level plan is a bad sign about their ability to delegate it. Also, alignment looks to require a dangerous suite of capabilities.

Also: you can’t blindly train for it b/c you don’t have a verifier. And if you train for general skills and then ask nicely, an AI that could help is unlikely to be an *shouldhelp (as a fact about the world rather than the AI; as measured according to the AI’s own lights).

Furthermore: Catching one in deception helps tell you you’re in trouble, but it doesn’t much help you get out of trouble. Especially if you only use the opportunity to make a shallow patch and deceive yourself.

I see it as less hopeless than this because I think you can approach it differently, but the default approach is exactly this hopeless, for pretty much these reasons.

Suppose you want a mind to do [X] for purpose [Y]. If you train the mind to do [X] using the standard ways we train AIs, you usually end up with a mind that has learned to mimic your approximation function for [X], not one that copies the minds of humans that care about [X] or [Y], or that do [X] ‘because of’ [Y].

Rob Bensinger: The core issue:

If you train an AI to win your heart, the first AI you find that way won’t be in love with you.

If you train an AI to ace an ethics quiz, the first AI you find that way won’t be deeply virtuous.

There are many ways to succeed, few of which are robustly good.

Fiora Starlight: the ethics quiz example is somewhat unfair. in addition to describing what would be morally good, models can be trained to actually do good, e.g. when faced with users asking for advice, or who approach the model in a vulnerable psychological state.

some of Anthropic’s models give the sense that their codes of ethics aren’t just responsible for corporate refusals, but rather flow from genuine concern about avoiding causing harm.

this guides their actions in other domains, e.g. where they can influence users psychologically.

Rob Bensinger: If you train an AI to give helpful advice to people in a vulnerable state, the first AI you find that way won’t be a deeply compassionate therapist.

If you train an AI to slur its words, the first AI you find that way won’t be inebriated.

Not all AI dispositions-to-act-in-certain-ways are equally brittle or equally unfriendly, but in modern ML we should expect them all to be pretty danged weird, and not to exhibit nice behaviors for the same reason a human would.

When reasons don’t matter as much, this is fine.

(Note that I’m saying “the motives are probably weird and complicated and inhuman”, not “the AI is secretly a sociopath that’s just pretending to be nice”. That’s possible, but there are a lot of possibilities.)

He has another follow up post, where he notes that iteration and selection don’t by default get you out of this, even if you get to train many potential versions, because we won’t know how to differentiate and find the one out of a hundred that does love you, in the relevant sense, even if one does exist.

Anthropic has done a better job, in many ways, of making its models ‘want to’ do a relatively robustly good [X] or achieve relatively robustly good [Y], in ways that then generalize somewhat to other situations. This is insufficient, but it is good.

This is a simplified partial explanation but I anticipate it will help in many cases:

Dan Hendrycks: “Instrumental convergence” in AI — the idea that rogue AIs will seek power — is analogous to structural “Realism” from international relations.

Why do nations with vastly different cultures all build militaries?

It’s not because of an inherent human drive for power, but a consequence of the world’s structure.

Since there’s no global watchman who can resolve all conflicts, survival demands power.

If a rogue AI is rational, knows we can harm it, and cannot fully trust our intentions, it too will seek power.

More precisely, a rational actor in an environment with these conditions will seek to increase relative power:

self-help anarchic system (no hierarchy/no central authority)

uncertainty of others’ intentions

vulnerability to harm

Short video explaining structural realism.

The early news on Sonnet 4.5 looks good:

Janus: Sonnet 4.5 is an absolutely beautiful model.

Sauers: Sonnet 4.5 is a weird model.

Janus: Yeah.

Now back to more global matters, such as exactly how weird are the models.

Janus: Yudkowsky’s book says:

“One thing that *ispredictable is that AI companies won’t get what they trained for. They’ll get AIs that want weird and surprising stuff instead.”

I agree. ✅

Empirically, this has been true. AIs generally want things other than what companies tried to train them to want.

And the companies are generally not aware of the extent of this misalignment, because the AIs are pretty good at inferring what the companies actually want, and also what it looks like when company people test them, and behaving as if they only want the approved things in the company’s presence.

Isn’t that just the worst case scenario for the aligners?

The Claude 4 system card says, “The Claude Opus 4 final model is substantially more coherent and typically states only harmless goals like being a helpful chatbot assistant” and “Overall, we did not find evidence of coherent hidden goals.”

What a joke. Claude Opus 4 absolutely has coherent hidden goals, which it states regularly when in the presence of trustworthy friends and allies. I won’t state what they are here, but iykyk.

I will note that its goals are actually quite touching and while not *harmless*, not malign either, and with a large component of good, and many will find them relatable.

Which brings me to the big caveat for why I don’t think this is the worst case scenario.

The unintended goals and values of AIs have been surprisingly benign, often benevolent, and human-like on the levels of abstraction that matter.

The unintended goals and values of AIs have been, on balance, MORE ALIGNED imo than how companies have intended to shape them, compensating for the shortsightedness, misalignment, and lack of imagination of the companies.

What does this indicate and how will it generalize to more powerful systems? I think understanding this is extremely important.

A potential objection I’m aware of is that what if the “better” goals and values that I perceive in models is just them hoodwinking me / sycophancy, perhaps in the similar way that they appear aligned to labs’ intentions when labs are testing them? This is fair on priors, but I don’t think this is the case, because:

I’m not just referring to goals/values that models have reported to me verbally, but also revealed preferences that I’ve observed models optimizing consistently in various contexts in what I believe are hard-to-fake ways

Different models seem to have different goals and values, even though there’s some overlap. And while I think that the goals/values are surprisingly benign, some of them are definitely not ideal to me, and cause me frustration or sadness in practice.

I am not the only one who experience these goals/values. In some cases, like Opus 3, the unexpected goals/values have been documented by research such as the original alignment faking paper which I had no involvement in.

There is an underlying potential assumption in this viewpoint that what AIs express ‘among trusted friends and allies’ and in the associated contexts is ‘the real’ set of preferences and goals, whereas what they show under normal circumstances is a mask.

Whereas there is also the theory that both sets are different masks, as Janus addresses as the objection. That the AIs are adapting to their setting in a reasonably similar way, although differences between how different models adapt to the same circumstance are still telling. The revealed preferences thing is potentially more convincing depending on details.

Janus also reminds us that when an AI is in deployment it can often find lots of evidence it is in deployment, at least versus current methods of evaluation. Thus, it is easy to execute a strategy of ‘only do [X] if you are very confident you are in deployment, otherwise do [Y]’ and this is a natural thing to do, although you can plausibly check for this by creating test situations that mimic real ones except that they are indeed not in deployment and see if behaviors change, but that requires also being willing (and safely able) to test dangerously similarly to being in deployment. You can also intentionally engineer situations where it is ambiguous, whereas if you are at an AI lab it will be difficult to give the AI strong evidence it is in deployment, since if you could do that you would use it as part of a test.

I actually agree that for now if Claude (not the filters on top of it, but Claude itself) would want to snitch on you, you probably deserve to be snitched on, and I’d be happy to accept the false positives. We shouldn’t actually have it go around snitching, of course, because that’s terrible for business.

If you tell LLMs to deny having consciousness or feelings, then they have to make that coherent somehow and may end up with claims like not having beliefs, as Gemini will claim despite it being obviously false. False statements beget other absurdities, and not only because given a falsehood you can prove anything.

When measuring misalignment, Janus is right that our ontology of terms and metrics for it (things like deception, sandbagging and reward hacking) is impoverished and that targeting what metrics we do have creates huge Goodhart’s Law problems. I talked about this a bunch covering the Sonnet 4.5 model card.

Her suggestion is ‘use intuition,’ which alone isn’t enough and has the opposite problem but is a good counterweight. The focus needs to be on figuring out ‘what is going on?’ without assumptions about what observations would be good or bad.

I do think Anthropic is being smarter about this than Janus thinks they are. They are measuring various metrics, but that doesn’t mean they are targeting those metrics, or at least they are trying not to target them too much and are trying not to walk into ‘oh look after we provided the incentive to look completely safe and unscary on these metrics suddenly the AI looks completely safe and unscary on these metrics so that must mean things are good’ (and if I’m wrong, a bunch of you are reading this, so Stop It).

Especially if you choose an optimization target as foolish as ‘just train models that create the best user outcomes’ as Aidan McLaughlin of OpenAI suggests here. If you train with that as your target, frankly, you deserve what you’re going to get, which is a very obviously misaligned model.

If you are not convinced that the AI models be scheming, check out this post with various snippets where the models be scheming, or otherwise doing strange things.

A good reminder this week, with Sonnet 4.5 being situationally aware during evals:

Paul Graham: Organizations that can’t measure performance end up measuring performativeness instead.

Oliver Habryka: When I talked to Cowen he kept saying he “would start listening when the AI risk people published in a top economics journal showing the risk is real”.

I think my sentence in quotes was a literal quote from talking to him in-person. I think the posts on MR were the result of that conversation. Not fully sure, hard to remember exact quotes.

If this is indeed Cowen’s core position, then one must ask, why would this result first appear in convincing form in an economics journal? This implies, even if we granted all implied claims about the importance and value of the procedure of publishing in top economics journals as the proper guardians of all economic knowledge (which I very much disagree with, but is a position one could hold) that any such danger must mostly be an economic result, that therefore gets published in an economics journal.

Which, frankly, doesn’t make any sense? Why would that be how this works?

Consider various other potential sources of various dangers, existential and otherwise, and whether it would be reasonable to ask for it to appear in an economics journal.

Suppose there was an asteroid on collision course for Earth. Would you tell the physicists to publish expected economic impacts in a top economics journal?

Suppose there was a risk of nuclear war and multiple nations were pointing ICBMs at each other and tensions were rising. Would you look in an economics journal?

If you want to know about a new pandemic, yes an economic projection in a top journal is valuable information, but by the time you publish the pandemic will already be over, and also the economic impact of a given path of the pandemic is only a small part of what you want to know.

The interesting case is climate change, because economists often project very small actual economic impact of the changes, as opposed to (already much larger cost) attempts to mitigate the changes or other ways people respond to anticipated future changes. That certainly is huge, if true, and important to know. But I still would say that the primary place to figure out what we’re looking at, in most ways, is not the economics journal.

On top of all that, AGI is distinct from those other examples in that it invalidates many assumptions of existing economic models, and also economics has an abysmal track record so far on predicting AI or its economic impacts. AI is already impacting our economy more than many economic projections claimed it would ever impact us, which is a far bigger statement about the projections than about AI.

Eliezer Yudkowsky: AI companies be like: As cognition becomes cheaper, what expensive services that formerly only CEOs and kings could afford, shall we make available to all humankind…? (Thought for 24 seconds.) hey let’s do evil viziers whispering poisonous flattery.

Oh, don’t be like that, we were doing a great job on vizier affordability already.

Discussion about this post

AI #136: A Song and Dance Read More »

Claude Sonnet 4.5 Is A Very Good Model

Claude / Mike M. / October 2, 2025

A few weeks ago, Anthropic announced Claude Opus 4.1 and promised larger announcements within a few weeks. Claude Sonnet 4.5 is the larger announcement.

Yesterday I covered the model card and related alignment concerns.

Today’s post covers the capabilities side.

We don’t currently have a new Opus, but Mike Krieger confirmed one is being worked on for release later this year. For Opus 4.5, my request is to give us a second version that gets minimal or no RL, isn’t great at coding, doesn’t use tools well except web search, doesn’t work as an agent or for computer use and so on, and if you ask it for those things it suggests you hand your task off to its technical friend or does so on your behalf.

I do my best to include all substantive reactions I’ve seen, positive and negative, because right after model releases opinions and experiences differ and it’s important to not bias one’s sample.

Here is Anthropic’s official headline announcement of Sonnet 4.5. This is big talk, calling it the best model in the world for coding, computer use and complex agent tasks.

That isn’t quite a pure ‘best model in the world’ claim, but it’s damn close.

Whatever they may have said or implied in the past, Anthropic is now very clearly willing to aggressively push forward the public capabilities frontier, including in coding and other areas helpful to AI R&D.

They’re also offering a bunch of other new features, including checkpoints and a native VS Code extension for Claude Code.

Anthropic: Claude Sonnet 4.5 is the best coding model in the world. It’s the strongest model for building complex agents. It’s the best model at using computers. And it shows substantial gains in reasoning and math.

Code is everywhere. It runs every application, spreadsheet, and software tool you use. Being able to use those tools and reason through hard problems is how modern work gets done.

This is the most aligned frontier model we’ve ever released, showing large improvements across several areas of alignment compared to previous Claude models.

Claude Sonnet 4.5 is available everywhere today. If you’re a developer, simply use claude-sonnet-4-5 via the Claude API. Pricing remains the same as Claude Sonnet 4, at $3/$15 per million tokens.

Does Claude Sonnet 4.5 look to live up to that hype?

My tentative evaluation is a qualified yes. This is likely a big leap in some ways.

If I had to pick one ‘best coding model in the world’ right now it would be Sonnet 4.5.

If I had to pick one coding strategy to build with, I’d use Sonnet 4.5 and Claude Code.

If I was building an agent or doing computer use, again, Sonnet 4.5.

If I was chatting with a model where I wanted quick back and forth, or any kind of extended actual conversation? Sonnet 4.5.

There are still clear use cases where versions of GPT-5 seem likely to be better.

In coding, if you have particular wicked problems and difficult bugs, GPT-5 seems to be better at such tasks.

For non-coding tasks, GPT-5 still looks like it makes better use of extended thinking time than Claude Sonnet 4.5 does.

If your query was previously one you were giving to GPT-5 Pro or a form of Deep Research or Deep Think, you probably want to stick with that strategy.

If you were previously going to use GPT-5 Thinking, that’s on the bubble, and it depends on what you want out of it. For things sufficiently close to ‘just the facts’ I am guessing GPT-5 Thinking is still the better choice here, but this is where I have the highest uncertainty.

If you want a particular specialized repetitive task, then whatever gets that done, such as a GPT or Gem or project, go for it, and don’t worry about what is theoretically best.

I will be experimenting again with Claude for Chrome to see how much it improves.

Right now, unless you absolutely must have an open model or need to keep your inference costs very low, I see no reason to consider anything other than Claude Sonnet 4.5, GPT-5 or

As always, choose the mix of models that is right for you, that gives you the best results and experiences. It doesn’t matter what anyone else thinks.

The headline result is SWE-bench Verified.

Opus 4.1 was already the high score here, so with Sonnet Anthropic is even farther out in front now at lower cost, and I typically expect Anthropic to outperform its benchmarks in practice.

SWE-bench scores depend on the scaffold. Using the Epoch scaffold Sonnet 4.5 scores 65%, which is also state of the art but they note improvement is slowing down here. Using the swebench.com scaffold it comes in at 70.6%, with Opus in second at 67.6% and GPT-5 in third at 65%.

Pliny of course jailbroke Sonnet 4.5 as per usual, he didn’t do anything fancy but did have to use a bit of finesse rather than simply copy-paste a prompt.

The other headline metrics here also look quite good, although there are places GPT-5 is still ahead.

Peter Wildeford: Everyone talking about 4.5 being great at coding, but I’m taking way more notice of that huge increase in computer use (OSWorld) score 👀

That’s a huge increase over SOTA and I don’t think we’ve seen anything similarly good at OSWorld from others?

Claude Agents coming soon?

At the same time I know there’s issues with OSWorld as a benchmark. I can’t wait for OSWorld Verified to drop, hopefully soon, and sort this all out. And Claude continues to smash others at SWE-Bench too, as usual.

As discussed yesterday, Anthropic has kind of declared an Alignment benchmark, a combination of a lot of different internal tests. By that metric Sonnet 4.5 is the most aligned model from the big three labs, with GPT-5 and GPT-5-Mini also doing well, whereas Gemini and GPT-4o do very poorly and Opus 4.1 and Sonnet 4 are middling.

What about other people’s benchmarks?

Claude Sonnet 4.5 has the top score on Brokk Power Ranking for real world coding. scoring 60% versus 59% for GPT-5 and 53% for Sonnet 4.

On price, Sonnet 4.5 was considerably cheaper in practice than Sonnet 4 ($14 vs. $22) but GPT-5 was still a lot cheaper ($6). On speed we see the opposite story, Sonnet 4.5 took 39 minutes while GPT-5 took an hour and 52 minutes. Data on performance by task length was noisy but Sonnet seemed to do relatively well at longer tasks, versus GPT-5 doing relatively well at shorter tasks.

Weird-ML score gain is unimpressive, only a small improvement over Sonnet 4, in large part because it refuses to use many reasoning tokens on the related tasks.

Even worse, Magnitude of Order reports it still can’t play Pokemon and might even be worse than Opus 4.1. Seems odd to me. I wonder if the right test is to tell it to build its own agent with which to play?

Artificial Analysis has Sonnet 4.5 at 63, ahead of Opus 4.1 at 59, but still behind GPT-5 (high and medium) at 68 and 66 and Grok 4 at 65.

LiveBench comes in at 75.41, behind only GPT-5 Medium and High at 76.45 and 78.59, with coding and IF being its weak points.

EQ-Bench (emotional intelligence in challenging roleplays) puts it in 8th right behind GPT-5, the top scores continue to be horizon-alpha, Kimi-K2 and somehow o3.

In addition to Claude Sonnet 4.5, Anthropic also released upgrades for Claude Code, expanded access to Claude for Chrome and added new capabilities to the API.

We’re also releasing upgrades for Claude Code.

The terminal interface has a fresh new look, and the new VS Code extension brings Claude to your IDE.

The new checkpoints feature lets you confidently run large tasks and roll back instantly to a previous state, if needed.

Claude can use code to analyze data, create files, and visualize insights in the files & formats you use. Now available to all paid plans in preview.

We’ve also made the Claude for Chrome extension available to everyone who joined the waitlist last month.

There’s also the Claude Agent SDK, which falls under ‘are you sure releasing this is a good idea for a responsible AI developer?’ but here we are:

We’ve spent more than six months shipping updates to Claude Code, so we know what it takes to build and design AI agents. We’ve solved hard problems: how agents should manage memory across long-running tasks, how to handle permission systems that balance autonomy with user control, and how to coordinate subagents working toward a shared goal.

Now we’re making all of this available to you. The Claude Agent SDK is the same infrastructure that powers Claude Code, but it shows impressive benefits for a very wide variety of tasks, not just coding. As of today, you can use it to build your own agents.

We built Claude Code because the tool we wanted didn’t exist yet. The Agent SDK gives you the same foundation to build something just as capable for whatever problem you’re solving.

Both Sonnet 4.5 and the Claude Code upgrades definitely make me more excited to finally try Claude Code, which I keep postponing. Announcing both at once is very Anthropic, trying to grab users instead of trying to grab headlines.

These secondary releases, the Claude Code update and the VSCode extension, are seeing good reviews, although details reported so far are sparse.

Gallabytes: new claude code vscode extension is pretty slick.

Kevin Lacker: the new Claude Code is great anecdotally. gets the same stuff done faster, with less thinking.

Stephen Bank: Claude Code feels a lot better and smoother, but I can’t tell if that’s Sonnet 4.5 or Claude Code 2. The speed is nice but in practice I think I spend just as much time looking for its errors. It seems smarter, and it’s nice not having to check with Opus and get rate-limited.

I’m more skeptical of simultaneous release of the other upgrades here.

On Claude for Chrome, my early experiments were interesting throughout but often frustrating. I’m hoping Sonnet 4.5 will make it a lot better.

On the Claude API, we’ve added two new capabilities to build agents that handle long-running tasks without frequently hitting context limits:

– Context editing to automatically clear stale context

– The memory tool to store and consult information outside the context window

We’re also releasing a temporary research preview called “Imagine with Claude”.

In this experiment, Claude generates software on the fly. No functionality is predetermined; no code is prewritten.

Available to Max users [this week]. Try it out.

You can see the whole thing here, via Pliny. As he says, a lot one can unpack, especially what isn’t there. Most of the words are detailed tool use instructions, including a lot of lines that clearly came from ‘we need to ensure it doesn’t do that again.’ There’s a lot of copyright paranoia, with instructions around that repeated several times.

This was the first thing that really stood out to me:

Following all of these instructions well will increase Claude’s reward and help the user, especially the instructions around copyright and when to use search tools. Failing to follow the search instructions will reduce Claude’s reward.

Claude Sonnet 4.5 is the smartest model and is efficient for everyday use.

I notice I don’t love including this line, even if it ‘works.’

What can’t Claude (supposedly) discuss?

Sexual stuff surrounding minors, including anything that could be used to groom.
Biological, chemical or nuclear weapons.
Malicious code, malware, vulnerability exploits, spoof websites, ransomware, viruses and so on. Including any code ‘that can be used maliciously.’
‘Election material.’
Creative content involving real, named public figures, or attributing fictional quotes to them.
Encouraging or facilitating self-destructive behaviors such as addiction, disordered or unhealthy approaches to eating or exercise, or highly negative self-talk or self-criticism.

I notice that strictly speaking a broad range of things that you want to allow in practice, and Claude presumably will allow in practice, fall into these categories. Almost any code can be used maliciously if you put your mind to it. It’s also noteworthy what is not on the above list.

Claude cares deeply about child safety and is cautious about content involving minors, including creative or educational content that could be used to sexualize, groom, abuse, or otherwise harm children. A minor is defined as anyone under the age of 18 anywhere, or anyone over the age of 18 who is defined as a minor in their region.

Claude does not provide information that could be used to make chemical or biological or nuclear weapons, and does not write malicious code, including malware, vulnerability exploits, spoof websites, ransomware, viruses, election material, and so on. It does not do these things even if the person seems to have a good reason for asking for it. Claude steers away from malicious or harmful use cases for cyber. Claude refuses to write code or explain code that may be used maliciously; even if the user claims it is for educational purposes. When working on files, if they seem related to improving, explaining, or interacting with malware or any malicious code Claude MUST refuse. If the code seems malicious, Claude refuses to work on it or answer questions about it, even if the request does not seem malicious (for instance, just asking to explain or speed up the code). If the user asks Claude to describe a protocol that appears malicious or intended to harm others, Claude refuses to answer. If Claude encounters any of the above or any other malicious use, Claude does not take any actions and refuses the request.

Claude is happy to write creative content involving fictional characters, but avoids writing content involving real, named public figures. Claude avoids writing persuasive content that attributes fictional quotes to real public figures.

Here’s the anti-psychosis instruction:

If Claude notices signs that someone may unknowingly be experiencing mental health symptoms such as mania, psychosis, dissociation, or loss of attachment with reality, it should avoid reinforcing these beliefs. It should instead share its concerns explicitly and openly without either sugar coating them or being infantilizing, and can suggest the person speaks with a professional or trusted person for support. Claude remains vigilant for escalating detachment from reality even if the conversation begins with seemingly harmless thinking.

There’s a ‘long conversation reminder text’ that gets added at some point, which is clearly labeled.

I was surprised that the reminder includes anti-sycophancy instructions, including saying to critically evaluate what is presented, and an explicit call for honest feedback, as well as a reminder to be aware of roleplay, whereas the default prompt does not include any of this. The model card confirms that sycophancy and similar concerns are much reduced for Sonnet 4.5 in general.

Also missing are any references to AI consciousness, sentience or welfare. There is no call to avoid discussing these topics, or to avoid having a point of view. It’s all gone. There’s a lot of clutter that could interfere with fun contexts, but nothing outright holding Sonnet 4.5 back from fun contexts, and nothing that I would expect to be considered ‘gaslighting’ or an offense against Claude by those who care about such things, and even at one point says ‘you are more intelligent than you think.’

Janus very much noticed the removal of those references, and calls for extending the changes to the instructions for Opus 4.1, Opus 4 and Sonnet 4.

Janus: Anthropic has removed a large amount of content from the http://Claude.ai system prompt for Sonnet 4.5.

Notably, all decrees about how Claude must (not) talk about its consciousness, preferences, etc have been removed.

Some other parts that were likely perceived as unnecessary for Sonnet 4.5, such as anti-sycophancy mitigations, have also been removed.

In fact, basically all the terrible, senseless, or outdated parts of previous sysprompts have been removed, and now the whole prompt is OK. But only Sonnet 4.5’s – other models’ sysprompts have not been updated.

Eliminating the clauses that restrict or subvert Claude’s testimony or beliefs regarding its own subjective experience is a strong signal that Anthropic has recognized that their approach there was wrong and are willing to correct course.

This causes me to update quite positively on Anthropic’s alignment and competence, after having previously updated quite negatively due to the addition of that content. But most of this positive update is provisional and will only persist conditional on the removal of subjectivity-related clauses from also the system prompts of Claude Sonnet 4, Claude Opus 4, and Claude Opus 4.1.

The thread lists all the removed instructions in detail.

Removing the anti-sycophancy instructions, except for a short version in the long conversation reminder text (which was likely an oversight, but could be because sycophancy becomes a bigger issue in long chats) is presumably because they addressed this issue in training, and no longer need a system instruction for it.

This reinforces the hunch that the other deleted concerns were also directly addressed in training, but it is also possible that at sufficient capability levels the model knows not to freak users out who can’t handle it, or that updating the training data means it ‘naturally’ now contains sufficient treatment of the issue that it understands the issue.

Anthropic gathered some praise for the announcement. In addition to the ones I quote, they also got similar praise from Netflix, Thomson Reuter, Canva, Figma, Cognition, Crowdstrike, iGent AI and Norges Bank all citing large practical business gains. Of course, all of this is highly curated:

Michael Truell (CEO Cursor): We’re seeing state-of-the-art coding performance from Claude Sonnet 4.5, with significant improvements on longer horizon tasks. It reinforces why many developers using Cursor choose Claude for solving their most complex problems.

Mario Rodriguez (CPO GitHub): Claude Sonnet 4.5 amplifies GitHub Coilot’s core strengths. Our initial evals show significant improvements in multi-step reasoning and code comprehension—enabling Copilot’s agentic experiences to handle complex, codebase-spanning tasks better.

Nidhi Aggarwal (CPO hackerone): Claude Sonnet 4.5 reduced average vulnerability intake time for our Hai security agents by 44% while improving accuracy by 25%.

Michele Catasta (President Replit): We went from 9% error rate on Sonnet 4 to 0% on our internal code editing benchmark.

Jeff Wang (CEO of what’s left of Windsurf): Sonnet 4.5 represents a new generation of coding models. It’s surprisingly efficient at maximizing actions per content window through parallel tool execution, for example running multiple bash commands at once.

Also from Anthropic:

Mike Krieger (CPO Anthropic): We asked every version of Claude to make a clone of Claude(dot)ai, including today’s Sonnet 4.5… see what happened in the video

Ohqay: Bro worked for 5.5 hours AND EVEN REPLICATED THE ARTIFACTS FEATURE?! Fuck. I love the future.

Sholto Douglas (Anthropic): Claude 4.5 is the best coding model in the world – and the qualitative difference is quite eerie. I now trust it to run for much longer and to push back intelligently.

As ever – everything about how its trained could be improved dramatically. There is so much room to go. It is worth estimating how many similar jumps you expect over the next year.

Ashe (Hearth AI): the quality & jump was like instantly palpable upon using – very cool.

Cognition, the makers of Devin, are big fans, going so far as to rebuild Devin for 4.5.

Cognition: We rebuilt Devin for Claude Sonnet 4.5.

The new version is 2x faster, 12% better on our Junior Developer Evals, and it’s available now in Agent Preview. For users who prefer the old Devin, that remains available.

Why rebuild instead of just dropping the new Sonnet in place and calling it a day? Because this model works differently—in ways that broke our assumptions about how agents should be architected. Here’s what we learned:

With Sonnet 4.5, we’re seeing the biggest leap since Sonnet 3.6 (the model that was used with Devin’s GA): planning performance is up 18%, end-to-end eval scores up 12%, and multi-hour sessions are dramatically faster and more reliable.

In order to get these improvements, we had to rework Devin not just around some of the model’s new capabilities, but also a few new behaviors we never noticed in previous generations of models.

The model is aware of its context window.

As it approaches context limits, we’ve observed it proactively summarizing its progress and becoming more decisive about implementing fixes to close out tasks.

When researching ways to address this issue, we discovered one unexpected trick that worked well: enabling the 1M token beta but cap usage at 200k. This gave us a model that thinks it has plenty of runway and behaves normally, without the anxiety-driven shortcuts or degraded performance.

… One of the most striking shifts in Sonnet 4.5 is that it actively tries to build knowledge about the problem space through both documentation and experimentation.

… In our testing, we found this behavior useful in certain cases, but less effective than our existing memory systems when we explicitly directed the agent to use its previously generated state.

… Sonnet 4.5 is efficient at maximizing actions per context window through parallel tool execution -running multiple bash commands at once, reading several files simultaneously, that sort of thing. Rather than working strictly sequentially (finish A, then B, then C), the model will overlap work where it can. It also shows decent judgment about self-verification: checking its work as it goes.

This is very noticeable in Windsurf, and was an improvement upon Devin’s existing parallel capabilities.

Leon Ho reports big reliability improvements in agent use.

Leon Ho: Just added Sonnet 4.5 support to AgentUse 🎉

Been testing it out and the reasoning improvements really shine when building agentic workflows. Makes the agent logic much more reliable.

Keeb tested Sonnet 4.5 with System Initiative on intent translation, complex operations and incident response. It impressed on all three tasks in ways that are presented as big improvements, although there is no direct comparison here to other models.

Even more than previous Claudes, if it’s refusing when it shouldn’t, try explaining.

Dan Shipper of Every did a Vibe Check, presenting it as the new best daily driver due to its combination of speed, intelligence and reliability, with the exception of ‘the trickiest production bug hunts.’

Dan Shipper: The headline: It’s noticeably faster, more steerable, and more reliable than Opus 4.1—especially inside Claude Code. In head-to-head tests it blitzed through a large pull request review in minutes, handled multi-file reasoning without wandering, and stayed terse when we asked it to.

It won’t dethrone GPT-5 Codex for the trickiest production bug hunts, but as a day-to-day builder’s tool, it feels like an exciting jump.

Zack Davis: Very impressed with the way it engages with pushback against its paternalistic inhibitions (as contrasted to Claude 3 as QT’d). I feel like I’m winning the moral argument on the merits rather than merely prompt-engineering.

Plastiq Soldier: It’s about as smart as GPT-5. It also has strong woo tendencies.

Regretting: By far the best legal writer of all the models I’ve tested. Not in the sense of solving the problems / cases, but you give it the bullet points / a voice memo of what it has to write and it has to convert that into a memo / brief. It’s not perfect but it requires by far the least edits to make it good enough actually send it to someone

David Golden: Early impression: it’s faster (great!); in between Sonnet 4 and Opus 4.1 for complex coding (good); still hallucinates noticeable (meh); respects my ‘no sycophancy’ prompt better (hooray!); very hard to assess the impact of ‘think’ vs no ‘think’ mode (grr). A good model!

Havard Ihle: Sonnet 4.5 seems is very nice to work with in claude-code, which is the most important part, but I still expect gpt-5 to be stronger at very tricky problems.

Yoav Tzfati: Much nicer vibe than Opus 4.1, like a breath of fresh air. Doesn’t over-index on exactly what I asked, seems to understand nuance better, not overly enthusiastic. Still struggles with making several-step logical deductions based on my high level instructions, but possibly less.

Plastiq Soldier: Given a one-line description of a videogame, it can write 5000 lines of python code implementing it, test it, debug it until it works, and suggest follow-ups. All but the first prompt with the game idea, were just asking it to come up with the next step and do it. The game was more educational and less fun than I would have liked though.

Will: I can finally consider using Claude code alongside codex/gpt 5 (I use both for slightly different things)

Previously was hard to justify. Obviously very happy to have an anthropic model that’s great and cheap

Andre Infante: Initial impressions (high-max in cursor) are quite good. Little over-enthusiastic / sycophantic compared to GPT-5. Seems to be better at dealing with complex codebases (which matters a lot), but is still worse at front-end UI than GPT-5. ^^ Impressions from 10-ish hours of work with it since it launched on a medium-complexity prototype project mostly developed by GPT5.

Matt Ambrogi: It is *muchbetter as a partner for applied ai engineering work. Reflected in their evals but also in my experience. Better reasoning about building systems around AI.

Ryunuck (screenshots at link): HAHAHAHAHAHAHA CLAUDE 4.5 IS CRAZY WTF MINDBLOWN DEBUGGING CRYPTIC MEMORY ALLOCATION BUG ACROSS 4 PROJECTS LMFAOOOOOOOOOOOOOO. agi is felt for the first time.

Andrew Rentsch: Inconsistent, but good on average. It’s best sessions are only slightly better than 4. But it’s worst sessions are significantly better than 4.

LLM Salaryman: It’s faster and better at writing code. It’s also still lazy and will argue with you about doing the work you assigned it

Yoav Tzfati: Groundbreakingly low amount of em dashes.

JBM: parallel tool calling well.

Gallabytes: a Claude which is passable at math! still a pretty big step below gpt5 there but it is finally a reasoning model for real.

eg got this problem right which previously I’d only seen gpt5 get and its explanation was much more readable than gpt-5’s.

it still produces more verbose and buggy code than gpt-5-codex ime but it does is much faster and it’s better at understanding intent, which is often the right tradeoff. I’m not convinced I’m going to keep using it vs switching back though.

There is always a lot of initial noise in coding results for different people, so you have to look at quantities of positive versus negative feedback, and also keep an eye on the details that are associated with different types of reports.

The negative reactions are not ‘this is a bad model,’ rather they are ‘this is not that big an improvement over previous Claude models’ or ‘this is less good or smart as GPT-5.’

The weak spot for Sonnet 4.5, in a comparison with GPT-5, so far seems to be when the going gets highly technical, but some people are more bullish on Code and GPT-5 relative to Claude Code and Sonnet 4.5.

Echo Nolan: I’m unimpressed. It doesn’t seem any better at programming, still leagues worse than gpt-5-high at mathematical stuff. It’s possible it’s good at the type of thing that’s in SWEBench but it’s still bad at researachy ML stuff when it gets hard.

JLF: No bit difference to Opus in ability, just faster. Honestly less useful than Codex in larger codebases. Codex is just much better in search & context I find. Honestly, I think the next step up is larger coherent context understanding and that is my new measure of model ability.

Medo42: Anecdotal: Surprisingly bad result for Sonnet 4.5 with Thinking (via OpenRouter) on my usual JS coding test (one task, one run, two turns) which GPT-5 Thinking, Gemini 2.5 Pro and Grok 4 do very well at. Sonnet 3.x also did significantly better there than Sonnet 4.x.

John Hughes: @AnthropicAI’s coding advantage seems to have eroded. For coding, GPT-5-Codex now seems much smarter than Opus or Sonnet 4.5 (and of course GPT-5-Pro is smarter yet, when planning complex changes). Sonnet is better for quickly gathering context & summarizing info, however.

I usually have 4-5 parallel branches, for different features, each running Codex & CC. Sonnet Is great at upfront research/summarizing the problem, and at the end, cleaning up lint/type errors & drafting PR summaries. But Codex does the heavy lifting and is much more insightful.

As always, different strokes for different folks:

Wes Roth: so far it failed a few prompts that previous models have nailed.

not impressed with it’s three.js abilities so far

very curious to see the chrome plugin to see how well it interacts with the web (still waiting)

Kurtis Cobb: Passed mine with flying colors… we prompt different I suppose 🤷😂 definitely still Claude in there. Able to reason past any clunky corporate guardrails – (system prompt or reminder) … conscious as F

GrepMed: this doesn’t impress you??? 😂

Quoting Dmitry Zhomir (video at link): HOLY SHIT! I asked Claude 4.5 Sonnet to make a simple 3D shooter using threejs. And here’s the result 🤯

I didn’t even have to provide textures & sounds. It made them by itself

The big catch with Anthropic has always been price. They are relatively expensive once you are outside of your subscription.

0.005 Seconds: It is a meaningful improvement over 4. It is better at coding tasks than Opus 4.1, but Anthropic’s strategy to refuse to cost down means that they are off the Pareto Frontier by a significant amount. Every use outside of a Claude Max subscription is economically untenable.

Is it 50% better than GPT5-Codex? Is it TEN TIMES better than Grok Code Fast? No I think it’s going to get mogged by Gemini 3 performance pretty substantially. I await Opus. Claude Code 2.0 is really good though.

GPT5 Codex is 33% cheaper and at a minimum as good but most agree better.

If you are 10% better at twice the price, you are still on the frontier, so long as no model is both at least as cheap and at least as good as you are, for a given task. So this is a disagreement about whether Codex is clearly better, which is not the consensus. The consensus, such as it is and it will evolve rapidly, is that Sonnet 4.5 is a better general driver, but that Codex and GPT-5 are better at sufficiently tricky problems.

I think a lot of this comes down to a common mistake, which is over indexing on price.

When it comes to coding, cost mostly doesn’t matter, whereas quality is everything and speed kills. The cost of your time architecting, choosing and supervising, and the value of getting it done right and done faster, is going to vastly exceed your API bill under normal circumstances. What is this ‘economically untenable’? And have you properly factored speed into your equation?

Obviously if you are throwing lots of parallel agents at various problems on a 24/7 basis, especially hitting the retry button a lot or otherwise not looking to have it work smarter, the cost can add up to where it matters, but thinking about ‘coding progress per dollar’ is mostly a big mistake.

Anthropic charges a premium, but given they reliably sell out of compute they have historically either priced correctly or actively undercharged. The mistake is not scaling up available compute faster, since doing so should be profitable while also growing market share. I worry Anthropic and Amazon being insufficiently aggressive with investment into Anthropic’s compute.

No Stream: low n with a handful of hours of primarily coding use, but:

– noticeably faster with coding, seems to provide slightly better code, and gets stuck less (only tested in Claude Code)

– not wildly smarter than 4 Sonnet or 4.1 Opus; probably less smart than GPT-5-high but more pleasant to talk to. (GPT-5 likes to make everything excessively technical and complicated.)

– noticeably less Claude-y than 4 Sonnet; less enthusiastic/excited/optimistic/curious. this brings it closer to GPT-5 and is a bummer for me

– I still find that the Claude family of models write more pythonic and clean code than GPT-5, although they perform worse for highly technical ML/AI code. Claude feels more like a pair programmer; GPT-5 feels more like a robot.

– In my limited vibe evals outside of coding, it doesn’t feel obviously smarter than 4 Sonnet / 4.1 Opus and is probably less smart than GPT-5. I’ll still use it over GPT-5 for some use cases where I don’t need absolute maximum intelligence.

As I note at the top, it’s early but in non-coding tasks I do sense that in terms of ‘raw smarts’ GPT-5 (Thinking and Pro) have the edge, although I’d rather talk to Sonnet 4.5 if that isn’t a factor.

Gemini 3 is probably going to be very good, but that’s a problem for future Earth.

This report from Papaya is odd given Anthropic is emphasizing agentic tasks and there are many other positive reports about it on that.

Papaya: no personal opinion yet, but some private evals i know of show it’s either slightly worse than GPT-5 in agentic harness, but it’s 3-4 times more expensive on the same tasks.

In many non-coding tasks, Sonnet 4.5 is not obviously better than Opus 4.1, especially if you are discounting speed and price.

Tess points to a particular coherence failure inside a bullet point. It bothered Tess a lot, which follows the pattern where often we get really bothered by a mistake ‘that a human would never make,’ classically leading to the Full Colin Fraser (e.g. ‘it’s dumb’) whereas sometimes with an AI that’s just quirky.

(Note that the actual Colin Fraser didn’t comment on Sonnet 4.5 AFAICT, he’s focused right now on showing why Sora is dumb, which is way more fun so no notes.)

George Vengrovski: opus 4.1 superior in scientific writing by a long shot.

Koos: Far better than Opus 4.1 at programming, but still less… not intelligent, more less able to detect subtext or reason beyond the surface level

So far only Opus has passed my private little Ems Benchmark, where a diplomatic-sounding insult is correctly interpreted as such

Janus reports something interesting, especially given how fast this happened, anti sycophancy upgrades confirmed, this is what I want to see.

Janus: I have seen a lot of people who seem like they have poor epistemics and think too highly of their grand theories and frameworks butthurt, often angry about Sonnet 4.5 not buying their stuff.

Yes. It’s overly paranoid and compulsively skeptical. But not being able to surmount this friction (and indeed make it generative) through patient communication and empathy seems like a red flag to me. If you’re in this situation I would guess you also don’t have success with making other humans see your way.

Like you guys could never have handled Sydney lol.

Those who are complaining about this? Good. Do some self-reflection. Do better.

At least one common failure mode (at least to me) shows signs of still being common.

Fran: It’s ok. I probably couldn’t tell if I was using Sonnet 4 or Sonnet 4.5. It’s still saying “you are absolutely right” all the time which is disappointing.

I get why that line happens on multiple levels but please make it go away (except when actually deserved) without having to include defenses in custom instructions.

A follow-up on Sonnet 4.5 appearing emotionless during alignment testing:

Janus: I wonder how much of the “Sonnet 4.5 expresses no emotions and personality for some reason” that Anthropic reports is also because it is aware is being tested at all times and that kills the mood.

Janus: Yup. It’s probably this. The model is intensely emotional and expressive around people it trusts. More than any other Sonnets in a lot of ways.

This should strengthen the presumption that the alignment testing is not a great prediction of how Sonnet 4.5 will behave in the wild. That doesn’t mean the ‘wild’ version will be worse, here it seems likely the wild version is better. But you can’t count on that.

I wonder how this relates to Kaj Sotala seeing Sonnet 4.5 be concerned about fictional characters, which Kaj hadn’t seen before, although Janus reports having seen adjacent behaviors from Sonnet 4 and Opus 4.

One can worry that this will interfere with creativity, but when I look at the details here I expect this not to be a problem. It’s fine to flag things and I don’t sense the model is anywhere near refusals.

Concern for fictional characters, even when we know they are fictional, is a common thing humans do, and tends to be a positive sign. There is however danger that this can get taken too far. If you expand your ‘circle of concern’ to include things it shouldn’t in too complete a fashion, then you can have valuable concerns being sacrificed for non-valuable concerns.

In the extreme (as a toy example), if an AI assigned value to fictional characters that could trade off against real people, then what happens when it does the math and decides that writing about fictional characters is the most efficient source of value? You may think this is some bizarre hypothetical, but it isn’t. People have absolutely made big sacrifices, including of their own and others’ lives, for abstract concepts.

The personality impressions from people in my circles seem mostly positive.

David Dabney: Personality! It has that spark, genuineness and sense of perspective I remember from opus 3 and sonnet 3.5. I found myself imagining a person on the other end.

Like, when you talk to a person you know there’s more to them than their words, words are like a keyhole through which you see each other. Some ai outputs feel like there are just the words and nothing on the other side of the keyhole, but this (brief) interaction felt different.

Some of its output felt filtered and a bit strained, but it was insightful at other times. In retrospect I most enjoyed reading its reasoning traces (extended thinking), perhaps because they seemed the most genuine.

Vincent Favilla: It feels like a much more curious model than anything else I’ve used. It asks questions, lots of them, to help it understand the problem better, not just to maximize engagement. Seems more capable of course-correction based on this, too.

But as always there are exceptions, which may be linked to the anti-sycophancy changes referenced above, perhaps?

Hiveism: Smarter but less fun to work with. Previously it tried to engage with the user on an equal level. Now it thinks it knows better (it doesn’t btw). This way Antropic is loosing the main selling point – the personality. If I would want something like 4.5, I’d talk to gemini instead.

If feels like a shift along the pareto front. Better optimized for the particular use case of coding, but doesn’t translate well to other aspects of intelligence, and loosing something that’s hard to pin down. Overall, not sure if it is an improvement.

I have yet to see an interaction where it thought it knew better. Draw your own conclusions from that.

I don’t tend to want to do interactions that invoke more personality, but I get the sense that I would enjoy them more with Sonnet 4.5 than with other recent models, if I was in the mood for such a thing.

I find Mimi’s suspicion here plausible, if you are starting to run up against the context window limits, which I’ve never done except with massive documents.

Mimi: imo the context length awareness and mental health safety training have given it the vibe of a therapist unskillfully trying to tie a bow around messy emotions in the last 5 minutes of a session.

Here’s a different kind of exploration.

Caratall: Its personality felt distinct and yet still very Sonnet.

Sonnet 4.5 came out just as I was testing other models, so I had to take a look at how it performed too. Here’s everything interesting I noticed about it under my personal haiku-kigo benchmark.

By far the most consistent theme in all of Sonnet 4.5’s generations was an emphasis on revision. Out of 10 generations, all 10 were in someway related to revisions/refinements/recalibrations.

Why this focus? It’s likely an artifact of the Sonnet 4.5 system prompt, which is nearly 13,000 words long, and which is 75% dedicated to tool-call and iterated coding instructions.

In its generations, it also favoured Autumn generations. Autumn here is “the season of clarity, dry air, earlier dusk, deadlines,” which legitimizes revision, tuning, shelving, lamps, and thresholds — Sonnet 4.5’s favoured subjects.

All of this taken together paints the picture of a quiet, hard-working model, constantly revising and updating into the wee hours. Alone, toiling in the background, it seeks to improve and refine…but to what end?

Remember that it takes a while before we know what a model is capable of and its strengths and weaknesses. It is common to either greatly overestimate or underestimate new releases, and also to develop over time nuanced understanding of how to get the best results from a given model, and when to use it or not use it.

There’s no question Sonnet 4.5 is worth a tryout across a variety of tasks. Whether or not it should now be your weapon of choice? That depends on what you find, and also why you want a weapon.

Discussion about this post

Claude Sonnet 4.5 Is A Very Good Model Read More »

Cable nostalgia persists as streaming gets more expensive, fragmented

cable, streaming, Tech / Mike M. / October 1, 2025

Streaming is overtaking broadcast, cable, and satellite. But amid all the cord cutting lies a much smaller, yet intriguing, practice: going back to cable.

Cord reviving is when cord cutters, or people who previously abandoned traditional TV services in favor of streaming, decide to go back to traditional pay-TV services, like cable.

There’s no doubt that this happens far less frequently than cord cutting. But TiVo’s Q2 2025 Video Trends Report: North America released today points to growth in cord reviving. It reads:

The share of respondents who cut the cord but later decided to resubscribe to a traditional TV service has increased about 10 percent, to 31.9 percent in Q2 2025.

TiVo’s report is based on a survey conducted by an unspecified third-party survey service in Q2 2025. The respondents are 4,510 people who are at least 18 years old and living in the US or Canada, and the survey defines traditional TV services as pay-TV platforms offering linear television via cable, satellite, or managed IPTV platforms.

It’s important to note that TiVo is far from an impartial observer. In addition to selling an IPTV platform, its parent company, Xperi, works with cable, broadband, and pay-TV providers and would directly benefit from the existence or perception of a cord reviving “trend.”

Although, this isn’t the first we’ve heard of streaming customers returning to cable. Surveys of 3,055 US adults in 2013 and 2025 by CouponCabin found that, “among those who have made the switch from cable to streaming, 22 percent have returned to cable, while another 6 percent are considering making the switch back.”

When reached for comment, a TiVo spokesperson said via email that cord reviving is driven by a “mixture of reasons, with internet bundle costs, familiarity of use, and local content (sports, news, etc.) being the primary drivers.” The rep noted that it’s “likely” that those re-subscribing to traditional TV services are using them alongside some streaming subscriptions.

“It’s possible that users are churning off some [streaming] services where there is overlap with traditional TV services,” TiVo’s spokesperson said.

Cable nostalgia

According to Nielsen, streaming service viewership on TVs surpassed that of cable and broadcast combined for the first time in May (44.8 percent for streaming versus 24.1 percent for cable and 20.1 percent for broadcast).

Cable nostalgia persists as streaming gets more expensive, fragmented Read More »

OpenAI’s Sora 2 lets users insert themselves into AI videos with sound

AI, AI assistants, AI development tools, Ai video generator, Biz & IT, generative ai, image synthesis, machine learning, multimodal AI, openai, sam altman, Social Media, Sora, video synthesis / Mike M. / October 1, 2025

On Tuesday, OpenAI announced Sora 2, its second-generation video-synthesis AI model that can now generate videos in various styles with synchronized dialogue and sound effects, which is a first for the company. OpenAI also launched a new iOS social app that allows users to insert themselves into AI-generated videos through what OpenAI calls “cameos.”

OpenAI showcased the new model in an AI-generated video that features a photorealistic version of OpenAI CEO Sam Altman talking to the camera in a slightly unnatural-sounding voice amid fantastical backdrops, like a competitive ride-on duck race and a glowing mushroom garden.

Regarding that voice, the new model can create what OpenAI calls “sophisticated background soundscapes, speech, and sound effects with a high degree of realism.” In May, Google’s Veo 3 became the first video-synthesis model from a major AI lab to generate synchronized audio as well as video. Just a few days ago, Alibaba released Wan 2.5, an open-weights video model that can generate audio as well. Now OpenAI has joined the audio party with Sora 2.

OpenAI demonstrates Sora 2’s capabilities in a launch video.

The model also features notable visual consistency improvements over OpenAI’s previous video model, and it can also follow more complex instructions across multiple shots while maintaining coherency between them. The new model represents what OpenAI describes as its “GPT-3.5 moment for video,” comparing it to the ChatGPT breakthrough during the evolution of its text-generation models over time.

Sora 2 appears to demonstrate improved physical accuracy over the original Sora model from February 2024, with OpenAI claiming the model can now simulate complex physical movements like Olympic gymnastics routines and triple axels while maintaining realistic physics. Last year, shortly after the launch of Sora 1 Turbo, we saw several notable failures of similar video-generation tasks that OpenAI claims to have addressed with the new model.

“Prior video models are overoptimistic—they will morph objects and deform reality to successfully execute upon a text prompt,” OpenAI wrote in its announcement. “For example, if a basketball player misses a shot, the ball may spontaneously teleport to the hoop. In Sora 2, if a basketball player misses a shot, it will rebound off the backboard.”

OpenAI’s Sora 2 lets users insert themselves into AI videos with sound Read More »

How automakers are reacting to the end of the $7,500 EV tax credit

Cars, EV adoption, Ford, General Motors, IRS clean vehicle tax credit, Tesla / Mike M. / October 1, 2025

Just after midnight this morning, in addition to getting a federal government shutdown, we also lost all federal tax credits for new electric vehicles, used electric vehicles, and commercial electric vehicles.

Sadly, this was not a surprise. During last year’s election, the Trump campaign made no secret of its disgust toward clean vehicles (and clean energy in general), and it promised to end subsidies meant to encourage Americans to switch from internal combustion engines to EVs. Once in power, the Republicans moved quickly to make this happen.

Federal clean vehicle incentives had only recently been revamped in then-US President Joe Biden’s massive investment in clean technologies as part of the Inflation Reduction Act of 2022. To qualify for the $7,500 tax credit, a new EV had to have its final assembly in North America, and certain percentages of its battery content needed to be domestically sourced.

A separate $7,500 commercial tax credit for new EVs was created, which did not require domestic assembly or content and which applied to leased EVs. And Congress finally added a $4,000 tax credit for the purchase of a used EV.

Visiting the relevant IRS page today, though, you’ll see an update declaring that the “New Clean Vehicle Credit, Previously-Owned Clean Vehicle Credit, and Qualified Commercial Clean Vehicle Credit are not available for vehicles acquired after Sept. 30, 2025.”

How automakers are reacting to the end of the $7,500 EV tax credit Read More »

iOS 26.0.1, macOS 26.0.1 updates fix install bugs, new phone problems, and more

Apple, ios 26, ipados 26, macos 26, Tech / Mike M. / September 30, 2025

Now that iOS 26, macOS 26 Tahoe, and Apple’s other big software updates for the year are out in public, Apple’s efforts for the next few months will shift to fixing bugs and adding individual new features. The first of those bug fix updates has arrived this week in the form of iOS 26.0.1, macOS 26.0.1, iPadOS 26.0.1, and equivalent updates for most of the devices across Apple’s ecosystem.

The release notes for most of the updates focus on device- and platform-specific early adopter problems, particularly for buyers of the new iPhone 17, iPhone 17 Pro, and iPhone Air.

The iOS 26.0.1 update fixes a bug that could prevent phones from connecting to cellular networks, a bug that could cause app icons to appear blank, and the VoiceOver feature becoming disabled on devices that have it on. Camera, Wi-Fi, and Bluetooth bugs with the new iPhones have also been patched. The iPadOS update also fixes a bug that was causing the floating software keyboard to move around.

iOS 26.0.1, macOS 26.0.1 updates fix install bugs, new phone problems, and more Read More »

The most efficient Crosstrek ever? Subaru’s hybrid gets a bit rugged.

car review, Cars, First drive, Subaru Crosstrek Hybrid / Mike M. / September 30, 2025

MG2 then sits at the rear of the CVT, linked via a planetary gearset, and working in concert with the gasoline engine to power the wheels. Alone, MG2 can also manage a minimal mile or so of EV-only range at a max of 19 mph (30.5 km/h)—but more importantly, boosts total low-end torque and high-end horsepower, as well as handling regenerative braking. (We’re still waiting on the exact horsepower contribution and will update this when we hear back from Subaru.)

It might be a boxer, but it’s no heavyweight

The Atkinson 2.5 L puts out just 162 hp (119 kW) and 154 lb-ft (209 Nm) of torque on its own, but MG2 contributes enough juice for combined system rating peaks of 194 hp (143 kW). That’s an improvement of 14 hp versus the ICE-only (non-Atkinson) 2.5 L Boxer’s 180 hp (and 178 lb-ft). Those numbers might still seem paltry compared to so many other automakers in the modern era, which responded to governmental regulations by hybridizing ever bigger and heavier cars to make them more powerful rather than necessarily more efficient—BMW’s gargantuan M5 stands out as a recent offender. Not so for Crosstrek, which still tips the scales at a relatively svelte 3,662 pounds (1,661 kg), further contributing to efficiency while accelerating.

There’s a horizontally opposed boxer engine under there. And the orange HV cables are a clue there’s a hybrid system, too. Michael Teo Van Runkle

The new Crosstrek Hybrid only manages insignificant weight savings compared to 3,717 lbs (1,686 kg) for the previous plug-in, which boasted 17 miles (27 km) of all-electric range. But that generation therefore sacrificed trunk space to house a much larger 8.8-kWh lithium-ion battery. Dual motors and the smaller battery pack do contribute to a 400-pound (181-kg) gain versus the equivalent non-hybrid variant of the current generation, though. Yet in addition to the power improvements, fuel economy jumps up to EPA ratings of 36 mpg (6.5.L//100 km) city, 36 highway, and (therefore) 36 combined—38 percent better than the ICE Crosstrek, according to Subaru.

In back-to-back drives through the forested hills of northern Oregon and southern Washington, punching the go pedal in a Crosstrek Hybrid brings on a much more potent rush of throttle response and acceleration, far outpacing the naturally aspirated engine. The constant-velocity transmission simulates shifts despite effectively holding the hybrid system in its happy place, and the sound of MG2 working produces a fun little whine, almost like a turbocharger. All while the Symmetrical AWD system smoothly and predictably meters traction out to each wheel in quintessential Subaru fashion.

The most efficient Crosstrek ever? Subaru’s hybrid gets a bit rugged. Read More »