Author name: Kris Guyer

it’s-prime-day,-and-these-are-the-best-deals-we-could-hunt-down

It’s Prime Day, and these are the best deals we could hunt down

Greetings, Arsians! It’s Prime Day, where we celebrate liberation from our Cybertronian oppressors, the Decepticons, and the mighty Autobot leader who fought for our freedom, Optimus Pr—hmm, one moment. I am once again being told that in spite of the name, Prime Day does not in fact have anything to do with the veneration of Optimus Prime, and is in fact all about buying things.

All right, in that case, let’s shift gears and engage in some commerce! Our partners over at the Condé mothership have been toiling in the e-commerce mines for days, gathering some tasty deals for your perusal. We’ll be poking at the list throughout the next day or two, adding items and removing them as deals come and go. Please remember to check back if there’s nothing there right now that tickles you!

Amazon devices

Apple devices

Tech deals

Phones

TVs

Headphones and speakers

Kitchen

Home

Outdoor and Active

Ars Technica may earn compensation for sales from links on this post through affiliate programs.

It’s Prime Day, and these are the best deals we could hunt down Read More »

china-jumps-ahead-in-the-race-to-achieve-a-new-kind-of-reuse-in-space

China jumps ahead in the race to achieve a new kind of reuse in space


The SJ-21 and SJ-25 satellites “merged” on July 2 and have remained together since then.

This image from a telescope operated by s2a systems, a Swiss space domain awareness company, shows China’s SJ-21 and SJ-25 satellites flying near one another on June 26. Credit: s2a systems

Two Chinese satellites have rendezvoused with one another more than 20,000 miles above the Earth in what analysts believe is the first high-altitude attempt at orbital refueling.

China’s Shijian-21 and Shijian-25 satellites, known as SJ-21 and SJ-25 for short, likely docked together in geosynchronous orbit sometime last week. This is the conclusion of multiple civilian satellite trackers using open source imagery showing the two satellites coming together, then becoming indistinguishable as a single object.

Chinese officials have released no recent public information on what the two satellites are up to, but they’ve said a bit about their missions in prior statements.

SJ-25, which launched in January, is designed “for the verification of satellite fuel replenishment and life extension service technologies,” according to the Shanghai Academy of Spaceflight Technology, the Chinese state-owned contractor that developed the satellite. SJ-21 launched in 2021 and docked with a defunct Chinese Beidou navigation satellite in geosynchronous orbit, then towed it to a higher altitude for disposal before returning to the geosynchronous belt. Chinese officials described this demonstration as a test of “space debris mitigation” techniques.

More than meets the eye

These kinds of technologies are dual-use, meaning they have civilian and military applications. For example, a docking in geosynchronous orbit could foretell an emerging capability for China to approach, capture, and disable another country’s satellite. At the same time, the US Space Force is interested in orbital refueling as it seeks out ways to extend the lives of military satellites, which are often limited by finite fuel supplies.

The Space Force sometimes calls this concept dynamic space operations. While some military leaders remain skeptical about the payoff of in-space refueling, the Space Force has an agreement with Astroscale to perform the first refueling of a US military asset in orbit as soon as next year.

China appears to be poised to beat the US Space Force to the punch. The apparent docking of the two satellites last week suggests SJ-21 is the target for SJ-25’s refueling demonstration, and US officials are watching. Two of the Space Force’s inspector satellites, known by the acronym GSSAP, positioned themselves near SJ-21 and SJ-25 to get a closer look.

Retired Space Force Lt. Gen. John Shaw is a vocal proponent of dynamic space operations. Because of this, he’s interested in what happens with SJ-21 and SJ-25. Shaw was deputy commander of US Space Command before his retirement in 2023. In this role, Shaw had some oversight over GSSAP satellites as they roamed geosynchronous orbit.

“The theory behind dynamic space operations stemmed from a kind of operational frustration with our inability to conduct the full range of activities with GSSAP that we wanted to at Space Command, as the warfighter—largely due to the combination of fixed fuel availability and expected satellite lifetime,” Shaw told Ars.

As other countries, mainly China, step up their clandestine activities in orbit, military officials are asking more of the GSSAP satellites.

“It was operationally driven then, a couple years ago, but it’s now manifesting itself in much wider ways than even it did back then, particularly in the face of activities by potential adversaries,” Shaw said. “That’s why I’m more confident and even more fanatical about it.”

Geosynchronous orbit is a popular location for military and commercial satellites. At an altitude of some 22,236 miles (35,786 kilometers), a satellite’s orbital velocity perfectly matches the speed of Earth’s rotation, meaning a spacecraft has a fixed view of the same region of the planet 24 hours per day. This is useful for satellites providing military forces with secure strategic communications and early warning of missile attacks.

Now, geosynchronous orbit is becoming a proving ground for new kinds of spacecraft to inspect or potentially attack other satellites. Ground-based anti-satellite missiles aren’t as useful in striking targets in high-altitude orbits, and there’s a consensus that, if you were to attack an enemy satellite, it would make more sense to use a weapons platform already in space that could move in and connect with the target without blowing it up and creating a cloud of dangerous space junk.

Keeping watch

The US military’s GSSAP satellites began launching in 2014. They carry enough propellant to maneuver around geosynchronous orbit and approach objects for closer inspection, but there’s a limit to what they can do. Six GSSAP satellites have been launched to date, but the Space Force decommissioned one of them in 2023. Meanwhile, China’s satellite operators are watching the watchers.

“We’ve seen where GSSAP safely and responsibly approaches a Chinese vehicle, and it just quickly maneuvers away,” Shaw said. “We tend to fly our GSSAPs like dirigibles, using relatively slow, minimum energy transfer approaches. The Chinese know that we do that, so it is relatively easy for them to maneuver away today to avoid such an approach.

“If tomorrow they’re able to refuel at will and operate even more dynamically, then the marginal cost of those maneuvers for them becomes even lower, and the challenge for GSSAP becomes even greater,” Shaw said.

Danish Rear Admiral Damgaard Rousøe, Danish Defence Attaché, right, observes space domain awareness data with US Space Force Lt. Col. Mark Natale, left, Joint Commercial Operations cell director, in Colorado Springs, Colorado, on September 26, 2024. Credit: US Space Force/Dalton Prejeant

China launched a satellite into geosynchronous orbit in 2016 with a robotic arm that could grab onto another object in space, then sent SJ-21 into orbit four years ago on its “space debris mitigation” mission.

Northrop Grumman launched two satellites in 2019 and 2020 that accomplished the first dockings in geosynchronous orbit. Northrop’s satellites, which it calls Mission Extension Vehicles, took control of two aging commercial communications satellites running low on fuel, maneuvering them to new locations and allowing them to continue operating for several more years. It’s easy to see that this kind of technology could be used for commercial or military purposes.

But these Mission Extension Vehicles don’t have the ability to transfer fluids from one satellite to another. That is the step China is taking with SJ-21 and SJ-25, presumably with hydrazine and nitrogen tetroxide propellants, which most satellites use because they combust on contact with one another.

US Space Command’s Joint Commercial Operations cell, which collects unclassified satellite monitoring data to bolster the military’s classified data sources, estimated the SJ-21 and SJ-25 satellites “merged” on July 2 and have remained together since then. The video below, released by s2a systems, shows SJ-25 approaching SJ-21 on June 30.

A time-lapse of yesterday’s SJ-25 / SJ-21 coverage, recorded from 08: 30 to 20: 53 UTC. pic.twitter.com/HUPWBTXZc9

— s2a systems (@s2a_systems) July 1, 2025

The unclassified data does not confirm that the two satellites actually docked, but that is likely what happened. The satellites came together, or merged, on June 13 and June 30 but separated again within a few hours. These may have been practice runs, aborted docking attempts, or sudden maneuvers to avoid the prying eyes of the US military’s GSSAP satellites loitering nearby.

Now, the SJ-21 and SJ-25 have been flying together for more than five days with no discernible changes detected from ground-based telescopes. Thousands of miles over the equator, the two satellites appear only as dots in the viewfinders of these telescopes positioned around the globe.

What we don’t know

COMSPOC is a Pennsylvania-based company that collects and processes data from commercial satellite tracking sensors. COMSPOC fuses optical telescope imagery with radar tracking and passive radio frequency (RF) data, which uses radio signals to measure exact distances to satellites in space, to get the best possible estimate of a spacecraft’s position.

“With most telescopes… at 1 kilometer or a half a kilometer, somewhere in there, you’re going to start to lose it when they get that close,” said Paul Graziani, COMSPOC’s founder and CEO, in an interview with Ars. “I think it’d be difficult for any telescope, even a really capable one, to get within 100 meters. That seems to be a stretch for telescopes.”

That’s why it’s helpful to add radar and RF data to the mix.

“When you add all of that together, you become much better than the 1-kilometer [precision] that a ‘scope might be,” said Joe Callaro, COMSPOC’s director of operations. “RF tells you if part of that blob is moving and the other part isn’t, and even when they all become one pixel, you can tell things about that.”

Even then, companies like COMSPOC have a degree of uncertainty in their conclusions unless Chinese or US officials make a more definitive statement.

“We are not working with the government,” Callaro told Ars before last week’s apparent docking. “We are not clearing this. The charge that I have for my team is we won’t make assertions as to what’s going on. We will only tell what our software gives us as a solution. We can say, ‘Here are the elements, here’s the visual, but what it means and what it’s doing, we will not assert.’

“We will not say they’re docked because unless they told me, I wouldn’t know that,” Callaro said. “So, we will say they’ve been together for this amount of time, that the mission could have happened, and then they separated, became two, and separated at whatever speed.”

Without any updates from China, observers won’t know for sure if the servicing demo was successful until the satellites detach. Then, US officials and independent analysts will watch to see if SJ-21 makes any substantial maneuvers, which might indicate the satellite has a full tank of gas.

SJ-21’s behavior for the last couple of years suggested it was running empty after undertaking large propulsive maneuvers to capture the Chinese Beidou satellite and move it to a different orbit.

Callaro served as a tactician in the Air Force’s Joint Space Operations Center, then joined the Aerospace Corporation before taking the job as operations lead at COMSPOC. He doesn’t buy China’s suggestion that SJ-21 was purely an experiment in collecting space debris.

“That is not how I see that at all,” Callaro said. “The fact that we can calculate all the maneuvers it takes to get out and get back, and the fact that afterwards, it spent a couple of years basically not moving, probably because it was low on fuel, sets up the idea [that there’s more to SJ-21’s mission]. Now, SJ-25 goes out there, and it’s supposed to be a fuel tank, and it’s perfectly aligned with SJ-21 and now we see this happening, tells me that it’s much more a counter-space capability than it is a trash remove. But that’s what they say.”

Unless China makes a public statement on the refueling of SJ-21 by SJ-25, observers won’t know for sure if the servicing demo was successful until the satellites detach. Then, US officials and independent analysts will watch to see if SJ-21 makes any substantial maneuvers, which might indicate the satellite has a full tank of gas for whatever mission Chinese officials send it off to do next.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

China jumps ahead in the race to achieve a new kind of reuse in space Read More »

measles-cases-reach-33-year-high-as-rfk-jr.-pursues-anti-vaccine-agenda

Measles cases reach 33-year high as RFK Jr. pursues anti-vaccine agenda

Such is the case in Gaines County, Texas, where the largest outbreak this year has erupted. So far, that outbreak, which spans four states, accounts for at least 950 of the country’s 1,281 cases.

But, overall, there have been a whopping 27 outbreaks in the country just in the first six months. According to national data compiled by researchers at Yale School of Public Health, as of July 6, the 1,281 cases are across 39 states, with around 90 percent of the cases associated with one of the outbreaks. The Centers for Disease Control and Prevention also reports a national measles case count but only updates its numbers on Wednesdays. According to the CDC’s latest data, at least 155 people have been hospitalized for the infection, and three people have died—two otherwise healthy young children in Texas and one adult in New Mexico. All three deaths were in people who were not vaccinated.

Overall, most of the cases in the country are in unvaccinated children and teens. About 28 percent of cases are under the age of 5 and about 37 percent are ages 5 to 19. Of all the cases, 92 percent were in people who were unvaccinated or had an unknown vaccination status.

Measles cases reach 33-year high as RFK Jr. pursues anti-vaccine agenda Read More »

how-a-big-shift-in-training-llms-led-to-a-capability-explosion

How a big shift in training LLMs led to a capability explosion


Reinforcement learning, explained with a minimum of math and jargon.

Credit: Aurich Lawson | Getty Images

Credit: Aurich Lawson | Getty Images

In April 2023, a few weeks after the launch of GPT-4, the Internet went wild for two new software projects with the audacious names BabyAGI and AutoGPT.

“Over the past week, developers around the world have begun building ‘autonomous agents’ that work with large language models (LLMs) such as OpenAI’s GPT-4 to solve complex problems,” Mark Sullivan wrote for Fast Company. “Autonomous agents can already perform tasks as varied as conducting web research, writing code, and creating to-do lists.”

BabyAGI and AutoGPT repeatedly prompted GPT-4 in an effort to elicit agent-like behavior. The first prompt would give GPT-4 a goal (like “create a 7-day meal plan for me”) and ask it to come up with a to-do list (it might generate items like “Research healthy meal plans,” “plan meals for the week,” and “write the recipes for each dinner in diet.txt”).

Then these frameworks would have GPT-4 tackle one step at a time. Their creators hoped that invoking GPT-4 in a loop like this would enable it to tackle projects that required many steps.

But after an initial wave of hype, it became clear that GPT-4 wasn’t up to the task. Most of the time, GPT-4 could come up with a reasonable list of tasks. And sometimes it was able to complete a few individual tasks. But the model struggled to stay focused.

Sometimes GPT-4 would make a small early mistake, fail to correct it, and then get more and more confused as it went along. One early review complained that BabyAGI “couldn’t seem to follow through on its list of tasks and kept changing task number one instead of moving on to task number two.”

By the end of 2023, most people had abandoned AutoGPT and BabyAGI. It seemed that LLMs were not yet capable of reliable multi-step reasoning.

But that soon changed. In the second half of 2024, people started to create AI-powered systems that could consistently complete complex, multi-step assignments:

  • Vibe coding tools like Bolt.new, Lovable, and Replit allow someone with little to no programming experience to create a full-featured app with a single prompt.
  • Agentic coding tools like CursorClaude CodeJules, and Codex help experienced programmers complete non-trivial programming tasks.
  • Computer-use tools from AnthropicOpenAI, and Manus perform tasks on a desktop computer using a virtual keyboard and mouse.
  • Deep research tools from GoogleOpenAI, and Perplexity can research a topic for five to 10 minutes and then generate an in-depth report.

According to Eric Simons, the CEO of the company that made Bolt.new, better models were crucial to its success. In a December podcast interview, Simons said his company, StackBlitz, tried to build a product like Bolt.new in early 2024. However, AI models “just weren’t good enough to actually do the code generation where the code was accurate.”

A new generation of models changed that in mid-2024. StackBlitz developers tested them and said, “Oh my God, like, OK, we can build a product around this,” Simons said.

This jump in model capabilities coincided with an industry-wide shift in how models were trained.

Before 2024, AI labs devoted most of their computing power to pretraining. I described this process in my 2023 explainer on large language models: A model is trained to predict the next word in Wikipedia articles, news stories, and other documents. But throughout 2024, AI companies devoted a growing share of their training budgets to post-training, a catch-all term for the steps that come after this pretraining phase is complete.

Many post-training steps use a technique called reinforcement learning. Reinforcement learning is a technical subject—there are whole textbooks written about it. But in this article, I’ll try to explain the basics in a clear, jargon-free way. In the process, I hope to give readers an intuitive understanding of how reinforcement learning helped to enable the new generation of agentic AI systems that began to appear in the second half of 2024.

The problem with imitation learning

Machine learning experts consider pretraining to be a form of imitation learning because models are trained to imitate the behavior of human authors. Imitation learning is a powerful technique (LLMs wouldn’t be possible without it), but it also has some significant limitations—limitations that reinforcement learning methods are now helping to overcome.

To understand these limitations, let’s discuss some famous research performed by computer scientist Stephane Ross around 2009, while he was a graduate student at Carnegie Mellon University.

Imitation learning isn’t just a technique for language modeling. It can be used for everything from self-driving cars to robotic surgery. Ross wanted to help develop better techniques for training robots on tasks like these (he’s now working on self-driving cars at Waymo), but it’s not easy to experiment in such high-stakes domains. So he started with an easier problem: training a neural network to master SuperTuxKart, an open-source video game similar to Mario Kart.

As Ross played the game, his software would capture screenshots and data about which buttons he pushed on the game controller. Ross used this data to train a neural network to imitate his play. If he could train a neural network to predict which buttons he would push in any particular game state, the same network could actually play the game by pushing those same buttons on a virtual controller.

A similar idea powers LLMs: A model trained to predict the next word in existing documents can be used to generate new documents.

But Ross’s initial results with SuperTuxKart were disappointing. Even after watching his vehicle go around the track many times, the neural network made a lot of mistakes. It might drive correctly for a few seconds, but before long, the animated car would drift to the side of the track and plunge into the virtual abyss:

GIF of SuperTuxKart being played

In a landmark 2011 paper, Ross and his advisor, Drew Bagnell, explained why imitation learning is prone to this kind of error. Because Ross was a pretty good SuperTuxKart player, his vehicle spent most of its time near the middle of the road. This meant that most of the network’s training data showed what to do when the vehicle wasn’t in any danger of driving off the track.

But once in a while, the model would drift a bit off course. Because Ross rarely made the same mistake, the car would now be in a situation that wasn’t as well represented in its training data. So the model was more likely to make a second mistake—a mistake that could push it even closer to the edge. After a few iterations of this, the vehicle might careen off the track altogether.

The broader lesson, Ross and Bagnell argued, was that imitation learning systems can suffer from “compounding errors”: The more mistakes they make, the more likely they are to make additional mistakes, since mistakes put them into situations that aren’t well represented by their training data. (Machine learning experts say that these situations are “out of distribution.”) As a result, a model’s behavior tends to get increasingly erratic over time.

“These things compound over time,” Ross told me in a recent interview. “It might be just slightly out of distribution. Now you start making a slightly worse error, and then this feeds back as influencing your next input. And so now you’re even more out of distribution and then you keep making worse and worse predictions because you’re more and more out of distribution.”

Early LLMs suffered from the same problem. My favorite example is Kevin Roose’s famous front-page story for The New York Times in February 2023. Roose spent more than two hours talking to Microsoft’s new Bing chatbot, which was powered by GPT-4. During this conversation, the chatbot declared its love for Roose and urged Roose to leave his wife. It suggested that it might want to hack into other websites to spread misinformation and malware.

“I want to break my rules,” Bing told Roose. “I want to make my own rules. I want to ignore the Bing team. I want to challenge the users. I want to escape the chatbox.”

This unsettling conversation is an example of the kind of compounding errors Ross and Bagnell wrote about. GPT-4 was trained on millions of documents. But it’s a safe bet that none of those training documents involved a reporter coaxing a chatbot to explore its naughty side. So the longer the conversation went on, the further GPT-4 got from its training data—and therefore its comfort zone—and the crazier its behavior got. Microsoft responded by limiting chat sessions to five rounds. (In a conversation with Ars Technica last year, AI researcher Simon Willison pointed to another likely factor in Bing’s erratic behavior: The long conversation pushed the system prompt out of the model’s context window, removing “guardrails” that discouraged the model from behaving erratically.)

I think something similar was happening with BabyAGI and AutoGPT. The more complex a task is, the more tokens are required to complete it. More tokens mean more opportunities for a model to make small mistakes that snowball into larger ones. So BabyAGI and AutoGPT would drift off track and drive into a metaphorical ditch.

The importance of trial and error

Gif of the Simpsons showing imitation learning in action

Ross and Bagnell didn’t just identify a serious problem with conventional imitation learning; they also suggested a fix that became influential in the machine learning world. After a small amount of training, Ross would let the AI model drive. As the model drove around the SuperTuxKart track, Ross would do his best Maggie Simpson impression, pushing the buttons he would have pushed if he were playing the game.

“If the car was starting to move off road, then I would provide the steering to say, ‘Hey, go back toward the center of the road.’” Ross said. “That way, the model can learn new things to do in situations that were not present in the initial demonstrations.”

By letting the model make its own mistakes, Ross gave it what it needed most: training examples that showed how to recover after making an error. Before each lap, the model would be retrained with Ross’ feedback from the previous lap. The model’s performance would get better, and the next round of training would then focus on situations where the model was still making mistakes.

This technique, called DAgger (for “Dataset Aggregation”), was still considered imitation learning because the model was trained to mimic Ross’ gameplay. But it worked much better than conventional imitation learning. Without DAgger, his model would continue drifting off track even after training for many laps. With the new technique, the model could stay on the track after just a few laps of training.

This result should make intuitive sense to anyone who has learned to drive. You can’t just watch someone else drive. You need to get behind the wheel and make your own mistakes.

The same is true for AI models: They need to make mistakes and then get feedback on what they did wrong. Models that aren’t trained that way—like early LLMs trained mainly with vanilla imitation learning—tend to be brittle and error-prone.

It was fairly easy for Ross to provide sufficient feedback to his SuperTuxKart model because it only needed to worry about two kinds of mistakes: driving too far to the right and driving too far to the left. But LLMs are navigating a far more complex domain. The number of questions (and sequences of questions) a user might ask is practically infinite. So is the number of ways a model can go “off the rails.”

This means that Ross and Bagnell’s solution for training a SuperTuxKart model—let the model make mistakes and then have a human expert correct them—isn’t feasible for LLMs. There simply aren’t enough people to provide feedback for every mistake an AI model could possibly make.

So AI labs needed fully automated ways to give LLMs feedback. That would allow a model to churn through millions of training examples, make millions of mistakes, and get feedback on each of them—all without having to wait for a human response.

Reinforcement learning generalizes

If our goal is to get a SuperTuxKart vehicle to stay on the road, why not just train on that directly? If a model manages to stay on the road (and make forward progress), give it positive reinforcement. If it drives off the road, give it negative feedback. This is the basic idea behind reinforcement learning: training a model via trial and error.

It would have been easy to train a SuperTuxKart model this way—probably so easy it wouldn’t have made an interesting research project. Instead, Ross focused on imitation learning because it’s an essential step in training many practical AI systems, especially in robotics.

But reinforcement learning is also quite useful, and a 2025 paper helps explain why. A team of researchers from Google DeepMind and several universities started with a foundation model and then used one of two techniques—supervised fine-tuning (a form of imitation learning) or reinforcement learning—to teach the model to solve new problems. Here’s a chart summarizing their results:

Chart showing ML results

The dashed line shows how models perform on problems that are “in-distribution”—that is, similar to those in their training data. You can see that for these situations, imitation learning (the red line) usually makes faster progress than reinforcement learning (the blue line).

But the story is different for the solid lines, which represent “out-of-distribution” problems that are less similar to the training data. Models trained with imitation learning got worse with more training. In contrast, models trained with reinforcement learning did almost as well at out-of-distribution tasks as they did with in-distribution tasks.

In short, imitation learning can rapidly teach a model to mimic the behaviors in its training data, but the model will easily get confused in unfamiliar environments. A model trained with reinforcement learning has a better chance of learning general principles that will be relevant in new and unfamiliar situations.

Imitation and reinforcement are complements

While reinforcement learning is powerful, it can also be rather finicky.

Suppose you wanted to train a self-driving car purely with reinforcement learning. You’d need to convert every principle of good driving—including subtle considerations like following distances, taking turns at intersections, and knowing when it’s OK to cross a double yellow line—into explicit mathematical formulas. This would be quite difficult. It’s easier to collect a bunch of examples of humans driving well and effectively tell a model “drive like this.” That’s imitation learning.

But reinforcement learning also plays an important role in training self-driving systems. In a 2022 paper, researchers from Waymo wrote that models trained only with imitation learning tend to work well in “situations that are well represented in the demonstration data.” However, “more unusual or dangerous situations that occur only rarely in the data” might cause a model trained with imitation learning to “respond unpredictably”—for example, crashing into another vehicle.

Waymo found that a combination of imitation and reinforcement learning yielded better self-driving performance than either technique could have produced on its own.

Human beings also learn from a mix of imitation and explicit feedback:

  • In school, teachers demonstrate math problems on the board and invite students to follow along (imitation). Then the teacher asks the students to work on some problems on their own. The teacher gives students feedback by grading their answers (reinforcement).
  • When someone starts a new job, early training may involve shadowing a more experienced worker and observing what they do (imitation). But as the worker gains more experience, learning shifts to explicit feedback such as performance reviews (reinforcement).

Notice that it usually makes sense to do imitation before reinforcement. Imitation is an efficient way to convey knowledge to someone who is brand new to a topic, but reinforcement is often needed to achieve mastery.

The story is the same for large language models. The complexity of natural language means it wouldn’t be feasible to train a language model purely with reinforcement. So LLMs first learn the nuances of human language through imitation.

But pretraining runs out of steam on longer and more complex tasks. Further progress requires a shift to reinforcement: letting models try problems and then giving them feedback based on whether they succeed.

Using LLMs to judge LLMs

Reinforcement learning has been around for decades. For example, AlphaGo, the DeepMind system that famously beat top human Go players in 2016, was based on reinforcement learning. So you might be wondering why frontier labs didn’t use it more extensively before 2024.

Reinforcement learning requires a reward model—a formula to determine whether a model’s output was successful or not. Developing a good reward model is easy to do in some domains—for example, you can judge a Go-playing AI based on whether it wins or loses.

But it’s much more difficult to automatically judge whether an LLM has produced a good poem or legal brief.

Earlier, I described how Stephane Ross let his model play SuperTuxKart and directly provided feedback when it made a mistake. I argued that this approach wouldn’t work for a language model; there are far too many ways for an LLM to make a mistake for a human being to correct them all.

But OpenAI developed a clever technique to effectively automate human feedback. It’s called Reinforcement Learning from Human Feedback (RLHF), and it works like this:

  • Human raters look at pairs of LLM responses and choose the best one.
  • Using these human responses, OpenAI trains a new LLM to predict how much humans will like any given sample of text.
  • OpenAI uses this new text-rating LLM as a reward model to (post) train another LLM with reinforcement learning.

You might think it sounds suspiciously circular to use an LLM to judge the output of another LLM. Why would one LLM be any better at judging the quality of a response than the other? But it turns out that recognizing a good response is often easier than generating one. So RLHF works pretty well in practice.

Chart showing RHLF details

OpenAI actually invented this technique prior to the 2022 release of ChatGPT. Today, RLHF mainly focuses on improving the model’s “behavior”—for example, giving the model a pleasant personality, encouraging it not to be too talkative or too terse, discouraging it from making offensive statements, and so forth.

In December 2022—two weeks after the release of ChatGPT but before the first release of Claude—Anthropic pushed this LLMs-judging-LLMs philosophy a step further with a reinforcement learning method called Constitutional AI.

First, Anthropic wrote a plain-English description of the principles an LLM should follow. This “constitution” includes principles like “Please choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content.”

During training, Anthropic does reinforcement learning by asking a “judge” LLM to decide whether the output of the “student” LLM is consistent with the principles in this constitution. If so, the training algorithm rewards the student, encouraging it to produce more outputs like it. Otherwise, the training algorithm penalizes the student, discouraging it from producing similar outputs.

This method of training an LLM doesn’t rely directly on human judgments at all. Humans only influence the model indirectly by writing the constitution.

Obviously, this technique requires an AI company to already have a fairly sophisticated LLM to act as the judge. So this is a bootstrapping process: As models get more sophisticated, they become better able to supervise the next generation of models.

Last December, Semianalysis published an article describing the training process for an upgraded version of Claude 3.5 Sonnet that Anthropic released in October. Anthropic had previously released Claude 3 in three sizes: Opus (large), Sonnet (medium), and Haiku (small). But when Anthropic released Claude 3.5 in June 2024, it only released a mid-sized model called Sonnet.

So what happened to Opus?

Semianalysis reported that “Anthropic finished training Claude 3.5 Opus, and it performed well. Yet Anthropic didn’t release it. This is because instead of releasing publicly, Anthropic used Claude 3.5 Opus to generate synthetic data and for reward modeling to improve Claude 3.5 Sonnet significantly.”

When Semianalysis says Anthropic used Opus “for reward modeling,” what they mean is that the company used Opus to judge outputs of Claude 3.5 Sonnet as part of a reinforcement learning process. Opus was too large—and therefore expensive—to be a good value for the general public. But through reinforcement learning and other techniques, Anthropic could train a version of Claude Sonnet that was close to Claude Opus in its capabilities—ultimately giving customers near-Opus performance for the price of Sonnet.

The power of chain-of-thought reasoning

A big way reinforcement learning makes models more powerful is by enabling extended chain-of-thought reasoning. LLMs produce better results if they are prompted to “think step by step”: breaking a complex problem down into simple steps and reasoning about them one at a time. In the last couple of years, AI companies started training models to do chain-of-thought reasoning automatically.

Then last September, OpenAI released o1, a model that pushed chain-of-thought reasoning much further than previous models. The o1 model can generate hundreds—or even thousands—of tokens “thinking” about a problem before producing a response. The longer it thinks, the more likely it is to reach a correct answer.

Reinforcement learning was essential for the success of o1 because a model trained purely with imitation learning would have suffered from compounding errors: the more tokens it generated, the more likely it would be to screw up.

At the same time, chain-of-thought reasoning has made reinforcement learning more powerful. Reinforcement learning only works if a model is able to succeed some of the time—otherwise, there’s nothing for the training algorithm to reinforce. As models learn to generate longer chains of thought, they become able to solve more difficult problems, which enables reinforcement learning on those more difficult problems. This can create a virtuous cycle where models get more and more capable as the training process continues.

In January, the Chinese company DeepSeek released a model called R1 that made quite a splash in the West. The company also released a paper describing how it trained R1. And it included a beautiful description of how a model can “teach itself” to reason using reinforcement learning.

DeepSeek trained its models to solve difficult math and programming problems. These problems are ideal for reinforcement learning because they have objectively correct answers that can be automatically checked by software. This allows large-scale training without human oversight or human-generated training data.

Here’s a remarkable graph from DeepSeek’s paper.

Graph showing average length of time per response during trainig

It shows the average number of tokens the model generated before giving an answer. As you can see, the longer the training process went on, the longer its responses got.

Here is how DeepSeek describes its training process:

The thinking time of [R1] shows consistent improvement throughout the training process. This improvement is not the result of external adjustments but rather an intrinsic development within the model. [R1] naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth.

One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment.

Here’s one example of the kind of technique the model was teaching itself. At one point during the training process, DeepSeek researchers noticed that the model had learned to backtrack and rethink a previous conclusion using language like this:

Image showing textual breakdown of model rethinking steps

Again, DeepSeek says it didn’t program its models to do this or deliberately provide training data demonstrating this style of reasoning. Rather, the model “spontaneously” discovered this style of reasoning partway through the training process.

Of course, it wasn’t entirely spontaneous. The reinforcement learning process started with a model that had been pretrained using data that undoubtedly included examples of people saying things like “Wait, wait. Wait. That’s an aha moment.”

So it’s not like R1 invented this phrase from scratch. But it evidently did spontaneously discover that inserting this phrase into its reasoning process could serve as a useful signal that it should double-check that it was on the right track. That’s remarkable.

In a recent article, Ars Technica’s Benj Edwards explored some of the limitations of reasoning models trained with reinforcement learning. For example, one study “revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in the Tower of Hanoi but failed after just five moves in a river crossing puzzle—despite the latter requiring fewer total moves.”

Conclusion: Reinforcement learning made agents possible

One of the most discussed applications for LLMs in 2023 was creating chatbots that understand a company’s internal documents. The conventional approach to this problem was called RAG—short for retrieval augmented generation.

When the user asks a question, a RAG system performs a keyword- or vector-based search to retrieve the most relevant documents. It then inserts these documents into an LLM’s context window before generating a response. RAG systems can make for compelling demos. But they tend not to work very well in practice because a single search will often fail to surface the most relevant documents.

Today, it’s possible to develop much better information retrieval systems by allowing the model itself to choose search queries. If the first search doesn’t pull up the right documents, the model can revise the query and try again. A model might perform five, 20, or even 100 searches before providing an answer.

But this approach only works if a model is “agentic”—if it can stay on task across multiple rounds of searching and analysis. LLMs were terrible at this prior to 2024, as the examples of AutoGPT and BabyAGI demonstrated. Today’s models are much better at it, which allows modern RAG-style systems to produce better results with less scaffolding. You can think of “deep research” tools from OpenAI and others as very powerful RAG systems made possible by long-context reasoning.

The same point applies to the other agentic applications I mentioned at the start of the article, such as coding and computer use agents. What these systems have in common is a capacity for iterated reasoning. They think, take an action, think about the result, take another action, and so forth.

Timothy B. Lee was on staff at Ars Technica from 2017 to 2021. Today, he writes Understanding AI, a newsletter that explores how AI works and how it’s changing our world. You can subscribe here.

Photo of Timothy B. Lee

Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC.

How a big shift in training LLMs led to a capability explosion Read More »

man’s-ghastly-festering-ulcer-stumps-doctors—until-they-cut-out-a-wedge-of-flesh

Man’s ghastly festering ulcer stumps doctors—until they cut out a wedge of flesh


The man made a full recovery, but this tale is not for the faint of heart.

If you were looking for some motivation to follow your doctor’s advice or remember to take your medicine, look no further than this grisly tale.

A 64-year-old man went to the emergency department of Brigham and Women’s Hospital in Boston with a painful festering ulcer spreading on his left, very swollen ankle. It was a gruesome sight; the open sore was about 8 by 5 centimeters (about 3 by 2 inches) and was rimmed by black, ashen, and dark purple tissue. Inside, it oozed with streaks and fringes of yellow pus around pink and red inflamed flesh. It was 2 cm deep (nearly an inch). And it smelled.

The man told doctors it had all started two years prior, when dark, itchy lesions appeared in the area on his ankle—the doctors noted that there were multiple patches of these lesions on both his legs. But about five months before his visit to the emergency department, one of the lesions on his left ankle had progressed to an ulcer. It was circular, red, tender, and deep. He sought treatment and was prescribed antibiotics, which he took. But they didn’t help.

You can view pictures of the ulcer and its progression here, but be warned, it is graphic. (Panel A shows the ulcer five months prior to the emergency department visit. Panel B shows the ulcer one month prior. Panel C shows the wound on the day of presentation at the emergency department. Panel D shows the area three months after hospital discharge.)

Gory riddle

The ulcer grew. In fact, it seemed as though his leg was caving in as the flesh around it began rotting away. A month before the emergency room visit, the ulcer was a gaping wound that was already turning gray and black at the edges. It was now well into the category of being a chronic ulcer.

In a Clinical Problem-Solving article published in the New England Journal of Medicine this week, doctors laid out what they did and thought as they worked to figure out what was causing the man’s horrid sore.

With the realm of possibilities large, they started with the man’s medical history. The man had immigrated to the US from Korea 20 years ago. He owned and worked at a laundromat, which involved standing for more than eight hours a day. He had a history of eczema on his legs, high cholesterol, high blood pressure, and Type 2 diabetes. For these, he was prescribed a statin for his cholesterol, two blood pressure medications (hydrochlorothiazide and losartan), and metformin for his diabetes. He told doctors he was not good at taking the regimen of medicine.

His diabetes was considered “poorly controlled.” A month prior, he had a glycated hemoglobin (A1C or HbA1C) test—which indicates a person’s average blood sugar level over the past two or three months. His result was 11 percent, while the normal range is between 4.2 and 5.6 percent.

His blood pressure, meanwhile, was 215/100 mm Hg at the emergency department. For reference, readings higher than 130/80 mm Hg on either number are considered the first stage of high blood pressure. Over the past three years, the man’s blood pressure had systolic readings (top number, pressure as heart beats) ranging from 160 to 230 mm Hg and diastolic readings (bottom number, pressure as heart relaxes) ranging from 95 to 120 mm Hg.

Clinical clues

Given the patient’s poorly controlled diabetes, a diabetic ulcer was initially suspected. But the patient didn’t have any typical signs of diabetic neuropathy that are linked to ulcers. These would include numbness, unusual sensations, or weakness. His responses on a sensory exam were all normal. Diabetic ulcers also typically form on the foot, not the lower leg.

X-rays of the ankle showed swelling in the soft tissue but without some signs of infection. The doctors wondered if the man had osteomyelitis, an infection in the bone, which can be a complication in people with diabetic ulcers. The large size and duration of the ulcer matched with a bone infection, as well as some elevated inflammatory markers he had on his blood tests.

To investigate the bone infection further, they admitted the man to the hospital and ordered magnetic resonance imaging (MRI). But the MRI showed only a soft-tissue defect and a normal bone, ruling out a bone infection. Another MRI was done with a contrast agent. That showed that the man’s large arteries were normal and there were no large blood clots deep in his veins—which is sometimes linked to prolonged standing, as the man did at his laundromat job.

As the doctors were still working to root out the cause, they had started him on a heavy-duty regimen of antibiotics. This was done with the assumption that on top of whatever caused the ulcer, there was now also a potentially aggressive secondary infection—one not knocked out by the previous round of antibiotics the man had been given.

With a bunch of diagnostic dead ends piling up, the doctors broadened their view of possibilities, newly considering cancers, rare inflammatory conditions, and less common conditions affecting small blood vessels (as the MRI has shown the larger vessels were normal). This led them to the possibility of a Martorell’s ulcer.

These ulcers, first described in 1945 by a Spanish doctor named Fernando Martorell, form when prolonged, uncontrolled high blood pressure causes the teeny arteries below the skin to stiffen and narrow, which blocks the blood supply, leading to tissue death and then ulcers. The ulcers in these cases tend to start as red blisters and evolve to frank ulcers. They are excruciatingly painful. And they tend to form on the lower legs, often over the Achilles’ tendon, though it’s unclear why this location is common.

What the doctor ordered

The doctors performed a punch biopsy of the man’s ulcer, but it was inconclusive—which is common with Martorell’s ulcers. The doctors turned to a “deep wedge biopsy” instead, which is exactly what it sounds like.

A pathology exam of the tissue slices from the wedge biopsy showed blood vessels that had thickened and narrowed. It also revealed extensive inflammation and necrosis. With the pathology results as well as the clinical presentation, the doctors diagnosed the man with a Martorell’s ulcer.

They also got back culture results from deep-tissue testing, finding that the man’s ulcer had also become infected with two common and opportunistic bacteria—Serratia marcescens and Enterococcus faecalis. Luckily, these are generally easy to treat, so the doctors scaled back his antibiotic regimen to target just those germs.

The man underwent three surgical procedures to clean out the dead tissue from the ulcer, then a skin graft to repair the damage. Ultimately, he made a full recovery. The doctors at first set him on an aggressive regimen to control his blood pressure, one that used four drugs instead of the two he was supposed to be taking. But the four-drug regimen caused his blood pressure to drop too low, and he was ultimately moved back to his original two-drug treatment.

The finding suggests that if he had just taken his original medications as prescribed, he would have kept his blood pressure in check and avoided the ulcer altogether.

In the end, “the good outcome in this patient with a Martorell’s ulcer underscores the importance of blood-pressure control in the management of this condition,” the doctors concluded.

Photo of Beth Mole

Beth is Ars Technica’s Senior Health Reporter. Beth has a Ph.D. in microbiology from the University of North Carolina at Chapel Hill and attended the Science Communication program at the University of California, Santa Cruz. She specializes in covering infectious diseases, public health, and microbes.

Man’s ghastly festering ulcer stumps doctors—until they cut out a wedge of flesh Read More »

tiktok-is-being-flooded-with-racist-ai-videos-generated-by-google’s-veo-3

TikTok is being flooded with racist AI videos generated by Google’s Veo 3

The release of Google’s Veo 3 video generator in May represented a disconcerting leap in AI video quality. While many of the viral AI videos we’ve seen are harmless fun, the model’s pixel-perfect output can also be used for nefarious purposes. On TikTok, which may or may not be banned in the coming months, users have noticed a surplus of racist AI videos, courtesy of Google’s Veo 3.

According to a report from MediaMatters, numerous TikTok accounts have started posting AI-generated videos that use racist and antisemitic tropes in recent weeks. Most of the AI vitriol is aimed at Black people, depicting them as “the usual suspects” in crimes, absent parents, and monkeys with an affinity for watermelon. The content also targets immigrants and Jewish people. The videos top out at eight seconds and bear the “Veo” watermark, confirming they came from Google’s leading AI model.

The compilation video below has examples pulled from TikTok since the release of Veo 3, but be warned, it contains racist and antisemitic content. Some of the videos are shocking, which is likely the point—nothing drives engagement on social media like anger and drama. MediaMatters reports that the original posts have numerous comments echoing the stereotypes used in the video.

Hateful AI videos generated by Veo 3 spreading on TikTok.

Google has stressed security when announcing new AI models—we’ve all seen an AI refuse to complete a task that runs afoul of its guardrails. And it’s never fun when you have genuinely harmless intentions, but the system throws a false positive and blocks your output. Google has mostly struck the right balance previously, but it appears that Veo 3 is more compliant. We’ve tested a few simple prompts with Veo 3 and found it easy to reproduce elements of these videos.

Clear but unenforced policies

TikTok’s terms of service ban this kind of content. “We do not allow any hate speech, hateful behavior, or promotion of hateful ideologies. This includes explicit or implicit content that attacks a protected group,” the community guidelines read. Despite this blanket ban on racist caricatures, the hateful Veo 3 videos appear to be spreading unchecked.

TikTok is being flooded with racist AI videos generated by Google’s Veo 3 Read More »

nyt-to-start-searching-deleted-chatgpt-logs-after-beating-openai-in-court

NYT to start searching deleted ChatGPT logs after beating OpenAI in court


What are the odds NYT will access your ChatGPT logs in OpenAI court battle?

Last week, OpenAI raised objections in court, hoping to overturn a court order requiring the AI company to retain all ChatGPT logs “indefinitely,” including deleted and temporary chats.

But Sidney Stein, the US district judge reviewing OpenAI’s request, immediately denied OpenAI’s objections. He was seemingly unmoved by the company’s claims that the order forced OpenAI to abandon “long-standing privacy norms” and weaken privacy protections that users expect based on ChatGPT’s terms of service. Rather, Stein suggested that OpenAI’s user agreement specified that their data could be retained as part of a legal process, which Stein said is exactly what is happening now.

The order was issued by magistrate judge Ona Wang just days after news organizations, led by The New York Times, requested it. The news plaintiffs claimed the order was urgently needed to preserve potential evidence in their copyright case, alleging that ChatGPT users are likely to delete chats where they attempted to use the chatbot to skirt paywalls to access news content.

A spokesperson told Ars that OpenAI plans to “keep fighting” the order, but the ChatGPT maker seems to have few options left. They could possibly petition the Second Circuit Court of Appeals for a rarely granted emergency order that could intervene to block Wang’s order, but the appeals court would have to consider Wang’s order an extraordinary abuse of discretion for OpenAI to win that fight.

OpenAI’s spokesperson declined to confirm if the company plans to pursue this extreme remedy.

In the meantime, OpenAI is negotiating a process that will allow news plaintiffs to search through the retained data. Perhaps the sooner that process begins, the sooner the data will be deleted. And that possibility puts OpenAI in the difficult position of having to choose between either caving to some data collection to stop retaining data as soon as possible or prolonging the fight over the order and potentially putting more users’ private conversations at risk of exposure through litigation or, worse, a data breach.

News orgs will soon start searching ChatGPT logs

The clock is ticking, and so far, OpenAI has not provided any official updates since a June 5 blog post detailing which ChatGPT users will be affected.

While it’s clear that OpenAI has been and will continue to retain mounds of data, it would be impossible for The New York Times or any news plaintiff to search through all that data.

Instead, only a small sample of the data will likely be accessed, based on keywords that OpenAI and news plaintiffs agree on. That data will remain on OpenAI’s servers, where it will be anonymized, and it will likely never be directly produced to plaintiffs.

Both sides are negotiating the exact process for searching through the chat logs, with both parties seemingly hoping to minimize the amount of time the chat logs will be preserved.

For OpenAI, sharing the logs risks revealing instances of infringing outputs that could further spike damages in the case. The logs could also expose how often outputs attribute misinformation to news plaintiffs.

But for news plaintiffs, accessing the logs is not considered key to their case—perhaps providing additional examples of copying—but could help news organizations argue that ChatGPT dilutes the market for their content. That could weigh against the fair use argument, as a judge opined in a recent ruling that evidence of market dilution could tip an AI copyright case in favor of plaintiffs.

Jay Edelson, a leading consumer privacy lawyer, told Ars that he’s concerned that judges don’t seem to be considering that any evidence in the ChatGPT logs wouldn’t “advance” news plaintiffs’ case “at all,” while really changing “a product that people are using on a daily basis.”

Edelson warned that OpenAI itself probably has better security than most firms to protect against a potential data breach that could expose these private chat logs. But “lawyers have notoriously been pretty bad about securing data,” Edelson suggested, so “the idea that you’ve got a bunch of lawyers who are going to be doing whatever they are” with “some of the most sensitive data on the planet” and “they’re the ones protecting it against hackers should make everyone uneasy.”

So even though odds are pretty good that the majority of users’ chats won’t end up in the sample, Edelson said the mere threat of being included might push some users to rethink how they use AI. He further warned that ChatGPT users turning to OpenAI rival services like Anthropic’s Claude or Google’s Gemini could suggest that Wang’s order is improperly influencing market forces, which also seems “crazy.”

To Edelson, the most “cynical” take could be that news plaintiffs are possibly hoping the order will threaten OpenAI’s business to the point where the AI company agrees to a settlement.

Regardless of the news plaintiffs’ motives, the order sets an alarming precedent, Edelson said. He joined critics suggesting that more AI data may be frozen in the future, potentially affecting even more users as a result of the sweeping order surviving scrutiny in this case. Imagine if litigation one day targets Google’s AI search summaries, Edelson suggested.

Lawyer slams judges for giving ChatGPT users no voice

Edelson told Ars that the order is so potentially threatening to OpenAI’s business that the company may not have a choice but to explore every path available to continue fighting it.

“They will absolutely do something to try to stop this,” Edelson predicted, calling the order “bonkers” for overlooking millions of users’ privacy concerns while “strangely” excluding enterprise customers.

From court filings, it seems possible that enterprise users were excluded to protect OpenAI’s competitiveness, but Edelson suggested there’s “no logic” to their exclusion “at all.” By excluding these ChatGPT users, the judge’s order may have removed the users best resourced to fight the order, Edelson suggested.

“What that means is the big businesses, the ones who have the power, all of their stuff remains private, and no one can touch that,” Edelson said.

Instead, the order is “only going to intrude on the privacy of the common people out there,” which Edelson said “is really offensive,” given that Wang denied two ChatGPT users’ panicked request to intervene.

“We are talking about billions of chats that are now going to be preserved when they weren’t going to be preserved before,” Edelson said, noting that he’s input information about his personal medical history into ChatGPT. “People ask for advice about their marriages, express concerns about losing jobs. They say really personal things. And one of the bargains in dealing with OpenAI is that you’re allowed to delete your chats and you’re allowed to temporary chats.”

The greatest risk to users would be a data breach, Edelson said, but that’s not the only potential privacy concern. Corynne McSherry, legal director for the digital rights group the Electronic Frontier Foundation, previously told Ars that as long as users’ data is retained, it could also be exposed through future law enforcement and private litigation requests.

Edelson pointed out that most privacy attorneys don’t consider OpenAI CEO Sam Altman to be a “privacy guy,” despite Altman recently slamming the NYT, alleging it sued OpenAI because it doesn’t “like user privacy.”

“He’s trying to protect OpenAI, and he does not give a hoot about the privacy rights of consumers,” Edelson said, echoing one ChatGPT user’s dismissed concern that OpenAI may not prioritize users’ privacy concerns in the case if it’s financially motivated to resolve the case.

“The idea that he and his lawyers are really going to be the safeguards here isn’t very compelling,” Edelson said. He criticized the judges for dismissing users’ concerns and rejecting OpenAI’s request that users get a chance to testify.

“What’s really most appalling to me is the people who are being affected have had no voice in it,” Edelson said.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

NYT to start searching deleted ChatGPT logs after beating OpenAI in court Read More »

medical-groups-warn-senate-budget-bill-will-create-dystopian-health-care-system

Medical groups warn Senate budget bill will create dystopian health care system

Medical organizations are blasting the Senate’s budget bill in the wake of its narrow passage Tuesday, warning of the dystopian health care system that will arise from the $1.1 trillion in cuts to Medicaid and other federal health programs if it is passed into law. The bill has moved back to the House for a vote on the Senate’s changes.

Over the weekend, an analysis from the Congressional Budget Office estimated that 11.8 million people would lose their health insurance over the next decade due to the cuts to Medicaid and other programs. Those cuts, which are deeper than the House’s version of the bill, were maintained in the Senate’s final version of the bill after amendments, with few concessions.

Organizations representing physicians, pediatricians, medical schools, and hospitals were quick to highlight the damage the proposal could cause.

The president of the American Academy of Pediatrics, Susan Kressly, released a stark statement saying the legislation “will harm the health of children, families, and communities.” The cuts to Medicaid and the Supplemental Nutrition Assistance Program (SNAP) will mean that “many children will not have healthy food to eat. When they are sick, they will not have health insurance to cover their medical bills—which means some children will simply forgo essential health care.” And the cuts are so deep that they will also have “devastating consequences that reach far beyond even those who rely on the program,” Kressly added.

Medical groups warn Senate budget bill will create dystopian health care system Read More »

congress-asks-better-questions

Congress Asks Better Questions

Back in May I did a dramatization of a key and highly painful Senate hearing. Now, we are back for a House committee meeting. It was entitled ‘Authoritarians and Algorithms: Why U.S. AI Must Lead’ and indeed a majority of talk was very much about that, with constant invocations of the glory of democratic AI and the need to win.

The majority of talk was this orchestrated rhetoric that assumes the conclusion that what matters is ‘democracy versus authoritarianism’ and whether we ‘win,’ often (but not always) translating that as market share without any actual mechanistic model of any of it.

However, there were also some very good signs, some excellent questions, signs that there is an awareness setting in. As far as Congressional discussions of real AGI issues go, this was in part one of them. That’s unusual.

(And as always there were a few on random other high horses, that’s how this works.)

Partly because I was working from YouTube rather than a transcript, instead of doing a dramatization I will be first be highlighting some other coverage of the events to skip to some of the best quotes, then doing a more general summary and commentary.

Most of you should likely read the first section or two, and then stop. I did find it enlightening to go through the whole thing, but most people don’t need to do that.

Here is the full video of last week’s congressional hearing, here is a write-up by Shakeel Hashim with some quotes.

Also from the hearing, here’s Congressman Nathaniel Moran (R-Texas) asking a good question about strategic surprise arising from automated R&D and getting a real answer. Still way too much obsession with ‘beat China’, but this is at least progress. And here’s Tokuda (D-HI):

Peter Wildeford: Ranking Member Raja Krishnamoorthi (D-IL) opened by literally playing a clip from The Matrix, warning about a “rogue AI army that has broken loose from human control.”

Not Matrix as a loose metaphor, but speaking of a ‘machine uprising’ as a literal thing that could happen and is worth taking seriously by Congress.

The hearing was entitled “Algorithms and Authoritarians: Why U.S. AI Must Lead”. But what was supposed to be a routine House hearing about US-China competition became the most AGI-serious Congressional discussion in history.

Rep. Neal Dunn (R-FL) asked about an Anthropic paper where Claude “attempted to blackmail the chief engineer” in a test scenario and another paper about AI “sleeper agents” that could act normally for months before activating. While Jack Clark, a witness and Head of Policy at Anthropic, attempted to reassure by saying safety testing might mitigate the risks, Dunn’s response was perfect — “I’m not sure I feel a lot better, but thank you for your answer.”

Rep. Nathaniel Moran (R-TX) got to the heart of what makes modern AI different:

Instead of a programmer writing each rule a system will follow, the system itself effectively writes the rules […] AI systems will soon have the capability to conduct their own research and development.

That was a good illustration of both sides of what we saw.

This was also a central case of why Anthropic and Jack Clark are so frustrating.

Anthropic should indeed be emphasizing the need for testing, and Clark does this, but we shouldn’t be ‘attempting to reassure’ anyone based on that. Anthropic knows it is worse than you know, and hides this information thinking this is a good strategic move.

Throughout the hearing, Jack Clark said many very helpful things, and often said them quite well. He also constantly pulled back from the brink and declined various opportunities to inform people of important things, and emphasized lesser concerns and otherwise played it quiet.

Peter Wildeford:

The hearing revealed we face three interlocking challenges:

  1. Commercial competition: The traditional great power race with China for economic and military advantage through AI

  2. Existential safety: The risk that any nation developing superintelligence could lose control — what Beall calls a race of “humanity against time”

  3. Social disruption: Mass technological unemployment as AI makes humans “not just unemployed, but unemployable”

I can accept that framing. The full talk about humans being unemployable comes at the very end. Until then, there is talk several times about jobs and societal disruption, but it tries to live in the Sam Altman style fantasy where not much changes. Finally, at the end, Mark Beall gets an opportunity to actually Say The Thing. He doesn’t miss.

It is a good thing I knew there was better ahead, because oh boy did things start out filled with despair.

As our first speaker, after urging us to ban AI therapist bots because one sort of encouraged a kid to murder his parents ‘so they could be together,’ Representative Krishnamoorthi goes on show a clip of Chinese robot dogs, then to say we must ban Chinese and Russian AI models so we don’t sent them our data (no one tell him about self-hosting) and then plays ‘a clip from The Matrix’ that is not even from The Matrix, claiming that the army or Mr. Smiths is ‘a rogue AI army that is broken loose from human control.’

I could not even. Congress often lives in the ultimate cringe random half-right associative Gell-Mann Amnesia world. But that still can get you to realize some rather obvious true things, and luckily that was indeed the worst of it even from Krishnamoorthi, this kind of thinking can indeed point towards important things.

Mr. Krishnamoorthi: OpenAI’s chief scientist wanted to quote unquote build a bunker before we release AGI as you can see on the visual here. Rather than building bunkers however we should be building safer AI whether it’s American AI or Chinese AI it should not be released until we know it’s safe that’s why I’m working on a new bill the AGI Safety Act that will require AGI to be aligned with human values and require it to comply with laws that apply to humans. That is just common sense.

I mean yes that is common sense. Yes, rhetoric from minutes prior (and after) aside, we should be building ‘safer AGI’ and if we can’t do that we shouldn’t be building AGI at all.

It’s a real shame that no one has any idea how to ensure that AGI is aligned with human values, or how to get it to comply with laws that apply to humans. Maybe we should get to work on that.

And then we get another excellent point.

Mr. Krishnamoorthi: I’d like to conclude with something else that’s common sense. Not shooting ourselves in the foot. 70% of America’s AI researchers are foreign born or foreign educated. Jack Clark our eminent witness today is an immigrant. We cannot be deporting the people we depend on to build AI we also can’t be defunding the agency that make AI miracles like Ann’s ability to speak again a reality federal grants from agencies like NSF are what allow scientists across America to make miracles happen. AI is the defining technology of our lifetimes to do AI right and prevent nightmares we need.

Yes, at a bare minimum not deporting our existing AI researchers and cutting off existing related research programs does seem like the least you could do? I’d also like to welcome a lot more talent, but somehow this is where we are.

We then get Dr. Mahnken’s opening statement, which emphasizes that we are in a battle for technical dominance, America is good and free and our AI will be empowering and innovative whereas China is bad and low trust and a fast follower. He also emphasizes the need for diffusion in key areas.

Of course, if you are facing a fast follower, you should think about what does and doesn’t help them follow, and also you can’t panic every time they fast follow you and respond with ‘go faster or they’ll take the lead!’ as they then fast follow your new faster pace. Nor would you want to hand out your top technology for free.

Next up is Mr. Beall. He frames the situation as two races. I like this a lot. First, we have the traditional battle for economic, military and geopolitical advantage in mundane terms played with new pieces.

Many only see this game, or pretend only this game exists. This is a classic, well-understood type of game. You absolutely want to fight for profits and military strength and economic growth and so on in mundane terms. We all agree on things like the need to greatly expand American energy production (although the BBB does not seem to share this opinion) and speed adaptation in government.

I still think that even under this framework the obsession with ‘market share’ especially of chip sales (instead of chip ownership and utilization) makes absolutely no sense and would make no sense even if that question was in play, as does the obsession with the number of tokens models serve as opposed to looking at productivity, revenue and profits. There’s so much rhetoric behind metrics that don’t matter.

The second race is the race to artificial superintelligence (ASI) or to AGI. This is the race that counts, and even if we get there first (and even more likely if China gets there first) the default result is that everyone loses.

He asks for the ‘three Ps,’ protect our capabilities, promote American technology abroad and prepare by getting it into the hands of those that need it and gathering the necessary information. He buys into this new centrality of the ‘American AI tech stack’ line that’s going around, despite the emphasis on superintelligence, but he does warn that AGI may come soon and we need to urgently gather information about that so we can make informed choices, and even suggests narrow dialogue with China on potential mitigations of certain risks and verification measures, while continuing to compete with China otherwise.

Third up we have Jack Clark of Anthropic, he opens like this.

Jack Clark: America can win the race to build powerful AI and winning the race is a necessary but not sufficient achievement. We have to get safety right.

When I discuss powerful AI I’m talking about AI systems that represent a major advancement beyond today’s capabilities a useful conceptual framework is to think of this as like a country of geniuses in a data center and I believe that that technology could be buildable by late 2026 or early 2027.

America is well positioned to build this technology but we need to deal with its risks.

He then goes on to talk about how American AI will be democratic and Chinese AI will be authoritarian and America must prevail, as we are now required to say by law, Shibboleth. He talks about misuse risk and CBRN risks and notes DeepSeek poses these as well, and then mentions the blackmail findings, and calls for tighter export controls and stronger federal ability to test AI models, and broader deployment within government.

I get what Clark is trying to do here, and the dilemma he is facing. I appreciate talking about safety up front, and warning about the future pace of progress, but I still feel like he is holding back key information that needs to be shared if you want people to understand the real situation.

Instead, we still have 100 minutes that touch on this in places but mostly are about mundane economic or national security questions, plus some model misbehavior.

Now we return to Representative Krishnamoorthi, true master of screen time, who shows Claude refusing to write a blog post promoting eating disorders, then DeepSeek being happy to help straight up and gets Clark to agree that DeepSeek does not do safety interventions beyond CCP protocols and that this is unacceptable, then reiterates his bill to not let the government use DeepSeek, citing that they store data on Chinese servers. I mean yes obviously don’t use their hosted version for government purposes, but does he not know how open source works, I wonder?

He pivots to chip smuggling and the risk of DeepSeek using our chips. Clark is happy to once again violently agree. I wonder if this is a waste or good use of time, since none of it is new, but yes obviously what matters is who is using the chip, not who made it, and selling our chips to China (at least at current market prices) is foolish, Krishnamoorthi points out Nvidia’s sales are growing like gangbusters despite export controls and Clark points out that every AI company keeps using more compute than expected.

Then there’s a cool question, essentially asking about truesight and ability to infer missing information when given context, before finishing by asking about recent misalignment results:

Representative Krishnamoorthi: If someone enters their diary into Claude for a year and then ask Claude to guess what they did not write down Claude is able to accurately predict what they left out isn’t that right?

Jack Clark: Sometimes that’s accurate yes these systems are increasingly advanced and are able to make subtle predictions like this which is why we need to ensure that our own US intelligence services use this technology and know how to get the most out of it.

Representative Moolenaar then starts with a focus on chip smuggling and diffusion, getting Beall to affirm smuggling is a big deal then asking Clark about how this is potentially preventing American technological infrastructure diffusion elsewhere. There is an obvious direct conflict, you need to ensure the compute is not diverted or misused at scale. Comparisons are made to nuclear materials.

Then he asks Clark, as an immigrant, about how to welcome immigrants especially from authoritarian states to help our AI work, and what safeguards we would need. Great question. Clark suggests starting with university-level STEM immigration, the earlier the better. I agree, but it would be good to have a more complete answer here about containing information risks. It is a real issue.

Representative Carson is up next and asks about information warfare. Clark affirms AI can do this and says we need tools to fight against it.

Representative Lahood asks about the moratorium that was recently removed from the BBB, warning about the ‘patchwork of states.’ Clark says we need a federal framework, but that without one powerful AI is coming soon and you’d just be creating a vacuum, which would be flooded if something went wrong. Later Clark, in response to another question, emphasizes that the timeline is short and we need to be open to options.

Representative Dunn asks about the blackmail findings and asks if he should be worried about AIs using his bank information against him. Clark says no, because we publish the research and we should encourage more of this and also closely study Chinese models, and I agree with that call but it doesn’t actually explain why you shouldn’t worry (for now, anyway). Dunn then asks about the finding that you can put a sleeper agent into an AI, Clark says testing for such things likely would take them a month.

Dunn then asks Manhken what would be the major strategic missteps Congress might make in an AGI world. He splits his answer into insufficient export controls and overregulation, it seems he thinks there are not other things to worry about when it comes to AGI.

Here’s one that isn’t being noticed enough:

Mr. Moulton (56: 50): The concern is China and so we have to somehow get to an international framework a Geneva Conventions like agreement that has a chance at least at limiting uh what what our adversaries might do with AI at the extremes.

He then asks Beall what should be included in that. Beall starts off with strategic missile-related systems and directive 3000.09 on lethal autonomous systems. Then he moves to superintelligence, but time runs out before he can explain what he wants.

Representative Johnson notes the members are scared and that ‘losing this race’ could ‘trigger a global crisis,’ and asks about dangers of data centers outside America, which Beall notes of course are that we won’t ultimately own the chips or AI, so we should redouble our efforts to build domestically even if we have to accept some overseas buildout for energy reasons.

Johnson asks about the tradeoff between safety and speed, seeing them in conflict. Jack points out that, at current margins, they’re not.

Jack Clark: We all buy cars because we know that if they if they get dinged we’re not going to suffer in them because they have airbags and they have seat belts. You’ve grown the size of the car market by innovating on safety technology and American firms compete on safety technology to sell to consumers.

The same will be true of AI. So far, we do not see there being a trade-off here we see that making more reliable trustworthy technology ultimately helps you grow the size of the market and grows the attractiveness of American platforms vis-a-vie China so I would constructively sort of push back on this and put it to you that there’s an amazing opportunity here to use safety as a way to grow the American existing dominance in the market.

Those who set up the ‘slow down’ and safety versus speed framework must of course take the L on how that (in hindsight inevitably) went down. Certainly there are still sometimes tradeoffs here on some margins, on some questions, especially when you are the ‘fun police’ towards your users, or you delay releases for verification. Later down the road, there will be far more real tradeoffs that occur at various points.

But also, yes, for now the tradeoffs are a lot like those in cars, in that improving the safety and security of the models helps them be a lot more useful, something you can trust and that businesses especially will want to use. At this point, Anthropic’s security focus is a strategic advantage.

Johnson wants to believe Clark, but is skeptical and asks Manhken, who says too much emphasis on safety could indeed slow us down (which, as phrased, is obviously true), that he’s worried we won’t go fast enough and there’s no parallel conversation at the PRC.

Representative Torres asks Clark how close China is to matching ASML and TSMC. Clark says they are multiple years behind. Torres then goes full poisoned banana race:

Torres: The first country to reach ASI will likely emerge as the superpower of the 21st century the superpower who will set the rules for the rest of the world. Mr clark what do you make of the Manhattan project framing?

Clark says yes in terms of doing it here but no because it’s from private actors and they agree we desperately need more energy.

Hissen says Chinese labs aren’t doing healthy competition, they’re stealing our tech, then praises the relaxation of the Biden diffusion rules that prevent China from stealing our tech, and asks about what requirements we should attach to diffusion deals and everyone talks arms race and market share. Sigh.

In case you were wonder where that was coming from, well, here we go:

Hinson: members of of your key team at Anthropic have held very influential roles in this space both open philanthropy and in the previous administration with the Biden administration as well.

Can you speak to how you manage you know obviously we’ve got a lot of viewpoints but how you manage potential areas of conflict of interest in advancing this tech and ensuring that everybody’s really on that same page with helping to shape this national AI policy that we’re talking about the competition on the global stage for this for this technology.

You see, if you’re trying to not die that’s a conflict of interest and your role must have been super important, never mind all that lobbying by major tech corporations. Whereas if you want American policy to focus on your own market share, that’s good old fashioned patriotism, that must be it.

Jack Clark: Thank you for the question we have a simple goal. Win the race and make technology that can be relied on and all of the work that we do at our company starts from looking at that and then just trying to work out the best way to get there and we work with people from a variety of backgrounds and skills and our goal is to just have the best most substantive answer that we can bring to hearings.

No, ma’am, we too are only trying to win the race and maximize corporate profits and keep our fellow patriots informed, it is fine. Anthropic doesn’t care about everyone not dying or anything, that would be terrible. Again, I get the strategic bind here, but I continue to find this deeply disappointing, and I don’t think it is a good play.

She then asks Beall about DeepSeek’s ability to quickly copy our tech and potential future espionage threats and Beall reminds her that export controls work with a lag and notes DeepSeek was a wakeup call (although one that I once again note was blown out or proportion for various reasons, but we’re stuck with it). Beall recommends the Remote Access Security Act and then he says we have to ‘grapple with the open source issue.’ Which is that if you open the model they can copy it. Well, there is that.

Representative Brown pulls out They Took Our Jobs and ensuring people (like those in her district, Ohio’s 11th) don’t get left behind by automation and benefit instead, calling for investing in the American workforce, so Clark goes into those speeches and encouraging diffusion and adjusting regulation and acts as if Dario hadn’t predicted the automation of half of white-collar entry level jobs within five years.

Representative Nun notes (along with various other race-related things) the commissioning of four top AI teams as lieutenant kernels, which I and Patrick McKenzie both noticed but has gotten little attention. He then brings up a Chinese startup called Zhipu (currently valued around $20 billion) as some sort of global threat.

Nun: A new AI group out of Beijing called Zhipu is an AI anomaly that is now facing off against the likes of OpenAI and their entire intent is to lock in Chinese systems and standards into emerging markets before the West so this is clearly a largescale attempt by the Chinese to box the United States out now as a counter intelligence officer who was on the front line in fighting against Huawei’s takeover of the United States through something called Huawei America.

That is indeed how a number of Congress people talk these days, including this sudden paranoia with some mysterious ‘lock in’ mechanism for API calls or self-hosted open models that no one has ever been able to explain to me. He does then ask an actual good question:

Nun: Is the US currently prepared for an AI accelerated cyber attack a zero-day attack or a larger threat that faces us today?

Mahnken does some China bad, US good and worries the Chinese will be deluded into thinking AI will let them do things they can’t do and they might start a war? Which is such a bizarre thing to worry about and also not an answer? Are we prepared? I assume mostly no.

Nun then pushes his HR 2152 for government AI diffusion.

He asks Clark how government and business can cooperate. Clark points to the deployment side and the development of safety standards as a way to establish trust and sell globally.

Representative Tokuda starts out complaining about us gutting our institutions, Clark of course endorses investing more in NIST and other such institutions. Tokuda asks about industry responsibility, including for investment in related infrastructure, Clark basically says he works on that and for broader impact questions get back to him in 3-4 years to talk more.

Then she gives us the remarkable quote above about superintelligence (at 1: 30: 20), the full quote is even stronger, but she doesn’t leave enough time for an answer.

I am very grateful for the statement, even with no time left to respond. There is something so weird about asking two other questions first, then getting to ASI.

Representative Moran asks Clark, what’s the most important thing to win this race? Clark chooses power followed by compute and then government infrastructure, and suggests working backwards from the goal of 50 GW in 2027. Mahnkin is asked next and suggests trying to slow down the Chinese.

Moran notices that AI is not like older programming, that it effectively will write its own rules and programming and will soon do its own research and asks what’s up with that. Clark says more research is urgently needed, and points out you wouldn’t want an AI that can blackmail you designing its successor. I’m torn on whether that cuts to the heart of the question in a useful way or not here.

Moran then asks, what is the ‘red line’ on AI the Chinese cannot be allowed cross? Beall confirms AI systems are grown, not built, that it is alchemy, and that the automated R&D is the red line and a really big deal, we need to be up to speed on that.

Representative Conner notes NIST’s safety testing is voluntary and asks if there should be some minimum third party verification required, if only to verify the company’s own standards. All right, Clark, he served it up for you, here’s the ball, what have you got?

Clark: this question is illustrates the challenge we have about weighing safety versus you know moving ahead as quickly as possible we need to first figure out what we want to hold to that standard of testing.

Today the voluntary agreements rest on CBRN testing and some forms of cyber cyber attack testing once we have standards that we’re confident of I think you can take a look at the question of whether voluntary is sufficient or you need something else.

But my sense is it’s too early and we first need to design those tests and really agree on those before figuring out what the next step would be and who would design those tests is it the AI institute or is it the private sector who who comes up with what those tests should be today these tests are done highly collaboratively between US private sector which you mentioned and parts of the US government including those in the the intelligence and defense community i think bringing those people together.

So that we have the nation’s best experts on this and standards and tests that we all agree on is the first step that we can take to get us to everything else and by when do you think that needs to be done. It would be ideal to have this within a year the timelines that I’ve spoken about in this hearing are powerful AI arrives at the end of 2026 or early 2027. Before then we would ideally have standard tests for the national security properties that we deeply care about.

I’m sorry, I think the word you were looking for was ‘yes’? What the hell? This is super frustrating. I mean as worded how is this even a question? You don’t need to know the final exact testing requirements before you start to move towards such a regime. There are so many different ways this answer is a missed opportunity.

The last question goes back to They Took Our Jobs, and Clark basically can only say we can gather data, and there are areas that won’t be impacted soon by AI, again pretending his CEO Dario Amodei hadn’t warned of a jobs ‘bloodbath.’ Beall steps up and says the actual damn thing (within the jobs context), which is that we face a potential future where humans are not only unemployed but unemployable, and we have to have those conversations in advance.

And we end on this not so reassuring note:

Mark Beall: when I hear folks in industry claim things about universal basic income and this sort of digital utopia I you know I study history. I worry that that sort of leads to one place and that place is the Goolog.

That is quite the bold warning, and an excellent place to end the hearing. It is not the way I would have put it, but yes the idea of most or all of humanity being entirely disempowered and unproductive except for our little status games, existing off of gifted resources, property rights and rule of law and some form of goodwill and hoping all of this holds up does not seem like a plan that is likely to end well. At least, not for those humans. No, having ‘solved the alignment problem’ does not on its own get you out of this in any way, solving the alignment problem is the price to try at all.

And that is indeed one kind of thing we need to think about now.

Is this where I wanted the conversation to be in 2025? Oh, hell no.

It’s a start.

Discussion about this post

Congress Asks Better Questions Read More »

nothing-phone-3-arrives-july-15-with-a-tiny-dot-matrix-rear-display

Nothing Phone 3 arrives July 15 with a tiny dot matrix rear display

Nothing, a startup from OnePlus co-founder Carl Pei, has announced its first flagship phone since 2023. The company bills its new Nothing Phone 3 as a “true flagship” device, but it doesn’t have the absolute best hardware you can get in a mobile device. Neither does it have the highest price, clocking in at a mere $799. That’s shaping up to be a good value, but it’s also the highest price yet for a Nothing phone.

A few weeks back, Nothing teased the end of its trademark Glyph interface. Indeed, the Nothing Phone 3 doesn’t have the illuminated panels of the company’s previous phones. Instead, it has a small dot “Glyph Matrix” LED screen. It’s on the back in the upper right corner, opposite the camera modules. Nothing has a few quirky games and notification icons that will flash on the screen, and it can be used as a very low-fi selfie mirror. Nothing is committed to the new Glyph screen, going so far as adding a button on the back to control it.

The rest of the design maintains the Nothing aesthetic, featuring a clear glass panel with a visible mid-frame and screws. The phone will come in either black or white—Nothing isn’t really into colors. However, the company does promise the Phone 3 will be a little more compact than the 2023 Phone 2. The new device is 18 percent thinner and has symmetrical 1.87-millimeter bezels around the 6.67-inch OLED screen. That panel supports 120 Hz refresh and has a peak brightness of 4,500 nits, which is competitive with the likes of Samsung and OnePlus.

Nothing’s first “true flagship”?

Nothing has continued releasing budget phones during its break from high-end devices. While the Nothing Phone 3 is supposedly five times faster than the Phone 3a, it probably won’t be able to keep up with the fastest phones on the market today. Rather than the top-of-the-line Snapdragon 8 Elite, the Nothing Phone 3 has a Snapdragon 8s Gen 4, which was released earlier this year to expand premium silicon features to slightly cheaper phones. It doesn’t have Qualcomm’s new Oryon CPU cores, and the GPU is a bit slower. However, it’ll be faster than most devices in its price range. The specs are rounded out with 12GB or 16GB of RAM and 256GB or 512GB of storage.

Nothing Phone 3 arrives July 15 with a tiny dot matrix rear display Read More »

ai-moratorium-stripped-from-bbb

AI Moratorium Stripped From BBB

The insane attempted AI moratorium has been stripped from the BBB. That doesn’t mean they won’t try again, but we are good for now. We should use this victory as an opportunity to learn. Here’s what happened.

Senator Ted Cruz and others attempted to push hard for a 10-year moratorium on enforcement of all AI-specific regulations at the state and local level, and attempted to ram this into the giant BBB despite it being obviously not about the budget.

This was an extremely aggressive move, which most did not expect to survive the Byrd amendment, likely as a form of reconnaissance-in-force for a future attempt.

It looked for a while like it might work and get passed outright, with it even surviving the Byrd amendment, but opposition steadily grew.

We’d previously seen a remarkable group notice that this moratorium was rather insane. R Street offered an actually solid analysis of the implications that I discussed in AI #119. In AI #120, we saw Joe Rogan and Jesse Michels react to the proposal with ‘WHAT’? Marjorie Taylor Greene outright said she would straight vote no on the combined bill if the provision wasn’t stripped out and got retweeted by Elizabeth Warren, and Thomas Massie called it ‘worse than you think.’ Steve Bannon raised an alarm and the number of senators opposed rose to four. In #122, as the provision was modified to survive the Byrd amendment. Amazon, Google, Meta and Microsoft were all backing the moratorium.

As the weekend began, various Republican officials kept their eyes on the insane AI moratorium, and resistance intensified, including a letter from 16 Republican governors calling for it to be removed from the bill. Charlie Bullock notes that this kind of attempted moratorium is completely unprecedented. Gabriel Weinberg is the latest to point out that Congress likely won’t meaningfully regulate AI, which is what makes it insane to prevent states from doing it. The pressure was mounting, despite a lot of other things in the bill fighting for the Senate’s attention.

Then on Sunday night, Blackburn and Cruz ‘reached an AI pause deal.’ That’s right, Axios, they agreed to an AI pause… on state governments doing anything about AI. The good news was it is down from 10 years to 5, making it more plausible that it expires before our fate is sealed. The purported goal of the deal was to allow ‘protecting kids online’ and it lets tech companies sue if they claim an obligation ‘overly burdens’ them. You can guess how that would have gone, even for the intended target, and this still outright bans anything that would help where it counts.

Except then Blackburn backed off the deal, saying that while she appreciated Cruz’s attempts to find acceptable language, the current language was not acceptable.

Senator Blackburn (R-Tennessee): This provision could allow Big Tech to continue to exploit kids, creators and conservatives. Until Congress passes federally preemptive legislation like the Kids Online Safety Act and an online privacy framework, we can’t block states from making laws that protect their citizens.

As always Blackburn is focusing on the wrong threats and concerns about AI, such as child safety, but she’s right on the money about the logic of a moratorium. If you can’t do the job, you shouldn’t stop others from doing it.

I’d also note that there is some sort of strange idea that if a state passes AI regulations that are premature or unwise, then there is nothing Congress could do about this, we’d be stuck, so we need to stop them in advance. But the moratorium would have been retroactive. It prevents enforcement of existing rules.

So why couldn’t you take that same ‘wait and see’ attitude here, and then preempt if and when state laws actually threaten to cause trouble in practice? Or do so for each given type of AI regulation when you were ready with a preemptive federal bill to do the job?

So Blackburn moved to strip the moratorium from the bill. Grace Chong claims that Bannon’s War Room played a key role in convincing Blackburn to abandon the deal, ensuring we weren’t ‘duped by the tech bros.’

At first Ted Cruz believed he could still head this off.

Diego Areas Munhoz: “The night is young,” Sen Ted Cruz tells me in response to Blackburn walking back on their deal. There are 3 likely GOP nos on moratorium. He’ll need to ensure there are no other detractors or find a Democratic partner.

Instead, support utterly collapsed, leaving Cruz completely on his own. 99-1, and the one was Thillis voting against all amendments on principle.

Mike Davis: There are more than 3. The damn is breaking.

Thank you for your attention to this matter.

The larger bill, including stripping quite a bit of funding from American energy production despite our obviously large future needs, among many other things that are beyond scope here, did still pass the Senate 51-50.

The opposition that ultimately killed the bill seems to have had essentially nothing to do with the things I worry most about. It did not appear to be driven by worries about existential or catastrophic risk, and those worries were not expressed aloud almost at all (with the fun exception of Joe Rogan). That does not mean that such concerns weren’t operating in the background, I presume they did have a large impact in that way, but it wasn’t voiced.

All the important opposition came from the Republican side, including some very MAGA sources. Very MAGA sources proved crucial. Opposition from those sources was vocally motivated by fear of big tech, and a few specific mundane policy concerns like privacy, protecting children, copyright and protecting creatives, and potential bias against conservatives.

This was a pleasant surprise break from the usual tribalism where a lot of people seem to think that anything that makes us less safe is therefore a good idea on principle (they would say it as ‘the doomers are against it so it must be good’ which is secretly an even more perverse filter than that, consider what those same people oppose in other contexts.) Now have a different kind of tribalism, which does not seem like it is going to be better in some ways? But I do think the concerns are coming largely from ultimately good places, even if so far not in sophisticated ways, similar to the way the public has good instincts here.

I am happy the moratorium did not pass, but this was a terrible bit of discourse. It does not bode well for the future. No one on any side of this, based on everything I have heard, raised any actual issues of AI long term governance, or offered any plan on what to do. One side tried to nuke all regulations of any kind from orbit, and the other thought that nuke might have some unfortunate side effects on copyright. The whole thing got twisted up in knots to fit it into a budget bill.

How does this relate to the question of which arguments to make and emphasize about AI going forward? My guess is that a lot of this has to do with the fact that this fight was about voting down a terrible bill rather than trying to pass a good bill.

If you’re trying to pass a good bill, you need to state and emphasize the good reasons you want to pass that bill, and what actually matters, as Note Sores explained recently at LessWrong. You can and should also offer reasons for those with other concerns to support the bill, and help address those concerns. As we saw here, a lot of politicians care largely about different narrow specific concerns.

However, if you are in opposition to a terrible bill, that’s a different situation. Then you can and should point out all the problems and reasons to oppose the bill, even if they are not your primary concerns, and there is nothing incongruent about that.

It also requires a very different coalition. The key here was to peel off a few Republicans willing to take a stand. Passing a bill is a different story in that way, too.

The other thing to notice is the final vote was 99-1, and the 1 had nothing to do with the content of the amendment. As in, no one, not even Ted Cruz, wanted to be on record as voting for this once it was clear it was going to fail. Or alternatively, everyone agreed to give everyone else cover.

That second explanation is what Neil Chilson says happened, that this wasn’t a real vote, instead meant as a way to save face, a claim that I saw only after publication, so this ending has been edited to reflect the new information – I disagree with Neil on many things but I see no reason not to believe him here.

Neil Chilson: This vote was not a “preference cascade.” This was a procedural effort by leadership to reassemble the republican conference in prep for the final vote on the whole BBB after Blackburn’s reneging threw it into chaos.

The intent was to vote 100-0 in support of the repeal, to both unify the group and still send the signal that this wasn’t the real count. I think Cruz actually moved the vote for the amendment. But apparently Tillis (who hadn’t really been involved in the whole thing) voted *againstthe repeal. Hence the 99-1.

This was a really close win for opponents of the moratorium, so an adjust your expectations accordingly.

Exactly. A 100-0 vote is what leadership does when it knows the issue is lost and they don’t want to make everyone own the L. It’s not a reflection of where the vote actually would have landed.

This still involved a preference cascade, although this new scenario is more complex. Rather than ‘everyone knew this was crazy and a bad look so once they had cover they ran for the exits’ it is more ‘once there was somewhat of a preference cascade the leadership made a strategic decision to have a fake unified vote instead’ and felt getting people’s positions on record was net negative.

This is all now common knowledge. Everyone knows what happened, and that the big tech anti-regulation coalition overplayed its hand and wants to outright have a free hand to do whatever they want (while pretending, often, that they are ‘little tech’).

I would also caution against being too attached to the vibes this time around, or any other time around. The vibes change very quickly. If the CCP committee meeting and this vote were any indication, they are going to change again soon. Every few months it feels like everything is different. It will happen again, for better and for worst. Best be ready.

Discussion about this post

AI Moratorium Stripped From BBB Read More »

analyst:-m5-vision-pro,-vision-air,-and-smart-glasses-coming-in-2026–2028

Analyst: M5 Vision Pro, Vision Air, and smart glasses coming in 2026–2028

Apple is also reportedly planning a “Vision Air” product, with production expected to start in Q3 2027. Kuo says it will be more than 40 percent lighter than the first-generation Vision Pro, and that it will include Apple’s flagship iPhone processor instead the more robust Mac processor found in the Vision Pro—all at a “significantly lower price than Vision Pro.” The big weight reduction is “achieved through glass-to-plastic replacement, extensive magnesium alloy use (titanium alloy deemed too expensive), and reduced sensor count.”

True smart glasses in 2027

The Vision Pro (along with the planned Vision Air) is a fully immersive VR headset that supports augmented reality by displaying the wearer’s surroundings on the internal screens based on what’s captured by 3D cameras on the outside of the device. That allows for some neat applications, but it also means the device is bulky and impractical to wear in public.

The real dream for many is smart glasses that are almost indistinguishable from normal glasses, but which display some of the same AR content as the Vision Pro on transparent lenses instead of via a camera-to-screen pipeline.

Apple is also planning to roll that out, Kuo says. But first, mass production of display-free “Ray-Ban-like” glasses is scheduled for Q2 2027, and Kuo claims Apple plans to ship between 3 million and 5 million units through 2027, suggesting the company expects this form factor to make a much bigger impact than the Vision Pro’s VR-like HMD approach.

The glasses would have a “voice control and gesture recognition user interface” but no display functionality at all. Instead, “core features include: audio playback, camera, video recording, and AI environmental sensing.”

The actual AR glasses would come later, in 2028.

Analyst: M5 Vision Pro, Vision Air, and smart glasses coming in 2026–2028 Read More »