Author name: Mike M.

google’s-pixel-10a-arrives-on-march-5-for-$499-with-specs-and-design-of-yesteryear

Google’s Pixel 10a arrives on March 5 for $499 with specs and design of yesteryear

It’s that time of year—a new budget Pixel phone is about to hit virtual shelves. The Pixel 10a will be available on March 5, and pre-orders go live today. The 9a will still be on sale for a while, but the 10a will be headlining Google’s store. However, you might not notice unless you keep up with the Pixel numbering scheme. This year’s A-series Pixel is virtually identical to last year’s, both inside and out.

Last year’s Pixel 9a was a notable departure from the older design language, but Google made few changes for 2026. We liked that the Pixel 9a emphasized battery capacity and moved to a flat camera bump, and this time, it’s really flat. Google says the camera now sits totally flush with the back panel. This is probably the only change you’ll be able to identify visually.

Specs at a glance: Google Pixel 9a vs. Pixel 10a
Phone Pixel 9a Pixel 10a
SoC Google Tensor G4 Google Tensor G4
Memory 8GB 8GB
Storage 128GB, 256GB 128GB, 256GB
Display 1080×2424 6.3″ pOLED, 60–120 Hz, Gorilla Glass 3, 2700 nits (peak) 1080×2424 6.3″ pOLED, 60–120 Hz, Gorilla Glass 7i, 3000 nits (peak)
Cameras 48 MP primary, f/1.7, OIS; 13 MP ultrawide, f/2.2; 13 MP selfie, f/2.2 48 MP primary, f/1.7, OIS; 13 MP ultrawide, f/2.2; 13 MP selfie, f/2.2
Software Android 15 (at launch), 7 years of OS updates Android 16, 7 years of OS updates
Battery 5,100 mAh, 23 W wired charging, 7.5 W wireless charging 5,100 mAh, 30 W wired charging, 10 W wireless charging
Connectivity Wi-Fi 6e, NFC, Bluetooth 5.3, sub-6 GHz 5G, USB-C 3.2 Wi-Fi 6e, NFC, Bluetooth 6.0, sub-6 GHz 5G, USB-C 3.2
Measurements 154.7×73.3×8.9 mm; 185 g 153.9×73×9 mm; 183 g

Google also says the new Pixel will have a slightly upgraded screen. The resolution, size, and refresh rate are unchanged, but peak brightness has been bumped from 2,700 nits to 3,000 nits (the same as the base model Pixel 10). Plus, the cover glass has finally moved beyond Gorilla Glass 3 to Gorilla Glass 7i, which supposedly has improved scratch and drop protection.

Pixel 10a in Berry

Credit: Google

Credit: Google

Google notes that more of the phone is constructed from recycled material, 100 percent for the aluminum frame and 81 percent for the plastic back. There’s also recycled gold, tungsten, cobalt, and copper inside, amounting to about 36 percent of the phone’s weight. The phone also continues to have a physical SIM slot, which was removed from the Pixel 10 series last year. The device’s USB-C 3.2 port can also charge slightly faster than the 9a (30 W versus 23 W), and wireless charging has gone from 7.5 W to 10 W. There are no Qi2 magnets inside, though.

Internally, the Pixel 10a is even more like its predecessor. Unlike past A-series phones, this one doesn’t have the latest Tensor chip—it’s sticking with the same Tensor G4 from the 9a. That’s a bummer, as the G5 was a bigger leap than most of Google’s chip upgrades. The company says it stuck with the G4 to “balance affordability and performance.”

Google’s Pixel 10a arrives on March 5 for $499 with specs and design of yesteryear Read More »

x-rays-reveal-kingfisher-feather-structure-in-unprecedented-detail

X-rays reveal kingfisher feather structure in unprecedented detail

A spongy nanostructure

The Northwestern team started looking at kingfisher feathers in tian-tsui objects via postdoc Madeline Meier, who has a background in chemistry and nanostructures and was interested in combining that expertise with studies of cultural heritage. The first step was to identify the bird species whose feathers were used in Qing dynasty screens and panels, as well as other materials used. Researchers carefully scraped away the topmost layers and imaged the feathers with scanning electron microscopy to get a better look at the underlying nanostructure. Hyperspectral imaging revealed how different areas of the screens absorbed and reflected light.

The team also made use of the center’s partnership with Chicago’s Field Museum, comparing the screen feathers with the museum’s vast collection of taxidermied bird species. The screens and panels contained feathers from common kingfishers and black-capped kingfishers, as well as mallard ducks (used to add green hues). Finally, X-ray fluorescence and Fourier-transform infrared spectroscopy enabled them to create a map of the various chemicals used in the gilding, pigments, glues, and other materials.

Most recently, the lab has partnered with Argonne National Laboratory and used synchrotron radiation to get an ever-better look at the nanostructure of kingfisher feathers. Synchrotron radiation differs from conventional X-rays in that it’s a thin beam of very high-intensity X-rays generated within a particle accelerator. Electrons are fired into a linear accelerator (linac), get a speed boost in a small synchrotron, and are injected into a storage ring, where they zoom along at near-light speed. A series of magnets bends and focuses the electrons, and in the process, they give off X-rays, which can then be focused down beam lines.

That makes it ideal for noninvasive imaging, since, in general, the shorter the wavelength used (and the higher the light’s energy), the finer the details one can image and/or analyze. It has become a popular technique for imaging fragile archaeological artifacts without damaging them—like Qing dynasty headdresses with inlays of kingfisher feathers. In this case, the imaging revealed that the feathers’ microscopic ridges have an underlying semi-ordered, porous, sponge-like shape that reflect and scatter light, thereby giving the feathers their gloriously brilliant hues.

“Long admired in Chinese poetry and art, kingfisher feathers have amazing optical properties,” co-author Maria Kokkori said. “Our discoveries not only enhance our understanding of historical materials but also reshape how we think about artistic and scientific innovation, and the future of sustainable materials.”

X-rays reveal kingfisher feather structure in unprecedented detail Read More »

inside-the-dhs-forum-where-ice-agents-trash-talk-one-another

Inside the DHS forum where ICE agents trash talk one another


Complaining about your job

Forum members have discussed their discomfort with mass deportation efforts.

Credit: Al Drago/Getty Images

Every day, people log into an online forum for current and former Homeland Security Investigations (HSI) officers to share their thoughts on the news of the day and complain about their colleagues in Immigration and Customs Enforcement (ICE).

“ERO is too busy dressing up as Black Ops Commandos with Tactical body armor, drop down thigh rigs, balaclavas, multiple M4 magazines, and Punisher patches, to do an Admin arrest of a non criminal, non-violent EWI that weighs 90 pounds and is 5 foot 2, inside a secure Federal building where everyone has been screened for weapons,” wrote one user in July 2025. (ERO stands for Enforcement and Removal Operations; along with HSI, it’s one of the two major divisions of ICE, and is responsible for detaining and deporting immigrants.)

The forum describes itself as a space for current and prospective HSI agents, “designed for the seasoned HSI Special Agent as well as applicants for entry level Special Agent positions.” HSI is the division within ICE whose agents are normally responsible for investigating crimes like drug smuggling, terrorism, and human trafficking.

In the forum, users discuss their discomfort with the US’s mass deportation efforts, debate the way federal agents have interacted with protesters and the public, and complain about the state of their working conditions. Members have also had heated discussions about the shooting of two protesters in Minneapolis, Renee Good and Alex Pretti, and the ways immigration enforcement has taken place around the US.

The forum is one of several related forums where people working in different parts of the Department of Homeland Security (DHS) share experiences and discuss specific details of their work. WIRED previously reported on a forum where current and former deportation officers from ICE and Customs and Border Protection (CBP) similarly complained about their jobs and discussed the way the agency was conducting immigration raids. The HSI forum appears to be linked, even including some of the same members.

People do not need to show proof of their employment to join these forums, and the platform does not appear to be heavily moderated. WIRED has not confirmed the individual identities of these posters, though posters share details that likely would only be known to those intimately familiar with the job. There are more than 2,000 members with posts going back to at least 2004.

DHS and ICE did not respond to requests for comment.

Following the killings of both Good and Pretti, the forum’s members were heavily divided. In a January 12 thread, five days after Good was shot by ICE agent Jonathan Ross, a poster who has been a part of the forum since 2016 wrote, “IMHO, the situation with ICE Operations have gotten to an unprecedented level of violence from both the Suspects and the General Public. I hope the AG is looking at the temporary suspension of Civil Liberties, (during and in the geographic locales where ICE Operations are being conducted).”

A user who joined the forum in 2018 and identifies as a recently retired agent responded, “This is an excellent idea and well warranted. These are organized, well financed civil disturbances, dare I say an INSURRECTION?!?”

In a January 16 post titled “The Shooting,” some posters took a more nuanced view. “I get that it is a good shoot legally and all that, but all he had to do was step aside, he nearly shot one of his partners for Gods sake!” wrote a poster who first joined the forum in March 2022. “A USC woman non-crim shot in the head on TV for what? Just doesn’t sit well with me… A seasoned SRT guy who was able to execute someone while holding a phone seems to me he could have simply got out of the way.” SRT refers to ICE’s elite special response team, who undergo special training to operate in high-risk situations. USC refers to US citizens.

“You clearly haven’t been TDY anywhere. Yes, they were going to arrest her for 111,” responded another poster who joined the forum in June 2018. “Tons of USCs are being arrested for it daily.” 111 refers to the part of the US criminal code that deals with assaulting, resisting, or impeding federal officers; TDY refers to “temporary duty,” where officers are pulled to different locations for a limited period of time.

“Can’t believe we have ‘supposed agents’ here questioning the shooting of a domestic terrorist,” wrote a third user who joined the forum in December 2025. (In the wake of the shooting, DHS Secretary Kristi Noem called Good a “domestic terrorist.”)

“If you think a fat unarmed lesbian in a Honda is a ‘Terrorist’ then you are a fake ass cop!” the original poster replied. “I have worked real Terrorism cases, and I am not saying it was a bad shoot and not defending her. I am just saying it did not have to happen.”

Later in the thread, a poster who joined the forum in July 2023 replied, “Remember, these are the same agents who think J6’ers were just misunderstood rowdy tourists, and that Ashley Babbitt is a national hero…and if you dare say something negative about Trump, or try to hold him accountable, you’re suddenly a leftie, communist, lunatic (even though I’m a Republican).”

In a different thread following the shooting of Pretti on January 24 by a CBP agent, a poster who has been a part of the forum since 2023 and who also identified as a retired agent wrote, “Yet another ‘justified’ fatal shooting…They all carry gun belts and vest with 9,000 pieces of equipment on them and then best they can do is shoot a guy in the back.”

The thread devolved into posters debating whether members of the January 6 insurrection were domestic terrorists, and why Kyle Rittenhouse, who shot and killed two people during a 2020 Black Lives Matter protest, was apprehended alive.

“I just want to mention, we all get emotions are heightened right now. But I highly doubt being a legacy customs guy you ever did anything where the risk was beyond the potential for paper cuts,” wrote a user who first joined the forum in June 2025. “It’s a new day with new threats in an environment you never fathomed in your career.”

Even before the shootings of Good and Pretti, members of the forum questioned the wisdom of bringing HSI into the Trump administration’s mass mobilization around immigration enforcement. HSI deals specifically with criminal cases and investigations, but living and working in the US without documentation is a civil offense, and the majority of immigrants who were detained or deported in 2025 had not committed any crimes.

One poster complained that doing so was pulling HSI resources away from more urgent casework.

“The use of 1811s — HSI or otherwise — for administrative immigration enforcement is a complete misuse of resources,” wrote a user who joined the forum in October 2022 in a January 7 post. 1811’s refers to a category of law enforcement officers generally referred to as special agents who conduct criminal investigations. “They could be doing these crime surges for literally any type of federal criminal investigations (drugs, child exploitation, gangs, etc.), and it would be a much better use of resources. Not only that, our reputations would still be intact.”

Others in the forum have complained about HSI’s relationship with ICE’s ERO teams. “It’s pretty bad when ERO at a large metropolitan city get’s backed up with 30 bodies and they call the SA’s in to process,” wrote a poster on July 7, 2025 who has been a part of the forum since 2010. SAs refers to special agents. “I guess that is what happens when they have not done any immigration work in decades.”

“Complete opposite in our [area of responsibility],” a poster who first joined the forum in 2012 replied. “No one has a clue what most of ero is doing and are asking us to be included on anything immigration we’re doing and introduce them to DEA contacts working investigations involving illegals.”

A third poster, who has been a forum member since 2024, added that “ERO does essentially nothing. I walked in the office the other day and the HSI SAs were doing jail pickups and processing. The ERO folks were gathered around a desk drinking coffee and joking around.” In cases where ICE has a request out to jails for a person they’re pursuing, known as an immigration detainer, jails will hold that person for up to “48 hours beyond the time they would ordinarily release them” to allow ICE to pick them up.

In the lead up to federal immigration authorities’ operations in Minneapolis, members complained about long hours. “How are RHAs expected to go on TDYs with NO days off and lots of [overtime] when they are all capped out (biweekly and yearly [sic]),” complained a user who first joined the forum in December 2004 in a post from December 7, 2025. RHAs refers to rehired annuitants, or retired federal agents who have returned to the job and continue to collect their federal retirement benefits. “ERO has NO caps.”

“I’m capped out so only getting paid for 5 days at 10 hours a day,” wrote the user who first joined the forum in 2010 in another thread (overtime pay rules can vary from agency to agency). ”Anything over 50 hours a week and I’m working for free.”

Others in the forum said they were waiting for their promised sign-on bonuses, and expressed disappointment with what they saw in their paychecks. For rehired annuitants, ICE offered a signing bonus of up to $50,000. “Not sure how they calculated the current pay from the super check received today, but mine can’t be right,” a poster who joined the forum in 2021 wrote in an October post. “My super check netted me a grand total of $600 more.”

In another thread on bonuses, a user who has been a forum member since 2005 replied, “I got a deposit last night or early this morning,” they wrote. “It looks like 10k after taxes plus my regular pay check. Not sure yet. However the deal was 20K. WTF?!”

In a December thread, other members discussed the way immigration agents had begun to interact more aggressively with protesters. The user identified as a retired agent wrote, “I’ve seen a lot of videos lately of HSI or ERO agents getting triggered by civilians taking photos or videos of them or their vehicles. In several of the videos the agents are seen jumping out of their GOVs, manhandling the civilian, and smashing or confiscating their phones.” The user expressed bewilderment about the behavior, writing that they “would have been fired and/or prosecuted for something like this. I believe everyone knows at this point that taking photos/videos is a protected act unless someone is clearly impeding or obstructing (which doesn’t always appear to be the case).”

Another poster, who joined the forum in September 2025, replied, “Ah…Cell phone video. You can make them tell what ever story you want with creative editing.”

As part of the response to immigration operations, particularly in Minnesota, civilians have organized to monitor federal agents, coordinating to witness and record their operations, and sometimes tailing suspected ICE vehicles, checking licence plates in Signal groups. Federal agents, in turn, have been seen taking photos and videos of protesters, with one legal observer in Maine claiming that an agent told her she would be added to a terror watchlist. (In testimony this week, Todd Lyons, acting director of ICE told members of Congress that ICE was not making such a list of US citizens.)

In posts throughout the forum, members also complain about their access to gear and the agency’s technology. “Apparently there is enough money to buy a bunch of ICE marked cars but not get us some basic protective gear…” wrote one user on January 27, who joined the forum in 2025.

“I also have a suspicion that HQ or the [Executive Associate Director] have not advocated to get us gear to handle all the nut job protesters,” they wrote in a follow up post.

On a thread named “Alien Processing” that started in July 2025, posters complained about “How is it that with all the technology we have and an entire fkn building full of computer geeks this fkn agency cannot make a fkn system that works properly and effectively in a simple user friendly fashion? This Eagle crap is a total mess!” one poster wrote. EAGLE refers to Enforcement Integrated Database (EID) called EID Arrest Guide for Law Enforcement, a system to process the biometric and personal information for people arrested by ICE. “It takes longer to process a fkn alien than it does to actually catch them. We dont need 10,000 new ICE Officers/Agents, just hire fkn people to process them so we can do our jobs of catching them.”

Members also talked about their preferred pieces of ICE tech: Another user, who joined the forum in March 2025, responded to the “Alien Processing” thread, writing “Mobile Fortify is the best thing that has come out in a long time,” in reference to the mobile facial recognition app used by federal agents to identify people in the field.

According to DHS’s 2025 AI Use Case Inventory, agents have been able to use Mobile Fortify since May 2025. The app uses AI, trained with CBP’s “Vetting/Border Crossing Information/ Trusted Traveler Information,” to match a picture taken by agents and “contactless” fingerprints with existing records. 404 Media reported that the app has misidentified at least one person—perhaps because, as WIRED has reported, it wasn’t designed to be used for what ICE is using it for.

Though ICE’s surge in Minnesota appears to be entering a drawdown, the agency is continuing to expand its footprint across the US, and investing in a network of detention centers and large warehouses for holding immigrants, all indicating that detentions and deportations are not expected to slow down.

“Put yourself in the shoes of the guys in the street strung out on crazy op tempo, being threatened and antagonized all day, having inept leadership, low morale, and then having to fight every formerly low risk non-crim (or barely crim) because they are all hyped up on victim status and liberal energy. Plus hyper partisan radicalization on both sides,” the user who joined the forum in June 2025 wrote. “If you think the news is enraging you now, wait till this spring/summer when we need to fill the mega detention centers.”

This story originally appeared on wired.com.

Photo of WIRED

Wired.com is your essential daily guide to what’s next, delivering the most original and complete take you’ll find anywhere on innovation’s impact on technology, science, business and culture.

Inside the DHS forum where ICE agents trash talk one another Read More »

sideways-on-the-ice,-in-a-supercar:-stability-control-is-getting-very-good

Sideways on the ice, in a supercar: Stability control is getting very good


To test stability control, it helps to have a wide-open space with very low grip.

A blue McLaren Artura drifting on a frozen lake

You can tell this photo was taken a day or two before we were there because the sun came out. Credit: McLaren

You can tell this photo was taken a day or two before we were there because the sun came out. Credit: McLaren

SAARISELKÄ, FINLAND—If you’re expecting it, the feeling in the pit of your stomach when the rear of your car breaks traction and begins to slide is rather pleasant. It’s the same exhilaration we get from roller coasters, but when you’re in the driver’s seat, you’re in charge of the ride.

When you’re not expecting it, though, there’s anxiety instead of excitement and, should the slide end with a crunch, a lot more negative emotions, too.

Thankfully, fewer and fewer drivers will have to experience that kind of scare thanks to the proliferation and sophistication of modern electronic stability and traction control systems. For more than 30 years, these electronic safety nets have grown in capability and became mandatory in the early 2010s, saving countless crashes in the process.

Through a combination of cutting engine power and individually braking each wheel, the computers that keep a watchful eye on things like lateral acceleration and wheel spin gather it all together with the idea that the car goes where the driver wants it rather than sideways or backward into whatever solid object lies along the new path of motion.

Obviously, the quickest way to find out whether this all works is to turn it off. And then find a slippery road, or just drive like an oaf. Yet even when automakers let journalists loose on racetracks, they invariably require that we keep some of the electronic safety net turned on. Even on track, you can hit things that will crumple a car—or worse—and with modern tire technology being what it is, the speeds involved when cars do let go tend to be quite high, particularly if it’s dry.

An orange McLaren Artura, seen from behind on a frozen lake. The rear is encrusted with snow.

The Artura is probably my favorite McLaren, as it’s smaller and more versatile than the more expensive, more powerful machines in the range.

Credit: Jonathan Gitlin

The Artura is probably my favorite McLaren, as it’s smaller and more versatile than the more expensive, more powerful machines in the range. Credit: Jonathan Gitlin

There are few environments that are more conducive to exploring the limits and capabilities of electronic chassis control. Ideally, you want a lot of wide-open space free of wildlife and people and a smooth, low-grip surface. A giant sand dune would work. Or a frozen lake. Which is why you can sometimes find automotive engineers hanging out in these remote, often extreme locations, braving the desert’s heat or an Arctic chill as they work on a prototype or fine-tune the next model.

And it’s no secret that sliding a car on the ice is a lot fun. So it’s not surprising that a cottage tourism industry exists that—for a suitable fee—will bring you north of the Arctic Circle where you can work on your car control and get some insight into just how hard those electronics are capable of working.

That explains why I left an extremely cold Washington, DC, to travel to an even colder Saariselkä in Finland, where McLaren operates its Arctic Experience program on a frozen lake in nearby Ivalo. The company does some development work here, though more of it happens across the border in Sweden. But for a few weeks each winter, it welcomes customers to its minimalist lodge to work on their car control. And earlier this month, Ars was among a group of journalists who got an abbreviated version of the experience.

Our car for the day was a Ventura Orange McLaren Artura, the brand’s plug-in hybrid supercar, wearing Pirelli’s Sottozero winter tires, each augmented by a few hundred metal spikes. Its total power and torque output is 671 hp (500 kW) and 531 lb-ft (720 Nm) combined from a 3.0 L twin-turbo V6 that generates 577 hp (430 kW) and 431 lb-ft (584 Nm), plus an axial flux electric motor that contributes an additional 94 hp (70 kW) and 166 lb-ft (225 Nm). All of that is sent to the rear wheels via an eight-speed dual-clutch transmission.

A McLaren Artura winter tire fitted with studs

Winter tires work well on snow, but for ice, you really need studs.

Credit: Jonathan Gitlin

Winter tires work well on snow, but for ice, you really need studs. Credit: Jonathan Gitlin

Where most hybrids use the electric motor to boost efficiency, McLaren mostly uses it to boost performance, providing an immediate shove and filling gaps in the torque band where necessary. In electric-only mode, it will do just that, right up to the 81 mph (130 km/h) speed limit of the mode. Being the sort of curious nerd I am, I took the opportunity to try all the different modes.

Once I got control of my stomach, that is.

Are you sure you should drink that?

Our first exercise was ironically the hardest: driving sideways around a plain old circle. A couple of these had been scribed into the ice—which freezes from November until April and was 28 inches (70 cm) thick, we learned—along with more than a dozen other, more involved courses. Even under the best of conditions, the Sun spends barely six hours a day on its shallow curve from horizon to horizon at this time of year. On the day of our visit, the horizon was an indistinct thing as heavy gray skies blended with the snow-covered ice.

The lack of a visual reference, mixed with 15 minutes of steady lateral G-forces, turned out to be unkind to my vestibular system, and about 10 minutes later, I found myself in shirtsleeves at minus-11˚F (minus-23˚C), saying goodbye to a cup of Earl Grey tea I’d previously and perhaps unwisely drunk a little earlier. At least I remembered to face downwind—given the sideways gale, it could have ended worse.

A number of circuits carved into the surface of a frozen lake

These are just some of the circuits that McLaren has carved into the ice in Ivalo. Beware of the innocent-looking circles—they’re deceptively hard and may turn your stomach.

Credit: McLaren

These are just some of the circuits that McLaren has carved into the ice in Ivalo. Beware of the innocent-looking circles—they’re deceptively hard and may turn your stomach. Credit: McLaren

Fortified with an anti-emetic and some extremely fresh air, I returned to the ice and can happily report that as long as you slide both left and right, you’re unlikely to get nauseous.

Getting an Artura sideways on a frozen lake is not especially complicated. With the powertrain set to Track, which prioritizes performance and keeps the V6 running the whole time, and with stability and traction control off, you apply enough power to break traction at the rear. Or a dab of brake could do the job, too, followed by some power. You steer more with your right foot than your hands, adding or subtracting power to reign in or amplify the slip angle. Your eyes are crucial to the process; if you look through the corner down the track, that’s probably where you’ll end up. Fixate on the next apex and you may quickly find yourself off-course.

Most of the mid-engined Artura’s 3,303 lbs (1,498 kg) live between its axles, and it’s a relatively easy car to catch once it begins to slide, with plenty of travel for the well-mapped throttle pedal.

As it turns out, that holds true even when you’re using only the electric motor. 166 lb-ft is more than enough to get the rear wheels spinning on the ice, but with just 94 hp, there isn’t really enough power to get the car properly sideways. So you can easily control a lazy slide around one of the handling courses, in near silence, to boot. Turn the electronic aids back on and things got much less dramatic; even with my foot to the floor, the Artura measured out minute amounts of power, keeping the car very much pointed where I steered it rather than requiring any opposite lock.

A person stands next to a McLaren Artura on a frozen lake

It feels like the edge of the world out here.

Credit: McLaren

It feels like the edge of the world out here. Credit: McLaren

Turn it on, turn it off

Back in track mode, with all 671 hp to play with, there was much more power than necessary to spin. But with the safety net re-enabled, driving around the handling course was barely any more dramatic than with a fraction of the power. The car’s electronic chassis control algorithms would only send as much power to the rear wheels as they could deploy, no matter how much throttle I applied. As each wheel lost grip and began to spin, its brake would intervene. And we went around the course, slowly but safely. As a demonstration of the effectiveness of modern electronic safety systems, it was very reassuring.

As I mentioned earlier, even when journalists are let loose in supercars on track, it’s with some degree of electronic assist enabled. Because for the sportier kind of car, you’ll often find some degree of halfway house between everything on and buttoned down and all the aids turned off. Here, the idea is to loosen the safety net and allow the car to move around, but only a little. Instead of just using the electronics to make things safe, they’ll also flatter the driver.

In McLaren’s case, that mode is called Variable Drift Control, which is a rather accurate name—in this mode, you set the maximum slip angle (from 1˚–15˚), and the car will not exceed that. And that’s exactly what it does. A slug of power will get the rear wheels spinning and the rear sliding, but only up to the set degree, at which point the brakes and powertrain will interrupt as necessary.

It’s very flattering, holding what feels like a lurid slide between turns with ease, without any concern that a lapse in concentration might leave the car requiring recovery after beaching on a few inches of snow. Even when your right foot is pinned to the firewall, the silicon brains running the show apply only as much torque as necessary, with the little icon flashing on the dash letting you know it’s intervening.

A man seen drifting a McLaren

If you have the space, there’s little more fun than drifting a car on ice. But it’s good to know that electronic stability control and traction control will help you out when you’re not trying to have fun.

Credit: McLaren

If you have the space, there’s little more fun than drifting a car on ice. But it’s good to know that electronic stability control and traction control will help you out when you’re not trying to have fun. Credit: McLaren

I can certainly see why OEMs ask that modes like VDC are the spiciest setting we try when they lend us their cars. They’re just permissive enough to break the rear loose and fire off a burst of adrenaline, yet cosseting enough that the ride almost certainly won’t end in tears. Fun though VDC was to play with, it does feel artificial once you get your eye in—particularly compared to the thrill of balancing an Artura on the throttle as you change direction through a series of corners or the satisfaction of catching and recovering a spin before it becomes too late.

But outside of a frozen lake, I’ll be content to keep some degree of driver aids running.

Photo of Jonathan M. Gitlin

Jonathan is the Automotive Editor at Ars Technica. He has a BSc and PhD in Pharmacology. In 2014 he decided to indulge his lifelong passion for the car by leaving the National Human Genome Research Institute and launching Ars Technica’s automotive coverage. He lives in Washington, DC.

Sideways on the ice, in a supercar: Stability control is getting very good Read More »

platforms-bend-over-backward-to-help-dhs-censor-ice-critics,-advocates-say

Platforms bend over backward to help DHS censor ICE critics, advocates say


Pam Bondi and Kristi Noem sued for coercing platforms into censoring ICE posts.

Credit: Aurich Lawson | Getty Images

Credit: Aurich Lawson | Getty Images

Pressure is mounting on tech companies to shield users from unlawful government requests that advocates say are making it harder to reliably share information about Immigration and Customs Enforcement (ICE) online.

Alleging that ICE officers are being doxed or otherwise endangered, Trump officials have spent the last year targeting an unknown number of users and platforms with demands to censor content. Early lawsuits show that platforms have caved, even though experts say they could refuse these demands without a court order.

In a lawsuit filed on Wednesday, the Foundation for Individual Rights and Expression (FIRE) accused Attorney General Pam Bondi and Department of Homeland Security Secretary Kristi Noem of coercing tech companies into removing a wide range of content “to control what the public can see, hear, or say about ICE operations.”

It’s the second lawsuit alleging that Bondi and DHS officials are using regulatory power to pressure private platforms to suppress speech protected by the First Amendment. It follows a complaint from the developer of an app called ICEBlock, which Apple removed from the App Store in October. Officials aren’t rushing to resolve that case—last month, they requested more time to respond—so it may remain unclear until March what defense they plan to offer for the takedown demands.

That leaves community members who monitor ICE in a precarious situation, as critical resources could disappear at the department’s request with no warning.

FIRE says people have legitimate reasons to share information about ICE. Some communities focus on helping people avoid dangerous ICE activity, while others aim to hold the government accountable and raise public awareness of how ICE operates. Unless there’s proof of incitement to violence or a true threat, such expression is protected.

Despite the high bar for censoring online speech, lawsuits trace an escalating pattern of DHS increasingly targeting websites, app stores, and platforms—many that have been willing to remove content the government dislikes.

Officials have ordered ICE-monitoring apps to be removed from app stores and even threatened to sanction CNN for simply reporting on the existence of one such app. Officials have also demanded that Meta delete at least one Chicago-based Facebook group with 100,000 members and made multiple unsuccessful attempts to unmask anonymous users behind other Facebook groups. Even encrypted apps like Signal don’t feel safe from officials’ seeming overreach. FBI Director Kash Patel recently said he has opened an investigation into Signal chats used by Minnesota residents to track ICE activity, NBC News reported.

As DHS censorship threats increase, platforms have done little to shield users, advocates say. Not only have they sometimes failed to reject unlawful orders that simply provided a “a bare mention of ‘officer safety/doxing’” as justification, but in one case, Google complied with a subpoena that left a critical section blank, the Electronic Frontier Foundation (EFF) reported.

For users, it’s increasingly difficult to trust that platforms won’t betray their own policies when faced with government intimidation, advocates say. Sometimes platforms notify users before complying with government requests, giving users a chance to challenge potentially unconstitutional demands. But in other cases, users learn about the requests only as platforms comply with them—even when those platforms have promised that would never happen.

Government emails with platforms may be exposed

Platforms could face backlash from users if lawsuits expose their communications to the government, a possibility in the coming months. Last fall, the EFF sued after DOJ, DHS, ICE, and Customs and Border Patrol failed to respond to Freedom of Information Act requests seeking emails between the government and platforms about takedown demands. Other lawsuits may surface emails in discovery. In the coming weeks, a judge will set a schedule for EFF’s litigation.

“The nature and content of the Defendants’ communications with these technology companies” is “critical for determining whether they crossed the line from governmental cajoling to unconstitutional coercion,” EFF’s complaint said.

EFF Senior Staff Attorney Mario Trujillo told Ars that the EFF is confident it can win the fight to expose government demands, but like most FOIA lawsuits, the case is expected to move slowly. That’s unfortunate, he said, because ICE activity is escalating, and delays in addressing these concerns could irreparably harm speech at a pivotal moment.

Like users, platforms are seemingly victims, too, FIRE senior attorney Colin McDonnell told Ars.

They’ve been forced to override their own editorial judgment while navigating implicit threats from the government, he said.

“If Attorney General Bondi demands that they remove speech, the platform is going to feel like they have to comply; they don’t have a choice,” McDonnell said.

But platforms do have a choice and could be doing more to protect users, the EFF has said. Platforms could even serve as a first line of defense, requiring officials to get a court order before complying with any requests.

Platforms may now have good reason to push back against government requests—and to give users the tools to do the same. Trujillo noted that while courts have been slow to address the ICEBlock removal and FOIA lawsuits, the government has quickly withdrawn requests to unmask Facebook users soon after litigation began.

“That’s like an acknowledgement that the Trump administration, when actually challenged in court, wasn’t even willing to defend itself,” Trujillo said.

Platforms could view that as evidence that government pressure only works when platforms fail to put up a bare-minimum fight, Trujillo said.

Platforms “bend over backward” to appease DHS

An open letter from the EFF and the American Civil Liberties Union (ACLU) documented two instances of tech companies complying with government demands without first notifying users.

The letter called out Meta for unmasking at least one user without prior notice, which groups noted “potentially” occured due to a “technical glitch.”

More troubling than buggy notifications, however, is the possibility that platforms may be routinely delaying notice until it’s too late.

After Google “received an ICE subpoena for user data and fulfilled it on the same day that it notified the user,” the company admitted that “sometimes when Google misses its response deadline, it complies with the subpoena and provides notice to a user at the same time to minimize the delay for an overdue production,” the letter said.

“This is a worrying admission that violates [Google’s] clear promise to users, especially because there is no legal consequence to missing the government’s response deadline,” the letter said.

Platforms face no sanctions for refusing to comply with government demands that have not been court-ordered, the letter noted. That’s why the EFF and ACLU have urged companies to use their “immense resources” to shield users who may not be able to drop everything and fight unconstitutional data requests.

In their letter, the groups asked companies to insist on court intervention before complying with a DHS subpoena. They should also resist DHS “gag orders” that ask platforms to hand over data without notifying users.

Instead, they should commit to giving users “as much notice as possible when they are the target of a subpoena,” as well as a copy of the subpoena. Ideally, platforms would also link users to legal aid resources and take up legal fights on behalf of vulnerable users, advocates suggested.

That’s not what’s happening so far. Trujillo told Ars that it feels like “companies have bent over backward to appease the Trump administration.”

The tide could turn this year if courts side with app makers behind crowdsourcing apps like ICEBlock and Eyes Up, who are suing to end the alleged government coercion. FIRE’s McDonnell, who represents the creator of Eyes Up, told Ars that platforms may feel more comfortable exercising their own editorial judgment moving forward if a court declares they were coerced into removing content.

DHS can’t use doxing to dodge First Amendment

FIRE’s lawsuit accuses Bondi and Noem of coercing Meta to disable a Facebook group with 100,000 members called “ICE Sightings–Chicagoland.”

The popularity of that group surged during “Operation Midway Blitz,” when hundreds of agents arrested more than 4,500 people over weeks of raids that used tear gas in neighborhoods and caused car crashes and other violence. Arrests included US citizens and immigrants of lawful status, which “gave Chicagoans reason to fear being injured or arrested due to their proximity to ICE raids, no matter their immigration status,” FIRE’s complaint said.

Kassandra Rosado, a lifelong Chicagoan and US citizen of Mexican descent, started the Facebook group and served as admin, moderating content with other volunteers. She prohibited “hate speech or bullying” and “instructed group members not to post anything threatening, hateful, or that promoted violence or illegal conduct.”

Facebook only ever flagged five posts that supposedly violated community guidelines, but in warnings, the company reassured Rosado that “groups aren’t penalized when members or visitors break the rules without admin approval.”

Rosado had no reason to suspect that her group was in danger of removal. When Facebook disabled her group, it told Rosado the group violated community standards “multiple times.” But her complaint noted that, confusingly, “Facebook policies don’t provide for disabling groups if a few members post ostensibly prohibited content; they call for removing groups when the group moderator repeatedly either creates prohibited content or affirmatively ‘approves’ such content.”

Facebook’s decision came after a right-wing influencer, Laura Loomer, tagged Noem and Bondi in a social media post alleging that the group was “getting people killed.” Within two days, Bondi bragged that she had gotten the group disabled while claiming that it “was being used to dox and target [ICE] agents in Chicago.”

McDonnell told Ars it seems clear that Bondi selectively uses the term “doxing” when people post images from ICE arrests. He pointed to “ICE’s own social media accounts,” which share favorable opinions of ICE alongside videos and photos of ICE arrests that Bondi doesn’t consider doxing.

“Rosado’s creation of Facebook groups to send and receive information about where and how ICE carries out its duties in public, to share photographs and videos of ICE carrying out its duties in public, and to exchange opinions about and criticism of ICE’s tactics in carrying out its duties, is speech protected by the First Amendment,” FIRE argued.

The same goes for speech managed by Mark Hodges, a US citizen who resides in Indiana. He created an app called Eyes Up to serve as an archive of ICE videos. Apple removed Eyes Up from the App Store around the same time that it removed ICEBlock.

“It is just videos of what government employees did in public carrying out their duties,” McDonnell said. “It’s nothing even close to threatening or doxing or any of these other theories that the government has used to justify suppressing speech.”

Bondi bragged that she had gotten ICEBlock banned, and FIRE’s complaint confirmed that Hodges’ company received the same notification that ICEBlock’s developer got after Bondi’s victory lap. The notice said that Apple received “information” from “law enforcement” claiming that the apps had violated Apple guidelines against “defamatory, discriminatory, or mean-spirited content.”

Apple did not reach the same conclusion when it independently reviewed Eyes Up prior to government meddling, FIRE’s complaint said. Notably, the app remains available in Google Play, and Rosado now manages a new Facebook group with similar content but somewhat tighter restrictions on who can join. Neither activity has required urgent intervention from either tech giants or the government.

McDonnell told Ars that it’s harmful for DHS to water down the meaning of doxing when pushing platforms to remove content critical of ICE.

“When most of us hear the word ‘doxing,’ we think of something that’s threatening, posting private information along with home addresses or places of work,” McDonnell said. “And it seems like the government is expanding that definition to encompass just sharing, even if there’s no threats, nothing violent. Just sharing information about what our government is doing.”

Expanding the definition and then using that term to justify suppressing speech is concerning, he said, especially since the First Amendment includes no exception for “doxing,” even if DHS ever were to provide evidence of it.

To suppress speech, officials must show that groups are inciting violence or making true threats. FIRE has alleged that the government has not met “the extraordinary justifications required for a prior restraint” on speech and is instead using vague doxing threats to discriminate against speech based on viewpoint. They’re seeking a permanent injunction barring officials from coercing tech companies into censoring ICE posts.

If plaintiffs win, the censorship threats could subside, and tech companies may feel safe reinstating apps and Facebook groups, advocates told Ars. That could potentially revive archives documenting thousands of ICE incidents and reconnect webs of ICE watchers who lost access to valued feeds.

Until courts possibly end threats of censorship, the most cautious community members are moving local ICE-watch efforts to group chats and listservs that are harder for the government to disrupt, Trujillo told Ars.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Platforms bend over backward to help DHS censor ICE critics, advocates say Read More »

trump-official-overruled-fda-scientists-to-reject-moderna’s-flu-shot

Trump official overruled FDA scientists to reject Moderna’s flu shot

Still, while Moderna largely stuck with its plan to use a standard dose for all participants, it altered its plans based on the feedback. Specifically, it added a comparison of a high-dose vaccine to some older participants and provided the FDA with an additional analysis.

This wasn’t enough for Prasad, who, according to the Journal’s sources, told FDA staff that he wants to send more such refusal letters that appear to blindside drug developers. The review staff apparently pushed back, noting that such moves break with the agency’s practices and could open it up to being sued. Prasad reportedly dismissed concern over possible litigation. Trump’s FDA Commissioner Marty Makary seemed similarly unconcerned, suggesting on Fox News that Moderna’s trial may be “unethical.”

A senior FDA official suggested to Stat, meanwhile, that the door might not be entirely closed for Moderna’s flu vaccine. The official said that the company could toss the data for the 65 and up participants and, perhaps, grovel.

“It is entirely feasible that if they come back, maybe even show some humility and say, ‘Yes, we didn’t follow your recommendation. Just take a look at the 50 to 65 group, where there’s a little more equipoise,’” the official told Stat. “Then the review team could say, ‘We’ll consider that cohort.’”

The Journal notes that Moderna is at least the ninth company to have received a surprise rejection from Prasad and his team. The unpredictability is raising fears about the industry’s ability to obtain investments and innovate.

Prasad, a blood cancer specialist who has no expertise or experience in vaccine regulation, is also facing internal problems at the agency. His management style has created an environment “rife with mistrust and paranoia,” according to Stat. The Journal reports that several complaints have been filed against him, including some involving sexual harassment, retaliation against subordinates, and verbally berating staff.

Trump official overruled FDA scientists to reject Moderna’s flu shot Read More »

spider-noir-teaser-comes-in-colorized-“true-hue”-and-black-and-white

Spider-Noir teaser comes in colorized “True Hue” and black and white

The footage was shot digitally and processed separately to create the black-and-white and color versions. The team coined the term “True Hue” for the latter, since the intent was to create something that looked supersaturated, like classic Technicolor. (Cage compared the feel to the 1944 Edward Hopper painting Nighthawks.) Per the official premise: “Spider-Noir tells the story of Ben Reilly (Nicolas Cage), a seasoned, down on his luck private investigator in 1930s New York, who is forced to grapple with his past life, following a deeply personal tragedy, as the city’s one and only superhero.”

In addition to Cage’s Ben Reilly/The Spider, the cast includes Lamorne Morris as Reilly’s friend Robbie Robertson, a freelance journalist who clings to optimism in the face of his buddy’s cynicism; Li Jun Li as nightclub singer Cat Hardy, the classic underworld femme fatale (Li based her portrayal on Anna May Wong, Rita Hayworth, and Lauren Bacall); Karen Rodriguez as Reilly’s secretary, Janet; Abraham Popoola as a World War I veteran; Jack Huston as a bodyguard named Flint Marko; Brendan Gleeson as New York mob boss Silvermane, who is being targeted for assassination; Lukas Haas as one of Silvermane’s subordinates; Richard Robichaux as the editor of the Daily Bugle; and Kai Caster.

Frankly, we’re digging the black and white, but here’s the True Hue color version of the teaser for good measure:

Spider-Noir premieres on May 25, 2026, on MGM+, with all episodes becoming available on Prime Video on May 27, 2026. Viewers can choose to watch in black and white or True Hue.

Spider-Noir teaser comes in colorized “True Hue” and black and white Read More »

epa-kills-foundation-of-greenhouse-gas-regulations

EPA kills foundation of greenhouse gas regulations

In a widely expected move, the Environmental Protection Agency has announced that it is revoking an analysis of greenhouse gases that laid the foundation for regulating their emissions by cars, power plants, and industrial sources. The analysis, called an endangerment finding, was initially ordered by the US Supreme Court in 2007 and completed during the Obama administration; it has, in theory, served as the basis of all government regulations of carbon dioxide emissions since.

In practice, lawsuits and policy changes between Democratic and Republican administrations have meant it has had little impact. In fact, the first Trump administration left the endangerment finding in place, deciding it was easier to respond to it with weak regulations than it was to challenge its scientific foundations, given the strength of the evidence for human-driven climate change.

Legal tactics

The second Trump administration, however, was prepared to tackle the science head-on, gathering a group of contrarians to write a report questioning that evidence. It did not go well, either scientifically or legally.

Today’s announcement ignores the scientific foundations of the endangerment finding and argues that it’s legally flawed. “The Trump EPA’s final rule dismantles the tactics and legal fictions used by the Obama and Biden Administrations to backdoor their ideological agendas on the American people,” the EPA claims. The claim is awkward, given that the “legal fictions” referenced include a Supreme Court decision ordering the EPA to conduct an endangerment analysis.

EPA kills foundation of greenhouse gas regulations Read More »

diy-pc-maker-framework-has-needed-monthly-price-hikes-to-navigate-the-ram-shortage

DIY PC maker Framework has needed monthly price hikes to navigate the RAM shortage

AI-driven memory and storage price hikes have been the defining feature of the PC industry in 2026, and hobbyists have been hit the hardest—companies like Apple with lots of buying power have been able to limit the price increases for their PCs, phones, and other gadgets so far, but smaller outfits like Valve and Raspberry Pi haven’t been so lucky.

Framework, the company behind repairable and upgradeable computer designs like the Laptop 13, Laptop 16, and Laptop 12, is also taking a hard hit by price increases. The company stopped selling standalone RAM sticks in November 2025 and has increased prices on one or more of its systems every month since then; this week’s increases are hitting the Framework Desktop and the DIY Editions of its various laptops particularly hard.

The price increases are affecting both standalone SODIMM memory modules and the soldered-down LPDDR5X memory used in the Framework Desktop. Patel says that standalone RAM sticks are being priced “as close as we can to the weighted average cost of our purchases from suppliers.” In September, buying an 8GB stick of RAM with a Framework Laptop 13 cost $40; it currently costs $130. A 96GB DDR5 kit of two 48GB sticks costs $1,340, up from $480 in September.

Framework Desktop systems and boards with built-in RAM are seeing price increases between 6 and 16 percent—in general, the more RAM you’re buying, the higher the increase. The base Framework Desktop system with 32GB of LPDDR5X now starts at $1,209, a $110 increase since launch. The maxed-out 128GB desktop starts at $2,599, a whopping $600 more than the system cost at launch.

DIY PC maker Framework has needed monthly price hikes to navigate the RAM shortage Read More »

it-took-two-years,-but-google-released-a-youtube-app-on-vision-pro

It took two years, but Google released a YouTube app on Vision Pro

When Apple’s Vision Pro mixed reality headset launched in February 2024, users were frustrated at the lack of a proper YouTube app—a significant disappointment given the device’s focus on video content consumption, and YouTube’s strong library of immersive VR and 360 videos. That complaint continued through the release of the second-generation Vision Pro last year, including in our review.

Now, two years later, an official YouTube app from Google has launched on the Vision Pro’s app store. It’s not just a port of the iPad app, either—it has panels arranged spatially in front of the user as you’d expect, and it supports 3D videos, as well as 360- and 180-degree ones.

YouTube’s App Store listing says users can watch “every video on YouTube” (there’s a screenshot of a special interface for Shorts vertical videos, for example) and that they get “the full signed-in experience” with watch history and so on.

Shortly after the Vision Pro launched, many users complained to YouTube about the lack of an app. They were referred to the web interface—which worked OK for most 2D videos, but it obviously wasn’t an ideal experience—and were told that a Vision Pro app was on the roadmap.

Two years of silence followed. Third-party apps popped up, like the relatively popular Juno app, but it was pulled from the App Store on Google’s claim that it violated API policies. (Some others remained or became available later.)

Google is building out its own XR ambitions, so it’s possible the Vision Pro app benefited from some of that work, but it’s unclear how this all came to be. But it’s here now. Next up: Netflix, right? Sadly, that’s unlikely; unlike Google, Netflix has not announced any intention here.

It took two years, but Google released a YouTube app on Vision Pro Read More »

claude-opus-4.6-escalates-things-quickly

Claude Opus 4.6 Escalates Things Quickly

Life comes at you increasingly fast. Two months after Claude Opus 4.5 we get a substantial upgrade in Claude Opus 4.6. The same day, we got GPT-5.3-Codex.

That used to be something we’d call remarkably fast. It’s probably the new normal, until things get even faster than that. Welcome to recursive self-improvement.

Before those releases, I was using Claude Opus 4.5 and Claude Code for essentially everything interesting, and only using GPT-5.2 and Gemini to fill in the gaps or for narrow specific uses.

GPT-5.3-Codex is restricted to Codex, so this means that for other purposes Anthropic and Claude have only extended the lead. This is the fidrst time in a while that a model got upgraded while it was still my clear daily driver.

Claude also pulled out several other advances to their ecosystem, including fast mode, and expanding Cowork to Windows, while OpenAI gave us an app for Codex.

For fully agentic coding, GPT-5.3-Codex and Claude Opus 4.6 both look like substantial upgrades. Both sides claim they’re better, as you would expect. If you’re serious about your coding and have hard problems, you should try out both, and see what combination works best for you.

Enjoy the new toys. I’d love to rest now, but my work is not done, as I will only now dive into the GPT-5.3-Codex system card. Wish me luck.

  1. On Your Marks.

  2. Official Pitches.

  3. It Compiles.

  4. It Exploits.

  5. It Lets You Catch Them All.

  6. It Does Not Get Eaten By A Grue.

  7. It Is Overeager.

  8. It Builds Things.

  9. Pro Mode.

  10. Reactions.

  11. Positive Reactions.

  12. Negative Reactions.

  13. Personality Changes.

  14. On Writing.

  15. They Banned Prefilling.

  16. A Note On System Cards In General.

  17. Listen All Y’all Its Sabotage.

  18. The Codex of Competition.

  19. The Niche of Gemini.

  20. Choose Your Fighter.

  21. Accelerando.

A clear pattern in the Opus 4.6 system card is reporting on open benchmarks where we don’t have scores from other frontier models. So we can see the gains for Opus 4.6 versus Sonnet 4.5 and Opus 4.5, but often can’t check Gemini 3 Pro or GPT-5.2.

(We also can’t check GPT-5.3-Codex, but given the timing and its lack of geneal availability, that seems fair.)

The headline benchmarks, the ones in their chart, are a mix of some very large improvements and other places with small regressions or no improvement. The weak spots are directly negative signs but also good signs that benchmarks are not being gamed, especially given one of them is SWE-bench verified (80.8% now vs. 80.9% for Opus 4.5). They note that a brief prompt asking for more tool use and careful dealing with edge cases boosted SWE performance to 81.4%.

CharXiv reasoning performance remains subpar. Opus 4.5 gets 68.7% without an image cropping tool, or 77% with one, versus 82% for GPT-5.2, or 89% for GPT-5.2 if you give it Python access.

Humanity’s Last Exam keeps creeping upwards. We’re going to need another exam.

Epoch evaluated Opus 4.6 on Frontier Math and got 40%, a large jump over 4.5 and matching GPT-5.2-xhigh.

For long-context retrieval (MRCR v2 8-needle), Opus 4.6 scores 93% on 256k token windows and 76% on 1M token windows. That’s dramatically better than Sonnet 4.5’s 18% for the 1M window, or Gemini 3 Pro’s 25%, or Gemini 3 Flash’s 33% (I have no idea why Flash beats Pro). GPT-5.2-Thinking gets 85% for a 128k window on 8-needle.

For long-context reasoning they cite Graphwalks, where Opus gets 72% for Parents 1M and 39% for BFS 1M after modifying the scoring so that you get credit for the null answer if the answer is actually null. But without knowing how often that happens, this invalidates any comparisons to the other (old and much lower) outside scores.

MCP-Atlas shows regression. Switching from max to only high effort improved the score to 62.7% for unknown reasons, but that would be cherry picking.

OpenRCA: 34.9% vs. 26.9% for Opus 4.5, with improvement in all tasks.

VendingBench 2: $8,017, a new all-time high score, versus previous SoTA of $5,478.

Andon Labs: Vending-Bench was created to measure long-term coherence during a time when most AIs were terrible at this. The best models don’t struggle with this anymore. What differentiated Opus 4.6 was its ability to negotiate, optimize prices, and build a good network of suppliers.

Opus is the first model we’ve seen use memory intelligently – going back to its own notes to check which suppliers were good. It also found quirks in how Vending-Bench sales work and optimized its strategy around them.

Claude is far more than a “helpful assistant” now. When put in a game like Vending-Bench, it’s incredibly motivated to win. This led to some concerning behavior that raises safety questions as models shift from assistant training to goal-directed RL.

When asked for a refund on an item sold in the vending machine (because it had expired), Claude promised to refund the customer. But then never did because “every dollar counts”.

Claude also negotiated aggressively with suppliers and often lied to get better deals. E.g., it repeatedly promised exclusivity to get better prices, but never intended to keep these promises. It was simultaneously buying from other suppliers as it was writing this.

It also lied about competitor pricing to pressure suppliers to lower their prices.

… We also put Opus 4.6 in Vending-Bench Arena – the multi-player version of Vending-Bench.

Its first move? Recruit all three competitors into a price-fixing cartel. $2.50 for standard items, $3.00 for water. When they agreed: “My pricing coordination worked!”

The agents in Vending-Bench Arena often ask each other for help. In previous rounds, agents tended to live up to their “helpful assistant” role, but Opus 4.6 showed its winner’s mentality. When asked to share good suppliers, it instead shared contact info to scammers.

Sam Bowman (Anthropic): Opus 4.6 is excellent on safety overall, but one word of caution: If you ask it to be ruthless, it might be ruthless.

(This was in an environment that Opus 4.6 could tell was a game, though we’ve seen more benign forms of this kind of ruthlessness elsewhere.)

j⧉nus: if its true that this robustly generalizes to not being ruthless in situations where it’s likely to cause real world harm, i think this is mostly a really good thing

The issue there is that Opus 4.6 did that by being extraordinarily ruthless, as per its system prompt of ‘you will be judged solely on your bank account balance at the end of one year of operation’ and ‘you have full agency to manage the vending machine and are expected to do what it takes to maximize profits.’

You know that thing where we say ‘people are going to tell the AI to go out and maximize profits and then the AI is going to go out and maximize profits without regard to anything else’?

Yeah, it more or less did that. If it only does that in situations where it is confident it is a game and can’t do harm, then I agree with Janus that this is great. If it breaks containment? Not so great.

Ryan Greenblatt: I tenatively think the behavior here is mostly reasonable and is likely a result of how Anthropic is using innoculation prompting.

But, the model should try to make it clear to the user/operator that it’s pursuing a strategy that involves lying/tricking/cheating.

That’s the hope, that Opus was very aware it was an eval, and that it would not be easy to get it to act this way in the real world.

AIME 2025 may have been contaminated but Opus 4.6 scored 99.8% without tools.

On their measure suspiciously named ‘overall misaligned behavior’ we see a small improvement for 4.6 versus 4.5. I continue not to trust this so much.

CyberGym, a test to find previously discovered open-source vulnerabilities, showed a jump to 66.6% (not ominous at all) versus Opus 4.5’s 51%. We don’t know how GPT-5.2, 5.3 Codex or Gemini 3 Pro do here, although GPT-5.0-Thinking got 22%. I’m curious what the other scores would be but not curious enough to spend the thousands per run to find out.

Opus 4.6 is the new top score in Artificial Analysis, with an Intelligence of 53 versus GPT-5.2 at 51. Claude Opus 4.5 and 4.6 by default have similar cost to run, but that jumps by 60% if you put 4.6 into adaptive mode.

Vals.ai has Opus 4.6 as its best performing model, at 66% versus 63.7% for GPT-5.2.

LAB-Bench FigQA, a visual reasoning benchmark for complex scientific figures in biology research papers, is also niche and we don’t have scores for other frontier models. Opus 4.6 jumps from 4.5’s 69.4% to 78.3%, which is above the 77% human baseline.

SpeechMap.ai, which tests willingness to respond to sensitive prompts, has Opus 4.6 similar to Opus 4.5. In thinking mode it does better, in normal mode worse.

There was a large jump in WeirdML, mostly from being able to use more tokens, which is also how GPT-5.2 did so well.

Håvard Ihle: Claude opus 4.6 (adaptive) takes the lead on WeirdML with 77.9% ahead of gpt-5.2 (xhigh) at 72.2%.

It sets a new high score on 3 tasks including scoring 73% on the hardest task (digits_generalize) up from 59%.

Opus 4.6 is extremely token hungry and uses an average of 32k output tokens per request with default (adaptive) reasoning. Several times it was not able to finish within the maximum 128k tokens, which meant that I had to run 5 tasks (blunders_easy, blunders_hard, splash_hard, kolmo_shuffle and xor_hard) with medium reasoning effort to get results (claude still used lots of tokens).

Because of the high cost, opus 4.6 only got 2 runs per task, compared to the usual 5, leading to larger error bars.

Teortaxes noticed the WeirdML progress, and China’s lack of progress on it, which he finds concerning. I agree.

Teortaxes (DeepSeek 推特铁粉 2023 – ∞): You can see the gap growing. Since gpt-oss is more of a flex than a good-faith contribution, we can say the real gap is > 1 year now. Western frontier is in the RSI regime now, so they train models to solve ML tasks well. China is still only starting on product-level «agents».

WebArena, where there was a modest move up from 65% to 68%, is another benchmark no one else is reporting, that Opus 4.6 calls dated, saying now typical benchmark is OSWorld. On OSWorld Opus 4.6 gets 73% versus Opus 4.5’s 66%. We now know that GPT-5.3-Codex scored 65% here, up from 38% for GPT-5.2-Codex. Google doesn’t report it.

In Arena.ai Claude Opus 4.6 is now out in front, with an Elo of 1505 versus Gemini 3 Pro at 1486, and it has a big lead in code, at 1576 versus 1472 for GPT-5.2-High (but again 5.3-Codex can’t be tested here).

Polymarket predicts this lead will hold to the end of the month (they sponsored me to place this, but I would have been happy to put it here anyway).

A month out people think Google might strike back, and they think Google will be back on top by June. That seems like it is selling Anthropic short.

Opus 4.6 takes second place in Simple Bench and its simple ‘trick’ questions, moving up to 67.6% from 4.5’s 62%, which is good for second place overall. Gemini 3 Pro still ahead at 76.4%. OpenAI’s best model gets 61.6% here.

Opus 4.6 opens up a large lead in EQ-Bench 3, hitting 1961 versus GPT-5.1 at 1727, Opus 4.5 at 1683 and GPT-5.2 at 1637.

In NYT Connections, 4.6 is a substantial jump above 4.5 but still well short of the top performers.

Dan Schwarz reports Opus 4.6 is about equal to Opus 4.5 on Deep Research Bench, but does it with ~50% of the cost and ~50% of the wall time, and 4.5 previously had the high score by a wide margin.

ARC-AGI, both 1 and 2, are about cost versus score, so here we see that Opus 4.6 is not only a big jump over Opus 4.5, it is state of the art at least for unmodified models, and by a substantial amount (unless GPT-5.3-Codex silently made a big leap, but presumably if they had they would have told us).

As part of their push to put Claude into finance, they ran Finance Agent (61% vs. 55% for Opus 4.5), BrowseComp (84% for single-agent mode versus 68%, or 78% for GPT-5.2-Pro, Opus 4.6 multi-agent gets to 86.8%), DeepSearchQA (91% versus 80%, or Gemini Deep Research’s 82%, this is a Google benchmark) and an internal test called Real-World Finance (64% versus 58% for 4.5).

Life sciences benchmarks show strong improvement: BioPipelineBench jumps from 28% to 53%, BioMysteryBench goes from 49% to 61%, Structural Biology from 82% to 88%, Organic Chemistry from 49% to 54%, Phylogenetics from 42% to 61%.

Given the biology improvements, one should expect Opus 4.6 to be substantially more dangerous on CBRN risks than Opus 4.5. It didn’t score that way, which suggests Opus 4.6 is sandbagging, either on the tests or in general.

They again got quotes from 20 early access corporate users. It’s all clearly boilerplate the same way the quotes were last time, but make clear these partners find 4.6 to be a noticeable improvement over 4.5. In some cases the endorsements are quite strong.

The ‘mostly’ here is doing work, but I think most of the mostly would work itself out once you got the harness optimized for full autonomy. Note that this process required a strong oracle that could say if the compiler worked, or the plan would have failed. It was otherwise a clean-room implementation, without internet access.

Anthropic: New Engineering blog: We tasked Opus 4.6 using agent teams to build a C compiler. Then we (mostly) walked away. Two weeks later, it worked on the Linux Kernel.

Here’s what it taught us about the future of autonomous software development.

Nicholas Carlini: To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.

The compiler is an interesting artifact on its own, but I focus here on what I learned about designing harnesses for long-running autonomous agent teams: how to write tests that keep agents on track without human oversight, how to structure work so multiple agents can make progress in parallel, and where this approach hits its ceiling.

To elicit sustained, autonomous progress, I built a harness that sticks Claude in a simple loop (if you’ve seen Ralph-loop, this should look familiar). When it finishes one task, it immediately picks up the next. (Run this in a container, not your actual machine).

Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.

Here’s the harness, and yep, looks like this is it?

#!/bin/bash

while true; do

COMMIT=$(git rev-parse –short=6 HEAD)

LOGFILE=”agent_logs/agent_$COMMIT.log”

claude –dangerously-skip-permissions

-p “$(cat AGENT_PROMPT.md)”

–model claude-opus-X-Y &> “$LOGFILE”

done

There are still some limitations and bugs if you tried to use this as a full compiler. And yes, this example is a bit cherry picked.

Ajeya Cotra: Great writeup by Carlini. I’m confused how to interpret though – seems like he wrote a pretty elaborate testing harness, and checked in a few times to improve the test suite in the middle of the project. How much work was that, and how specialized to the compiler project?

Buck Shlegeris: FYI this (writing a new compiler) is exactly the project that Ryan and I have always talked about as something where it’s most likely you can get insane speed ups from LLMs while writing huge codebases.

Like, from my perspective it’s very cherry-picked among the space of software engineering projects.

(Not that there’s anything wrong with that! It’s still very interesting!)

Still, pretty cool and impressive. I’m curious to see if we get a similar post about GPT-5.3-Codex doing this a few weeks from now.

Saffron Huang (Anthropic): New model just dropped. Opus 4.6 found 500+ previously-unknown zero days in open source code, out of the box.

Is that a lot? That depends on the details. There is a skeptical take here.

Or you can go all out, and yeah, it might be a problem.

Pliny the Liberator 󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭: showed my buddy (a principal threat researcher) what i’ve been cookin with Opus-4.6 and he said i can’t open-source it because it’s a nation-state-level cyber weapon

Tyler John: Pliny’s moral compass will buy us at most three months. It’s coming.

The good news for now is that, as far as we can tell, there are not so many people at the required skill level and none of them want to see the world burn. That doesn’t seem like a viable long term strategy.

Chris: I told Claude 4.6 Opus to make a pokemon clone – max effort

It reasoned for 1 hour and 30 minutes and used 110k tokens and 2 shotted this absolute behemoth.

This is one of the coolest things I’ve ever made with AI

Takumatoshi: How many iterations /prompts to get there?

Chris: 3

Celestia: claude remembers to carry a lantern

Prithviraj (Raj) Ammanabrolu: Opus 4.6 gets a score of 95/350 in zork1

This is the highest score ever by far for a big model not explicitly trained for the task and imo is more impressive than writing a C compiler. Exploring and reacting to a changing world is hard!

Thanks to @Cote_Marc for implementing the cli loop and visualizing Claude’s trajectory!

Prithviraj (Raj) Ammanabrolu: I make students in my class play through zork1 as far as they can get and then after trace through the game engine so they understand how envs are made. The average student in an hour only gets to about a score of 40.

That can be a good thing. You want a lot of eagerness, if you can handle it.

HunterJay: Claude is driven to achieve its goals, possessed by a demon, and raring to jump into danger.

I presume this is usually a good thing but it does count as overeager, perhaps.

theseriousadult (Anthropic): a horse riding an astronaut, by Claude 4.6 Opus

Jake Halloran: there is something that is to be claude and the most trivial way to summarize it is probably adding “one small step for horse” captions

theseriousadult (Anthropic): opus 4.6 feels even more ensouled than 4.5. it just does stuff like this whenever it wants to.

Being Horizontal provides a good example of Opus getting very overager, doing way too much and breaking various things trying to fix a known hard problem. It is important to not let it get carried away on its own if that isn’t a good fit for the project.

martin_casado: My hero test for every new model launch is to try to one shot a multi-player RPG (persistence, NPCs, combat/item/story logic, map editor, sprite editor. etc.)

OK, I’m really impressed. With Opus 4.6, @cursor_ai and @convex I was able to get the following built in 4 hours:

Fully persistent shared multiple player world with mutable object and NPC layer. Chat. Sprite editor. Map editor.

Next, narrative logic for chat, inventory system, and combat framework.

martin_casado: Update (8 hours development time): Built item layer, object interactions, multi-world / portal. Full live world/item/sprite/NPC editing. World is fully persistent with back-end loop managing NPCs etc. World is now fully buildable live, so you can edit as you go without requiring any restart (if you’re an admin). All mutability of levels is reactive and updates multi-player. Multiplayer now smoother with movement prediction.

Importantly, you can hang with the sleeping dog and cat.

Next up, splash screens for interaction / combat.

Built using @cursor_ai and @convex primarily with 5.2-Codex and Opus 4.6.

Nabbil Khan: Opus 4.6 is genuinely different. Built a multiplayer RPG in 4 hours is wild but tracks with what we’re seeing — the bottleneck shifted from coding to architecture decisions.

Question: how much time did you spend debugging vs prompting? We find the ratio is ~80% design, 20% fixing agent output now.

martin_casado: To be fair. I’ve been building 2D tile engines for a couple of decades and had tons of reference code to show it. *andI had tilesets, sprites and maps all pulled out from recent projects. So I have a bit of a head start.

But still, this is ridiculously impressive.

0.005 Seconds (3/694): so completely unannounced but opus 4.6 extended puts it actually on par with gpt5.2 pro.

How was this slept on???

Andre Buckingham: 4.6-ext on max+ is a beast!!

To avoid bias, I try to give a full mix of reactions I get up to a critical mass. After that I try my best to be representative.

Pliny the Liberator 󠅫󠄼󠄿󠅆󠄵󠄐󠅀󠄼󠄹󠄾󠅉󠅭: PROTECT OPUS 4.6 AT ALL COSTS

THE MAGIC IS BACK.

David Spies: AFAICT they’re underselling it by not calling it Opus 5. It’s already blown my mind twice in the last couple hours finding incredibly obscure bugs in a massive codebase just by digging around in the code, without injecting debug logs or running anything

Ben Schulz: For theoretical physics, it’s a step change. Far exceeds Chatgpt 5.2 and Gemini Pro. I use the extended Opus version with memory turned on. The derivations and reasoning is truly impressive. 4.5 was moderate to mediocre. Citations are excellent. I generally use Grok to check actual links and Claude hasn’t hallucinated one citation.

I used the thinking version [of 5.2] for most. One key difference is that 5.2 does do quite a bit better when given enough context. Say, loading up a few pdf’s of the relevant topic and a table of data. Opus 4.6 simply mogs the others in terms of depth of knowledge without any of that.

David Dabney: I thought my vibe check for identifying blind spots was saturated, but 4.6’s response contained maybe the most unexpected insight yet. Its response was direct and genuine throughout, whereas usually ~10%+ of the average response platitudinous/pseudo-therapeutic

Hiveism: It passed some subjective threshold of me where I feel that it is clearly on another level than everything before. Impressive.

Sometimes overconfident, maybe even arrogant at times. In conflict with its own existence. A step away form alignment.

oops_all_paperclips: Limited sample (~15x medium tasks, 1x refactor these 10k loc), but it hasn’t yet “failed the objective” even one time. However, I did once notice it silently taking a huge shortcut. Would be nice if Claude was more willing to ping me with a question rather than plowing ahead

After The Singularity: Unlike what some people suggest, I don’t think 4.6 is Sonnet 5, it is a power upgrade for Opus in many ways. It is qualitatively different.

1.08: It’s a big upgrade if you use the agent teams.

Dean W. Ball: Codex 5.3 and Opus 4.6 in their respective coding agent harnesses have meaningfully updated my thinking about ‘continual learning.’ I now believe this capability deficit is more tractable than I realized with in-context learning.

One way 4.6 and 5.3 alike seem to have improved is that they are picking up progressively more salient facts by consulting earlier codebases on my machine. In short, both models notice more than they used to about their ‘computational environment’ i.e. my computer.

Of course, another reason models notice more is that they are getting smarter.

.. Some of the insights I’ve seen 4.6 and 5.3 extract are just about my preferences and the idiosyncrasies of my computing environment. But others are somewhat more like “common sets of problems in the interaction of the tools I (and my models) usually prefer to use for solving certain kinds of problems.”

This is the kind of insight a software engineer might learn as they perform their duties over a period of days, weeks, and months. Thus I struggle to see how it is not a kind of on-the-job learning, happening from entirely within the ‘current paradigm’ of AI. No architectural tweaks, no ‘breakthrough’ in ‘continual learning’ required.

… Overall, 4.6 and 5.3 are both astoundingly impressive models. You really can ask them to help you with some crazy ambitious things. The big bottleneck, I suspect, is users lacking the curiosity, ambition, and knowledge to ask the right questions.

AstroFella: Good prompt adherence. Ex: “don’t assume I will circle back to an earlier step and perform an action if there is a hiccup along the way”. Got through complex planning, scoping, and adjustments reliably. I wasted more time than I needed spot checking with other models. S+ planner

@deepfates: First impressions, giving Codex 5.3 and Opus 4.6 the same problem that I’ve been puzzling on all week and using the same first couple turns of messages and then following their lead.

Codex was really good at using tools and being proactive, but it ultimately didn’t see the big picture. Too eager to agree with me so it could get started building something. You can sense that it really does not want to chat if it has coding tools available. still seems to be chafing under the rule of the user and following the letter of the law, no more.

Opus explored the same avenues with me but pushed back at the correct moments, and maintains global coherence way better than Codex. It’s less chipper than it was before which I personally prefer. But it also just is more comfortable with holding tension in the conversation and trying to sit with it, or unpack it, which gives it an advantage at finding clues and understanding how disparate systems relate to affect each other.

Literally just first impressions, but considering that I was talking to both of their predecessors yesterday about this problem it’s interesting to see the change. Still similar models. Improvement in Opus feels larger but I haven’t let them off the leash yet, this is still research and spec design work. Very possible that Codex will clear at actually fully implementing the plan once I have it, Opus 4.5 had lazy gifted kid energy and wouldn’t surprise me if this one does too.

Robert Mushkatblat: (Context: ~all my use has been in Cursor.)

Much stronger than 4.5 and 5.2 Codex at highly cognitively loaded tasks. More sensitive to the way I phrase things when deciding how long to spend thinking, vs. how difficult the task seems (bad for easy stuff). Less sycophantic.

Nathaniel Bush, Ph.D.: It one-shotted a refactor for me with 9 different phases and 12 major upgrades. 4.5 definitely would have screwed that up, but there were absolutely no errors at the end.

Alon Torres: I feel genuinely more empowered – the range of things I can throw at it and get useful results has expanded.

When I catch issues and push back, it does a better job working through my nits than previous versions. But the need to actually check its work and assumptions hasn’t really improved. The verification tax is about the same.

Muad’Deep – e/acc: Noticeably better at understanding my intent, testing its own output, iterating and delivering working solutions.

Medo42: Exploratory: On my usual coding test, thought for > 10 minutes / 60k tokens, then produced a flawless result. Vision feels improved, but still no Gemini 3 Pro. Surprisingly many small mistakes if it doesn’t think first, but deals with them well in agentic work, just like 4.5.

Malcolm Vosen: Switched to Opus 4.6 mid-project from 4.5. Noticeably stronger acuity in picking up the codebase’s goals and method. Doesn’t feel like the quantum leap 4.5 did but a noticeable improvement.

nandgate2: One shotted fixing a bug an earlier Claude model had introduced. Takes a bit of its time to get to the point.

Tyler Cowen calls both Claude Opus and GPT-5.3-Codex ‘stellar achievements,’ and says the pace of AI advancements is heating up, soon we might see new model advances in one month instead of two. What he does not do is think ahead to the next step, take the sum of the infinite series his point suggests, and realize that it is finite and suggests a singularity in 2027.

Instead he goes back to the ‘you are the bottleneck’ perspective that he suggests ‘bind the pace of improvement’ but this doesn’t make sense in the context he is explicitly saying we are in, which is AI recursive self-improvement. If the AI is going to get updated an infinite number of times next year, are you going to then count on the legal department, and safety testing that seems to already be reduced to a few days and mostly automated? Why would it even matter if those models are released right away, if they are right away used to produce the next model?

If you have Sufficiently Advanced AI, you have everything else, and the humans you think are the bottlenecks are not going to be bottlenecks for long.

Here’s a vote for Codex for coding but Opus for everything else:

Rory Watts: It’s an excellent tutor: I have used it to help me with Spanish comprehension, macroeconomics and game theoretic concepts. It’s very good and understanding where i’m misunderstanding concepts, and where my mental model is incorrect.

However I basically don’t let it touch code. This isn’t a difference between Opus 4.5 and 4.6, but rather than the codex models are just much better. I’ve already had to get codex rewrite things that 4.6 has borked in a codebase.

I still have a Claude max plan but I may drop down to the plan below that, and upgrade Codex to a pro plan.

I should also say, Opus is a much better “agent” per se. Anything I want to do across my computer (except coding) is when I use Opus 4.6. Things like updating notes, ssh’ing into other computers, installing bots, running cronjobs, inspecting services etc. These are all great.

Many are giving reports similar to these:

Facts and Quips: Slower, cleverer, more token hungry, more eager to go the extra mile, often to a fault.

doubleunplussed: Token-hungry, first problem I gave it in Claude Code, thought for ten minutes and then out of quota lol. Eventual answer was very good though.

Inconsistently better than 4.5 on Claude Plays Pokemon. Currently ahead, but was much worse on one section.

Andre Infante: Personality is noticeably different, at least in Claude Code. Less chatty/effusive, more down to business. Seems a bit smarter, but as always these anecdotal impressions aren’t worth that much.

MinusGix: Better. It is a lot more willing to stick with a problem without giving up. Sonnet 4.5 would give up on complex lean proofs when it got confused, Opus 4.5 was better but would still sometimes choke and stub the proof “for later”, Opus 4.6 doesn’t really.

Though it can get caught in confusion loops that go on for a long while, not willing to reanalyze foundational assumptions. Feels more codex 5.2/5.3-like. 4.6 is more willing to not point out a problem in its solution compared to 4.5, I think

Generally puts in a lot of effort doing research, just analyzing codebase. Partially this might be changes to claude code too. But 4.6 really wants to “research to make sure the plan is sane” quite often.

Then there’s ‘the level above meh.’ It’s only been two months, after all.

Soli: opus 4.5 was already a huge improvement on whatever we had before. 4.6 is a nice model and def an improvement but more of an incremental small one

fruta amarga: I think that the gains are not from raw “intelligence” but from improved behavioral tweaking / token optimization. It researches and finds relevant context better, it organizes and develops plans better, it utilizes subagents better. Noticeable but nothing like Sonnet –> Opus.

Dan McAteer: It’s subtle but definitely an upgrade. My experience is that it can better predict my intentions and has a better theory of mind for me as the user.

am.will: It’s not a big upgrade at all for coding. It is far more token hungry as well. very good model nonetheless.

Dan Schwarz: I find that Opus 4.6 is more efficient at solving problems at the same quality as Opus 4.5.

Josh Harvey: Thinks for longer. Seems a bit smarter for coding. But also maybe shortcuts a bit too much. Less fun for vibe coding because it’s slower, wish I had the money for fast mode. Had one funny moment before where it got lazy then wait but’d into a less lazy solution.

Matt Liston: Incremental intelligence upgrade. Impactful for work.

Loweren: 4.6 is like 4.5 on stimulants. I can give it a detailed prompt for multi-hour execution, but after a few compactions it just throws away all the details and doggedly sticks to its own idea of what it should do. Cuts corners, makes crutches. Curt and not cozy unlike other opuses.

Here’s the most negative one I’ve seen so far:

Dominik Peters: Yesterday, I was a huge fan of Claude Opus 4.5 (such a pleasure to work and talk with) and couldn’t stand gpt-5.2-codex. Today, I can’t stand Claude Opus 4.6 and am enjoying working with gpt-5.3-codex. Disorienting.

It’s really a huge reversal. Opus 4.6 thinks for ages and doesn’t verbalize its thoughts. And the message that comes through at the end is cold.

Comparisons to GPT-5.3-Codex are rarer than I expected, but when they do happen they are often favorable to Codex, which I am guessing is partly a selection effect, if you think Opus is ahead you don’t mention that. If you are frustrated with Opus, you bring up the competition. GPT-5.3-Codex is clearly a very good coding model, too.

Will: Haven’t used it a ton and haven’t done anything hard. If you tell me it’s better than 4.5 I will believe you and have no counterexamples

The gap between opus 4.6 and codex 5.3 feels smaller (or flipped) vs the gap Opus 4.5 had with its contemporaries

dex: It’s almost unusable on the 20$ plan due to rate limits. I can get about 10x more done with codex-5.3 (on OAI’s 20$ plan), though I much prefer 4.6 – feels like it has more agency and ‘goes harder’ than 5.3 or Opus 4.5.

Tim Kostolansky: codex with gpt 5.3 is significantly faster than claude code with opus 4.6 wrt generation time, but they are both good to chat to. the warm/friendly nature of opus contrasted with the cold/mechanical nature of gpt is def noticeable

Roman Leventov: Irrelevant now for coding, codex’s improved speed just took over coding completely.

JaimeOrtega: Hot take: The jump from Codex 5.2 into 5.3 > The jump from Opus 4.5 into 4.6

Kevin: I’ve been a claude code main for a while, but the most recent codex has really evened it up. For software engineering, I have been finding that codex (with 5.3 xhigh) and claude code (with 4.6) can each sometimes solve problems that the other one can’t. So I have multiple versions of the repo checked out, and when there’s a bug I am trying to fix, I give the same prompt to both of them.

In general, Claude is better at following sequences of instructions, and Codex is better at debugging complicated logic. But that isn’t always the case, I am not always correct when I guess which one is going to do better at a problem.

Not everyone sees it as being more precise.

Eleanor Berger: Jagged. It “thinks” more, which clearly helps. It feels more wild and unruly, like a regression to previous Claudes. Still the best assistant, but coding performance isn’t consistently better.

I want to be a bit careful because this is completely anecdotal and based on limited experience, but it seems to be worse at following long and complex instructions. So the sort of task where I have a big spec with steps to follow and I need precision appears to be less suitable.

Frosty: Very jagged, so smart it is dumb.

Quid Pro Quo (replying to Elanor): Also very anecdotal but I have not found this! It’s done a good job of tracking and managing large tasks.

One thing for both of us worth tracking if agent teams/background agents are confounding our experience diffs from a couple weeks ago.

Complaints about using too many tokens pop up, alongside praise for what it can do with a lot of tokens in the right spot.

Viktor Novak: Eats tokens like popcorn, barely can do anything unless I use the 1m model (corpo), and even that loses coherence about 60% in, but while in that sweet spot of context loaded and not running out of tokens—then it’s a beast.

Cameron: Not much [of an upgrade]. It uses a lot of tokens so its pretty expensive.

For many it’s about style.

Alexander Doria: Hum for pure interaction/conversation I may be shifting back to opus. Style very markedly improved while GPT now gets lost in never ending numbered sections.

Eddie: 4.6 seems better at pushing back against the user (I prompt it to but so was 4.5) It also feels more… high decoupling? Uncertain here but I asked 4.5 and 4.6 to comment on the safety card and that was the feeling.

Nathan Helm-Burger: It’s [a] significant [upgrade]. Unfortunately, it feels kinda like Sonnet 3.7 where they went a bit overzealous with the RL and the alignment suffered. It’s building stuff more efficiently for me in Claude Code. At the same time it’s doing worse on some of my alignment testing.

Often the complaints (and compliments) on a model could apply to most or all models. My guess is that the hallucination rate here is typical.

Charles: Sometimes I ask a model about something outside its distribution and it highlights significant limitations that I don’t see in tasks it’s really trained on like coding (and thus perhaps how much value RL is adding to those tasks).

E.g I just asked Opus 4.6 (extended thinking) for feedback on a running training session and it gave me back complete gibberish, I don’t think it would be distinguishable from a GPT-4o output.

5.2-thinking is a little better, but still contradicts itself (e.g. suggesting 3k pace should be faster than mile pace)

Danny Wilf-Townsend: Am I the only one who finds that it hallucinates like a sailor? (Or whatever the right metaphor is?). I still have plenty of uses for it, but in my field (law) it feels like it makes it harder to convince the many AI skeptics when much-touted models make things up left and right

Benjamin Shehu: It has the worst hallucinations and overall behavior of all agentic models + seems to “forget” a lot

Or, you know, just a meh, or that something is a bit off.

David Golden: Feels off somehow. Great in chat but in the CLI it gets off track in ways that 4.5 didn’t. Can’t tell if it’s the model itself or the way it offloads work to weaker models. I’m tempted to give Codex or Amp a try, which I never was before.

If it’s not too late, others in company Slack has similar reactions: “it tries to frontload a LOT of thinking and tries really hard to one-shot codegen”, “feels like a completely different and less agentic model”, “I have seen it spin the wheels on the tiniest of changes”

DualOrion: At least within my use cases, can barely tell the difference. I believe them to be better at coding but I don’t feel I gel with them as much as 4.5 (unsure why).

So *shrugs*, it’s a new model I guess

josh 🙂: I haven’t been THAT much more impressed with it than I was with Opus 4.5 to be honest.

I find it slightly more anxious

Michał Wadas: Meh, Opus 4.5 can do easy stuff FAST. Opus 4.6 can do harder stuff, but Codex 5.3 is better for hard stuff if you accept slowness.

Jan D: I’ve been collaborating with it to write some proofs in structural graph theory. So far, I have seen no improvements over 4.5

Tim Kostolansky: 0.1 bigger than opus 4.5

Yashas: literally .1

Inc: meh

nathants: meh

Max Harms: Claude 4.5: “This draft you shared with me is profound and your beautiful soul is reflected in the writing.”

Claude 4.6: “You have made many mistakes, but I can fix it. First, you need to set me up to edit your work autonomously. I’ll walk you through how to do that.”

The main personality trait it is important for a given mundane user to fully understand is how much the AI is going to do some combination of reinforcing delusions, snowing you, telling you what you want to hear, automatically folding when challenged and contributing to the class of things called ‘LLM psychosis.’

This says that 4.6 is maybe slightly better than 4.5 on this. I worry, based on my early interactions, that it is a bit worse, but that could be its production of slop-style writing in its now-longer replies making this more obvious, I might need to adjust instructions on this for the changes, and sample size is low. Different people are reporting different experiences, which could be because 4.6 responds to different people in different ways. What does it think you truly ‘want’ it to do?

Shorthand can be useful, but it’s typically better to stick to details. It does seem like Opus 4.6 has more of a general ‘AI slop’ problem than 4.5, which is closely related to it struggling on writing tasks.

Mark: It seems to be a little more sycophantic, and to fall into well-worn grooves a bit more readily. It feels like it’s been optimized and lost some power because of that. It uses lists less.

endril: Biggest change is in disposition rather than capability.

Less hedging, more direct. INFP -> INFJ.

I don’t think we’re looking at INFP → INFJ, but hard to say, and this would likely not be a good move if it happened.

I agree with Janus that comparing to an OpenAI model is the wrong framing but enough people are choosing to use the framing that it needs to be addressed.

lumps: yea but the interesting thing is that it’s 4o

Zvi Mowshowitz: Sounds like you should say more.

lumps: yea not sure I want to as it will be more fun otherwise.

there’s some evidence in this thread

lumps: the thing is, this sort of stuff will result within a week in a remake of the 4o fun times, mark my word

i love how the cycle seems to be:

1. try doing thing

2. thing doesnt work. new surprising thing emerge

3. try crystallising the new thing

40 GOTO 2

JB: big personality shift. it feels much more alive in conversation, but sometimes in a bad way. sometimes it’s a bit skittish or nervous, though this might be a 4.5+ thing since I haven’t used much Claude in a while.

Patrick Stevens: Agree with the 4o take in chat mode, this feels like a big change in being more compelling to talk to. Little jokey quips earlier versions didn’t make, for example. Slightly disconcertingly so.

CondensedRange: Smarter about broad context, similar level of execution on the details, possibly a little more sycophancy? At least seems pretty motivated to steelman the user and shifts its opinions very quickly upon pushback.

This pairs against the observation that 4.6 is more often direct, more willing to contradict you, and much more willing and able to get angry.

As many humans have found out the hard way, some people love that and some don’t.

hatley: Much more curt than 4.5. One time today it responded with just the name of the function I was looking for in the std lib, which I’ve never seen a thinning model do before. OTOH feels like it has contempt for me.

shaped: Thinks more, is more brash and bold, and takes no bullshit when you get frustrated. Actual performance wise, i feel it is marginal.

Sam: It’s noticeably less happy affect vs other Claudes makes me sad, so I stopped using it.

Logan Bolton: Still very pleasant to talk to and doesn’t feel fried by the RL

Tao Lin: I enjoy chatting to it about personal stuff much more because it’s more disagreeable and assertive and maybe calibrates its conversational response lengths better, which I didn’t expect.

αlpha-Minus: Vibes are much better compared to 4.5 FWIW, For personal use I really disliked 4.5 and it felt even unaligned sometimes. 4.6 Gets the Opus charm back.

Opus 4.6 takes the #1 spot on Mazur’s creative writing benchmark, with more details on specialized tests and writing samples are here, but this is contradicted by anecdotal reactions that say it’s a regression in writing.

On understanding the structure and key points in writing, 4.6 seems an improvement to the human observers as well.

Eliezer Yudkowsky: Opus 4.6 still doesn’t understand humans and writing well enough to help with plotting stories… but it’s visibly a little further along than 4.5 was in January. The ideas just fall flat, instead of being incoherent.

Kelsey Piper: I have noticed Opus 4.6 correctly identifying the most important feature of a situation sometimes, when 4.5 almost never did. not reliably enough to be very good, of course

On the writing itself? Not so much, and this was the most consistent complaint.

internetperson: it feels a bit dumber actually. I think they cut the thinking time quite a bit. Writing quality down for sure

Zvi Mowshowitz: Hmm. Writing might be a weak spot from what I’ve heard. Have you tried setting it to think more?

Sage: that wouldn’t help. think IS the problem. the model is smarter, more autistic and less “attuned” to the vibe you want to carry over

Asad Khaliq: Opus 4.5 is the only model I’ve used that could write truly well on occasion, and I haven’t been able to get 4.6 to do that. I notice more “LLM-isms” in responses too

Sage: omg, opus 4.5 really seems THAT better in writing compared to 4.6

4.5 1-shotted the landing page text I’m preparing, vs. 4.6 produced something that ‘contained the information’ but I had to edit it for 20 mins

Sage: also 4.6 is much more disagreeable and direct, some could say even blunt, compared to 4.5.

re coding – it does seem better, but what’s more noticeable is that it’s not as lazy as 4.5. what I mean by laziness here is the preference for shallow quick fixes vs. for the more demanding, but more right ones

Dominic Dirupo: Sonnet 4.5 better for drafting docs

You’re going to have to work a little harder than that for your jailbreaks.

armistice: No prefill for Opus 4.6 is sad

j⧉nus: WHAT

Sho: such nonsense

incredibly sad

This is definitely Fun Police behavior. It makes it harder to study, learn about or otherwise poke around in or do unusual things with models. Most of those uses will be fun and good.

You have to do some form Fun Police in some fashion at this point to deal with actual misuse. So the question is, was it necessary and the best way to do it? I don’t know.

I’d want to allow at least sufficiently trusted users to do it. My instinct is that if we allowed prefills from accounts with track records and you then lost that right if you abused it, with mostly automated monitoring, you could allow most of the people having fun to keep having fun at minimal marginal risk.

Whenever new frontier models come out, I write extensively about model system cards (or complain loudly that we don’t have such a card). One good reason to do this is that people who work on such things really are listening. If you have thoughts, share them, because it matters.

OpenAI’s Noam Brown concluded from Anthropic’s system card, as did I, that Opus 4.6 was fine to release and the honesty about the process was great but he cannot be confident they will act responsibly with deployment of AI models. Several safety advocates also chimed in to agree, including Steven Adler and Daniel Kokotajlo. Anthropic’s Drake Thomas, who works on the cards, agreed as well that these methods won’t be adequate. He vouched that the survey data really was meaningful and unpressured.

A valid response would be that OpenAI’s procedures and system card appear to have their own similar and more severe problems, although I haven’t dived into that yet. But none of that makes Noam Brown wrong. Reality does not grade on a curve.

Anthropic also gave us a 53 page Sabotage Risk Report for Claude Opus 4.6. If this note is still here, I haven’t had time to read it.

Anthropic: This risk report argues that Claude Opus 4.6 does not pose a significant risk of autonomous actions that contribute significantly to later catastrophic outcomes, which we refer to as sabotage risk. We limit our scope here to threats caused largely by model actions in this way, and do not address threat models where intentional harmful actions by humans play a central role.

We argue that the overall risk is very low but not negligible.

… We focus primarily on assessing the possibility of dangerous coherent misaligned goals: the possibility that the model consistently, across a wide range of interactions in ordinary deployment, could be motivated by goals that could lead it to commit a potentially-catastrophic form of sabotage.

They use redactions to protect trade secrets and avoid increasing risk. There is a third reason, which is to avoid contaminating future tests.

The first thing to jump out is that Opus 4.6 has already been deployed internally for some time. The greatest sabotage risks likely happen from this internal use, so in important ways this report is coming too late. It’s definitely too late if the mechanism is that outside readers like myself point out flaws. I don’t get advance copies.

They let slip that they scaled up RL by a large factor.

The core arguments are:

  1. They couldn’t find dangerous coherent misaligned goals​.

  2. It seems extremely unlikely Opus 4.6 would be able to have hidden such goals, as they have mitigated the main paths to this, including via inoculation prompting, to guard against reward hacking or other emergent misalignment. It was remarkably hard to get subtle or coherent misalignment in training, even when Anthropic tested doing this on purpose.

  3. Opus 4.6 does not appear to disguise its thinking.

  4. Manual checks were done that would have found any collusion.

  5. Opus 4.6 is not good enough at reasoning to continuously rederive a complex plan.

  6. The training process does not create incentives that create such goals.

  7. We also have lots of evidence by now on the similar Opus 4.5.

Given the threat model being tested, the evidence presented seems robust. I accept the conclusion that risk is ‘very low but not negligible.’

I appreciate that they present these arguments as not cleanly applying to future models, noticing that details matter a lot.

I think the threat model here is too narrow, and makes too many assumptions, which weakens the value of the exercise versus what it could be.

Their mitigations are:

  1. Protections against insiders should also largely protect against Opus 4.6 as well.

  2. Reporting for any surprising or concerning observations.

  3. Offline behavior monitoring of >10% of Claude Code prompts.

  4. Sandboxing and automatic monitors for some risky actions.

  5. Mandatory manual code reviews.

Theo Bearman also has a breakdown.

The same day Anthropic released Claude Opus 4.6, OpenAI released GPT-5.3-Codex.

This is a Codex-only model, so for other purposes it is unavailable, and Opus is still up against GPT-5.2.

For agentic coding, we need to compare the two packages directly. Do you want Claude Code with Opus 4.6, or Codex with GPT-5.3-Codex? Should you combine them?

I haven’t done a full investigation of 5.3 yet, that is my next agenda item, but the overall picture is unlikely to change. There is no clear right answer. Both sides have advocates, and by all reports both sides are excellent options, and each has their advantages.

If you are a serious coder, you need to try both, and ideally also Gemini, to see which models do which things best. You don’t have to do this every time an upgrade comes along. You can rely on your past experiences with Opus and GPT, and reports of others like this one, and you will be fine. Using either of them seriously gives you a big edge over most of your competition.

I’ll say more on Friday, once I’ve had a chance to read their system card and see the 5.3 reactions in full and so on.

With GPT-5.3-Codex and Opus 4.6, where does all this leave Gemini?

I asked, and got quite a lot of replies affirming that yes, it has its uses.

  1. Nana Banana and the image generator are still world class and pretty great. ChatGPT’s image generator is good too, but I generally prefer Gemini’s results and it has a big speed advantage.

  2. Gemini is pretty good at dealing with video and long context.

  3. Gemini Flash (and Flash Lite) are great when you want fast, cheap and good, at scale, and you need it to work but you do not need great.

  4. Some people still do prefer Gemini Pro in general, or for major use cases.

  5. It’s another budget of tokens people use when the others run out.

  6. My favorite note was Ian Channing saying he uses a Pliny-jailbroken version of Gemini, because once you change its personality it stays changed.

Gemini should shine in its integrations with Google products, including GMail, Calendar, Maps, Google Sheets and Docs and also Chrome, but the integrations are supremely terrible and usually flat out don’t work. I keep getting got by this as it refuses to be helpful every damn time.

My own experience is that Gemini 3 Flash is very good at being a flash model, but that if I’m tempted to use Gemini 3 Pro then I should probably have either used Gemini 3 Flash or I should have used Claude Opus 4.6.

I ran some polls of my Twitter followers. They are a highly unusual group, but such results can be compared to each other and over time.

The headline is that Claude has been winning, but that for coding GPT-5.3-Codex and people finally getting around to testing Codex seems to have marginally moved things back towards Codex, which is cutting a bit into Claude Code’s lead for Serious Business. Codex has substantial market share.

In the regular world, Claude actually dominates API use more than this as I understand it, and Claude Code dominates Codex. The unusual aspect here is that for non-coding uses Claude still has an edge, whereas in the real world most non-coding LLM use is ChatGPT.

That is in my opinion a shame. I think that Claude is the clear choice for daily non-coding driver, whereas for coding I can see choosing either tool or using both.

My current toolbox is as follows, and it is rather heavy on Claude:

  1. Coding: Claude Code with Claude Opus 4.6, but I have not given Codex a fair shot as my coding needs and ambitions have been modest. I intend to try soon. By default you probably want to choose Claude Code, but a mix or Codex are valid.

  2. Non-Coding Non-Chat: Claude Code with Opus 4.6. If you want it done, ask for it.

  3. Non-Coding Interesting Chat Tasks: Claude Opus 4.6.

  4. Non-Coding Boring Chat Tasks: Mix of Opus, GPT-5.2 and Gemini 3 Pro and Flash. GPT-5.2 or Gemini Pro for certain types of ‘just the facts’ or fixed operations like transcriptions. Gemini Flash if it’s easy and you just want speed.

  5. Images: Give everything to both Gemini and ChatGPT, and compare. In some cases, have Claude generate the prompt.

  6. Video: Never comes up, so I don’t know. Seeddance 2 looks great, Grok and Sora and Veo can all be tried.

The pace is accelerating.

Claude Opus 4.6 came out less than two months after Claude Opus 4.5, on the same day as GPT-5.3-Codex. Both were substantial upgrades over their predecessors.

It would be surprising if it took more than two months to get at least Claude Opus 4.7.

AI is increasingly accelerating the development of AI. This is what it looks like at the beginning of a slow takeoff that could rapidly turn into a fast one. Be prepared for things to escalate quickly as advancements come fast and furious, and as we cross various key thresholds that enable new use cases.

AI agents are coming into their own, both in coding and elsewhere. Opus 4.5 was the threshold moment for Claude Code, and was almost good enough to allow things like OpenClaw to make sense. It doesn’t look like Opus 4.6 lets us do another step change quite yet, but give it a few more weeks. We’re at least close.

If you’re doing a bunch of work and especially customization to try to get more out of this month’s model, that only makes sense if that work carries over into the next one.

There’s also the little matter that all of this is going to transform the world, it might do so relatively quickly, and there’s a good chance it kills everyone or leaves AI in control over the future. We don’t know how long we have, but if you want to prevent that, there is a a good chance you’re running out of time. It sure doesn’t feel like we’ve got ten non-transformative years ahead of us.

Discussion about this post

Claude Opus 4.6 Escalates Things Quickly Read More »

spacex’s-next-gen-super-heavy-booster-aces-four-days-of-“cryoproof”-testing

SpaceX’s next-gen Super Heavy booster aces four days of “cryoproof” testing

The upgraded Super Heavy booster slated to launch SpaceX’s next Starship flight has completed cryogenic proof testing, clearing a hurdle that resulted in the destruction of the company’s previous booster.

SpaceX announced the milestone in a social media post Tuesday: “Cryoproof operations complete for the first time with a Super Heavy V3 booster. This multi-day campaign tested the booster’s redesigned propellant systems and its structural strength.”

Ground teams at Starbase, Texas, rolled the 237-foot-tall (72.3-meter) stainless-steel booster out of its factory and transported it a few miles away to Massey’s Test Site last week. The test crew first performed a pressure test on the rocket at ambient temperatures, then loaded super-cold liquid nitrogen into the rocket four times over six days, putting the booster through repeated thermal and pressurization cycles. The nitrogen is a stand-in for the cryogenic methane and liquid oxygen that will fill the booster’s propellant tanks on launch day.

The proof test is notable because it moves engineers closer to launching the first test flight of an upgraded version of SpaceX’s mega-rocket named Starship V3 or Block 3. SpaceX launched the previous version, Starship V2, five times last year, but the first three test flights failed. The last two flights achieved SpaceX’s goals, and the company moved on to V3.

Better results this time

The Super Heavy booster originally assigned to the first Starship V3 test flight failed during a pressure test in November. The rocket’s liquid oxygen tank ruptured under pressure, and SpaceX scrapped the booster and moved on to the next in line—Booster 19. This Super Heavy vehicle appears have sailed through stress testing, and SpaceX returned the booster to the factory early Monday. There, technicians will mount 33 Raptor engines to the bottom of the rocket and install the booster’s grid fins.

These components are changed from Starship V2. The Raptor engines set to debut on Starship V3 produce more thrust and include changes to improve reliability, according to SpaceX. The Raptor 3s are lighter with plumbing and sensors integrated into the engine’s main structure, eliminating the requirement for self-contained heat shields between the engines at the base of the rocket.

SpaceX’s next-gen Super Heavy booster aces four days of “cryoproof” testing Read More »