Author name: Rejus Almole

amazon-and-stellantis-abandon-project-to-create-a-digital-“smartcockpit”

Amazon and Stellantis abandon project to create a digital “SmartCockpit”

Automaker Stellantis and retail and web services behemoth Amazon have decided to put an end to a collaboration on new in-car software. The partnership dates back to 2022, part of a wide-ranging agreement that also saw Stellantis pick Amazon Web Services as its cloud platform for new vehicles and Amazon sign on as the first customer for Ram’s fully electric ProMaster EV van.

A key aspect of the Amazon-Stellantis partnership was to be a software platform for new Stellantis vehicles called STLA SmartCockpit. Meant to debut last year, SmartCockpit was supposed to “seamlessly integrate with customers’ digital lives to create personalized, intuitive in-vehicle experiences,” using Alexa and other AI agents to provide better in-car entertainment but also navigation, vehicle maintenance, and in-car payments as well.

But 2024 came and went without the launch of SmartCockpit, and now the joint work has wound down, according to Reuters, although not for any particular reason the news organization could discern. Rather, the companies said in a statement that they “will allow each team to focus on solutions that provide value to our shared customers and better align with our evolving strategies.”

Amazon and Stellantis abandon project to create a digital “SmartCockpit” Read More »

samsung-drops-android-16-beta-for-galaxy-s25-with-more-ai-you-probably-don’t-want

Samsung drops Android 16 beta for Galaxy S25 with more AI you probably don’t want

The next version of Android is expected to hit Pixel phones in June, but it’ll take longer for devices from other manufacturers to see the new OS. However, Samsung is making unusually good time this cycle. Owners of the company’s Galaxy S25 phones can get an early look at One UI 8 (based on Android 16) in the new open beta program. Samsung promises a lot of upgrades, but it may not feel that way.

Signing up for the beta is a snap—just open the Samsung Members app, and the beta signup should be right on the main landing page. From there, the OTA update should appear on your device within a few minutes. It’s pretty hefty at 3.4GB, but the installation is quick, and none of your data should be affected. That said, backups are always advisable when using beta software.

You must be in the US, Germany, Korea, or the UK to join the beta, and US phones must be unlocked or the T-Mobile variants. The software is compatible with the Galaxy S25, S25+, and S25 Ultra—the new S25 Edge need not apply (for now).

Samsung’s big pitch here might not resonate with everyone: more AI. It’s a bit vague about what exactly that means, though. The company claims One UI 8 brings “multimodal capabilities, UX tailored to different device form factors, and personalized, proactive suggestions.” Having used the new OS for a few hours, it doesn’t seem like much has changed with Samsung’s AI implementation.

One UI 8 install

The One UI 8 beta is large, but installation doesn’t take very long.

Credit: Ryan Whitwam

The One UI 8 beta is large, but installation doesn’t take very long. Credit: Ryan Whitwam

The Galaxy S25 series launched with a feature called Now Brief, along with a companion widget called Now Bar. The idea is that this interface will assimilate your data, process it privately inside the Samsung Knox enclave, and spit out useful suggestions. For the most part, it doesn’t. While Samsung says One UI 8 will allow Now Brief to deliver more personal, customized insights, it seems to do the same handful of things it did before. It’s plugged into a lot of Samsung apps and data sources, but mostly just shows weather information, calendar events, and news articles you probably won’t care about. There were some hints of an upcoming audio version of Now Brief, but that’s not included in the beta.

Samsung drops Android 16 beta for Galaxy S25 with more AI you probably don’t want Read More »

we-now-have-a-good-idea-about-the-makeup-of-uranus’-atmosphere

We now have a good idea about the makeup of Uranus’ atmosphere

Uranus, the seventh planet in the Solar System, located between Saturn and Neptune, has long been a mystery. But by analyzing observations made by NASA’s Hubble Space Telescope over a 20-year period, a research team from the University of Arizona and other institutions has provided new insights into the composition and dynamics of the planet’s atmosphere.

Information about Uranus is limited. What we know is that the planet is composed mainly of water and ammonia ice, its diameter is about 51,000 kilometers, about four times that of the Earth, and its mass is about 15 times greater than Earth’s. Uranus also has 13 rings and 28 satellites.

In January 1986, NASA’s Voyager 2 space probe successfully completed what has been, to date, the only exploration of the planet, conducting a flyby as part of its mission to study the outer planets of the Solar System.

Uranus in 1986

This image of Uranus was taken by NASA’s Voyager 2 space probe in January 1986.

This image of Uranus was taken by NASA’s Voyager 2 space probe in January 1986. Credit: NASA/JPL

But thanks to this new research, we now know a little more about this icy giant. According to the research, which assessed Hubble images taken between 2002 and 2022, the main components of Uranus’ atmosphere are hydrogen and helium, with a small amount of methane and very small amounts of water and ammonia. Uranus appears pale blue-green because methane absorbs the red component of sunlight.

This image of Uranus, taken by NASA’s James Webb Space Telescope, shows nine of the planet’s 28 satellites and its rings.

This image of Uranus, taken by NASA’s James Webb Space Telescope, shows nine of the planet’s 28 satellites and its rings. Credit: NASA/ESA/CSA/STSCI

The research has also shed light on the planet’s seasons.

Unlike all of the other planets in the Solar System, Uranus’ axis of rotation is almost parallel to its orbital plane. For this reason, Uranus is said to be orbiting in an “overturned” position, as shown in the picture below. It is hypothesized that this may be due to a collision with an Earth-sized object in the past.

Uranus orbiting the Sun. It can be seen that Uranus’ axis of rotation is almost parallel to its orbital plane.

Uranus orbiting the Sun. It can be seen that Uranus’ axis of rotation is almost parallel to its orbital plane. Credit: NASA/ESA/J. Feild (STSCI)

The planet’s orbital period is about 84 years, which means that, for a specific point on the surface, the period when the sun shines (some of spring, summer, and some of fall) lasts about 42 years, and the period when the sun does not shine (some of fall, winter, and some of spring) lasts for about 42 years as well. In this study, the research team spent 20 years observing the seasons.

We now have a good idea about the makeup of Uranus’ atmosphere Read More »

hidden-ai-instructions-reveal-how-anthropic-controls-claude-4

Hidden AI instructions reveal how Anthropic controls Claude 4

Willison, who coined the term “prompt injection” in 2022, is always on the lookout for LLM vulnerabilities. In his post, he notes that reading system prompts reminds him of warning signs in the real world that hint at past problems. “A system prompt can often be interpreted as a detailed list of all of the things the model used to do before it was told not to do them,” he writes.

Fighting the flattery problem

An illustrated robot holds four red hearts with its four robotic arms.

Willison’s analysis comes as AI companies grapple with sycophantic behavior in their models. As we reported in April, ChatGPT users have complained about GPT-4o’s “relentlessly positive tone” and excessive flattery since OpenAI’s March update. Users described feeling “buttered up” by responses like “Good question! You’re very astute to ask that,” with software engineer Craig Weiss tweeting that “ChatGPT is suddenly the biggest suckup I’ve ever met.”

The issue stems from how companies collect user feedback during training—people tend to prefer responses that make them feel good, creating a feedback loop where models learn that enthusiasm leads to higher ratings from humans. As a response to the feedback, OpenAI later rolled back ChatGPT’s 4o model and altered the system prompt as well, something we reported on and Willison also analyzed at the time.

One of Willison’s most interesting findings about Claude 4 relates to how Anthropic has guided both Claude models to avoid sycophantic behavior. “Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective,” Anthropic writes in the prompt. “It skips the flattery and responds directly.”

Other system prompt highlights

The Claude 4 system prompt also includes extensive instructions on when Claude should or shouldn’t use bullet points and lists, with multiple paragraphs dedicated to discouraging frequent list-making in casual conversation. “Claude should not use bullet points or numbered lists for reports, documents, explanations, or unless the user explicitly asks for a list or ranking,” the prompt states.

Hidden AI instructions reveal how Anthropic controls Claude 4 Read More »

the-key-to-a-successful-egg-drop-experiment?-drop-it-on-its-side.

The key to a successful egg drop experiment? Drop it on its side.

There was a key difference, however, between how vertically and horizontally squeezed eggs deformed in the compression experiments—namely, the former deformed less than the latter. The shell’s greater rigidity along its long axis was an advantage because the heavy load was distributed over the surface. (It’s why the one-handed egg-cracking technique targets the center of a horizontally held egg.)

But the authors found that this advantage when under static compression proved to be a disadvantage when dropping eggs from a height, with the horizontal position emerging as the optimal orientation.  It comes down to the difference between stiffness—how much force is needed to deform the egg—and toughness, i.e., how much energy the egg can absorb before it cracks.

Cohen et al.’s experiments showed that eggs are tougher when loaded horizontally along their equator, and stiffer when compressed vertically, suggesting that “an egg dropped on its equator can likely sustain greater drop heights without cracking,” they wrote. “Even if eggs could sustain a higher force when loaded in the vertical direction, it does not necessarily imply that they are less likely to break when dropped in that orientation. In contrast to static loading, to remain intact following a dynamic impact, a body must be able to absorb all of its kinetic energy by transferring it into reversible deformation.”

“Eggs need to be tough, not stiff, in order to survive a fall,” Cohen et al. concluded, pointing to our intuitive understanding that we should bend our knees rather than lock them into a straightened position when landing after a jump, for example. “Our results and analysis serve as a cautionary tale about how language can affect our understanding of a system, and improper framing of a problem can lead to misunderstanding and miseducation.”

DOI: Communications Physics, 2025. 10.1038/s42005-025-02087-0  (About DOIs).

The key to a successful egg drop experiment? Drop it on its side. Read More »

college-board-keeps-apologizing-for-screwing-up-digital-sat-and-ap-tests

College Board keeps apologizing for screwing up digital SAT and AP tests

Don’t worry about the “mission-driven not-for-profit” College Board—it’s drowning in cash. The US group, which administers the SAT and AP tests to college-bound students, paid its CEO $2.38 million in total compensation in 2023 (the most recent year data is available). The senior VP in charge of AP programs made $694,662 in total compensation, while the senior VP for Technology Strategy made $765,267 in total compensation.

Given such eye-popping numbers, one would have expected the College Board’s transition to digital exams to go smoothly, but it continues to have issues.

Just last week, the group’s AP Psychology exam was disrupted nationally when the required “Bluebook” testing app couldn’t be accessed by many students. Because the College Board shifted to digital-only exams for 28 of its 36 AP courses beginning this year, no paper-based backup options were available. The only “solution” was to wait quietly in a freezing gymnasium, surrounded by a hundred other stressed-out students, to see if College Board could get its digital act together.

I speak, as you may have gathered, from family experience; one of my kids got to experience the incident first-hand. I was first clued into the problem by an e-mail from my school, which announced “a nationwide Bluebook outage (the testing application used for all digital AP exams)” for all AP Psych testers. Within an hour, many students were finally able to log in and begin the test, but other students had scheduling conflicts and were therefore “dismissed from the testing room” and given slots during the “late test day” or “during the exception testing window.”

On Reddit, the r/APStudents board melted down in predictable fashion. One post asked the question on everyone’s mind:

HOW DO U NOT PREPARE THE SERVERS FOR THE EXAM WHEN U KNOW WE’RE TAKING THE EXAM HELLO?????? BE SO FR

i was locked in and studied like hell for this ap psych exam ts pmo like what the actual f—nugget???

Now, I’m old enough not to know what “BE SO FR,” “TS,” or “PMO” mean, but the term “f—nugget” comes through loud and clear, and I plan to add it to my vocabulary.

College Board keeps apologizing for screwing up digital SAT and AP tests Read More »

researchers-cause-gitlab-ai-developer-assistant-to-turn-safe-code-malicious

Researchers cause GitLab AI developer assistant to turn safe code malicious

Marketers promote AI-assisted developer tools as workhorses that are essential for today’s software engineer. Developer platform GitLab, for instance, claims its Duo chatbot can “instantly generate a to-do list” that eliminates the burden of “wading through weeks of commits.” What these companies don’t say is that these tools are, by temperament if not default, easily tricked by malicious actors into performing hostile actions against their users.

Researchers from security firm Legit on Thursday demonstrated an attack that induced Duo into inserting malicious code into a script it had been instructed to write. The attack could also leak private code and confidential issue data, such as zero-day vulnerability details. All that’s required is for the user to instruct the chatbot to interact with a merge request or similar content from an outside source.

AI assistants’ double-edged blade

The mechanism for triggering the attacks is, of course, prompt injections. Among the most common forms of chatbot exploits, prompt injections are embedded into content a chatbot is asked to work with, such as an email to be answered, a calendar to consult, or a webpage to summarize. Large language model-based assistants are so eager to follow instructions that they’ll take orders from just about anywhere, including sources that can be controlled by malicious actors.

The attacks targeting Duo came from various resources that are commonly used by developers. Examples include merge requests, commits, bug descriptions and comments, and source code. The researchers demonstrated how instructions embedded inside these sources can lead Duo astray.

“This vulnerability highlights the double-edged nature of AI assistants like GitLab Duo: when deeply integrated into development workflows, they inherit not just context—but risk,” Legit researcher Omer Mayraz wrote. “By embedding hidden instructions in seemingly harmless project content, we were able to manipulate Duo’s behavior, exfiltrate private source code, and demonstrate how AI responses can be leveraged for unintended and harmful outcomes.”

Researchers cause GitLab AI developer assistant to turn safe code malicious Read More »

cdc-can-no-longer-help-prevent-lead-poisoning-in-children,-state-officials-say

CDC can no longer help prevent lead poisoning in children, state officials say

Amid the brutal cuts across the federal government under the Trump administration, perhaps one of the most gutting is the loss of experts at the Centers for Disease Control and Prevention who respond to lead poisoning in children.

On April 1, the staff of the CDC’s Childhood Lead Poisoning Prevention Program was terminated as part of the agency’s reduction in force, according to NPR. The staff included epidemiologists, statisticians, and advisors who specialized in lead exposures and responses.

The cuts were immediately consequential to health officials in Milwaukee, who are currently dealing with a lead exposure crisis in public schools. Six schools have had to close, displacing 1,800 students. In April, the city requested help from the CDC’s lead experts, but the request was denied—there was no one left to help.

In a Congressional hearing this week, US health secretary and anti-vaccine advocate Robert F. Kennedy Jr. told lawmakers, “We have a team in Milwaukee.”

But Milwaukee Health Commissioner Mike Totoraitis told NPR that this is false. “There is no team in Milwaukee,” he said. “We had a single [federal] staff person come to Milwaukee for a brief period to help validate a machine, but that was separate from the formal request that we had for a small team to actually come to Milwaukee for our Milwaukee Public Schools investigation and ongoing support there.”

Kennedy has also previously told lawmakers that lead experts at the CDC who were terminated would be rehired. But that statement was also false. The health department’s own communications team told ABC that the lead experts would not be reinstated.

CDC can no longer help prevent lead poisoning in children, state officials say Read More »

faa:-airplanes-should-stay-far-away-from-spacex’s-next-starship-launch

FAA: Airplanes should stay far away from SpaceX’s next Starship launch


“The FAA is expanding the size of hazard areas both in the US and other countries.”

The Starship for SpaceX’s next test flight, known as Ship 35, on the move between the production site at Starbase (in background) and the Massey’s test facility for a static fire test. Credit: SpaceX

The Federal Aviation Administration gave the green light Thursday for SpaceX to launch the next test flight of its Starship mega-rocket as soon as next week, following two consecutive failures earlier this year.

The failures set back SpaceX’s Starship program by several months. The company aims to get the rocket’s development back on track with the upcoming launch, Starship’s ninth full-scale test flight since its debut in April 2023. Starship is central to SpaceX’s long-held ambition to send humans to Mars and is the vehicle NASA has selected to land astronauts on the Moon under the umbrella of the government’s Artemis program.

In a statement Thursday, the FAA said SpaceX is authorized to launch the next Starship test flight, known as Flight 9, after finding the company “meets all of the rigorous safety, environmental and other licensing requirements.”

SpaceX has not confirmed a target launch date for the next launch of Starship, but warning notices for pilots and mariners to steer clear of hazard areas in the Gulf of Mexico suggest the flight might happen as soon as the evening of Tuesday, May 27. The rocket will lift off from Starbase, Texas, SpaceX’s privately owned spaceport near the US-Mexico border.

This will be the third flight of SpaceX’s upgraded Block 2, or Version 2, Starship rocket. The first two flights of Starship Block 2—in January and Marchdid not go well. On both occasions, the rocket’s upper stage shut down its engines prematurely and the vehicle lost control, breaking apart in the upper atmosphere and spreading debris near the Bahamas and the Turks and Caicos Islands.

Debris from Starship falls back into the atmosphere after Starship Flight 8 in this view over Hog Cay, Bahamas. Credit: GeneDoctorB via X

Investigators determined the cause of the January failure was a series of fuel leaks and fires in the ship’s aft compartment. The leaks were most likely triggered by vibrations that were more intense than anticipated, SpaceX said before Starship’s most recent flight in March. SpaceX has not announced the cause of the March failure, although the circumstances were similar to the mishap in January.

“The FAA conducted a comprehensive safety review of the SpaceX Starship Flight 8 mishap and determined that the company has satisfactorily addressed the causes of the mishap, and therefore, the Starship vehicle can return to flight,” the agency said. “The FAA will verify SpaceX implements all corrective actions.”

Flight safety

The flight profile for the next Starship launch will largely be a repeat of what SpaceX hoped to accomplish on the ill-fated tests earlier this year. If all goes according to plan, the rocket’s upper stage, or ship, will travel halfway around the world from Starbase, reaching an altitude of more than 100 miles before reentering the atmosphere over the Indian Ocean. A little more than an hour after liftoff, the ship will aim for a controlled splashdown in the ocean northwest of Australia.

Apart from overcoming the problems that afflicted the last two launches, one of the most important objectives for this flight is to test the performance of Starship’s heat shield. Starship Block 2 includes improved heat shield materials that could do better at protecting the ship from the superheated temperatures of reentry and, ultimately, make it easier to reuse the vehicle. The problems on the last two Starship test flights prevented the rocket from reaching the point where its heat shield could be tested.

Starship Block 2 also features redesigned flaps to better control the vehicle during its descent through the atmosphere. This version of Starship also has larger propellant tanks and reconfigured fuel feed lines for the ship’s six Raptor engines.

The FAA’s approval for Starship Flight 9 comes with some stipulations. The agency is expanding the size of hazard areas in the United States and in other countries based on an updated “flight safety analysis” from SpaceX and because SpaceX will reuse a previously flown first-stage booster—called Super Heavy—for the first time.

The aircraft hazard area for Starship Flight 9 extends approximately 1,600 nautical miles to the east from Starbase, Texas. Credit: Federal Aviation Administration

This flight-safety analysis takes into account the outcomes of previous flights, including accidents, population exposure risk, the probability of vehicle failure, and debris propagation and behavior, among other considerations. “The FAA uses this and other data to determine and implement measures to mitigate public risk,” the agency said.

All of this culminated in the FAA’s “return to flight determination,” which the agency says is based on public safety. The FAA’s primary concern with commercial space activity is ensuring rocket launches don’t endanger third parties. The agency also requires that SpaceX maintain at least $500 million in liability insurance to cover claims resulting from the launch and flight of Starship Flight 9, the same requirement the FAA levied for previous Starship test flights.

For the next launch, the FAA will establish an aircraft hazard area covering approximately 1,600 nautical miles extending eastward from Starbase, Texas, and through the Straits of Florida, including the Bahamas and the Turks and Caicos Islands. This is an extension of the 885-nautical-mile hazard area the FAA established for the test flight in March. In order to minimize disruption to commercial and private air traffic, the FAA is requiring the launch window for Starship Flight 9 to be scheduled during “non-peak transit periods.”

The size of FAA-mandated airspace closures can expand or shrink based on the reliability of the launch vehicle. The failures of Starship earlier this year raised the probability of vehicle failure in the flight-safety analysis for Starship Flight 9, according to the FAA.

The expanded hazard area will force the closure of more than 70 established air routes across the Gulf of Mexico and now includes the Bahamas and the Turks and Caicos Islands. The FAA anticipates this will affect more than 175 flights, almost all of them on international connecting routes. For airline passengers traveling through this region, this will mean an average flight delay of approximately 40 minutes, and potentially up to two hours, the FAA said.

If SpaceX can reel off a series of successful Starship flights, the hazard areas will likely shrink in size. This will be important as SpaceX ramps up the Starship launch cadence. The FAA recently approved SpaceX to increase its Starship flight rate from five per year to 25 per year.

The agency said it is in “close contact and collaboration” with other nations with territory along or near Starship’s flight path, including the United Kingdom, Turks and Caicos, the Bahamas, Mexico, and Cuba.

Status report

Meanwhile, SpaceX’s hardware for Starship Flight 9 appears to be moving closer to launch. Engineers test-fired the Super Heavy booster, which SpaceX previously launched and recovered in January, last month on the launch pad in South Texas. On May 12, SpaceX fired the ship’s six Raptor engines for 60 seconds on a test stand near Starbase.

After the test-firing, ground crews rolled the ship back to the Starship production site a few miles away, only to return the vehicle to the test stand Wednesday for unspecified testing. SpaceX is expected to roll the ship back to the production site again before the end of the week.

The final steps before launch will involve separately transporting the Super Heavy booster and Starship upper stage from the production site to the launch pad. There, SpaceX will stack the ship on top of the booster. Once the two pieces are stacked together, the rocket will stand 404 feet (123.1 meters) tall.

If SpaceX moves forward with a launch attempt next Tuesday evening, the long-range outlook from the National Weather Service calls for a 30 percent chance of showers and thunderstorms.

Photo of Stephen Clark

Stephen Clark is a space reporter at Ars Technica, covering private space companies and the world’s space agencies. Stephen writes about the nexus of technology, science, policy, and business on and off the planet.

FAA: Airplanes should stay far away from SpaceX’s next Starship launch Read More »

rfk-jr.-calls-who-“moribund”-amid-us-withdrawal;-china-pledges-to-give-$500m

RFK Jr. calls WHO “moribund” amid US withdrawal; China pledges to give $500M

“WHO’s priorities have increasingly reflected the biases and interests of corporate medicine,” Kennedy said, alluding to his anti-vaccine and germ-theory denialist views. He chastised the health organization for allegedly capitulating to China and working with the country to “promote the fiction that COVID originated in bats.”

Kennedy ended the short speech by touting his Make America Healthy Again agenda. He also urged the WHO to undergo a radical overhaul similar to what the Trump administration is currently doing to the US government—presumably including dismantling and withholding funding from critical health agencies and programs. Last, he pitched other countries to join the US in abandoning the WHO.

“I would like to take this opportunity to invite my fellow health ministers around the world into a new era of cooperation…. we’re ready to work with you,” Kennedy said.

Meanwhile, the WHA embraced collaboration. During the assembly this week, WHO overwhelmingly voted to adopt the world’s first pandemic treaty, aimed at collectively preventing, preparing for, and responding to any future pandemics. The treaty took over three years to negotiate, but in the end, no country voted against it—124 votes in favor, 11 abstentions, and no objections. (The US, no longer being a member of WHO, did not have a vote.)

“The world is safer today thanks to the leadership, collaboration and commitment of our Member States to adopt the historic WHO Pandemic Agreement,” WHO Director-General Tedros Adhanom Ghebreyesus said. “The Agreement is a victory for public health, science and multilateral action. It will ensure we, collectively, can better protect the world from future pandemic threats. It is also a recognition by the international community that our citizens, societies and economies must not be left vulnerable to again suffer losses like those endured during COVID-19.”

RFK Jr. calls WHO “moribund” amid US withdrawal; China pledges to give $500M Read More »

gemini-2.5-is-leaving-preview-just-in-time-for-google’s-new-$250-ai-subscription

Gemini 2.5 is leaving preview just in time for Google’s new $250 AI subscription

Deep Think graphs I/O

Deep Think is more capable of complex math and coding. Credit: Ryan Whitwam

Both 2.5 models have adjustable thinking budgets when used in Vertex AI and via the API, and now the models will also include summaries of the “thinking” process for each output. This makes a little progress toward making generative AI less overwhelmingly expensive to run. Gemini 2.5 Pro will also appear in some of Google’s dev products, including Gemini Code Assist.

Gemini Live, previously known as Project Astra, started to appear on mobile devices over the last few months. Initially, you needed to have a Gemini subscription or a Pixel phone to access Gemini Live, but now it’s coming to all Android and iOS devices immediately. Google demoed a future “agentic” capability in the Gemini app that can actually control your phone, search the web for files, open apps, and make calls. It’s perhaps a little aspirational, just like the Astra demo from last year. The version of Gemini Live we got wasn’t as good, but as a glimpse of the future, it was impressive.

There are also some developments in Chrome, and you guessed it, it’s getting Gemini. It’s not dissimilar from what you get in Edge with Copilot. There’s a little Gemini icon in the corner of the browser, which you can click to access Google’s chatbot. You can ask it about the pages you’re browsing, have it summarize those pages, and ask follow-up questions.

Google AI Ultra is ultra-expensive

Since launching Gemini, Google has only had a single $20 monthly plan for AI features. That plan granted you access to the Pro models and early versions of Google’s upcoming AI. At I/O, Google is catching up to AI firms like OpenAI, which have offered sky-high AI plans. Google’s new Google AI Ultra plan will cost $250 per month, more than the $200 plan for ChatGPT Pro.

Gemini 2.5 is leaving preview just in time for Google’s new $250 AI subscription Read More »

the-codex-of-ultimate-vibing

The Codex of Ultimate Vibing

While we wait for wisdom, OpenAI releases a research preview of a new software engineering agent called Codex, because they previously released a lightweight open-source coding agent in terminal called Codex CLI and if OpenAI uses non-confusing product names it violates the nonprofit charter. The promise, also reflected in a number of rival coding agents, is to graduate from vibe coding. Why not let the AI do all the work on its own, typically for 1-30 minutes?

The answer is that it’s still early days, but already many report this is highly useful.

Sam Altman: today we are introducing codex.

it is a software engineering agent that runs in the cloud and does tasks for you, like writing a new feature of fixing a bug.

you can run many tasks in parallel.

it is amazing and exciting how much software one person is going to be able to create with tools like this. “you can just do things” is one of my favorite memes;

i didn’t think it would apply to AI itself, and its users, in such an important way so soon.

OpenAI: Today we’re launching a research preview of Codex: a cloud-based software engineering agent that can work on many tasks in parallel. Codex can perform tasks for you such as writing features, answering questions about your codebase, fixing bugs, and proposing pull requests for review; each task runs in its own cloud sandbox environment, preloaded with your repository.

Codex is powered by codex-1, a version of OpenAI o3 optimized for software engineering. It was trained using reinforcement learning on real-world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences, adheres precisely to instructions, and can iteratively run tests until it receives a passing result.

Once Codex completes a task, it commits its changes in its environment. Codex provides verifiable evidence of its actions through citations of terminal logs and test outputs, allowing you to trace each step taken during task completion. You can then review the results, request further revisions, open a GitHub pull request, or directly integrate the changes into your local environment. In the product, you can configure the Codex environment to match your real development environment as closely as possible.

Codex can be guided by AGENTS.md files placed within your repository. These are text files, akin to README.md, where you can inform Codex how to navigate your codebase, which commands to run for testing, and how best to adhere to your project’s standard practices. Like human developers, Codex agents perform best when provided with configured dev environments, reliable testing setups, and clear documentation.

On coding evaluations and internal benchmarks, codex-1 shows strong performance even without AGENTS.md files or custom scaffolding.

All code is provided via GitHub repositories. All codex executions are sandboxed in the cloud. The agent cannot access external websites, APIs or other services. Afterwards you are given a comprehensive log of its actions and changes. You then choose to get the code via pull requests.

Note that while it lacks internet access during its core work, it can still install dependencies before it starts. But there are reports of struggles with its inability to install dependencies while it runs, which seems like a major issue.

Inability to access the web also makes some things trickier to diagnose, figure out or test. A lot of my frustration with AI coding is everything I want to do seems to involve interacting with persnickety websites.

This is a ‘research preview,’ and the worst Codex will ever be, although it might temporarily get less affordable once the free preview period ends. It does seem like they have given this a solid amount of thought and taken reasonable precautions.

The question is, when is this a better way to code than Cursor or Claude Code, and how does this compare to existing coding agents like Devin?

It would have been easy, given everything that happened, for OpenAI to have said ‘we do not need to give you a system card addendum, this is in preview and not a fully new model, etc.’ It is thus to their credit that they gave us the card anyway. It is short, but there is no need for it to be long.

As you would expect, the first thing that stood out was 2.3, ‘falsely claiming to have completed a task it did not complete.’ This seems to be a common pattern in similar models, including Claude 3.7.

I believe this behavior is something you want to fight hard to avoid having the AI learn in the first place. Once the AI learns to do this, it is difficult to get rid of it, but it wouldn’t learn it if you weren’t rewarding it during training. It is avoidable in theory. Is it avoidable in practice? I don’t know if the price is worthwhile, but I do know it’s worth a lot to avoid it.

OpenAI does indeed try, but with positive action rather than via negativa. Their plan is ensuring that the model is penalized for producing results inconsistent with its actions, and rewarded for acknowledging limitations. Good. That was a big help, going from 15% to 85% chance of correctly stating it couldn’t complete tasks. But 85% really isn’t 99%.

As in, I think if you include some things that push against pretending to solve problems, that helps a lot (hence the results here), but if you also have other places that pretending is rewarded, there will be a pattern, and then you still have a problem, and it will keep getting bigger. So instead, track down every damn place in which the AI could get away with claiming to have solved a task during training without having solved it, and make sure you always catch all of them. I know this is asking a lot.

They solve prompt injecting via network sandbagging. That definitely does the job for now, but also they made sure that prompt injections inside the coding environment also mostly failed. Good.

Finally we have the preparedness team affirming that the model did not reach high risk in any categories. I’d have liked to see more detail here, but overall This Is Fine.

Want to keep using the command line? OpenAI gives you codex-1, a variant of o4-mini, as an upgrade. They’re also introducing a simpler onboarding process for it and offering some free credits.

These look like a noticeable improvement over o4-mini-high and even o3-high. Codex-mini-latest will be priced at $1.50/$6 per million with a 75% prompt caching discount. They are also setting a great precedent by sharing the system message.

Greg Brockman speculates that over time the ‘local’ and ‘remote’ coding agents will merge. This makes sense. Why shouldn’t the local agent call additional remote agents to execute subtasks? Parallelism for the win. Nothing could possibly go wrong.

Immediate reaction to Codex was relatively muted. It takes a while for people to properly evaluate this kind of tool, and it is only available to those paying $200/month.

What feedback we do have is somewhat mixed. Cautious optimism, especially for what a future version could be, seems like the baseline.

Codex is the combination of an agent implementation with the underlying model. Reports seem to be consistent with the underlying model and async capabilities being excellent and those both matter a lot, but with the implementation needing work and being much less practically useful than rival agents, requiring more hand holding, having less clean UI and running slower.

That makes Codex in its current state a kind of ‘AI coding agent for advanced power users.’ You wouldn’t use the current Codex over the competition unless you understood what you were doing, and you wanted to do a lot of it.

The future of Codex looks bright. OpenAI in many senses started with ‘the hard part’ of having a great model and strong parallelism. The things still missing seem easily fixable over time.

One must also keep an eye out that OpenAI (especially via Greg Brockman) is picking out and amplifying positive feedback. It’s not yet clear how much of an upgrade this is over existing alternatives, especially as most reports don’t compare Codex to its rivals. That’s one reason I like to rely on my own Twitter reaction threads.

Then there’s Jules, Google’s coding assistant, which according to multiple sources is coming soon. Google will no doubt once again Fail Marketing Forever, but it seems highly plausible that Jules could be a better tool, and almost certain it will have a cheaper price tag.

What can it do?

Whatever those things are, it can do them fully in parallel. People seem to be underestimating this aspect of coding agents.

Alex Halliday: The killer feature of OpenAI Codex is parallelism.

Browser-based work is evolving: from humans handling tasks one tab at a time, to overseeing multiple AI agent tabs, providing feedback as needed.

The most important thing is the Task Relevant Maturity of these systems. You need to understand for which tasks systems like Codex can be used which is function of model capability and error tolerance. This is the “opportunity zone” for all AI systems, including ours @AirOpsHQ.

It can do legacy project migrations.

Flavio Adamo: I asked Codex to convert a legacy project from Python 2.7 to 3.11 and from Django 1.x to 5.0

It literally took 12 minutes. If you know, that’s usually weeks of pain. This is actually insane.

Haider: how much manual cleanup or review did it need after that initial pass?

Flavio Adamo: Not much, actually. Just a few Docker issues, solved in a couple of minutes.

Here’s Darwin Santos pumping out PRs and being very impressed.

Darwin Santos: Don’t mind us – it’s just @elvstejd and me knocking one PR after another with Codex. Thanks @embirico – @kevinweil. You weren’t joking with this being yet again a game changer.

Here’s Seconds being even more impressed, and sdmat being impressed with caveats.

0.005 Seconds: It’s incredible. The ux is mid and it’s missing features but the underlying model is so good that if you transported this to 2022 everyone would assume you have agi and put 70% of engineers into unemployment. 6 months of product engineering and it replaces teams.

It has been making insane progress in fairly complex scenarios on my personal project and I pretty effortlessly closed 7 tickets at work today. It obliterates small to medium tasks in familiar context.

Sdmat: Fantastic, though only part of what it will be and rough around the edges.

With no environment internet access, no agent search tool, and oriented to small-medium tasks it is currently a scalpel.

An excellent scalpel if you know what it is you want to cut.

Conrad Barski: this is right: it’s power is not that it can solve 50% of hard problems, it’s that it solves 99.9% of mid problems.

Sdmat: Exactly.

And mid problems comprise >90% of hard problems, so if you know what you are doing and can carve at the joints it is a very, very useful tool.

And here’s Riley Coyote being perhaps the most impressed, especially by the parallelism.

Riley Coyote: I’m *reallytrying to play it cool here but like…

I’mma just say it: Codex might be the most impressive, most *powerfulAI product I’ve ever touched. all things considered. the async ability, especially, is on another level. like it’s not just a technical ‘leap’, it’s transcendent. I’ve used basically every ai coding tool and platform out there at least once, and nothing else is in the same class. it just works, ridiculously well. and I’ll admit, I didn’t want to like it. Maybe it’s stubborn loyalty to Claude – I love that retro GUI and the no-nonsense simplicity of Claude Code. There’s still something special there and ill alway use it.

but, if I’m honest: that edge is kinda becoming irrelevant, because Codex feels like having a private, hyper-competent swarm – a crack team of 10/10 FS devs, but kinda *betteri think tbh.

it’s wild. at this rate, I might start shipping something new every single day, at least until I clear out my backlog (which, without exaggeration, is something like 35-40 ‘projects’ that are all ~70–85% done). this could not have come at a better time too. I desperately needed the combination of something like codex and much higher rate limits + a streamlined pipeline from my daily drive ai to db.

go try it out.

sidebar/tip: if you cant get over the initial hump, pop over to ai.studio.google.com and click the “build apps” button on the left hand side.

a bunch of sample apps and tools propogates and they’re actually really really really good one-click zero-shots essentially….

shits getting wild. and its only monday.

Bayram Annakov prefers Deep Research’s output for now on a sample task, but finds Codex to be promising as well, and it gets a B on an AI Product Engineer homework assignment.

Here’s Robbie Bouschery finding a bug in the first three minutes.

JB one shots a doodle jump game and gets 600k likes for the post, so clearly money well spent. Paul Couvert does the same with Gemini 2.5 although objectively the platform placement seems better in Codex’s version. Upgrade?

Reliability will always be a huge sticking point, right up until it isn’t. Being highly autonomous only matters if you can trust it.

Fleischman Mena: I’m reticent to use it on featurework: ~unchanged benchmarks & results look like o3 bolted to a SWE-bench finetune + git.

You seem to still need to baby it w/ gold-set context for decent outputs, so it’s unclear where alpha is vs. current reprompt grinds

It’s a nice “throw it in the bag, too” feature if you’re hitting GPT caps and don’t want to fan out to other services: But to me, it’s in the same category as task scheduling and the web agent: the “party trick” version of a better thing yet to come.

He points to a similar issue with Operator. I have access to Operator, but I don’t bother using it, largely because in many of the places where it is valuable it requires enough supervision I might as well do the job myself:

Henry: Does anyone use that ‘operator’ agent for anything?

Fleischman Mena: Not really.

Problem with web operators are that the REAL version of that product pretty much HAVE to be made by a sin-eater like the leetcode cheating startup.

Nobody wants “we build a web botting platform but it’s useless whenever lots of bots would have an impact.”

You pretty much HAVE to commit to “we’re going to sell you the ability to destroy the internet commons with bots”,

-or accept you’re only selling the “party trick” version of what this software would actually be if implemented “properly” for its users.

The few times I tried to use Operator to do something that would have been highly annoying to do myself, it fell down and died, and I decided that unless other people started reporting great results I’d rather just wait for similar agents to get better.

Alex Mizrahi reports Codex engaging in ‘busywork,’ identifying and fixing a ‘bug’ that wasn’t actually a bug.

Scott Swingle tries Codex out and compares it to Mentat. A theme throughout is that Mentat is more polished and faster, whereas Codex has to rerun a bunch of stuff. He likes o3 as the underlying model more than Sonnet 3.7, but finds the current implementation to not yet be up to par.

Lemonaut mostly doesn’t see the alpha over using some combination of Devin and Cursor/Cline, and finds it terribly finnicky and requiring hand holding in ways Cline and Devin aren’t, but does notice it solve a relatively difficult prompt. Again, that is compatible with o3 being a very good base model, but the implementation needing work.

People think about price all wrong.

Don’t think about relative price. Think about absolute benefits versus absolute price.

It doesn’t matter if ten times the price is ten times better. If ten times the price makes you 10% better, it’s an absolute steal.

Fleischman Mena: The sticking point is $2,160/year more than plus.

If you think Plus is a good deal at $240, the upgrade only makes sense if you GENUINELY believe

“This isn’t just better, it’s 10x better than plus, AND a better idea than subscribing to 9 other LLM pro plans.”

Seems dubious.

The $2,160 price issue is hard to ignore. that buys you ~43M o3 I/O tokens via API. War and peace is ~750k tokens. Most codebases & outputs don’t come close.

If spend’s okay, you prob do better plugging an API key into a half dozen agent competitors; you’d still come out ahead.

The dollar price, even at the $200/month a level, is chump change for a programmer, relative to a substantial productivity gain. What matters is your time and your productivity. If this improves your productivity even a few percent over rival options, and there isn’t a principal-agent problem (aka you pay the cost and someone else gets the productivity gains), then it is worthwhile. So ask whether or not it does that.

The other way this is the wrong approach is that it is only part of the $200/month package. You also get unlimited o3 and deep research use, among other products, which was previously the main attraction.

As a company, you are paying six figures for a programmer. Give them the best tools you can, whether or not this is the best tool.

This seems spot on to me:

Sully: I think agents are going to be split into 2 categories

Background & active

Background agents = stuff I don’t want to do (ux/spees doesn’t matter, but review + feedback does)

“Active agents” = things I want to do but 10x faster with agents (ux/speed matters, most apps are this)

Mat Ferrante: And I think they will be able to integrate with each other. Background leverages active one to execute quick stuff just like a user would. Active kicking off background tasks.

Sully: 100%.

Codex is currently in a weird spot. It wants to be background (or async) and is great at being async, but requires too much hand holding to let you actually ignore it for long. Once that is solved, things get a lot more interesting.

Discussion about this post

The Codex of Ultimate Vibing Read More »