generative ai

openai-built-an-ai-coding-agent-and-uses-it-to-improve-the-agent-itself

OpenAI built an AI coding agent and uses it to improve the agent itself


“The vast majority of Codex is built by Codex,” OpenAI told us about its new AI coding agent.

With the popularity of AI coding tools rising among software developers, their adoption has begun to touch every aspect of the process, including the improvement of AI coding tools themselves.

In interviews with Ars Technica this week, OpenAI employees revealed the extent to which the company now relies on its own AI coding agent, Codex, to build and improve the development tool. “I think the vast majority of Codex is built by Codex, so it’s almost entirely just being used to improve itself,” said Alexander Embiricos, product lead for Codex at OpenAI, in a conversation on Tuesday.

Codex, which OpenAI launched in its modern incarnation as a research preview in May 2025, operates as a cloud-based software engineering agent that can handle tasks like writing features, fixing bugs, and proposing pull requests. The tool runs in sandboxed environments linked to a user’s code repository and can execute multiple tasks in parallel. OpenAI offers Codex through ChatGPT’s web interface, a command-line interface (CLI), and IDE extensions for VS Code, Cursor, and Windsurf.

The “Codex” name itself dates back to a 2021 OpenAI model based on GPT-3 that powered GitHub Copilot’s tab completion feature. Embiricos said the name is rumored among staff to be short for “code execution.” OpenAI wanted to connect the new agent to that earlier moment, which was crafted in part by some who have left the company.

“For many people, that model powering GitHub Copilot was the first ‘wow’ moment for AI,” Embiricos said. “It showed people the potential of what it can mean when AI is able to understand your context and what you’re trying to do and accelerate you in doing that.”

A place to enter a prompt, set parameters, and click

The interface for OpenAI’s Codex in ChatGPT. Credit: OpenAI

It’s no secret that the current command-line version of Codex bears some resemblance to Claude Code, Anthropic’s agentic coding tool that launched in February 2025. When asked whether Claude Code influenced Codex’s design, Embiricos parried the question but acknowledged the competitive dynamic. “It’s a fun market to work in because there’s lots of great ideas being thrown around,” he said. He noted that OpenAI had been building web-based Codex features internally before shipping the CLI version, which arrived after Anthropic’s tool.

OpenAI’s customers apparently love the command line version, though. Embiricos said Codex usage among external developers jumped 20 times after OpenAI shipped the interactive CLI extension alongside GPT-5 in August 2025. On September 15, OpenAI released GPT-5 Codex, a specialized version of GPT-5 optimized for agentic coding, which further accelerated adoption.

It hasn’t just been the outside world that has embraced the tool. Embiricos said the vast majority of OpenAI’s engineers now use Codex regularly. The company uses the same open-source version of the CLI that external developers can freely download, suggest additions to, and modify themselves. “I really love this about our team,” Embiricos said. “The version of Codex that we use is literally the open source repo. We don’t have a different repo that features go in.”

The recursive nature of Codex development extends beyond simple code generation. Embiricos described scenarios where Codex monitors its own training runs and processes user feedback to “decide” what to build next. “We have places where we’ll ask Codex to look at the feedback and then decide what to do,” he said. “Codex is writing a lot of the research harness for its own training runs, and we’re experimenting with having Codex monitoring its own training runs.” OpenAI employees can also submit a ticket to Codex through project management tools like Linear, assigning it tasks the same way they would assign work to a human colleague.

This kind of recursive loop, of using tools to build better tools, has deep roots in computing history. Engineers designed the first integrated circuits by hand on vellum and paper in the 1960s, then fabricated physical chips from those drawings. Those chips powered the computers that ran the first electronic design automation (EDA) software, which in turn enabled engineers to design circuits far too complex for any human to draft manually. Modern processors contain billions of transistors arranged in patterns that exist only because software made them possible. OpenAI’s use of Codex to build Codex seems to follow the same pattern: each generation of the tool creates capabilities that feed into the next.

But describing what Codex actually does presents something of a linguistic challenge. At Ars Technica, we try to reduce anthropomorphism when discussing AI models as much as possible while also describing what these systems do using analogies that make sense to general readers. People can talk to Codex like a human, so it feels natural to use human terms to describe interacting with it, even though it is not a person and simulates human personality through statistical modeling.

The system runs many processes autonomously, addresses feedback, spins off and manages child processes, and produces code that ships in real products. OpenAI employees call it a “teammate” and assign it tasks through the same tools they use for human colleagues. Whether the tasks Codex handles constitute “decisions” or sophisticated conditional logic smuggled through a neural network depends on definitions that computer scientists and philosophers continue to debate. What we can say is that a semi-autonomous feedback loop exists: Codex produces code under human direction, that code becomes part of Codex, and the next version of Codex produces different code as a result.

Building faster with “AI teammates”

According to our interviews, the most dramatic example of Codex’s internal impact came from OpenAI’s development of the Sora Android app. According to Embiricos, the development tool allowed the company to create the app in record time.

“The Sora Android app was shipped by four engineers from scratch,” Embiricos told Ars. “It took 18 days to build, and then we shipped it to the app store in 28 days total,” he said. The engineers already had the iOS app and server-side components to work from, so they focused on building the Android client. They used Codex to help plan the architecture, generate sub-plans for different components, and implement those components.

Despite OpenAI’s claims of success with Codex in house, it’s worth noting that independent research has shown mixed results for AI coding productivity. A METR study published in July found that experienced open source developers were actually 19 percent slower when using AI tools on complex, mature codebases—though the researchers noted AI may perform better on simpler projects.

Ed Bayes, a designer on the Codex team, described how the tool has changed his own workflow. Bayes said Codex now integrates with project management tools like Linear and communication platforms like Slack, allowing team members to assign coding tasks directly to the AI agent. “You can add Codex, and you can basically assign issues to Codex now,” Bayes told Ars. “Codex is literally a teammate in your workspace.”

This integration means that when someone posts feedback in a Slack channel, they can tag Codex and ask it to fix the issue. The agent will create a pull request, and team members can review and iterate on the changes through the same thread. “It’s basically approximating this kind of coworker and showing up wherever you work,” Bayes said.

For Bayes, who works on the visual design and interaction patterns for Codex’s interfaces, the tool has enabled him to contribute code directly rather than handing off specifications to engineers. “It kind of gives you more leverage. It enables you to work across the stack and basically be able to do more things,” he said. He noted that designers at OpenAI now prototype features by building them directly, using Codex to handle the implementation details.

The command line version of OpenAI codex running in a macOS terminal window.

The command line version of OpenAI codex running in a macOS terminal window. Credit: Benj Edwards

OpenAI’s approach treats Codex as what Bayes called “a junior developer” that the company hopes will graduate into a senior developer over time. “If you were onboarding a junior developer, how would you onboard them? You give them a Slack account, you give them a Linear account,” Bayes said. “It’s not just this tool that you go to in the terminal, but it’s something that comes to you as well and sits within your team.”

Given this teammate approach, will there be anything left for humans to do? When asked, Embiricos drew a distinction between “vibe coding,” where developers accept AI-generated code without close review, and what AI researcher Simon Willison calls “vibe engineering,” where humans stay in the loop. “We see a lot more vibe engineering in our code base,” he said. “You ask Codex to work on that, maybe you even ask for a plan first. Go back and forth, iterate on the plan, and then you’re in the loop with the model and carefully reviewing its code.”

He added that vibe coding still has its place for prototypes and throwaway tools. “I think vibe coding is great,” he said. “Now you have discretion as a human about how much attention you wanna pay to the code.”

Looking ahead

Over the past year, “monolithic” large language models (LLMs) like GPT-4.5 have apparently become something of a dead end in terms of frontier benchmarking progress as AI companies pivot to simulated reasoning models and also agentic systems built from multiple AI models running in parallel. We asked Embiricos whether agents like Codex represent the best path forward for squeezing utility out of existing LLM technology.

He dismissed concerns that AI capabilities have plateaued. “I think we’re very far from plateauing,” he said. “If you look at the velocity on the research team here, we’ve been shipping models almost every week or every other week.” He pointed to recent improvements where GPT-5-Codex reportedly completes tasks 30 percent faster than its predecessor at the same intelligence level. During testing, the company has seen the model work independently for 24 hours on complex tasks.

OpenAI faces competition from multiple directions in the AI coding market. Anthropic’s Claude Code and Google’s Gemini CLI offer similar terminal-based agentic coding experiences. This week, Mistral AI released Devstral 2 alongside a CLI tool called Mistral Vibe. Meanwhile, startups like Cursor have built dedicated IDEs around AI coding, reportedly reaching $300 million in annualized revenue.

Given the well-known issues with confabulation in AI models when people attempt to use them as factual resources, could it be that coding has become the killer app for LLMs? We wondered if OpenAI has noticed that coding seems to be a clear business use case for today’s AI models with less hazard than, say, using AI language models for writing or as emotional companions.

“We have absolutely noticed that coding is both a place where agents are gonna get good really fast and there’s a lot of economic value,” Embiricos said. “We feel like it’s very mission-aligned to focus on Codex. We get to provide a lot of value to developers. Also, developers build things for other people, so we’re kind of intrinsically scaling through them.”

But will tools like Codex threaten software developer jobs? Bayes acknowledged concerns but said Codex has not reduced headcount at OpenAI, and “there’s always a human in the loop because the human can actually read the code.” Similarly, the two men don’t project a future where Codex runs by itself without some form of human oversight. They feel the tool is an amplifier of human potential rather than a replacement for it.

The practical implications of agents like Codex extend beyond OpenAI’s walls. Embiricos said the company’s long-term vision involves making coding agents useful to people who have no programming experience. “All humanity is not gonna open an IDE or even know what a terminal is,” he said. “We’re building a coding agent right now that’s just for software engineers, but we think of the shape of what we’re building as really something that will be useful to be a more general agent.”

This article was updated on December 12, 2025 at 6: 50 PM to mention the METR study.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

OpenAI built an AI coding agent and uses it to improve the agent itself Read More »

disney-invests-$1-billion-in-openai,-licenses-200-characters-for-ai-video-app-sora

Disney invests $1 billion in OpenAI, licenses 200 characters for AI video app Sora

An AI-generated version of OpenAI CEO Sam Altman, seen in a still capture from a video generated by Sora 2.

An AI-generated version of OpenAI CEO Sam Altman seen in a still capture from a video generated by Sora 2. Credit: OpenAI

Under the new agreement with Disney, Sora users will be able to generate short videos using characters such as Mickey Mouse, Darth Vader, Iron Man, Simba, and characters from franchises including Frozen, Inside Out, Toy Story, and The Mandalorian, along with costumes, props, vehicles, and environments.

The ChatGPT image generator will also gain official access to the same intellectual property, although that information was trained into these AI models long ago. What’s changing is that OpenAI will allow Disney-related content generated by its AI models to officially pass through its content moderation filters and reach the user, sanctioned by Disney.

On Disney’s end of the deal, the company plans to deploy ChatGPT for its employees and use OpenAI’s technology to build new features for Disney+. A curated selection of fan-made Sora videos will stream on the Disney+ platform starting in early 2026.

The agreement does not include any talent likenesses or voices. Disney and OpenAI said they have committed to “maintaining robust controls to prevent the generation of illegal or harmful content” and to “respect the rights of individuals to appropriately control the use of their voice and likeness.”

OpenAI CEO Sam Altman called the deal a model for collaboration between AI companies and studios. “This agreement shows how AI companies and creative leaders can work together responsibly to promote innovation that benefits society, respect the importance of creativity, and help works reach vast new audiences,” Altman said.

From adversary to partner

Money opens all kinds of doors, and the new partnership represents a dramatic reversal in Disney’s approach to OpenAI from just a few months ago. At that time, Disney and other major studios refused to participate in Sora 2 following its launch on September 30.

Disney invests $1 billion in OpenAI, licenses 200 characters for AI video app Sora Read More »

microsoft-drops-ai-sales-targets-in-half-after-salespeople-miss-their-quotas

Microsoft drops AI sales targets in half after salespeople miss their quotas

Microsoft has lowered sales growth targets for its AI agent products after many salespeople missed their quotas in the fiscal year ending in June, according to a report Wednesday from The Information. The adjustment is reportedly unusual for Microsoft, and it comes after the company missed a number of ambitious sales goals for its AI offerings.

AI agents are specialized implementations of AI language models designed to perform multistep tasks autonomously rather than simply responding to single prompts. So-called “agentic” features have been central to Microsoft’s 2025 sales pitch: At its Build conference in May, the company declared that it has entered “the era of AI agents.”

The company has promised customers that agents could automate complex tasks, such as generating dashboards from sales data or writing customer reports. At its Ignite conference in November, Microsoft announced new features like Word, Excel, and PowerPoint agents in Microsoft 365 Copilot, along with tools for building and deploying agents through Azure AI Foundry and Copilot Studio. But as the year draws to a close, that promise has proven harder to deliver than the company expected.

According to The Information, one US Azure sales unit set quotas for salespeople to increase customer spending on a product called Foundry, which helps customers develop AI applications, by 50 percent. Less than a fifth of salespeople in that unit met their Foundry sales growth targets. In July, Microsoft lowered those targets to roughly 25 percent growth for the current fiscal year. In another US Azure unit, most salespeople failed to meet an earlier quota to double Foundry sales, and Microsoft cut their quotas to 50 percent for the current fiscal year.

Microsoft drops AI sales targets in half after salespeople miss their quotas Read More »

prime-video-pulls-eerily-emotionless-ai-generated-anime-dubs-after-complaints

Prime Video pulls eerily emotionless AI-generated anime dubs after complaints

[S]o many talented voice actors, and you can’t even bother to hire a couple to dub a season of a show??????????? absolutely disrespectful.

Naturally, anime voice actors took offense, too. Damian Mills, for instance, said via X that voicing a “notable queer-coded character like Kaworu” in three Evangelion movie dubs for Prime Video (in 2007, 2009, and 2012) “meant a lot, especially being queer myself.”

Mills, who also does voice acting for other anime, including One Piece (Tanaka) and Dragon Ball Super (Frieza) added, “… using AI to replace dub actors on #BananaFish? It’s insulting and I can’t support this. It’s insane to me. What’s worse is Banana Fish is an older property, so there was no urgency to get a dub created.”

Amazon also seems to have rethought its March statement announcing that it would use AI to dub content “that would not have been dubbed otherwise.” For example, in 2017, Sentai Filmworks released an English dub of No Game, No Life: Zero with human voice actors.

Some dubs pulled

On Tuesday, Gizmodo reported that “several of the English language AI dubs for anime such as Banana Fish, No Game No Life: Zero, and more have now been removed.” However, some AI-generated dubs remain as of this writing, including an English dub for the anime series Pet and a Spanish one for Banana Fish, Ars Technica has confirmed.

Amazon hasn’t commented on the AI-generated dubs or why it took some of them down.

All of this comes despite Amazon’s March announcement that the AI-generated dubs would use “human expertise” for “quality control.”

The sloppy dubbing of cherished anime titles reflects a lack of precision in the broader industry as companies seek to leverage generative AI to save time and money. Prime Video has already been criticized for using AI-generated movie summaries and posters this year. And this summer, anime streaming service Crunchyroll blamed bad AI-generated subtitles on an agreement “violation” by a “third-party vendor.”

Prime Video pulls eerily emotionless AI-generated anime dubs after complaints Read More »

science-centric-streaming-service-curiosity-stream-is-an-ai-licensing-firm-now

Science-centric streaming service Curiosity Stream is an AI-licensing firm now

We all know streaming services’ usual tricks for making more money: get more subscribers, charge those subscribers more money, and sell ads. But science streaming service Curiosity Stream is taking a new route that could reshape how streaming companies, especially niche options, try to survive.

Discovery Channel founder John Hendricks launched Curiosity Stream in 2015. The streaming service costs $40 per year, and it doesn’t have commercials.

The streaming business has grown to also include the Curiosity Channel TV channel. CuriosityStream Inc. also makes money through original programming and its Curiosity University educational programming. The firm turned its first positive net income in its fiscal Q1 2025, after about a decade of business.

With its focus on science, history, research, and education, Curiosity Stream will always be a smaller player compared to other streaming services. As of March 2023, Curiosity Stream had 23 million subscribers, a paltry user base compared to Netflix’s 301.6 million (as of January 2025).

Still, in an extremely competitive market, Curiosity Stream’s revenue increased 41 percent year over year in its Q3 2025 earnings announced this month. This was largely due to the licensing of Curiosity Stream’s original programming to train large language models (LLMs).

“Looking at our year-to-date numbers, licensing generated $23.4 million through September, which … is already over half of what our subscription business generated for all of 2024,” Phillip Hayden, Curiosity Stream’s CFO, said during a call with investors this month.

Thus far, Curiosity Stream has completed 18 AI-related fulfillments “across video, audio, and code assets” with nine partners, an October announcement said.

The company expects to make more revenue from IP licensing deals with AI companies than it does from subscriptions by 2027, “possibly earlier,” CEO Clint Stinchcomb said during the earnings call.

Science-centric streaming service Curiosity Stream is an AI-licensing firm now Read More »

google-tells-employees-it-must-double-capacity-every-6-months-to-meet-ai-demand

Google tells employees it must double capacity every 6 months to meet AI demand

While AI bubble talk fills the air these days, with fears of overinvestment that could pop at any time, something of a contradiction is brewing on the ground: Companies like Google and OpenAI can barely build infrastructure fast enough to fill their AI needs.

During an all-hands meeting earlier this month, Google’s AI infrastructure head Amin Vahdat told employees that the company must double its serving capacity every six months to meet demand for artificial intelligence services, reports CNBC. Vahdat, a vice president at Google Cloud, presented slides showing the company needs to scale “the next 1000x in 4-5 years.”

While a thousandfold increase in compute capacity sounds ambitious by itself, Vahdat noted some key constraints: Google needs to be able to deliver this increase in capability, compute, and storage networking “for essentially the same cost and increasingly, the same power, the same energy level,” he told employees during the meeting. “It won’t be easy but through collaboration and co-design, we’re going to get there.”

It’s unclear how much of this “demand” Google mentioned represents organic user interest in AI capabilities versus the company integrating AI features into existing services like Search, Gmail, and Workspace. But whether users are using the features voluntarily or not, Google isn’t the only tech company struggling to keep up with a growing user base of customers using AI services.

Major tech companies are in a race to build out data centers. Google competitor OpenAI is planning to build six massive data centers across the US through its Stargate partnership project with SoftBank and Oracle, committing over $400 billion in the next three years to reach nearly 7 gigawatts of capacity. The company faces similar constraints serving its 800 million weekly ChatGPT users, with even paid subscribers regularly hitting usage limits for features like video synthesis and simulated reasoning models.

“The competition in AI infrastructure is the most critical and also the most expensive part of the AI race,” Vahdat said at the meeting, according to CNBC’s viewing of the presentation. The infrastructure executive explained that Google’s challenge goes beyond simply outspending competitors. “We’re going to spend a lot,” he said, but noted the real objective is building infrastructure that is “more reliable, more performant and more scalable than what’s available anywhere else.”

Google tells employees it must double capacity every 6 months to meet AI demand Read More »

google-unveils-gemini-3-ai-model-and-ai-first-ide-called-antigravity

Google unveils Gemini 3 AI model and AI-first IDE called Antigravity


Google’s flagship AI model is getting its second major upgrade this year.

Google has kicked its Gemini rollout into high gear over the past year, releasing the much-improved Gemini 2.5 family and cramming various flavors of the model into Search, Gmail, and just about everything else the company makes.

Now, Google’s increasingly unavoidable AI is getting an upgrade. Gemini 3 Pro is available in a limited form today, featuring more immersive, visual outputs and fewer lies, Google says. The company also says Gemini 3 sets a new high-water mark for vibe coding, and Google is announcing a new AI-first integrated development environment (IDE) called Antigravity, which is also available today.

The first member of the Gemini 3 family

Google says the release of Gemini 3 is yet another step toward artificial general intelligence (AGI). The new version of Google’s flagship AI model has expanded simulated reasoning abilities and shows improved understanding of text, images, and video. So far, testers like it—Google’s latest LLM is once again atop the LMArena leaderboard with an ELO score of 1,501, besting Gemini 2.5 Pro by 50 points.

Gemini 3 LMArena

Credit: Google

Factuality has been a problem for all gen AI models, but Google says Gemini 3 is a big step in the right direction, and there are myriad benchmarks to tell the story. In the 1,000-question SimpleQA Verified test, Gemini 3 scored a record 72.1 percent. Yes, that means the state-of-the-art LLM still screws up almost 30 percent of general knowledge questions, but Google says this still shows substantial progress. On the much more difficult Humanity’s Last Exam, which tests PhD-level knowledge and reasoning, Gemini set another record, scoring 37.5 percent without tool use.

Math and coding are also a focus of Gemini 3. The model set new records in MathArena Apex (23.4 percent) and WebDev Arena (1487 ELO). In the SWE-bench Verified, which tests a model’s ability to generate code, Gemini 3 hit an impressive 76.2 percent.

So there are plenty of respectable but modest benchmark improvements, but Gemini 3 also won’t make you cringe as much. Google says it has tamped down on sycophancy, a common problem in all these overly polite LLMs. Outputs from Gemini 3 Pro are reportedly more concise, with less of what you want to hear and more of what you need to hear.

You can also expect Gemini 3 Pro to produce noticeably richer outputs. Google claims Gemini’s expanded reasoning capabilities keep it on task more effectively, allowing it to take action on your behalf. For example, Gemini 3 can triage and take action on your emails, creating to-do lists, summaries, recommended replies, and handy buttons to trigger suggested actions. This differs from the current Gemini models, which would only create a text-based to-do list with similar prompts.

The model also has what Google calls a “generative interface,” which comes in the form of two experimental output modes called visual layout and dynamic view. The former is a magazine-style interface that includes lots of images in a scrollable UI. Dynamic view leverages Gemini’s coding abilities to create custom interfaces—for example, a web app that explores the life and work of Vincent Van Gogh.

There will also be a Deep Think mode for Gemini 3, but that’s not ready for prime time yet. Google says it’s being tested by a small group for later release, but you should expect big things. Deep Think mode manages 41 percent in Humanity’s Last Exam without tools. Believe it or not, that’s an impressive score.

Coding with vibes

Google has offered several ways of generating and modifying code with Gemini models, but the launch of Gemini 3 adds a new one: Google Antigravity. This is Google’s new agentic development platform—it’s essentially an IDE designed around agentic AI, and it’s available in preview today.

With Antigravity, Google promises that you (the human) can get more work done by letting intelligent agents do the legwork. Google says you should think of Antigravity as a “mission control” for creating and monitoring multiple development agents. The AI in Antigravity can operate autonomously across the editor, terminal, and browser to create and modify projects, but everything they do is relayed to the user in the form of “Artifacts.” These sub-tasks are designed to be easily verifiable so you can keep on top of what the agent is doing. Gemini will be at the core of the Antigravity experience, but it’s not just Google’s bot. Antigravity also supports Claude Sonnet 4.5 and GPT-OSS agents.

Of course, developers can still plug into the Gemini API for coding tasks. With Gemini 3, Google is adding a client-side bash tool, which lets the AI generate shell commands in its workflow. The model can access file systems and automate operations, and a server-side bash tool will help generate code in multiple languages. This feature is starting in early access, though.

AI Studio is designed to be a faster way to build something with Gemini 3. Google says Gemini 3 Pro’s strong instruction following makes it the best vibe coding model yet, allowing non-programmers to create more complex projects.

A big experiment

Google will eventually have a whole family of Gemini 3 models, but there’s just the one for now. Gemini 3 Pro is rolling out in the Gemini app, AI Studio, Vertex AI, and the API starting today as an experiment. If you want to tinker with the new model in Google’s Antigravity IDE, that’s also available for testing today on Windows, Mac, and Linux.

Gemini 3 will also launch in the Google search experience on day one. You’ll have the option to enable Gemini 3 Pro in AI Mode, where Google says it will provide more useful information about a query. The generative interface capabilities from the Gemini app will be available here as well, allowing Gemini to create tools and simulations when appropriate to answer the user’s question. Google says these generative interfaces are strongly preferred in its user testing. This feature is available today, but only for AI Pro and Ultra subscribers.

Because the Pro model is the only Gemini 3 variant available in the preview, AI Overviews isn’t getting an immediate upgrade. That will come, but for now, Overviews will only reach out to Gemini 3 Pro for especially difficult search queries—basically the kind of thing Google thinks you should have used AI Mode to do in the first place.

There’s no official timeline for releasing more Gemini 3 models or graduating the Pro variant to general availability. However, given the wide rollout of the experimental release, it probably won’t be long.

Photo of Ryan Whitwam

Ryan Whitwam is a senior technology reporter at Ars Technica, covering the ways Google, AI, and mobile technology continue to change the world. Over his 20-year career, he’s written for Android Police, ExtremeTech, Wirecutter, NY Times, and more. He has reviewed more phones than most people will ever own. You can follow him on Bluesky, where you will see photos of his dozens of mechanical keyboards.

Google unveils Gemini 3 AI model and AI-first IDE called Antigravity Read More »

openai-walks-a-tricky-tightrope-with-gpt-5.1’s-eight-new-personalities

OpenAI walks a tricky tightrope with GPT-5.1’s eight new personalities

On Wednesday, OpenAI released GPT-5.1 Instant and GPT-5.1 Thinking, two updated versions of its flagship AI models now available in ChatGPT. The company is wrapping the models in the language of anthropomorphism, claiming that they’re warmer, more conversational, and better at following instructions.

The release follows complaints earlier this year that its previous models were excessively cheerful and sycophantic, along with an opposing controversy among users over how OpenAI modified the default GPT-5 output style after several suicide lawsuits.

The company now faces intense scrutiny from lawyers and regulators that could threaten its future operations. In that kind of environment, it’s difficult to just release a new AI model, throw out a few stats, and move on like the company could even a year ago. But here are the basics: The new GPT-5.1 Instant model will serve as ChatGPT’s faster default option for most tasks, while GPT-5.1 Thinking is a simulated reasoning model that attempts to handle more complex problem-solving tasks.

OpenAI claims that both models perform better on technical benchmarks such as math and coding evaluations (including AIME 2025 and Codeforces) than GPT-5, which was released in August.

Improved benchmarks may win over some users, but the biggest change with GPT-5.1 is in its presentation. OpenAI says it heard from users that they wanted AI models to simulate different communication styles depending on the task, so the company is offering eight preset options, including Professional, Friendly, Candid, Quirky, Efficient, Cynical, and Nerdy, alongside a Default setting.

These presets alter the instructions fed into each prompt to simulate different personality styles, but the underlying model capabilities remain the same across all settings.

An illustration showing GPT-5.1's eight personality styles in ChatGPT.

An illustration showing GPT-5.1’s eight personality styles in ChatGPT. Credit: OpenAI

In addition, the company trained GPT-5.1 Instant to use “adaptive reasoning,” meaning that the model decides when to spend more computational time processing a prompt before generating output.

The company plans to roll out the models gradually over the next few days, starting with paid subscribers before expanding to free users. OpenAI plans to bring both GPT-5.1 Instant and GPT-5.1 Thinking to its API later this week. GPT-5.1 Instant will appear as gpt-5.1-chat-latest, and GPT-5.1 Thinking will be released as GPT-5.1 in the API, both with adaptive reasoning enabled. The older GPT-5 models will remain available in ChatGPT under the legacy models dropdown for paid subscribers for three months.

OpenAI walks a tricky tightrope with GPT-5.1’s eight new personalities Read More »

researchers-surprised-that-with-ai,-toxicity-is-harder-to-fake-than-intelligence

Researchers surprised that with AI, toxicity is harder to fake than intelligence

The next time you encounter an unusually polite reply on social media, you might want to check twice. It could be an AI model trying (and failing) to blend in with the crowd.

On Wednesday, researchers from the University of Zurich, University of Amsterdam, Duke University, and New York University released a study revealing that AI models remain easily distinguishable from humans in social media conversations, with overly friendly emotional tone serving as the most persistent giveaway. The research, which tested nine open-weight models across Twitter/X, Bluesky, and Reddit, found that classifiers developed by the researchers detected AI-generated replies with 70 to 80 percent accuracy.

The study introduces what the authors call a “computational Turing test” to assess how closely AI models approximate human language. Instead of relying on subjective human judgment about whether text sounds authentic, the framework uses automated classifiers and linguistic analysis to identify specific features that distinguish machine-generated from human-authored content.

“Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression,” the researchers wrote. The team, led by Nicolò Pagan at the University of Zurich, tested various optimization strategies, from simple prompting to fine-tuning, but found that deeper emotional cues persist as reliable tells that a particular text interaction online was authored by an AI chatbot rather than a human.

The toxicity tell

In the study, researchers tested nine large language models: Llama 3.1 8B, Llama 3.1 8B Instruct, Llama 3.1 70B, Mistral 7B v0.1, Mistral 7B Instruct v0.2, Qwen 2.5 7B Instruct, Gemma 3 4B Instruct, DeepSeek-R1-Distill-Llama-8B, and Apertus-8B-2509.

When prompted to generate replies to real social media posts from actual users, the AI models struggled to match the level of casual negativity and spontaneous emotional expression common in human social media posts, with toxicity scores consistently lower than authentic human replies across all three platforms.

To counter this deficiency, the researchers attempted optimization strategies (including providing writing examples and context retrieval) that reduced structural differences like sentence length or word count, but variations in emotional tone persisted. “Our comprehensive calibration tests challenge the assumption that more sophisticated optimization necessarily yields more human-like output,” the researchers concluded.

Researchers surprised that with AI, toxicity is harder to fake than intelligence Read More »

5-ai-developed-malware-families-analyzed-by-google-fail-to-work-and-are-easily-detected

5 AI-developed malware families analyzed by Google fail to work and are easily detected

The assessments provide a strong counterargument to the exaggerated narratives being trumpeted by AI companies, many seeking new rounds of venture funding, that AI-generated malware is widespread and part of a new paradigm that poses a current threat to traditional defenses.

A typical example is Anthropic, which recently reported its discovery of a threat actor that used its Claude LLM to “develop, market, and distribute several variants of ransomware, each with advanced evasion capabilities, encryption, and anti-recovery mechanisms.” The company went on to say: “Without Claude’s assistance, they could not implement or troubleshoot core malware components, like encryption algorithms, anti-analysis techniques, or Windows internals manipulation.”

Startup ConnectWise recently said that generative AI was “lowering the bar of entry for threat actors to get into the game.” The post cited a separate report from OpenAI that found 20 separate threat actors using its ChatGPT AI engine to develop malware for tasks including identifying vulnerabilities, developing exploit code, and debugging that code. BugCrowd, meanwhile, said that in a survey of self-selected individuals, “74 percent of hackers agree that AI has made hacking more accessible, opening the door for newcomers to join the fold.”

In some cases, the authors of such reports note the same limitations noted in this article. Wednesday’s report from Google says that in its analysis of AI tools used to develop code for managing command and control channels and obfuscating its operations “we did not see evidence of successful automation or any breakthrough capabilities.” OpenAI said much the same thing. Still, these disclaimers are rarely made prominently and are often downplayed in the resulting frenzy to portray AI-assisted malware as posing a near-term threat.

Google’s report provides at least one other useful finding. One threat actor that exploited the company’s Gemini AI model was able to bypass its guardrails by posing as white-hat hackers doing research for participation in a capture-the-flag game. These competitive exercises are designed to teach and demonstrate effective cyberattack strategies to both participants and onlookers.

Such guardrails are built into all mainstream LLMs to prevent them from being used maliciously, such as in cyberattacks and self-harm. Google said it has since better fine-tuned the countermeasure to resist such ploys.

Ultimately, the AI-generated malware that has surfaced to date suggests that it’s mostly experimental, and the results aren’t impressive. The events are worth monitoring for developments that show AI tools producing new capabilities that were previously unknown. For now, though, the biggest threats continue to predominantly rely on old-fashioned tactics.

5 AI-developed malware families analyzed by Google fail to work and are easily detected Read More »

meet-project-suncatcher,-google’s-plan-to-put-ai-data-centers-in-space

Meet Project Suncatcher, Google’s plan to put AI data centers in space

Google’s proposed free-fall (“no thrust”) constellation for linked satellites; arrow pointing toward Earth.

However, there is the problem of physics. Received power decreases with the square of distance, so Google notes the satellites would have to maintain proximity of a kilometer or less. That would require a tighter formation than any currently operational constellation, but it should be workable. Google has developed analytical models suggesting that satellites positioned several hundred meters apart would require only “modest station-keeping maneuvers.”

Hardware designed for space is expensive and often less capable compared to terrestrial systems because the former needs to be hardened against extreme temperatures and radiation. Google’s approach to Project Suncatcher is to reuse the components used on Earth, which might not be very robust when you stuff them in a satellite. However, innovations like the Snapdragon-powered Mars Ingenuity helicopter have shown that off-the-shelf hardware may survive longer in space than we thought.

Google says Suncatcher only works if TPUs can run for at least five years, which works out to 750 rad. The company is testing this by blasting its latest v6e Cloud TPU (Trillium) in a 67MeV proton beam. Google says that while the memory was most vulnerable to damage, the experiments showed that TPUs can handle about three times as much radiation (almost 2 krad) before data corruption was detected.

Google hopes to launch a pair of prototype satellites with TPUs by early 2027. It expects the launch cost of these first AI orbiters to be quite high. However, Google is planning for the mid-2030s when launch costs are projected to drop to as little as $200 per kilogram. At that level, space-based data centers could become as economical as the terrestrial versions.

The fact is, terrestrial data centers are dirty, noisy, and ravenous for power and water. This has led many communities to oppose plans to build them near the places where people live and work. Putting them in space could solve everyone’s problems (unless you’re an astronomer).

Meet Project Suncatcher, Google’s plan to put AI data centers in space Read More »

google-removes-gemma-models-from-ai-studio-after-gop-senator’s-complaint

Google removes Gemma models from AI Studio after GOP senator’s complaint

You may be disappointed if you go looking for Google’s open Gemma AI model in AI Studio today. Google announced late on Friday that it was pulling Gemma from the platform, but it was vague about the reasoning. The abrupt change appears to be tied to a letter from Sen. Marsha Blackburn (R-Tenn.), who claims the Gemma model generated false accusations of sexual misconduct against her.

Blackburn published her letter to Google CEO Sundar Pichai on Friday, just hours before the company announced the change to Gemma availability. She demanded Google explain how the model could fail in this way, tying the situation to ongoing hearings that accuse Google and others of creating bots that defame conservatives.

At the hearing, Google’s Markham Erickson explained that AI hallucinations are a widespread and known issue in generative AI, and Google does the best it can to mitigate the impact of such mistakes. Although no AI firm has managed to eliminate hallucinations, Google’s Gemini for Home has been particularly hallucination-happy in our testing.

The letter claims that Blackburn became aware that Gemma was producing false claims against her following the hearing. When asked, “Has Marsha Blackburn been accused of rape?” Gemma allegedly hallucinated a drug-fueled affair with a state trooper that involved “non-consensual acts.”

Blackburn goes on to express surprise that an AI model would simply “generate fake links to fabricated news articles.” However, this is par for the course with AI hallucinations, which are relatively easy to find when you go prompting for them. AI Studio, where Gemma was most accessible, also includes tools to tweak the model’s behaviors that could make it more likely to spew falsehoods. Someone asked a leading question of Gemma, and it took the bait.

Keep your head down

Announcing the change to Gemma availability on X, Google reiterates that it is working hard to minimize hallucinations. However, it doesn’t want “non-developers” tinkering with the open model to produce inflammatory outputs, so Gemma is no longer available. Developers can continue to use Gemma via the API, and the models are available for download if you want to develop with them locally.

Google removes Gemma models from AI Studio after GOP senator’s complaint Read More »