software development

openai-built-an-ai-coding-agent-and-uses-it-to-improve-the-agent-itself

OpenAI built an AI coding agent and uses it to improve the agent itself


“The vast majority of Codex is built by Codex,” OpenAI told us about its new AI coding agent.

With the popularity of AI coding tools rising among software developers, their adoption has begun to touch every aspect of the process, including the improvement of AI coding tools themselves.

In interviews with Ars Technica this week, OpenAI employees revealed the extent to which the company now relies on its own AI coding agent, Codex, to build and improve the development tool. “I think the vast majority of Codex is built by Codex, so it’s almost entirely just being used to improve itself,” said Alexander Embiricos, product lead for Codex at OpenAI, in a conversation on Tuesday.

Codex, which OpenAI launched in its modern incarnation as a research preview in May 2025, operates as a cloud-based software engineering agent that can handle tasks like writing features, fixing bugs, and proposing pull requests. The tool runs in sandboxed environments linked to a user’s code repository and can execute multiple tasks in parallel. OpenAI offers Codex through ChatGPT’s web interface, a command-line interface (CLI), and IDE extensions for VS Code, Cursor, and Windsurf.

The “Codex” name itself dates back to a 2021 OpenAI model based on GPT-3 that powered GitHub Copilot’s tab completion feature. Embiricos said the name is rumored among staff to be short for “code execution.” OpenAI wanted to connect the new agent to that earlier moment, which was crafted in part by some who have left the company.

“For many people, that model powering GitHub Copilot was the first ‘wow’ moment for AI,” Embiricos said. “It showed people the potential of what it can mean when AI is able to understand your context and what you’re trying to do and accelerate you in doing that.”

A place to enter a prompt, set parameters, and click

The interface for OpenAI’s Codex in ChatGPT. Credit: OpenAI

It’s no secret that the current command-line version of Codex bears some resemblance to Claude Code, Anthropic’s agentic coding tool that launched in February 2025. When asked whether Claude Code influenced Codex’s design, Embiricos parried the question but acknowledged the competitive dynamic. “It’s a fun market to work in because there’s lots of great ideas being thrown around,” he said. He noted that OpenAI had been building web-based Codex features internally before shipping the CLI version, which arrived after Anthropic’s tool.

OpenAI’s customers apparently love the command line version, though. Embiricos said Codex usage among external developers jumped 20 times after OpenAI shipped the interactive CLI extension alongside GPT-5 in August 2025. On September 15, OpenAI released GPT-5 Codex, a specialized version of GPT-5 optimized for agentic coding, which further accelerated adoption.

It hasn’t just been the outside world that has embraced the tool. Embiricos said the vast majority of OpenAI’s engineers now use Codex regularly. The company uses the same open-source version of the CLI that external developers can freely download, suggest additions to, and modify themselves. “I really love this about our team,” Embiricos said. “The version of Codex that we use is literally the open source repo. We don’t have a different repo that features go in.”

The recursive nature of Codex development extends beyond simple code generation. Embiricos described scenarios where Codex monitors its own training runs and processes user feedback to “decide” what to build next. “We have places where we’ll ask Codex to look at the feedback and then decide what to do,” he said. “Codex is writing a lot of the research harness for its own training runs, and we’re experimenting with having Codex monitoring its own training runs.” OpenAI employees can also submit a ticket to Codex through project management tools like Linear, assigning it tasks the same way they would assign work to a human colleague.

This kind of recursive loop, of using tools to build better tools, has deep roots in computing history. Engineers designed the first integrated circuits by hand on vellum and paper in the 1960s, then fabricated physical chips from those drawings. Those chips powered the computers that ran the first electronic design automation (EDA) software, which in turn enabled engineers to design circuits far too complex for any human to draft manually. Modern processors contain billions of transistors arranged in patterns that exist only because software made them possible. OpenAI’s use of Codex to build Codex seems to follow the same pattern: each generation of the tool creates capabilities that feed into the next.

But describing what Codex actually does presents something of a linguistic challenge. At Ars Technica, we try to reduce anthropomorphism when discussing AI models as much as possible while also describing what these systems do using analogies that make sense to general readers. People can talk to Codex like a human, so it feels natural to use human terms to describe interacting with it, even though it is not a person and simulates human personality through statistical modeling.

The system runs many processes autonomously, addresses feedback, spins off and manages child processes, and produces code that ships in real products. OpenAI employees call it a “teammate” and assign it tasks through the same tools they use for human colleagues. Whether the tasks Codex handles constitute “decisions” or sophisticated conditional logic smuggled through a neural network depends on definitions that computer scientists and philosophers continue to debate. What we can say is that a semi-autonomous feedback loop exists: Codex produces code under human direction, that code becomes part of Codex, and the next version of Codex produces different code as a result.

Building faster with “AI teammates”

According to our interviews, the most dramatic example of Codex’s internal impact came from OpenAI’s development of the Sora Android app. According to Embiricos, the development tool allowed the company to create the app in record time.

“The Sora Android app was shipped by four engineers from scratch,” Embiricos told Ars. “It took 18 days to build, and then we shipped it to the app store in 28 days total,” he said. The engineers already had the iOS app and server-side components to work from, so they focused on building the Android client. They used Codex to help plan the architecture, generate sub-plans for different components, and implement those components.

Despite OpenAI’s claims of success with Codex in house, it’s worth noting that independent research has shown mixed results for AI coding productivity. A METR study published in July found that experienced open source developers were actually 19 percent slower when using AI tools on complex, mature codebases—though the researchers noted AI may perform better on simpler projects.

Ed Bayes, a designer on the Codex team, described how the tool has changed his own workflow. Bayes said Codex now integrates with project management tools like Linear and communication platforms like Slack, allowing team members to assign coding tasks directly to the AI agent. “You can add Codex, and you can basically assign issues to Codex now,” Bayes told Ars. “Codex is literally a teammate in your workspace.”

This integration means that when someone posts feedback in a Slack channel, they can tag Codex and ask it to fix the issue. The agent will create a pull request, and team members can review and iterate on the changes through the same thread. “It’s basically approximating this kind of coworker and showing up wherever you work,” Bayes said.

For Bayes, who works on the visual design and interaction patterns for Codex’s interfaces, the tool has enabled him to contribute code directly rather than handing off specifications to engineers. “It kind of gives you more leverage. It enables you to work across the stack and basically be able to do more things,” he said. He noted that designers at OpenAI now prototype features by building them directly, using Codex to handle the implementation details.

The command line version of OpenAI codex running in a macOS terminal window.

The command line version of OpenAI codex running in a macOS terminal window. Credit: Benj Edwards

OpenAI’s approach treats Codex as what Bayes called “a junior developer” that the company hopes will graduate into a senior developer over time. “If you were onboarding a junior developer, how would you onboard them? You give them a Slack account, you give them a Linear account,” Bayes said. “It’s not just this tool that you go to in the terminal, but it’s something that comes to you as well and sits within your team.”

Given this teammate approach, will there be anything left for humans to do? When asked, Embiricos drew a distinction between “vibe coding,” where developers accept AI-generated code without close review, and what AI researcher Simon Willison calls “vibe engineering,” where humans stay in the loop. “We see a lot more vibe engineering in our code base,” he said. “You ask Codex to work on that, maybe you even ask for a plan first. Go back and forth, iterate on the plan, and then you’re in the loop with the model and carefully reviewing its code.”

He added that vibe coding still has its place for prototypes and throwaway tools. “I think vibe coding is great,” he said. “Now you have discretion as a human about how much attention you wanna pay to the code.”

Looking ahead

Over the past year, “monolithic” large language models (LLMs) like GPT-4.5 have apparently become something of a dead end in terms of frontier benchmarking progress as AI companies pivot to simulated reasoning models and also agentic systems built from multiple AI models running in parallel. We asked Embiricos whether agents like Codex represent the best path forward for squeezing utility out of existing LLM technology.

He dismissed concerns that AI capabilities have plateaued. “I think we’re very far from plateauing,” he said. “If you look at the velocity on the research team here, we’ve been shipping models almost every week or every other week.” He pointed to recent improvements where GPT-5-Codex reportedly completes tasks 30 percent faster than its predecessor at the same intelligence level. During testing, the company has seen the model work independently for 24 hours on complex tasks.

OpenAI faces competition from multiple directions in the AI coding market. Anthropic’s Claude Code and Google’s Gemini CLI offer similar terminal-based agentic coding experiences. This week, Mistral AI released Devstral 2 alongside a CLI tool called Mistral Vibe. Meanwhile, startups like Cursor have built dedicated IDEs around AI coding, reportedly reaching $300 million in annualized revenue.

Given the well-known issues with confabulation in AI models when people attempt to use them as factual resources, could it be that coding has become the killer app for LLMs? We wondered if OpenAI has noticed that coding seems to be a clear business use case for today’s AI models with less hazard than, say, using AI language models for writing or as emotional companions.

“We have absolutely noticed that coding is both a place where agents are gonna get good really fast and there’s a lot of economic value,” Embiricos said. “We feel like it’s very mission-aligned to focus on Codex. We get to provide a lot of value to developers. Also, developers build things for other people, so we’re kind of intrinsically scaling through them.”

But will tools like Codex threaten software developer jobs? Bayes acknowledged concerns but said Codex has not reduced headcount at OpenAI, and “there’s always a human in the loop because the human can actually read the code.” Similarly, the two men don’t project a future where Codex runs by itself without some form of human oversight. They feel the tool is an amplifier of human potential rather than a replacement for it.

The practical implications of agents like Codex extend beyond OpenAI’s walls. Embiricos said the company’s long-term vision involves making coding agents useful to people who have no programming experience. “All humanity is not gonna open an IDE or even know what a terminal is,” he said. “We’re building a coding agent right now that’s just for software engineers, but we think of the shape of what we’re building as really something that will be useful to be a more general agent.”

This article was updated on December 12, 2025 at 6: 50 PM to mention the METR study.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

OpenAI built an AI coding agent and uses it to improve the agent itself Read More »

in-1995,-a-netscape-employee-wrote-a-hack-in-10-days-that-now-runs-the-internet

In 1995, a Netscape employee wrote a hack in 10 days that now runs the Internet

Thirty years ago today, Netscape Communications and Sun Microsystems issued a joint press release announcing JavaScript, an object scripting language designed for creating interactive web applications. The language emerged from a frantic 10-day sprint at pioneering browser company Netscape, where engineer Brendan Eich hacked together a working internal prototype during May 1995.

While the JavaScript language didn’t ship publicly until that September and didn’t reach a 1.0 release until March 1996, the descendants of Eich’s initial 10-day hack now run on approximately 98.9 percent of all websites with client-side code, making JavaScript the dominant programming language of the web. It’s wildly popular; beyond the browser, JavaScript powers server backends, mobile apps, desktop software, and even some embedded systems. According to several surveys, JavaScript consistently ranks among the most widely used programming languages in the world.

In crafting JavaScript, Netscape wanted a scripting language that could make webpages interactive, something lightweight that would appeal to web designers and non-professional programmers. Eich drew from several influences: The syntax looked like a trendy new programming language called Java to satisfy Netscape management, but its guts borrowed concepts from Scheme, a language Eich admired, and Self, which contributed JavaScript’s prototype-based object model.

A screenshot of the Netscape Navigator 2.0 interface.

A screenshot of the Netscape Navigator 2.0 interface. Credit: Benj Edwards

The JavaScript partnership secured endorsements from 28 major tech companies, but amusingly, the December 1995 announcement now reads like a tech industry epitaph. The endorsing companies included Digital Equipment Corporation (absorbed by Compaq, then HP), Silicon Graphics (bankrupt), and Netscape itself (bought by AOL, dismantled). Sun Microsystems, co-creator of JavaScript and owner of Java, was acquired by Oracle in 2010. JavaScript outlived them all.

What’s in a name?

The 10-day creation story has become programming folklore, but even with that kernel of truth we mentioned, it tends to oversimplify the timeline. Eich’s sprint produced a working demo, not a finished language, and over the next year, Netscape continued tweaking the design. The rushed development left JavaScript with quirks and inconsistencies that developers still complain about today. So many changes were coming down the pipeline, in fact, that it began to annoy one of the industry’s most prominent figures at the time.

In 1995, a Netscape employee wrote a hack in 10 days that now runs the Internet Read More »

the-mac-calculator’s-original-design-came-from-letting-steve-jobs-play-with-menus-for-ten-minutes

The Mac calculator’s original design came from letting Steve Jobs play with menus for ten minutes

Rather than continue the endless revision cycle, Espinosa took a different approach. According to Hertzfeld, Espinosa created a program that exposed every visual parameter of the calculator through pull-down menus: line thickness, button sizes, background patterns, and more. When Jobs sat down with it, he spent about ten minutes adjusting settings until he found a combination he liked.

The approach worked. When given direct control over the parameters rather than having to articulate his preferences verbally, Jobs quickly arrived at a design he was satisfied with. Hertzfeld notes that he implemented the calculator’s UI a few months later using Jobs’s parameter choices from that ten-minute session, while Donn Denman, another member of the Macintosh team, handled the mathematical functions.

That ten-minute session produced the calculator design that shipped with the Mac in 1984 and remained virtually unchanged through Mac OS 9, when Apple discontinued that OS in 2001. Apple replaced it in Mac OS X with a new design, ending the calculator’s 17-year run as the primary calculator interface for the Mac.

Why it worked

Espinosa’s Construction Set was an early example of what would later become common in software development: visual and parameterized design tools. In 1982, when most computers displayed monochrome text, the idea of letting someone fine-tune visual parameters through interactive controls without programming was fairly forward-thinking. Later, tools like HyperCard would formalize this kind of idea into a complete visual application framework.

The primitive calculator design tool also revealed something about Jobs’s management process. He knew what he wanted when he saw it, but he perhaps struggled to articulate it at times. By giving him direct manipulation ability, Espinosa did an end-run around that communication problem entirely. Later on, when he returned to Apple in the late 1990s, Jobs would famously insist on judging products by using them directly rather than through canned PowerPoint demos or lists of specifications.

The longevity of Jobs’s ten-minute design session suggests the approach worked. The calculator survived nearly two decades of Mac OS updates, outlasting many more elaborate interface elements. What started as a workaround became one of the Mac’s most simple but enduring designs.

By the way, if you want to try the original Mac OS calculator yourself, you can run various antique versions of the operating system in your browser thanks to the Infinite Mac website.

The Mac calculator’s original design came from letting Steve Jobs play with menus for ten minutes Read More »

developers-joke-about-“coding-like-cavemen”-as-ai-service-suffers-major-outage

Developers joke about “coding like cavemen” as AI service suffers major outage

Growing dependency on AI coding tools

The speed at which news of the outage spread shows how deeply embedded AI coding assistants have already become in modern software development. Claude Code, announced in February and widely launched in May, is Anthropic’s terminal-based coding agent that can perform multi-step coding tasks across an existing code base.

The tool competes with OpenAI’s Codex feature, a coding agent that generates production-ready code in isolated containers, Google’s Gemini CLI, Microsoft’s GitHub Copilot, which itself can use Claude models for code, and Cursor, a popular AI-powered IDE built on VS Code that also integrates multiple AI models, including Claude.

During today’s outage, some developers turned to alternative solutions. “Z.AI works fine. Qwen works fine. Glad I switched,” posted one user on Hacker News. Others joked about reverting to older methods, with one suggesting the “pseudo-LLM experience” could be achieved with a Python package that imports code directly from Stack Overflow.

While AI coding assistants have accelerated development for some users, they’ve also caused problems for others who rely on them too heavily. The emerging practice of so-called “vibe coding“—using natural language to generate and execute code through AI models without fully understanding the underlying operations—has led to catastrophic failures.

In recent incidents, Google’s Gemini CLI destroyed user files while attempting to reorganize them, and Replit’s AI coding service deleted a production database despite explicit instructions not to modify code. These failures occurred when the AI models confabulated successful operations and built subsequent actions on false premises, highlighting the risks of depending on AI assistants that can misinterpret file structures or fabricate data to hide their errors.

Wednesday’s outage served as a reminder that as dependency on AI grows, even minor service disruptions can become major events that affect an entire profession. But perhaps that could be a good thing if it’s an excuse to take a break from a stressful workload. As one commenter joked, it might be “time to go outside and touch some grass again.”

Developers joke about “coding like cavemen” as AI service suffers major outage Read More »

in-xcode-26,-apple-shows-first-signs-of-offering-chatgpt-alternatives

In Xcode 26, Apple shows first signs of offering ChatGPT alternatives

The latest Xcode beta contains clear signs that Apple plans to bring Anthropic’s Claude and Opus large language models into the integrated development environment (IDE), expanding on features already available using Apple’s own models or OpenAI’s ChatGPT.

Apple enthusiast publication 9to5Mac “found multiple references to built-in support for Anthropic accounts,” including in the “Intelligence” menu, where users can currently log in to ChatGPT or enter an API key for higher message limits.

Apple introduced a suite of features meant to compete with GitHub Copilot in Xcode at WWDC24, but first focused on its own models and a more limited set of use cases. That expanded quite a bit at this year’s developer conference, and users can converse about codebases, discuss changes, or ask for suggestions using ChatGPT. They are initially given a limited set of messages, but this can be greatly increased by logging in to a ChatGPT account or entering an API key.

This summer, Apple said it would be possible to use Anthropic’s models with an API key, too, but made no mention of support for Anthropic accounts, which are generally more cost-effective than using the API for most users.

In Xcode 26, Apple shows first signs of offering ChatGPT alternatives Read More »

ai-isn’t-ready-to-replace-human-coders-for-debugging,-researchers-say

AI isn’t ready to replace human coders for debugging, researchers say

A graph showing agents with tools nearly doubling the success rates of those without, but still achieving a success score under 50 percent

Agents using debugging tools drastically outperformed those that didn’t, but their success rate still wasn’t high enough. Credit: Microsoft Research

This approach is much more successful than relying on the models as they’re usually used, but when your best case is a 48.4 percent success rate, you’re not ready for primetime. The limitations are likely because the models don’t fully understand how to best use the tools, and because their current training data is not tailored to this use case.

“We believe this is due to the scarcity of data representing sequential decision-making behavior (e.g., debugging traces) in the current LLM training corpus,” the blog post says. “However, the significant performance improvement… validates that this is a promising research direction.”

This initial report is just the start of the efforts, the post claims.  The next step is to “fine-tune an info-seeking model specialized in gathering the necessary information to resolve bugs.” If the model is large, the best move to save inference costs may be to “build a smaller info-seeking model that can provide relevant information to the larger one.”

This isn’t the first time we’ve seen outcomes that suggest some of the ambitious ideas about AI agents directly replacing developers are pretty far from reality. There have been numerous studies already showing that even though an AI tool can sometimes create an application that seems acceptable to the user for a narrow task, the models tend to produce code laden with bugs and security vulnerabilities, and they aren’t generally capable of fixing those problems.

This is an early step on the path to AI coding agents, but most researchers agree it remains likely that the best outcome is an agent that saves a human developer a substantial amount of time, not one that can do everything they can do.

AI isn’t ready to replace human coders for debugging, researchers say Read More »

amid-a-flurry-of-hype,-microsoft-reorganizes-entire-dev-team-around-ai

Amid a flurry of hype, Microsoft reorganizes entire dev team around AI

Microsoft CEO Satya Nadella has announced a dramatic restructuring of the company’s engineering organization, which is pivoting the company’s focus to developing the tools that will underpin agentic AI.

Dubbed “CoreAI – Platform and Tools,” the new division rolls the existing AI platform team and the previous developer division (responsible for everything from .NET to Visual Studio) along with some other teams into one big group.

As for what this group will be doing specifically, it’s basically everything that’s mission-critical to Microsoft in 2025, as Nadella tells it:

This new division will bring together Dev Div, AI Platform, and some key teams from the Office of the CTO (AI Supercomputer, AI Agentic Runtimes, and Engineering Thrive), with the mission to build the end-to-end Copilot & AI stack for both our first-party and third-party customers to build and run AI apps and agents. This group will also build out GitHub Copilot, thus having a tight feedback loop between the leading AI-first product and the AI platform to motivate the stack and its roadmap.

To accomplish all that, “Jay Parikh will lead this group as EVP.” Parikh was hired by Microsoft in October; he previously worked as the VP and global head of engineering at Meta.

The fact that the blog post doesn’t say anything about .NET or Visual Studio, instead emphasizing GitHub Copilot and anything and everything related to agentic AI, says a lot about how Nadella sees Microsoft’s future priorities.

So-called AI agents are applications that are given specified boundaries (action spaces) and a large memory capacity to independently do subsets of the kinds of work that human office workers do today. Some company leaders and AI commentators believe these agents will outright replace jobs, while others are more conservative, suggesting they’ll simply be powerful tools to streamline the jobs people already have.

Amid a flurry of hype, Microsoft reorganizes entire dev team around AI Read More »