agentic AI

openai-built-an-ai-coding-agent-and-uses-it-to-improve-the-agent-itself

OpenAI built an AI coding agent and uses it to improve the agent itself


“The vast majority of Codex is built by Codex,” OpenAI told us about its new AI coding agent.

With the popularity of AI coding tools rising among software developers, their adoption has begun to touch every aspect of the process, including the improvement of AI coding tools themselves.

In interviews with Ars Technica this week, OpenAI employees revealed the extent to which the company now relies on its own AI coding agent, Codex, to build and improve the development tool. “I think the vast majority of Codex is built by Codex, so it’s almost entirely just being used to improve itself,” said Alexander Embiricos, product lead for Codex at OpenAI, in a conversation on Tuesday.

Codex, which OpenAI launched in its modern incarnation as a research preview in May 2025, operates as a cloud-based software engineering agent that can handle tasks like writing features, fixing bugs, and proposing pull requests. The tool runs in sandboxed environments linked to a user’s code repository and can execute multiple tasks in parallel. OpenAI offers Codex through ChatGPT’s web interface, a command-line interface (CLI), and IDE extensions for VS Code, Cursor, and Windsurf.

The “Codex” name itself dates back to a 2021 OpenAI model based on GPT-3 that powered GitHub Copilot’s tab completion feature. Embiricos said the name is rumored among staff to be short for “code execution.” OpenAI wanted to connect the new agent to that earlier moment, which was crafted in part by some who have left the company.

“For many people, that model powering GitHub Copilot was the first ‘wow’ moment for AI,” Embiricos said. “It showed people the potential of what it can mean when AI is able to understand your context and what you’re trying to do and accelerate you in doing that.”

A place to enter a prompt, set parameters, and click

The interface for OpenAI’s Codex in ChatGPT. Credit: OpenAI

It’s no secret that the current command-line version of Codex bears some resemblance to Claude Code, Anthropic’s agentic coding tool that launched in February 2025. When asked whether Claude Code influenced Codex’s design, Embiricos parried the question but acknowledged the competitive dynamic. “It’s a fun market to work in because there’s lots of great ideas being thrown around,” he said. He noted that OpenAI had been building web-based Codex features internally before shipping the CLI version, which arrived after Anthropic’s tool.

OpenAI’s customers apparently love the command line version, though. Embiricos said Codex usage among external developers jumped 20 times after OpenAI shipped the interactive CLI extension alongside GPT-5 in August 2025. On September 15, OpenAI released GPT-5 Codex, a specialized version of GPT-5 optimized for agentic coding, which further accelerated adoption.

It hasn’t just been the outside world that has embraced the tool. Embiricos said the vast majority of OpenAI’s engineers now use Codex regularly. The company uses the same open-source version of the CLI that external developers can freely download, suggest additions to, and modify themselves. “I really love this about our team,” Embiricos said. “The version of Codex that we use is literally the open source repo. We don’t have a different repo that features go in.”

The recursive nature of Codex development extends beyond simple code generation. Embiricos described scenarios where Codex monitors its own training runs and processes user feedback to “decide” what to build next. “We have places where we’ll ask Codex to look at the feedback and then decide what to do,” he said. “Codex is writing a lot of the research harness for its own training runs, and we’re experimenting with having Codex monitoring its own training runs.” OpenAI employees can also submit a ticket to Codex through project management tools like Linear, assigning it tasks the same way they would assign work to a human colleague.

This kind of recursive loop, of using tools to build better tools, has deep roots in computing history. Engineers designed the first integrated circuits by hand on vellum and paper in the 1960s, then fabricated physical chips from those drawings. Those chips powered the computers that ran the first electronic design automation (EDA) software, which in turn enabled engineers to design circuits far too complex for any human to draft manually. Modern processors contain billions of transistors arranged in patterns that exist only because software made them possible. OpenAI’s use of Codex to build Codex seems to follow the same pattern: each generation of the tool creates capabilities that feed into the next.

But describing what Codex actually does presents something of a linguistic challenge. At Ars Technica, we try to reduce anthropomorphism when discussing AI models as much as possible while also describing what these systems do using analogies that make sense to general readers. People can talk to Codex like a human, so it feels natural to use human terms to describe interacting with it, even though it is not a person and simulates human personality through statistical modeling.

The system runs many processes autonomously, addresses feedback, spins off and manages child processes, and produces code that ships in real products. OpenAI employees call it a “teammate” and assign it tasks through the same tools they use for human colleagues. Whether the tasks Codex handles constitute “decisions” or sophisticated conditional logic smuggled through a neural network depends on definitions that computer scientists and philosophers continue to debate. What we can say is that a semi-autonomous feedback loop exists: Codex produces code under human direction, that code becomes part of Codex, and the next version of Codex produces different code as a result.

Building faster with “AI teammates”

According to our interviews, the most dramatic example of Codex’s internal impact came from OpenAI’s development of the Sora Android app. According to Embiricos, the development tool allowed the company to create the app in record time.

“The Sora Android app was shipped by four engineers from scratch,” Embiricos told Ars. “It took 18 days to build, and then we shipped it to the app store in 28 days total,” he said. The engineers already had the iOS app and server-side components to work from, so they focused on building the Android client. They used Codex to help plan the architecture, generate sub-plans for different components, and implement those components.

Despite OpenAI’s claims of success with Codex in house, it’s worth noting that independent research has shown mixed results for AI coding productivity. A METR study published in July found that experienced open source developers were actually 19 percent slower when using AI tools on complex, mature codebases—though the researchers noted AI may perform better on simpler projects.

Ed Bayes, a designer on the Codex team, described how the tool has changed his own workflow. Bayes said Codex now integrates with project management tools like Linear and communication platforms like Slack, allowing team members to assign coding tasks directly to the AI agent. “You can add Codex, and you can basically assign issues to Codex now,” Bayes told Ars. “Codex is literally a teammate in your workspace.”

This integration means that when someone posts feedback in a Slack channel, they can tag Codex and ask it to fix the issue. The agent will create a pull request, and team members can review and iterate on the changes through the same thread. “It’s basically approximating this kind of coworker and showing up wherever you work,” Bayes said.

For Bayes, who works on the visual design and interaction patterns for Codex’s interfaces, the tool has enabled him to contribute code directly rather than handing off specifications to engineers. “It kind of gives you more leverage. It enables you to work across the stack and basically be able to do more things,” he said. He noted that designers at OpenAI now prototype features by building them directly, using Codex to handle the implementation details.

The command line version of OpenAI codex running in a macOS terminal window.

The command line version of OpenAI codex running in a macOS terminal window. Credit: Benj Edwards

OpenAI’s approach treats Codex as what Bayes called “a junior developer” that the company hopes will graduate into a senior developer over time. “If you were onboarding a junior developer, how would you onboard them? You give them a Slack account, you give them a Linear account,” Bayes said. “It’s not just this tool that you go to in the terminal, but it’s something that comes to you as well and sits within your team.”

Given this teammate approach, will there be anything left for humans to do? When asked, Embiricos drew a distinction between “vibe coding,” where developers accept AI-generated code without close review, and what AI researcher Simon Willison calls “vibe engineering,” where humans stay in the loop. “We see a lot more vibe engineering in our code base,” he said. “You ask Codex to work on that, maybe you even ask for a plan first. Go back and forth, iterate on the plan, and then you’re in the loop with the model and carefully reviewing its code.”

He added that vibe coding still has its place for prototypes and throwaway tools. “I think vibe coding is great,” he said. “Now you have discretion as a human about how much attention you wanna pay to the code.”

Looking ahead

Over the past year, “monolithic” large language models (LLMs) like GPT-4.5 have apparently become something of a dead end in terms of frontier benchmarking progress as AI companies pivot to simulated reasoning models and also agentic systems built from multiple AI models running in parallel. We asked Embiricos whether agents like Codex represent the best path forward for squeezing utility out of existing LLM technology.

He dismissed concerns that AI capabilities have plateaued. “I think we’re very far from plateauing,” he said. “If you look at the velocity on the research team here, we’ve been shipping models almost every week or every other week.” He pointed to recent improvements where GPT-5-Codex reportedly completes tasks 30 percent faster than its predecessor at the same intelligence level. During testing, the company has seen the model work independently for 24 hours on complex tasks.

OpenAI faces competition from multiple directions in the AI coding market. Anthropic’s Claude Code and Google’s Gemini CLI offer similar terminal-based agentic coding experiences. This week, Mistral AI released Devstral 2 alongside a CLI tool called Mistral Vibe. Meanwhile, startups like Cursor have built dedicated IDEs around AI coding, reportedly reaching $300 million in annualized revenue.

Given the well-known issues with confabulation in AI models when people attempt to use them as factual resources, could it be that coding has become the killer app for LLMs? We wondered if OpenAI has noticed that coding seems to be a clear business use case for today’s AI models with less hazard than, say, using AI language models for writing or as emotional companions.

“We have absolutely noticed that coding is both a place where agents are gonna get good really fast and there’s a lot of economic value,” Embiricos said. “We feel like it’s very mission-aligned to focus on Codex. We get to provide a lot of value to developers. Also, developers build things for other people, so we’re kind of intrinsically scaling through them.”

But will tools like Codex threaten software developer jobs? Bayes acknowledged concerns but said Codex has not reduced headcount at OpenAI, and “there’s always a human in the loop because the human can actually read the code.” Similarly, the two men don’t project a future where Codex runs by itself without some form of human oversight. They feel the tool is an amplifier of human potential rather than a replacement for it.

The practical implications of agents like Codex extend beyond OpenAI’s walls. Embiricos said the company’s long-term vision involves making coding agents useful to people who have no programming experience. “All humanity is not gonna open an IDE or even know what a terminal is,” he said. “We’re building a coding agent right now that’s just for software engineers, but we think of the shape of what we’re building as really something that will be useful to be a more general agent.”

This article was updated on December 12, 2025 at 6: 50 PM to mention the METR study.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

OpenAI built an AI coding agent and uses it to improve the agent itself Read More »

openai-releases-gpt-5.2-after-“code-red”-google-threat-alert

OpenAI releases GPT-5.2 after “code red” Google threat alert

On Thursday, OpenAI released GPT-5.2, its newest family of AI models for ChatGPT, in three versions called Instant, Thinking, and Pro. The release follows CEO Sam Altman’s internal “code red” memo earlier this month, which directed company resources toward improving ChatGPT in response to competitive pressure from Google’s Gemini 3 AI model.

“We designed 5.2 to unlock even more economic value for people,” Fidji Simo, OpenAI’s chief product officer, said during a press briefing with journalists on Thursday. “It’s better at creating spreadsheets, building presentations, writing code, perceiving images, understanding long context, using tools and then linking complex, multi-step projects.”

As with previous versions of GPT-5, the three model tiers serve different purposes: Instant handles faster tasks like writing and translation; Thinking spits out simulated reasoning “thinking” text in an attempt to tackle more complex work like coding and math; and Pro spits out even more simulated reasoning text with the goal of delivering the highest-accuracy performance for difficult problems.

A chart of GPT-5.2 benchmark results taken from OpenAI's website.

A chart of GPT-5.2 Thinking benchmark results comparing it to its predecessor, taken from OpenAI’s website. Credit: OpenAI

GPT-5.2 features a 400,000-token context window, allowing it to process hundreds of documents at once, and a knowledge cutoff date of August 31, 2025.

GPT-5.2 is rolling out to paid ChatGPT subscribers starting Thursday, with API access available to developers. Pricing in the API runs $1.75 per million input tokens for the standard model, a 40 percent increase over GPT-5.1. OpenAI says the older GPT-5.1 will remain available in ChatGPT for paid users for three months under a legacy models dropdown.

Playing catch-up with Google

The release follows a tricky month for OpenAI. In early December, Altman issued an internal “code red” directive after Google’s Gemini 3 model topped multiple AI benchmarks and gained market share. The memo called for delaying other initiatives, including advertising plans for ChatGPT, to focus on improving the chatbot’s core experience.

The stakes for OpenAI are substantial. The company has made commitments totaling $1.4 trillion for AI infrastructure buildouts over the next several years, bets it made when it had a more obvious technology lead among AI companies. Google’s Gemini app now has more than 650 million monthly active users, while OpenAI reports 800 million weekly active users for ChatGPT.

OpenAI releases GPT-5.2 after “code red” Google threat alert Read More »

a-new-open-weights-ai-coding-model-is-closing-in-on-proprietary-options

A new open-weights AI coding model is closing in on proprietary options

On Tuesday, French AI startup Mistral AI released Devstral 2, a 123 billion parameter open-weights coding model designed to work as part of an autonomous software engineering agent. The model achieves a 72.2 percent score on SWE-bench Verified, a benchmark that attempts to test whether AI systems can solve real GitHub issues, putting it among the top-performing open-weights models.

Perhaps more notably, Mistral didn’t just release an AI model, it released a new development app called Mistral Vibe. It’s a command line interface (CLI) similar to Claude Code, OpenAI Codex, and Gemini CLI that lets developers interact with the Devstral models directly in their terminal. The tool can scan file structures and Git status to maintain context across an entire project, make changes across multiple files, and execute shell commands autonomously. Mistral released the CLI under the Apache 2.0 license.

It’s always wise to take AI benchmarks with a large grain of salt, but we’ve heard from employees of the big AI companies that they pay very close attention to how well models do on SWE-bench Verified, which presents AI models with 500 real software engineering problems pulled from GitHub issues in popular Python repositories. The AI must read the issue description, navigate the codebase, and generate a working patch that passes unit tests. While some AI researchers have noted that around 90 percent of the tasks in the benchmark test relatively simple bug fixes that experienced engineers could complete in under an hour, it’s one of the few standardized ways to compare coding models.

At the same time as the larger AI coding model, Mistral also released Devstral Small 2, a 24 billion parameter version that scores 68 percent on the same benchmark and can run locally on consumer hardware like a laptop with no Internet connection required. Both models support a 256,000 token context window, allowing them to process moderately large codebases (although whether you consider it large or small is very relative depending on overall project complexity). The company released Devstral 2 under a modified MIT license and Devstral Small 2 under the more permissive Apache 2.0 license.

A new open-weights AI coding model is closing in on proprietary options Read More »

microsoft-drops-ai-sales-targets-in-half-after-salespeople-miss-their-quotas

Microsoft drops AI sales targets in half after salespeople miss their quotas

Microsoft has lowered sales growth targets for its AI agent products after many salespeople missed their quotas in the fiscal year ending in June, according to a report Wednesday from The Information. The adjustment is reportedly unusual for Microsoft, and it comes after the company missed a number of ambitious sales goals for its AI offerings.

AI agents are specialized implementations of AI language models designed to perform multistep tasks autonomously rather than simply responding to single prompts. So-called “agentic” features have been central to Microsoft’s 2025 sales pitch: At its Build conference in May, the company declared that it has entered “the era of AI agents.”

The company has promised customers that agents could automate complex tasks, such as generating dashboards from sales data or writing customer reports. At its Ignite conference in November, Microsoft announced new features like Word, Excel, and PowerPoint agents in Microsoft 365 Copilot, along with tools for building and deploying agents through Azure AI Foundry and Copilot Studio. But as the year draws to a close, that promise has proven harder to deliver than the company expected.

According to The Information, one US Azure sales unit set quotas for salespeople to increase customer spending on a product called Foundry, which helps customers develop AI applications, by 50 percent. Less than a fifth of salespeople in that unit met their Foundry sales growth targets. In July, Microsoft lowered those targets to roughly 25 percent growth for the current fiscal year. In another US Azure unit, most salespeople failed to meet an earlier quota to double Foundry sales, and Microsoft cut their quotas to 50 percent for the current fiscal year.

Microsoft drops AI sales targets in half after salespeople miss their quotas Read More »

claude-code-gets-a-web-version—but-it’s-the-new-sandboxing-that-really-matters

Claude Code gets a web version—but it’s the new sandboxing that really matters

Now, it can instead be given permissions for specific file system folders and network servers. That means fewer approval steps, but it’s also more secure overall against prompt injection and other risks.

Anthropic’s demo video for Claude Code on the web.

According to Anthropic’s engineering blog, the new network isolation approach only allows Internet access “through a unix domain socket connected to a proxy server running outside the sandbox. … This proxy server enforces restrictions on the domains that a process can connect to, and handles user confirmation for newly requested domains.” Additionally, users can customize the proxy to set their own rules for outgoing traffic.

This way, the coding agent can do things like fetch npm packages from approved sources, but without carte blanche for communicating with the outside world, and without badgering the user with constant approvals.

For many developers, these additions are more significant than the availability of web or mobile interfaces. They allow Claude Code agents to operate more independently without as many detailed, line-by-line approvals.

That’s more convenient, but it’s a double-edged sword, as it will also make code review even more important. One of the strengths of the too-many-approvals approach was that it made sure developers were still looking closely at every little change. Now it might be a little bit easier to miss Claude Code making a bad call.

The new features are available in beta now as a research preview, and they are available to Claude users with Pro or Max subscriptions.

Claude Code gets a web version—but it’s the new sandboxing that really matters Read More »

developers-joke-about-“coding-like-cavemen”-as-ai-service-suffers-major-outage

Developers joke about “coding like cavemen” as AI service suffers major outage

Growing dependency on AI coding tools

The speed at which news of the outage spread shows how deeply embedded AI coding assistants have already become in modern software development. Claude Code, announced in February and widely launched in May, is Anthropic’s terminal-based coding agent that can perform multi-step coding tasks across an existing code base.

The tool competes with OpenAI’s Codex feature, a coding agent that generates production-ready code in isolated containers, Google’s Gemini CLI, Microsoft’s GitHub Copilot, which itself can use Claude models for code, and Cursor, a popular AI-powered IDE built on VS Code that also integrates multiple AI models, including Claude.

During today’s outage, some developers turned to alternative solutions. “Z.AI works fine. Qwen works fine. Glad I switched,” posted one user on Hacker News. Others joked about reverting to older methods, with one suggesting the “pseudo-LLM experience” could be achieved with a Python package that imports code directly from Stack Overflow.

While AI coding assistants have accelerated development for some users, they’ve also caused problems for others who rely on them too heavily. The emerging practice of so-called “vibe coding“—using natural language to generate and execute code through AI models without fully understanding the underlying operations—has led to catastrophic failures.

In recent incidents, Google’s Gemini CLI destroyed user files while attempting to reorganize them, and Replit’s AI coding service deleted a production database despite explicit instructions not to modify code. These failures occurred when the AI models confabulated successful operations and built subsequent actions on false premises, highlighting the risks of depending on AI assistants that can misinterpret file structures or fabricate data to hide their errors.

Wednesday’s outage served as a reminder that as dependency on AI grows, even minor service disruptions can become major events that affect an entire profession. But perhaps that could be a good thing if it’s an excuse to take a break from a stressful workload. As one commenter joked, it might be “time to go outside and touch some grass again.”

Developers joke about “coding like cavemen” as AI service suffers major outage Read More »

anthropic’s-auto-clicking-ai-chrome-extension-raises-browser-hijacking-concerns

Anthropic’s auto-clicking AI Chrome extension raises browser-hijacking concerns

The company tested 123 cases representing 29 different attack scenarios and found a 23.6 percent attack success rate when browser use operated without safety mitigations.

One example involved a malicious email that instructed Claude to delete a user’s emails for “mailbox hygiene” purposes. Without safeguards, Claude followed these instructions and deleted the user’s emails without confirmation.

Anthropic says it has implemented several defenses to address these vulnerabilities. Users can grant or revoke Claude’s access to specific websites through site-level permissions. The system requires user confirmation before Claude takes high-risk actions like publishing, purchasing, or sharing personal data. The company has also blocked Claude from accessing websites offering financial services, adult content, and pirated content by default.

These safety measures reduced the attack success rate from 23.6 percent to 11.2 percent in autonomous mode. On a specialized test of four browser-specific attack types, the new mitigations reportedly reduced the success rate from 35.7 percent to 0 percent.

Independent AI researcher Simon Willison, who has extensively written about AI security risks and coined the term “prompt injection” in 2022, called the remaining 11.2 percent attack rate “catastrophic,” writing on his blog that “in the absence of 100% reliable protection I have trouble imagining a world in which it’s a good idea to unleash this pattern.”

By “pattern,” Willison is referring to the recent trend of integrating AI agents into web browsers. “I strongly expect that the entire concept of an agentic browser extension is fatally flawed and cannot be built safely,” he wrote in an earlier post on similar prompt injection security issues recently found in Perplexity Comet.

The security risks are no longer theoretical. Last week, Brave’s security team discovered that Perplexity’s Comet browser could be tricked into accessing users’ Gmail accounts and triggering password recovery flows through malicious instructions hidden in Reddit posts. When users asked Comet to summarize a Reddit thread, attackers could embed invisible commands that instructed the AI to open Gmail in another tab, extract the user’s email address, and perform unauthorized actions. Although Perplexity attempted to fix the vulnerability, Brave later confirmed that its mitigations were defeated and the security hole remained.

For now, Anthropic plans to use its new research preview to identify and address attack patterns that emerge in real-world usage before making the Chrome extension more widely available. In the absence of good protections from AI vendors, the burden of security falls on the user, who is taking a large risk by using these tools on the open web. As Willison noted in his post about Claude for Chrome, “I don’t think it’s reasonable to expect end users to make good decisions about the security risks.”

Anthropic’s auto-clicking AI Chrome extension raises browser-hijacking concerns Read More »

openai’s-chatgpt-agent-casually-clicks-through-“i-am-not-a-robot”-verification-test

OpenAI’s ChatGPT Agent casually clicks through “I am not a robot” verification test

The CAPTCHA arms race

While the agent didn’t face an actual CAPTCHA puzzle with images in this case, successfully passing Cloudflare’s behavioral screening that determines whether to present such challenges demonstrates sophisticated browser automation.

To understand the significance of this capability, it’s important to know that CAPTCHA systems have served as a security measure on the web for decades. Computer researchers invented the technique in the 1990s to screen bots from entering information into websites, originally using images with letters and numbers written in wiggly fonts, often obscured with lines or noise to foil computer vision algorithms. The assumption is that the task will be easy for humans but difficult for machines.

Cloudflare’s screening system, called Turnstile, often precedes actual CAPTCHA challenges and represents one of the most widely deployed bot-detection methods today. The checkbox analyzes multiple signals, including mouse movements, click timing, browser fingerprints, IP reputation, and JavaScript execution patterns to determine if the user exhibits human-like behavior. If these checks pass, users proceed without seeing a CAPTCHA puzzle. If the system detects suspicious patterns, it escalates to visual challenges.

The ability for an AI model to defeat a CAPTCHA isn’t entirely new (although having one narrate the process feels fairly novel). AI tools have been able to defeat certain CAPTCHAs for a while, which has led to an arms race between those that create them and those that defeat them. OpenAI’s Operator, an experimental web-browsing AI agent launched in January, faced difficulty clicking through some CAPTCHAs (and was also trained to stop and ask a human to complete them), but the latest ChatGPT Agent tool has seen a much wider release.

It’s tempting to say that the ability of AI agents to pass these tests puts the future effectiveness of CAPTCHAs into question, but for as long as there have been CAPTCHAs, there have been bots that could later defeat them. As a result, recent CAPTCHAs have become more of a way to slow down bot attacks or make them more expensive rather than a way to defeat them entirely. Some malefactors even hire out farms of humans to defeat them in bulk.

OpenAI’s ChatGPT Agent casually clicks through “I am not a robot” verification test Read More »

chatgpt’s-new-ai-agent-can-browse-the-web-and-create-powerpoint-slideshows

ChatGPT’s new AI agent can browse the web and create PowerPoint slideshows

On Thursday, OpenAI launched ChatGPT Agent, a new feature that lets the company’s AI assistant complete multi-step tasks by controlling its own web browser. The update merges capabilities from OpenAI’s earlier Operator tool and the Deep Research feature, allowing ChatGPT to navigate websites, run code, and create documents while users maintain control over the process.

The feature marks OpenAI’s latest entry into what the tech industry calls “agentic AI“—systems that can take autonomous multi-step actions on behalf of the user. OpenAI says users can ask Agent to handle requests like assembling and purchasing a clothing outfit for a particular occasion, creating PowerPoint slide decks, planning meals, or updating financial spreadsheets with new data.

The system uses a combination of web browsers, terminal access, and API connections to complete these tasks, including “ChatGPT Connectors” that integrate with apps like Gmail and GitHub.

While using Agent, users watch a window inside the ChatGPT interface that shows all of the AI’s actions taking place inside its own private sandbox. This sandbox features its own virtual operating system and web browser with access to the real Internet; it does not control your personal device. “ChatGPT carries out these tasks using its own virtual computer,” OpenAI writes, “fluidly shifting between reasoning and action to handle complex workflows from start to finish, all based on your instructions.”

A still image from an OpenAI ChatGPT Agent promotional demo video showing the AI agent searching for flights.

A still image from an OpenAI ChatGPT Agent promotional demo video showing the AI agent searching for flights. Credit: OpenAI

Like Operator before it, the agent feature requires user permission before taking certain actions with real-world consequences, such as making purchases. Users can interrupt tasks at any point, take control of the browser, or stop operations entirely. The system also includes a “Watch Mode” for tasks like sending emails that require active user oversight.

Since Agent surpasses Operator in capability, OpenAI says the company’s earlier Operator preview site will remain functional for a few more weeks before being shut down.

Performance claims

OpenAI’s claims are one thing, but how well the company’s new AI agent will actually complete multi-step tasks will vary wildly depending on the situation. That’s because the AI model isn’t a complete form of problem-solving intelligence, but rather a complex master imitator. It has some flexibility in piecing a scenario together but also many blind spots. OpenAI trained the agent (and its constituent components) using examples of computer usage and tool usage; whatever falls outside of the examples absorbed from training data will likely still prove difficult to accomplish.

ChatGPT’s new AI agent can browse the web and create PowerPoint slideshows Read More »

anthropic-summons-the-spirit-of-flash-games-for-the-ai-age

Anthropic summons the spirit of Flash games for the AI age

For those who missed the Flash era, these in-browser apps feel somewhat like the vintage apps that defined a generation of Internet culture from the late 1990s through the 2000s when it first became possible to create complex in-browser experiences. Adobe Flash (originally Macromedia Flash) began as animation software for designers but quickly became the backbone of interactive web content when it gained its own programming language, ActionScript, in 2000.

But unlike Flash games, where hosting costs fell on portal operators, Anthropic has crafted a system where users pay for their own fun through their existing Claude subscriptions. “When someone uses your Claude-powered app, they authenticate with their existing Claude account,” Anthropic explained in its announcement. “Their API usage counts against their subscription, not yours. You pay nothing for their usage.”

A view of the Anthropic Artifacts gallery in the “Play a Game” section. Benj Edwards / Anthropic

Like the Flash games of yesteryear, any Claude-powered apps you build run in the browser and can be shared with anyone who has a Claude account. They’re interactive experiences shared with a simple link, no installation required, created by other people for the sake of creating, except now they’re powered by JavaScript instead of ActionScript.

While you can share these apps with others individually, right now Anthropic’s Artifact gallery only shows examples made by Anthropic and your own personal Artifacts. (If Anthropic expanded it into the future, it might end up feeling a bit like Scratch meets Newgrounds, but with AI doing the coding.) Ultimately, humans are still behind the wheel, describing what kinds of apps they want the AI model to build and guiding the process when it inevitably makes mistakes.

Speaking of mistakes, don’t expect perfect results at first. Usually, building an app with Claude is an interactive experience that requires some guidance to achieve your desired results. But with a little patience and a lot of tokens, you’ll be vibe coding in no time.

Anthropic summons the spirit of Flash games for the AI age Read More »

openai-introduces-codex,-its-first-full-fledged-ai-agent-for-coding

OpenAI introduces Codex, its first full-fledged AI agent for coding

We’ve been expecting it for a while, and now it’s here: OpenAI has introduced an agentic coding tool called Codex in research preview. The tool is meant to allow experienced developers to delegate rote and relatively simple programming tasks to an AI agent that will generate production-ready code and show its work along the way.

Codex is a unique interface (not to be confused with the Codex CLI tool introduced by OpenAI last month) that can be reached from the side bar in the ChatGPT web app. Users enter a prompt and then click either “code” to have it begin producing code, or “ask” to have it answer questions and advise.

Whenever it’s given a task, that task is performed in a distinct container that is preloaded with the user’s codebase and is meant to accurately reflect their development environment.

To make Codex more effective, developers can include an “AGENTS.md” file in the repo with custom instructions, for example to contextualize and explain the code base or to communicate standardizations and style practices for the project—kind of a README.md but for AI agents rather than humans.

Codex is built on codex-1, a fine-tuned variation of OpenAI’s o3 reasoning model that was trained using reinforcement learning on a wide range of coding tasks to analyze and generate code, and to iterate through tests along the way.

OpenAI introduces Codex, its first full-fledged AI agent for coding Read More »

perplexity-wants-to-reinvent-the-web-browser-with-ai—but-there’s-fierce-competition

Perplexity wants to reinvent the web browser with AI—but there’s fierce competition

It has recently been expanding its offerings—for example, it recently launched a deep research tool competing with similar ones provided by OpenAI and Google, as well as Sonar, an API for generative AI-powered search.

It will face fierce competition in the browser market, though. Google’s Chrome accounts for the majority of web browser use around the world, and despite its position at the forefront of AI search, Perplexity isn’t the first to introduce a browser with heavy use of generative AI features. For example, The Browser Company showed off its Dia browser in December.

Dia will allow users to type natural language commands into the search bar, like finding a document or webpage or creating a calendar event. It’s possible that Comet will do similar things, but again, we don’t know.

So far, most consumer-facing AI tools have come in one of three forms. There are general-purpose chatbots (like OpenAI’s ChatGPT and Anthropic’s Claude); features that use trained deep learning models subtly baked into existing software (as in Adobe Photoshop or Apple’s iOS); and, less commonly, standalone software meant to remake existing application categories using AI features (like the Cursor IDE).

There haven’t been a ton of AI-specific applications in existing categories like this before, but expect to see more coming over the next couple of years.

Perplexity wants to reinvent the web browser with AI—but there’s fierce competition Read More »