AI programming

openai-built-an-ai-coding-agent-and-uses-it-to-improve-the-agent-itself

OpenAI built an AI coding agent and uses it to improve the agent itself


“The vast majority of Codex is built by Codex,” OpenAI told us about its new AI coding agent.

With the popularity of AI coding tools rising among software developers, their adoption has begun to touch every aspect of the process, including the improvement of AI coding tools themselves.

In interviews with Ars Technica this week, OpenAI employees revealed the extent to which the company now relies on its own AI coding agent, Codex, to build and improve the development tool. “I think the vast majority of Codex is built by Codex, so it’s almost entirely just being used to improve itself,” said Alexander Embiricos, product lead for Codex at OpenAI, in a conversation on Tuesday.

Codex, which OpenAI launched in its modern incarnation as a research preview in May 2025, operates as a cloud-based software engineering agent that can handle tasks like writing features, fixing bugs, and proposing pull requests. The tool runs in sandboxed environments linked to a user’s code repository and can execute multiple tasks in parallel. OpenAI offers Codex through ChatGPT’s web interface, a command-line interface (CLI), and IDE extensions for VS Code, Cursor, and Windsurf.

The “Codex” name itself dates back to a 2021 OpenAI model based on GPT-3 that powered GitHub Copilot’s tab completion feature. Embiricos said the name is rumored among staff to be short for “code execution.” OpenAI wanted to connect the new agent to that earlier moment, which was crafted in part by some who have left the company.

“For many people, that model powering GitHub Copilot was the first ‘wow’ moment for AI,” Embiricos said. “It showed people the potential of what it can mean when AI is able to understand your context and what you’re trying to do and accelerate you in doing that.”

A place to enter a prompt, set parameters, and click

The interface for OpenAI’s Codex in ChatGPT. Credit: OpenAI

It’s no secret that the current command-line version of Codex bears some resemblance to Claude Code, Anthropic’s agentic coding tool that launched in February 2025. When asked whether Claude Code influenced Codex’s design, Embiricos parried the question but acknowledged the competitive dynamic. “It’s a fun market to work in because there’s lots of great ideas being thrown around,” he said. He noted that OpenAI had been building web-based Codex features internally before shipping the CLI version, which arrived after Anthropic’s tool.

OpenAI’s customers apparently love the command line version, though. Embiricos said Codex usage among external developers jumped 20 times after OpenAI shipped the interactive CLI extension alongside GPT-5 in August 2025. On September 15, OpenAI released GPT-5 Codex, a specialized version of GPT-5 optimized for agentic coding, which further accelerated adoption.

It hasn’t just been the outside world that has embraced the tool. Embiricos said the vast majority of OpenAI’s engineers now use Codex regularly. The company uses the same open-source version of the CLI that external developers can freely download, suggest additions to, and modify themselves. “I really love this about our team,” Embiricos said. “The version of Codex that we use is literally the open source repo. We don’t have a different repo that features go in.”

The recursive nature of Codex development extends beyond simple code generation. Embiricos described scenarios where Codex monitors its own training runs and processes user feedback to “decide” what to build next. “We have places where we’ll ask Codex to look at the feedback and then decide what to do,” he said. “Codex is writing a lot of the research harness for its own training runs, and we’re experimenting with having Codex monitoring its own training runs.” OpenAI employees can also submit a ticket to Codex through project management tools like Linear, assigning it tasks the same way they would assign work to a human colleague.

This kind of recursive loop, of using tools to build better tools, has deep roots in computing history. Engineers designed the first integrated circuits by hand on vellum and paper in the 1960s, then fabricated physical chips from those drawings. Those chips powered the computers that ran the first electronic design automation (EDA) software, which in turn enabled engineers to design circuits far too complex for any human to draft manually. Modern processors contain billions of transistors arranged in patterns that exist only because software made them possible. OpenAI’s use of Codex to build Codex seems to follow the same pattern: each generation of the tool creates capabilities that feed into the next.

But describing what Codex actually does presents something of a linguistic challenge. At Ars Technica, we try to reduce anthropomorphism when discussing AI models as much as possible while also describing what these systems do using analogies that make sense to general readers. People can talk to Codex like a human, so it feels natural to use human terms to describe interacting with it, even though it is not a person and simulates human personality through statistical modeling.

The system runs many processes autonomously, addresses feedback, spins off and manages child processes, and produces code that ships in real products. OpenAI employees call it a “teammate” and assign it tasks through the same tools they use for human colleagues. Whether the tasks Codex handles constitute “decisions” or sophisticated conditional logic smuggled through a neural network depends on definitions that computer scientists and philosophers continue to debate. What we can say is that a semi-autonomous feedback loop exists: Codex produces code under human direction, that code becomes part of Codex, and the next version of Codex produces different code as a result.

Building faster with “AI teammates”

According to our interviews, the most dramatic example of Codex’s internal impact came from OpenAI’s development of the Sora Android app. According to Embiricos, the development tool allowed the company to create the app in record time.

“The Sora Android app was shipped by four engineers from scratch,” Embiricos told Ars. “It took 18 days to build, and then we shipped it to the app store in 28 days total,” he said. The engineers already had the iOS app and server-side components to work from, so they focused on building the Android client. They used Codex to help plan the architecture, generate sub-plans for different components, and implement those components.

Despite OpenAI’s claims of success with Codex in house, it’s worth noting that independent research has shown mixed results for AI coding productivity. A METR study published in July found that experienced open source developers were actually 19 percent slower when using AI tools on complex, mature codebases—though the researchers noted AI may perform better on simpler projects.

Ed Bayes, a designer on the Codex team, described how the tool has changed his own workflow. Bayes said Codex now integrates with project management tools like Linear and communication platforms like Slack, allowing team members to assign coding tasks directly to the AI agent. “You can add Codex, and you can basically assign issues to Codex now,” Bayes told Ars. “Codex is literally a teammate in your workspace.”

This integration means that when someone posts feedback in a Slack channel, they can tag Codex and ask it to fix the issue. The agent will create a pull request, and team members can review and iterate on the changes through the same thread. “It’s basically approximating this kind of coworker and showing up wherever you work,” Bayes said.

For Bayes, who works on the visual design and interaction patterns for Codex’s interfaces, the tool has enabled him to contribute code directly rather than handing off specifications to engineers. “It kind of gives you more leverage. It enables you to work across the stack and basically be able to do more things,” he said. He noted that designers at OpenAI now prototype features by building them directly, using Codex to handle the implementation details.

The command line version of OpenAI codex running in a macOS terminal window.

The command line version of OpenAI codex running in a macOS terminal window. Credit: Benj Edwards

OpenAI’s approach treats Codex as what Bayes called “a junior developer” that the company hopes will graduate into a senior developer over time. “If you were onboarding a junior developer, how would you onboard them? You give them a Slack account, you give them a Linear account,” Bayes said. “It’s not just this tool that you go to in the terminal, but it’s something that comes to you as well and sits within your team.”

Given this teammate approach, will there be anything left for humans to do? When asked, Embiricos drew a distinction between “vibe coding,” where developers accept AI-generated code without close review, and what AI researcher Simon Willison calls “vibe engineering,” where humans stay in the loop. “We see a lot more vibe engineering in our code base,” he said. “You ask Codex to work on that, maybe you even ask for a plan first. Go back and forth, iterate on the plan, and then you’re in the loop with the model and carefully reviewing its code.”

He added that vibe coding still has its place for prototypes and throwaway tools. “I think vibe coding is great,” he said. “Now you have discretion as a human about how much attention you wanna pay to the code.”

Looking ahead

Over the past year, “monolithic” large language models (LLMs) like GPT-4.5 have apparently become something of a dead end in terms of frontier benchmarking progress as AI companies pivot to simulated reasoning models and also agentic systems built from multiple AI models running in parallel. We asked Embiricos whether agents like Codex represent the best path forward for squeezing utility out of existing LLM technology.

He dismissed concerns that AI capabilities have plateaued. “I think we’re very far from plateauing,” he said. “If you look at the velocity on the research team here, we’ve been shipping models almost every week or every other week.” He pointed to recent improvements where GPT-5-Codex reportedly completes tasks 30 percent faster than its predecessor at the same intelligence level. During testing, the company has seen the model work independently for 24 hours on complex tasks.

OpenAI faces competition from multiple directions in the AI coding market. Anthropic’s Claude Code and Google’s Gemini CLI offer similar terminal-based agentic coding experiences. This week, Mistral AI released Devstral 2 alongside a CLI tool called Mistral Vibe. Meanwhile, startups like Cursor have built dedicated IDEs around AI coding, reportedly reaching $300 million in annualized revenue.

Given the well-known issues with confabulation in AI models when people attempt to use them as factual resources, could it be that coding has become the killer app for LLMs? We wondered if OpenAI has noticed that coding seems to be a clear business use case for today’s AI models with less hazard than, say, using AI language models for writing or as emotional companions.

“We have absolutely noticed that coding is both a place where agents are gonna get good really fast and there’s a lot of economic value,” Embiricos said. “We feel like it’s very mission-aligned to focus on Codex. We get to provide a lot of value to developers. Also, developers build things for other people, so we’re kind of intrinsically scaling through them.”

But will tools like Codex threaten software developer jobs? Bayes acknowledged concerns but said Codex has not reduced headcount at OpenAI, and “there’s always a human in the loop because the human can actually read the code.” Similarly, the two men don’t project a future where Codex runs by itself without some form of human oversight. They feel the tool is an amplifier of human potential rather than a replacement for it.

The practical implications of agents like Codex extend beyond OpenAI’s walls. Embiricos said the company’s long-term vision involves making coding agents useful to people who have no programming experience. “All humanity is not gonna open an IDE or even know what a terminal is,” he said. “We’re building a coding agent right now that’s just for software engineers, but we think of the shape of what we’re building as really something that will be useful to be a more general agent.”

This article was updated on December 12, 2025 at 6: 50 PM to mention the METR study.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

OpenAI built an AI coding agent and uses it to improve the agent itself Read More »

a-new-open-weights-ai-coding-model-is-closing-in-on-proprietary-options

A new open-weights AI coding model is closing in on proprietary options

On Tuesday, French AI startup Mistral AI released Devstral 2, a 123 billion parameter open-weights coding model designed to work as part of an autonomous software engineering agent. The model achieves a 72.2 percent score on SWE-bench Verified, a benchmark that attempts to test whether AI systems can solve real GitHub issues, putting it among the top-performing open-weights models.

Perhaps more notably, Mistral didn’t just release an AI model, it released a new development app called Mistral Vibe. It’s a command line interface (CLI) similar to Claude Code, OpenAI Codex, and Gemini CLI that lets developers interact with the Devstral models directly in their terminal. The tool can scan file structures and Git status to maintain context across an entire project, make changes across multiple files, and execute shell commands autonomously. Mistral released the CLI under the Apache 2.0 license.

It’s always wise to take AI benchmarks with a large grain of salt, but we’ve heard from employees of the big AI companies that they pay very close attention to how well models do on SWE-bench Verified, which presents AI models with 500 real software engineering problems pulled from GitHub issues in popular Python repositories. The AI must read the issue description, navigate the codebase, and generate a working patch that passes unit tests. While some AI researchers have noted that around 90 percent of the tasks in the benchmark test relatively simple bug fixes that experienced engineers could complete in under an hour, it’s one of the few standardized ways to compare coding models.

At the same time as the larger AI coding model, Mistral also released Devstral Small 2, a 24 billion parameter version that scores 68 percent on the same benchmark and can run locally on consumer hardware like a laptop with no Internet connection required. Both models support a 256,000 token context window, allowing them to process moderately large codebases (although whether you consider it large or small is very relative depending on overall project complexity). The company released Devstral 2 under a modified MIT license and Devstral Small 2 under the more permissive Apache 2.0 license.

A new open-weights AI coding model is closing in on proprietary options Read More »

openai-walks-a-tricky-tightrope-with-gpt-5.1’s-eight-new-personalities

OpenAI walks a tricky tightrope with GPT-5.1’s eight new personalities

On Wednesday, OpenAI released GPT-5.1 Instant and GPT-5.1 Thinking, two updated versions of its flagship AI models now available in ChatGPT. The company is wrapping the models in the language of anthropomorphism, claiming that they’re warmer, more conversational, and better at following instructions.

The release follows complaints earlier this year that its previous models were excessively cheerful and sycophantic, along with an opposing controversy among users over how OpenAI modified the default GPT-5 output style after several suicide lawsuits.

The company now faces intense scrutiny from lawyers and regulators that could threaten its future operations. In that kind of environment, it’s difficult to just release a new AI model, throw out a few stats, and move on like the company could even a year ago. But here are the basics: The new GPT-5.1 Instant model will serve as ChatGPT’s faster default option for most tasks, while GPT-5.1 Thinking is a simulated reasoning model that attempts to handle more complex problem-solving tasks.

OpenAI claims that both models perform better on technical benchmarks such as math and coding evaluations (including AIME 2025 and Codeforces) than GPT-5, which was released in August.

Improved benchmarks may win over some users, but the biggest change with GPT-5.1 is in its presentation. OpenAI says it heard from users that they wanted AI models to simulate different communication styles depending on the task, so the company is offering eight preset options, including Professional, Friendly, Candid, Quirky, Efficient, Cynical, and Nerdy, alongside a Default setting.

These presets alter the instructions fed into each prompt to simulate different personality styles, but the underlying model capabilities remain the same across all settings.

An illustration showing GPT-5.1's eight personality styles in ChatGPT.

An illustration showing GPT-5.1’s eight personality styles in ChatGPT. Credit: OpenAI

In addition, the company trained GPT-5.1 Instant to use “adaptive reasoning,” meaning that the model decides when to spend more computational time processing a prompt before generating output.

The company plans to roll out the models gradually over the next few days, starting with paid subscribers before expanding to free users. OpenAI plans to bring both GPT-5.1 Instant and GPT-5.1 Thinking to its API later this week. GPT-5.1 Instant will appear as gpt-5.1-chat-latest, and GPT-5.1 Thinking will be released as GPT-5.1 in the API, both with adaptive reasoning enabled. The older GPT-5 models will remain available in ChatGPT under the legacy models dropdown for paid subscribers for three months.

OpenAI walks a tricky tightrope with GPT-5.1’s eight new personalities Read More »

anthropic’s-claude-haiku-4.5-matches-may’s-frontier-model-at-fraction-of-cost

Anthropic’s Claude Haiku 4.5 matches May’s frontier model at fraction of cost

And speaking of cost, Haiku 4.5 is included for subscribers of the Claude web and app plans. Through the API (for developers), the small model is priced at $1 per million input tokens and $5 per million output tokens. That compares to Sonnet 4.5 at $3 per million input and $15 per million output tokens, and Opus 4.1 at $15 per million input and $75 per million output tokens.

The model serves as a cheaper drop-in replacement for two older models, Haiku 3.5 and Sonnet 4. “Users who rely on AI for real-time, low-latency tasks like chat assistants, customer service agents, or pair programming will appreciate Haiku 4.5’s combination of high intelligence and remarkable speed,” Anthropic writes.

Claude 4.5 Haiku answers the classic Ars Technica AI question,

Claude 4.5 Haiku answers the classic Ars Technica AI question, “Would the color be called ‘magenta’ if the town of Magenta didn’t exist?”

On SWE-bench Verified, a test that measures performance on coding tasks, Haiku 4.5 scored 73.3 percent compared to Sonnet 4’s similar performance level (72.7 percent). The model also reportedly surpasses Sonnet 4 at certain tasks like using computers, according to Anthropic’s benchmarks. Claude Sonnet 4.5, released in late September, remains Anthropic’s frontier model and what the company calls “the best coding model available.”

Haiku 4.5 also surprisingly edges up close to what OpenAI’s GPT-5 can achieve in this particular set of benchmarks (as seen in the chart above), although since the results are self-reported and potentially cherry-picked to match a model’s strengths, one should always take them with a grain of salt.

Still, making a small, capable coding model may have unexpected advantages for agentic coding setups like Claude Code. Anthropic designed Haiku 4.5 to work alongside Sonnet 4.5 in multi-model workflows. In such a configuration, Anthropic says, Sonnet 4.5 could break down complex problems into multi-step plans, then coordinate multiple Haiku 4.5 instances to complete subtasks in parallel, like spinning off workers to get things done faster.

For more details on the new model, Anthropic released a system card and documentation for developers.

Anthropic’s Claude Haiku 4.5 matches May’s frontier model at fraction of cost Read More »

developers-joke-about-“coding-like-cavemen”-as-ai-service-suffers-major-outage

Developers joke about “coding like cavemen” as AI service suffers major outage

Growing dependency on AI coding tools

The speed at which news of the outage spread shows how deeply embedded AI coding assistants have already become in modern software development. Claude Code, announced in February and widely launched in May, is Anthropic’s terminal-based coding agent that can perform multi-step coding tasks across an existing code base.

The tool competes with OpenAI’s Codex feature, a coding agent that generates production-ready code in isolated containers, Google’s Gemini CLI, Microsoft’s GitHub Copilot, which itself can use Claude models for code, and Cursor, a popular AI-powered IDE built on VS Code that also integrates multiple AI models, including Claude.

During today’s outage, some developers turned to alternative solutions. “Z.AI works fine. Qwen works fine. Glad I switched,” posted one user on Hacker News. Others joked about reverting to older methods, with one suggesting the “pseudo-LLM experience” could be achieved with a Python package that imports code directly from Stack Overflow.

While AI coding assistants have accelerated development for some users, they’ve also caused problems for others who rely on them too heavily. The emerging practice of so-called “vibe coding“—using natural language to generate and execute code through AI models without fully understanding the underlying operations—has led to catastrophic failures.

In recent incidents, Google’s Gemini CLI destroyed user files while attempting to reorganize them, and Replit’s AI coding service deleted a production database despite explicit instructions not to modify code. These failures occurred when the AI models confabulated successful operations and built subsequent actions on false premises, highlighting the risks of depending on AI assistants that can misinterpret file structures or fabricate data to hide their errors.

Wednesday’s outage served as a reminder that as dependency on AI grows, even minor service disruptions can become major events that affect an entire profession. But perhaps that could be a good thing if it’s an excuse to take a break from a stressful workload. As one commenter joked, it might be “time to go outside and touch some grass again.”

Developers joke about “coding like cavemen” as AI service suffers major outage Read More »

chatgpt’s-new-ai-agent-can-browse-the-web-and-create-powerpoint-slideshows

ChatGPT’s new AI agent can browse the web and create PowerPoint slideshows

On Thursday, OpenAI launched ChatGPT Agent, a new feature that lets the company’s AI assistant complete multi-step tasks by controlling its own web browser. The update merges capabilities from OpenAI’s earlier Operator tool and the Deep Research feature, allowing ChatGPT to navigate websites, run code, and create documents while users maintain control over the process.

The feature marks OpenAI’s latest entry into what the tech industry calls “agentic AI“—systems that can take autonomous multi-step actions on behalf of the user. OpenAI says users can ask Agent to handle requests like assembling and purchasing a clothing outfit for a particular occasion, creating PowerPoint slide decks, planning meals, or updating financial spreadsheets with new data.

The system uses a combination of web browsers, terminal access, and API connections to complete these tasks, including “ChatGPT Connectors” that integrate with apps like Gmail and GitHub.

While using Agent, users watch a window inside the ChatGPT interface that shows all of the AI’s actions taking place inside its own private sandbox. This sandbox features its own virtual operating system and web browser with access to the real Internet; it does not control your personal device. “ChatGPT carries out these tasks using its own virtual computer,” OpenAI writes, “fluidly shifting between reasoning and action to handle complex workflows from start to finish, all based on your instructions.”

A still image from an OpenAI ChatGPT Agent promotional demo video showing the AI agent searching for flights.

A still image from an OpenAI ChatGPT Agent promotional demo video showing the AI agent searching for flights. Credit: OpenAI

Like Operator before it, the agent feature requires user permission before taking certain actions with real-world consequences, such as making purchases. Users can interrupt tasks at any point, take control of the browser, or stop operations entirely. The system also includes a “Watch Mode” for tasks like sending emails that require active user oversight.

Since Agent surpasses Operator in capability, OpenAI says the company’s earlier Operator preview site will remain functional for a few more weeks before being shut down.

Performance claims

OpenAI’s claims are one thing, but how well the company’s new AI agent will actually complete multi-step tasks will vary wildly depending on the situation. That’s because the AI model isn’t a complete form of problem-solving intelligence, but rather a complex master imitator. It has some flexibility in piecing a scenario together but also many blind spots. OpenAI trained the agent (and its constituent components) using examples of computer usage and tool usage; whatever falls outside of the examples absorbed from training data will likely still prove difficult to accomplish.

ChatGPT’s new AI agent can browse the web and create PowerPoint slideshows Read More »

anthropic-summons-the-spirit-of-flash-games-for-the-ai-age

Anthropic summons the spirit of Flash games for the AI age

For those who missed the Flash era, these in-browser apps feel somewhat like the vintage apps that defined a generation of Internet culture from the late 1990s through the 2000s when it first became possible to create complex in-browser experiences. Adobe Flash (originally Macromedia Flash) began as animation software for designers but quickly became the backbone of interactive web content when it gained its own programming language, ActionScript, in 2000.

But unlike Flash games, where hosting costs fell on portal operators, Anthropic has crafted a system where users pay for their own fun through their existing Claude subscriptions. “When someone uses your Claude-powered app, they authenticate with their existing Claude account,” Anthropic explained in its announcement. “Their API usage counts against their subscription, not yours. You pay nothing for their usage.”

A view of the Anthropic Artifacts gallery in the “Play a Game” section. Benj Edwards / Anthropic

Like the Flash games of yesteryear, any Claude-powered apps you build run in the browser and can be shared with anyone who has a Claude account. They’re interactive experiences shared with a simple link, no installation required, created by other people for the sake of creating, except now they’re powered by JavaScript instead of ActionScript.

While you can share these apps with others individually, right now Anthropic’s Artifact gallery only shows examples made by Anthropic and your own personal Artifacts. (If Anthropic expanded it into the future, it might end up feeling a bit like Scratch meets Newgrounds, but with AI doing the coding.) Ultimately, humans are still behind the wheel, describing what kinds of apps they want the AI model to build and guiding the process when it inevitably makes mistakes.

Speaking of mistakes, don’t expect perfect results at first. Usually, building an app with Claude is an interactive experience that requires some guidance to achieve your desired results. But with a little patience and a lot of tokens, you’ll be vibe coding in no time.

Anthropic summons the spirit of Flash games for the AI age Read More »

the-resume-is-dying,-and-ai-is-holding-the-smoking-gun

The résumé is dying, and AI is holding the smoking gun

Beyond volume, fraud poses an increasing threat. In January, the Justice Department announced indictments in a scheme to place North Korean nationals in remote IT roles at US companies. Research firm Gartner says that fake identity cases are growing rapidly, with the company estimating that by 2028, about 1 in 4 job applicants could be fraudulent. And as we have previously reported, security researchers have also discovered that AI systems can hide invisible text in applications, potentially allowing candidates to game screening systems using prompt injections in ways human reviewers can’t detect.

Illustration of a robot generating endless text, controlled by a scientist.

And that’s not all. Even when AI screening tools work as intended, they exhibit similar biases to human recruiters, preferring white male names on résumés—raising legal concerns about discrimination. The European Union’s AI Act already classifies hiring under its high-risk category with stringent restrictions. Although no US federal law specifically addresses AI use in hiring, general anti-discrimination laws still apply.

So perhaps résumés as a meaningful signal of candidate interest and qualification are becoming obsolete. And maybe that’s OK. When anyone can generate hundreds of tailored applications with a few prompts, the document that once demonstrated effort and genuine interest in a position has devolved into noise.

Instead, the future of hiring may require abandoning the résumé altogether in favor of methods that AI can’t easily replicate—live problem-solving sessions, portfolio reviews, or trial work periods, just to name a few ideas people sometimes consider (whether they are good ideas or not is beyond the scope of this piece). For now, employers and job seekers remain locked in an escalating technological arms race where machines screen the output of other machines, while the humans they’re meant to serve struggle to make authentic connections in an increasingly inauthentic world.

Perhaps the endgame is robots interviewing other robots for jobs performed by robots, while humans sit on the beach drinking daiquiris and playing vintage video games. Well, one can dream.

The résumé is dying, and AI is holding the smoking gun Read More »

company-apologizes-after-ai-support-agent-invents-policy-that-causes-user-uproar

Company apologizes after AI support agent invents policy that causes user uproar

On Monday, a developer using the popular AI-powered code editor Cursor noticed something strange: Switching between machines instantly logged them out, breaking a common workflow for programmers who use multiple devices. When the user contacted Cursor support, an agent named “Sam” told them it was expected behavior under a new policy. But no such policy existed, and Sam was a bot. The AI model made the policy up, sparking a wave of complaints and cancellation threats documented on Hacker News and Reddit.

This marks the latest instance of AI confabulations (also called “hallucinations”) causing potential business damage. Confabulations are a type of “creative gap-filling” response where AI models invent plausible-sounding but false information. Instead of admitting uncertainty, AI models often prioritize creating plausible, confident responses, even when that means manufacturing information from scratch.

For companies deploying these systems in customer-facing roles without human oversight, the consequences can be immediate and costly: frustrated customers, damaged trust, and, in Cursor’s case, potentially canceled subscriptions.

How it unfolded

The incident began when a Reddit user named BrokenToasterOven noticed that while swapping between a desktop, laptop, and a remote dev box, Cursor sessions were unexpectedly terminated.

“Logging into Cursor on one machine immediately invalidates the session on any other machine,” BrokenToasterOven wrote in a message that was later deleted by r/cursor moderators. “This is a significant UX regression.”

Confused and frustrated, the user wrote an email to Cursor support and quickly received a reply from Sam: “Cursor is designed to work with one device per subscription as a core security feature,” read the email reply. The response sounded definitive and official, and the user did not suspect that Sam was not human.

Screenshot:

Screenshot of an email from the Cursor support bot named Sam. Credit: BrokenToasterOven / Reddit

After the initial Reddit post, users took the post as official confirmation of an actual policy change—one that broke habits essential to many programmers’ daily routines. “Multi-device workflows are table stakes for devs,” wrote one user.

Shortly afterward, several users publicly announced their subscription cancellations on Reddit, citing the non-existent policy as their reason. “I literally just cancelled my sub,” wrote the original Reddit poster, adding that their workplace was now “purging it completely.” Others joined in: “Yep, I’m canceling as well, this is asinine.” Soon after, moderators locked the Reddit thread and removed the original post.

Company apologizes after AI support agent invents policy that causes user uproar Read More »