vibe coding

10-things-i-learned-from-burning-myself-out-with-ai-coding-agents

10 things I learned from burning myself out with AI coding agents


Opinion: As software power tools, AI agents may make people busier than ever before.

Credit: Aurich Lawson | Getty Images

Credit: Aurich Lawson | Getty Images

If you’ve ever used a 3D printer, you may recall the wondrous feeling when you first printed something you could have never sculpted or built yourself. Download a model file, load some plastic filament, push a button, and almost like magic, a three-dimensional object appears. But the result isn’t polished and ready for mass production, and creating a novel shape requires more skills than just pushing a button. Interestingly, today’s AI coding agents feel much the same way.

Since November, I have used Claude Code and Claude Opus 4.5 through a personal Claude Max account to extensively experiment with AI-assisted software development (I have also used OpenAI’s Codex in a similar way, though not as frequently). Fifty projects later, I’ll be frank: I have not had this much fun with a computer since I learned BASIC on my Apple II Plus when I was 9 years old. This opinion comes not as an endorsement but as personal experience: I voluntarily undertook this project, and I paid out of pocket for both OpenAI and Anthropic’s premium AI plans.

Throughout my life, I have dabbled in programming as a utilitarian coder, writing small tools or scripts when needed. In my web development career, I wrote some small tools from scratch, but I primarily modified other people’s code for my needs. Since 1990, I’ve programmed in BASIC, C, Visual Basic, PHP, ASP, Perl, Python, Ruby, MUSHcode, and some others. I am not an expert in any of these languages—I learned just enough to get the job done. I have developed my own hobby games over the years using BASIC, Torque Game Engine, and Godot, so I have some idea of what makes a good architecture for a modular program that can be expanded over time.

In December, I used Claude Code to create a multiplayer online clone of Katamari Damacy called

In December, I used Claude Code to create a multiplayer online clone of Katamari Damacy called “Christmas Roll-Up.”

In December, I used Claude Code to create a multiplayer online clone of Katamari Damacy called “Christmas Roll-Up.” Credit: Benj Edwards

Claude Code, Codex, and Google’s Gemini CLI, can seemingly perform software miracles on a small scale. They can spit out flashy prototypes of simple applications, user interfaces, and even games, but only as long as they borrow patterns from their training data. Much like a 3D printer, doing production-level work takes far more effort. Creating durable production code, managing a complex project, or crafting something truly novel still requires experience, patience, and skill beyond what today’s AI agents can provide on their own.

And yet these tools have opened a world of creative potential in software that was previously closed to me, and they feel personally empowering. Even with that impression, though, I know these are hobby projects, and the limitations of coding agents lead me to believe that veteran software developers probably shouldn’t fear losing their jobs to these tools any time soon. In fact, they may become busier than ever.

So far, I have created over 50 demo projects in the past two months, fueled in part by a bout of COVID that left me bedridden with a laptop and a generous 2x Claude usage cap that Anthropic put in place during the last few weeks of December. As I typed furiously all day, my wife kept asking me, “Who are you talking to?”

You can see a few of the more interesting results listed on my personal website. Here are 10 interesting things I’ve learned from the process.

1. People are still necessary

Even with the best AI coding agents available today, humans remain essential to the software development process. Experienced human software developers bring judgment, creativity, and domain knowledge that AI models lack. They know how to architect systems for long-term maintainability, how to balance technical debt against feature velocity, and when to push back when requirements don’t make sense.

For hobby projects like mine, I can get away with a lot of sloppiness. But for production work, having someone who understands version control, incremental backups, testing one feature at a time, and debugging complex interactions between systems makes all the difference. Knowing something about how good software development works helps a lot when guiding an AI coding agent—the tool amplifies your existing knowledge rather than replacing it.

As independent AI researcher Simon Willison wrote in a post distinguishing serious AI-assisted development from casual “vibe coding,” “AI tools amplify existing expertise. The more skills and experience you have as a software engineer the faster and better the results you can get from working with LLMs and coding agents.”

With AI assistance, you don’t have to remember how to do everything. You just need to know what you want to do.

Card Miner: Heart of the Earth is entirely human-designed by AI coded using Claude Code. It represents about a month of iterative work.

Card Miner: Heart of the Earth is entirely human-designed, but it was AI-coded using Claude Code. It represents about a month of iterative work.

Card Miner: Heart of the Earth is entirely human-designed, but it was AI-coded using Claude Code. It represents about a month of iterative work. Credit: Benj Edwards

So I like to remind myself that coding agents are software tools best used to enact human ideas, not autonomous coding employees. They are not people (and not people replacements) no matter how the companies behind them might market them.

If you think about it, everything you do on a computer was once a manual process. Programming a computer like the ENIAC involved literally making physical bits (connections) with wire on a plugboard. The history of programming has been one of increasing automation, so even though this AI-assisted leap is somewhat startling, one could think of these tools as an advancement similar to the advent of high-level languages, automated compilers and debugger tools, or GUI-based IDEs. They can automate many tasks, but managing the overarching project scope still falls to the person telling the tool what to do.

And they can have rapidly compounding benefits. I’ve now used AI tools to write better tools—such as changing the source of an emulator so a coding agent can use it directly—and those improved tools are already having ripple effects. But a human must be in the loop for the best execution of my vision. This approach has kept me very busy, and contrary to some prevailing fears about people becoming dumber due to AI, I have learned many new things along the way.

2. AI models are brittle beyond their training data

Like all AI models based on the Transformer architecture, the large language models (LLMs) that underpin today’s coding agents have a significant limitation: They can only reliably apply knowledge gleaned from training data, and they have a limited ability to generalize that knowledge to novel domains not represented in that data.

What is training data? In this case, when building coding-flavored LLMs, AI companies download millions of examples of software code from sources like GitHub and use them to make the AI models. Companies later specialize them for coding through fine-tuning processes.

The ability of AI agents to use trial and error—attempting something and then trying again—helps mitigate the brittleness of LLMs somewhat. But it’s not perfect, and it can be frustrating to see a coding agent spin its wheels trying and failing at a task repeatedly, either because it doesn’t know how to do it or because it previously learned how to solve a problem but then forgot because the context window got compacted (more on that here).

Violent Checkers is a physics-based corruption of the classic board game, coded using Claude Code.

Violent Checkers is a physics-based corruption of the classic board game, coded using Claude Code.

Violent Checkers is a physics-based corruption of the classic board game, coded using Claude Code. Credit: Benj Edwards

To get around this, it helps to have the AI model take copious notes as it goes along about how it solved certain problems so that future instances of the agent can learn from them again. You also want to set ground rules in the claude.md file that the agent reads when it begins its session.

This brittleness means that coding agents are almost frighteningly good at what they’ve been trained and fine-tuned on—modern programming languages, JavaScript, HTML, and similar well-represented technologies—and generally terrible at tasks on which they have not been deeply trained, such as 6502 Assembly or programming an Atari 800 game with authentic-looking character graphics.

It took me five minutes to make a nice HTML5 demo with Claude but a week of torturous trial and error, plus actual systematic design on my part, to make a similar demo of an Atari 800 game. To do so, I had to use Claude Code to invent several tools, like command-line emulators and MCP servers, that allow it to peek into the operation of the Atari 800’s memory and chipset to even begin to make it happen.

3. True novelty can be an uphill battle

Due to what might poetically be called “preconceived notions” baked into a coding model’s neural network (more technically, statistical semantic associations), it can be difficult to get AI agents to create truly novel things, even if you carefully spell out what you want.

For example, I spent four days trying to get Claude Code to create an Atari 800 version of my HTML game Violent Checkers, but it had trouble because in the game’s design, the squares on the checkerboard don’t matter beyond their starting positions. No matter how many times I told the agent (and made notes in my Claude project files), it would come back to trying to center the pieces to the squares, snap them within squares, or use the squares as a logical basis of the game’s calculations when they should really just form a background image.

To get around this in the Atari 800 version, I started over and told Claude that I was creating a game with a UFO (instead of a circular checker piece) flying over a field of adjacent squares—never once mentioning the words “checker,” “checkerboard,” or “checkers.” With that approach, I got the results I wanted.

A screenshot of Benj's Mac while working on a Violent Checkers port for the Atari 800 home computer, amid other projects.

A screenshot of Benj’s Mac while working on a Violent Checkers port for the Atari 800 home computer, amid other projects.

A screenshot of Benj’s Mac while working on a Violent Checkers port for the Atari 800 home computer, amid other projects. Credit: Benj Edwards

Why does this matter? Because with LLMs, context is everything, and in language, context changes meaning. Take the word “bank” and add the words “river” or “central” in front of it, and see how the meaning changes. In a way, words act as addresses that unlock the semantic relationships encoded in a neural network. So if you put “checkerboard” and “game” in the context, the model’s self-attention process links up a massive web of semantic associations about how checkers games should work, and that semantic baggage throws things off.

A couple of tricks can help AI coders navigate around these limitations. First, avoid contaminating the context with irrelevant information. Second, when the agent gets stuck, try this prompt: “What information do you need that would let you implement this perfectly right now? What tools are available to you that you could use to discover that information systematically without guessing?” This forces the agent to identify (semantically link up) its own knowledge gaps, spelled out in the context window and subject to future action, instead of flailing around blindly.

4. The 90 percent problem

The first 90 percent of an AI coding project comes in fast and amazes you. The last 10 percent involves tediously filling in the details through back-and-forth trial-and-error conversation with the agent. Tasks that require deeper insight or understanding than what the agent can provide still require humans to make the connections and guide it in the right direction. The limitations we discussed above can also cause your project to hit a brick wall.

From what I have observed over the years, larger LLMs can potentially make deeper contextual connections than smaller ones. They have more parameters (encoded data points), and those parameters are linked in more multidimensional ways, so they tend to have a deeper map of semantic relationships. As deep as those go, it seems that human brains still have an even deeper grasp of semantic connections and can make wild semantic jumps that LLMs tend not to.

Creativity, in this sense, may be when you jump from, say, basketball to how bubbles form in soap film and somehow make a useful connection that leads to a breakthrough. Instead, LLMs tend to follow conventional semantic paths that are more conservative and entirely guided by mapped-out relationships from the training data. That limits their creative potential unless the prompter unlocks it by guiding the LLM to make novel semantic connections. That takes skill and creativity on the part of the operator, which once again shows the role of LLMs as tools used by humans rather than independent thinking machines.

5. Feature creep becomes irresistible

While creating software with AI coding tools, the joy of experiencing novelty makes you want to keep adding interesting new features rather than fixing bugs or perfecting existing systems. And Claude (or Codex) is happy to oblige, churning away at new ideas that are easy to sketch out in a quick and pleasing demo (the 90 percent problem again) rather than polishing the code.

Flip-Lash started as a

Flip-Lash started as a “Tetris but you can flip the board,” but feature creep made me throw in the kitchen sink, losing focus.

Flip-Lash started as a “Tetris but you can flip the board,” but feature creep made me throw in the kitchen sink, losing focus. Credit: Benj Edwards

Fixing bugs can also create bugs elsewhere. This is not new to coding agents—it’s a time-honored problem in software development. But agents supercharge this phenomenon because they can barrel through your code and make sweeping changes in pursuit of narrow-minded goals that affect lots of working systems. We’ve already talked about the importance of having a good architecture guided by the human mind behind the wheel above, and that comes into play here.

6. AGI is not here yet

Given the limitations I’ve described above, it’s very clear that an AI model with general intelligence—what people usually call artificial general intelligence (AGI)—is still not here. AGI would hypothetically be able to navigate around baked-in stereotype associations and not have to rely on explicit training or fine-tuning on many examples to get things right. AI companies will probably need a different architecture in the future.

I’m speculating, but AGI would likely need to learn permanently on the fly—as in modify its own neural network weights—instead of relying on what is called “in-context learning,” which only persists until the context fills up and gets compacted or wiped out.

Grapheeti is a

Grapheeti is a “drawing MMO” where people around the world share a canvas.

Grapheeti is a “drawing MMO” where people around the world share a canvas. Credit: Benj Edwards

In other words, you could teach a true AGI system how to do something by explanation or let it learn by doing, noting successes, and having those lessons permanently stick, no matter what is in the context window. Today’s coding agents can’t do that—they forget lessons from earlier in a long session or between sessions unless you manually document everything for them. My favorite trick is instructing them to write a long, detailed report on what happened when a bug is fixed. That way, you can point to the hard-earned solution the next time the amnestic AI model makes the same mistake.

7. Even fast isn’t fast enough

While using Claude Code for a while, it’s easy to take for granted that you suddenly have the power to create software without knowing certain programming languages. This is amazing at first, but you can quickly become frustrated that what is conventionally a very fast development process isn’t fast enough. Impatience at the coding machine sets in, and you start wanting more.

But even if you do know the programming languages being used, you don’t get a free pass. You still need to make key decisions about how the project will unfold. And when the agent gets stuck or makes a mess of things, your programming knowledge becomes essential for diagnosing what went wrong and steering it back on course.

8. People may become busier than ever

After guiding way too many hobby projects through Claude Code over the past two months, I’m starting to think that most people won’t become unemployed due to AI—they will become busier than ever. Power tools allow more work to be done in less time, and the economy will demand more productivity to match.

It’s almost too easy to make new software, in fact, and that can be exhausting. One project idea would lead to another, and I was soon spending eight hours a day during my winter vacation shepherding about 15 Claude Code projects at once. That’s too much split attention for good results, but the novelty of seeing my ideas come to life was addictive. In addition to the game ideas I’ve mentioned here, I made tools that scrape and search my past articles, a graphical MUD based on ZZT, a new type of MUSH (text game) that uses AI-generated rooms, a new type of Telnet display proxy, and a Claude Code client for the Apple II (more on that soon). I also put two AI-enabled emulators for Apple II and Atari 800 on GitHub. Phew.

Consider the advent of the steam shovel, which allowed humans to dig holes faster than a team using hand shovels. It made existing projects faster and new projects possible. But think about the human operator of the steam shovel. Suddenly, we had a tireless tool that could work 24 hours a day if fueled up and maintained properly, while the human piloting it would need to eat, sleep, and rest.

I used Claude Code to create a windowing GUI simulation of the Mac that works over Telnet.

I used Claude Code to create a windowing GUI simulation of the Mac that works over Telnet.

I used Claude Code to create a windowing GUI simulation of the Mac that works over Telnet. Credit: Benj Edwards

In fact, we may end up needing new protections for human knowledge workers using these tireless information engines to implement their ideas, much as unions rose as a response to industrial production lines over 100 years ago. Humans need rest, even when machines don’t.

Will an AI system ever replace the human role here? Even if AI coding agents could eventually work fully autonomously, I don’t think they’ll replace humans entirely because there will still be people who want to get things done, and new AI power tools will emerge to help them do it.

9. Fast is scary to people

AI coding tools can turn what was once a year-long personal project into a five-minute session. I fed Claude Code a photo of a two-player Tetris game I sketched in a notebook back in 2008, and it produced a working prototype in minutes (prompt: “create a fully-featured web game with sound effects based on this diagram”). That’s wild, and even though the results are imperfect, it’s a bit frightening to comprehend what kind of sea change in software development this might entail.

Since early December, I’ve been posting some of my more amusing experimental AI-coded projects to Bluesky for people to try out, but I discovered I needed to deliberately slow down with updates because they came too fast for people to absorb (and too fast for me to fully test). I’ve also received comments like “I’m worried you’re using AI, you’re making games too fast” and so on.

Benj's handwritten game design note about a two-player Tetris concept from 2007.

Benj’s handwritten game design note about a two-player Tetris concept from 2007.

Benj’s handwritten game design note about a two-player Tetris concept from 2007. Credit: Benj Edwards

Regardless of my own habits, the flow of new software will not slow down. There will soon be a seemingly endless supply of AI-augmented media (games, movies, images, books), and that’s a problem we’ll have to figure out how to deal with. These products won’t all be “AI slop,” either; some will be done very well, and the acceleration in production times due to these new power tools will balloon the quantity beyond anything we’ve seen.

Social media tends to prime people to believe that AI is all good or all bad, but that kind of black-and-white thinking may be the easy way out. You’ll have no cognitive dissonance, but you’ll miss a far richer third option: seeing these tools as imperfect and deserving of critique but also as useful and empowering when they bring your ideas to life.

AI agents should be considered tools, not entities or employees, and they should be amplifiers of human ideas. My game-in-progress Card Miner is entirely my own high-level creative design work, but the AI model handled the low-level code. I am still proud of it as an expression of my personal ideas, and it would not exist without AI coding agents.

10. These tools aren’t going away

For now, at least, coding agents remain very much tools in the hands of people who want to build things. The question is whether humans will learn to wield these new tools effectively to empower themselves. Based on two months of intensive experimentation, I’d say the answer is a qualified yes, with plenty of caveats.

We also have social issues to face: Professional developers already use these tools, and with the prevailing stigma against AI tools in some online communities, many software developers and the platforms that host their work will face difficult decisions.

Ultimately, I don’t think AI tools will make human software designers obsolete. Instead, they may well help those designers become more capable. This isn’t new, of course; tools of every kind have been serving this role since long before the dawn of recorded history. The best tools amplify human capability while keeping a person behind the wheel. The 3D printer analogy holds: amazing fast results are possible, but mastery still takes time, skill, and a lot of patience with the machine.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

10 things I learned from burning myself out with AI coding agents Read More »

even-linus-torvalds-is-trying-his-hand-at-vibe-coding-(but-just-a-little)

Even Linus Torvalds is trying his hand at vibe coding (but just a little)

Linux and Git creator Linus Torvalds’ latest project contains code that was “basically written by vibe coding,” but you shouldn’t read that to mean that Torvalds is embracing that approach for anything and everything.

Torvalds sometimes works on a small hobby projects over holiday breaks. Last year, he made guitar pedals. This year, he did some work on AudioNoise, which he calls “another silly guitar-pedal-related repo.” It creates random digital audio effects.

Torvalds revealed that he had used an AI coding tool in the README for the repo:

Also note that the python visualizer tool has been basically written by vibe-coding. I know more about analog filters—and that’s not saying much—than I do about python. It started out as my typical “google and do the monkey-see-monkey-do” kind of programming, but then I cut out the middle-man—me—and just used Google Antigravity to do the audio sample visualizer.

Google’s Antigravity is a fork of the AI-focused IDE Windsurf. He didn’t specify which model he used, but using Antigravity suggests (but does not prove) that it was some version of Google’s Gemini.

Torvalds’ past public comments on using large language model-based tools for programming have been more nuanced than many online discussions about it.

He has touted AI primarily as “a tool to help maintain code, including automated patch checking and code review,” citing examples of tools that found problems he had missed.

On the other hand, he has also said he is generally “much less interested in AI for writing code,” and has publicly said that he’s not anti-AI in principle, but he’s very much anti-hype around AI.

Even Linus Torvalds is trying his hand at vibe coding (but just a little) Read More »

how-ai-coding-agents-work—and-what-to-remember-if-you-use-them

How AI coding agents work—and what to remember if you use them


Agents of uncertain change

From compression tricks to multi-agent teamwork, here’s what makes them tick.

AI coding agents from OpenAI, Anthropic, and Google can now work on software projects for hours at a time, writing complete apps, running tests, and fixing bugs with human supervision. But these tools are not magic and can complicate rather than simplify a software project. Understanding how they work under the hood can help developers know when (and if) to use them, while avoiding common pitfalls.

We’ll start with the basics: At the core of every AI coding agent is a technology called a large language model (LLM), which is a type of neural network trained on vast amounts of text data, including lots of programming code. It’s a pattern-matching machine that uses a prompt to “extract” compressed statistical representations of data it saw during training and provide a plausible continuation of that pattern as an output. In this extraction, an LLM can interpolate across domains and concepts, resulting in some useful logical inferences when done well and confabulation errors when done poorly.

These base models are then further refined through techniques like fine-tuning on curated examples and reinforcement learning from human feedback (RLHF), which shape the model to follow instructions, use tools, and produce more useful outputs.

A screenshot of the Claude Code command-line interface.

A screenshot of the Claude Code command-line interface. Credit: Anthropic

Over the past few years, AI researchers have been probing LLMs’ deficiencies and finding ways to work around them. One recent innovation was the simulated reasoning model, which generates context (extending the prompt) in the form of reasoning-style text that can help an LLM home in on a more accurate output. Another innovation was an application called an “agent” that links several LLMs together to perform tasks simultaneously and evaluate outputs.

How coding agents are structured

In that sense, each AI coding agent is a program wrapper that works with multiple LLMs. There is typically a “supervising” LLM that interprets tasks (prompts) from the human user and then assigns those tasks to parallel LLMs that can use software tools to execute the instructions. The supervising agent can interrupt tasks below it and evaluate the subtask results to see how a project is going. Anthropic’s engineering documentation describes this pattern as “gather context, take action, verify work, repeat.”

If run locally through a command-line interface (CLI), users give the agents conditional permission to write files on the local machine (code or whatever is needed), run exploratory commands (say, “ls” to list files in a directory), fetch websites (usually using “curl”), download software, or upload files to remote servers. There are lots of possibilities (and potential dangers) with this approach, so it needs to be used carefully.

In contrast, when a user starts a task in the web-based agent like the web versions of Codex and Claude Code, the system provisions a sandboxed cloud container preloaded with the user’s code repository, where Codex can read and edit files, run commands (including test harnesses and linters), and execute code in isolation. Anthropic’s Claude Code uses operating system-level features to create filesystem and network boundaries within which the agent can work more freely.

The context problem

Every LLM has a short-term memory, so to speak, that limits the amount of data it can process before it “forgets” what it’s doing. This is called “context.” Every time you submit a response to the supervising agent, you are amending one gigantic prompt that includes the entire history of the conversation so far (and all the code generated, plus the simulated reasoning tokens the model uses to “think” more about a problem). The AI model then evaluates this prompt and produces an output. It’s a very computationally expensive process that increases quadratically with prompt size because LLMs process every token (chunk of data) against every other token in the prompt.

Anthropic’s engineering team describes context as a finite resource with diminishing returns. Studies have revealed what researchers call “context rot”: As the number of tokens in the context window increases, the model’s ability to accurately recall information decreases. Every new token depletes what the documentation calls an “attention budget.”

This context limit naturally limits the size of a codebase a LLM can process at one time, and if you feed the AI model lots of huge code files (which have to be re-evaluated by the LLM every time you send another response), it can burn up token or usage limits pretty quickly.

Tricks of the trade

To get around these limits, the creators of coding agents use several tricks. For example, AI models are fine-tuned to write code to outsource activities to other software tools. For example, they might write Python scripts to extract data from images or files rather than feeding the whole file through an LLM, which saves tokens and avoids inaccurate results.

Anthropic’s documentation notes that Claude Code also uses this approach to perform complex data analysis over large databases, writing targeted queries and using Bash commands like “head” and “tail” to analyze large volumes of data without ever loading the full data objects into context.

(In a way, these AI agents are guided but semi-autonomous tool-using programs that are a major extension of a concept we first saw in early 2023.)

Another major breakthrough in agents came from dynamic context management. Agents can do this in a few ways that are not fully disclosed in proprietary coding models, but we do know the most important technique they use: context compression.

The command line version of OpenAI codex running in a macOS terminal window.

The command-line version of OpenAI Codex running in a macOS terminal window. Credit: Benj Edwards

When a coding LLM nears its context limit, this technique compresses the context history by summarizing it, losing details in the process but shortening the history to key details. Anthropic’s documentation describes this “compaction” as distilling context contents in a high-fidelity manner, preserving key details like architectural decisions and unresolved bugs while discarding redundant tool outputs.

This means the AI coding agents periodically “forget” a large portion of what they are doing every time this compression happens, but unlike older LLM-based systems, they aren’t completely clueless about what has transpired and can rapidly re-orient themselves by reading existing code, written notes left in files, change logs, and so on.

Anthropic’s documentation recommends using CLAUDE.md files to document common bash commands, core files, utility functions, code style guidelines, and testing instructions. AGENTS.md, now a multi-company standard, is another useful way of guiding agent actions in between context refreshes. These files act as external notes that let agents track progress across complex tasks while maintaining critical context that would otherwise be lost.

For tasks requiring extended work, both companies employ multi-agent architectures. According to Anthropic’s research documentation, its system uses an “orchestrator-worker pattern” in which a lead agent coordinates the process while delegating to specialized subagents that operate in parallel. When a user submits a query, the lead agent analyzes it, develops a strategy, and spawns subagents to explore different aspects simultaneously. The subagents act as intelligent filters, returning only relevant information rather than their full context to the lead agent.

The multi-agent approach burns through tokens rapidly. Anthropic’s documentation notes that agents typically use about four times more tokens than chatbot interactions, and multi-agent systems use about 15 times more tokens than chats. For economic viability, these systems require tasks where the value is high enough to justify the increased cost.

Best practices for humans

While using these agents is contentious in some programming circles, if you use one to code a project, knowing good software development practices helps to head off future problems. For example, it’s good to know about version control, making incremental backups, implementing one feature at a time, and testing it before moving on.

What people call “vibe coding”—creating AI-generated code without understanding what it’s doing—is clearly dangerous for production work. Shipping code you didn’t write yourself in a production environment is risky because it could introduce security issues or other bugs or begin gathering technical debt that could snowball over time.

Independent AI researcher Simon Willison recently argued that developers using coding agents still bear responsibility for proving their code works. “Almost anyone can prompt an LLM to generate a thousand-line patch and submit it for code review,” Willison wrote. “That’s no longer valuable. What’s valuable is contributing code that is proven to work.”

In fact, human planning is key. Claude Code’s best practices documentation recommends a specific workflow for complex problems: First, ask the agent to read relevant files and explicitly tell it not to write any code yet, then ask it to make a plan. Without these research and planning steps, the documentation warns, Claude’s outputs tend to jump straight to coding a solution.

Without planning, LLMs sometimes reach for quick solutions to satisfy a momentary objective that might break later if a project were expanded. So having some idea of what makes a good architecture for a modular program that can be expanded over time can help you guide the LLM to craft something more durable.

As mentioned above, these agents aren’t perfect, and some people prefer not to use them at all. A randomized controlled trial published by the nonprofit research organization METR in July 2025 found that experienced open-source developers actually took 19 percent longer to complete tasks when using AI tools, despite believing they were working faster. The study’s authors note several caveats: The developers were highly experienced with their codebases (averaging five years and 1,500 commits), the repositories were large and mature, and the models used (primarily Claude 3.5 and 3.7 Sonnet via Cursor) have since been superseded by more capable versions.

Whether newer models would produce different results remains an open question, but the study suggests that AI coding tools may not always provide universal speed-ups, particularly for developers who already know their codebases well.

Given these potential hazards, coding proof-of-concept demos and internal tools is probably the ideal use of coding agents right now. Since AI models have no actual agency (despite being called agents) and are not people who can be held accountable for mistakes, human oversight is key.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

How AI coding agents work—and what to remember if you use them Read More »

openai-built-an-ai-coding-agent-and-uses-it-to-improve-the-agent-itself

OpenAI built an AI coding agent and uses it to improve the agent itself


“The vast majority of Codex is built by Codex,” OpenAI told us about its new AI coding agent.

With the popularity of AI coding tools rising among software developers, their adoption has begun to touch every aspect of the process, including the improvement of AI coding tools themselves.

In interviews with Ars Technica this week, OpenAI employees revealed the extent to which the company now relies on its own AI coding agent, Codex, to build and improve the development tool. “I think the vast majority of Codex is built by Codex, so it’s almost entirely just being used to improve itself,” said Alexander Embiricos, product lead for Codex at OpenAI, in a conversation on Tuesday.

Codex, which OpenAI launched in its modern incarnation as a research preview in May 2025, operates as a cloud-based software engineering agent that can handle tasks like writing features, fixing bugs, and proposing pull requests. The tool runs in sandboxed environments linked to a user’s code repository and can execute multiple tasks in parallel. OpenAI offers Codex through ChatGPT’s web interface, a command-line interface (CLI), and IDE extensions for VS Code, Cursor, and Windsurf.

The “Codex” name itself dates back to a 2021 OpenAI model based on GPT-3 that powered GitHub Copilot’s tab completion feature. Embiricos said the name is rumored among staff to be short for “code execution.” OpenAI wanted to connect the new agent to that earlier moment, which was crafted in part by some who have left the company.

“For many people, that model powering GitHub Copilot was the first ‘wow’ moment for AI,” Embiricos said. “It showed people the potential of what it can mean when AI is able to understand your context and what you’re trying to do and accelerate you in doing that.”

A place to enter a prompt, set parameters, and click

The interface for OpenAI’s Codex in ChatGPT. Credit: OpenAI

It’s no secret that the current command-line version of Codex bears some resemblance to Claude Code, Anthropic’s agentic coding tool that launched in February 2025. When asked whether Claude Code influenced Codex’s design, Embiricos parried the question but acknowledged the competitive dynamic. “It’s a fun market to work in because there’s lots of great ideas being thrown around,” he said. He noted that OpenAI had been building web-based Codex features internally before shipping the CLI version, which arrived after Anthropic’s tool.

OpenAI’s customers apparently love the command line version, though. Embiricos said Codex usage among external developers jumped 20 times after OpenAI shipped the interactive CLI extension alongside GPT-5 in August 2025. On September 15, OpenAI released GPT-5 Codex, a specialized version of GPT-5 optimized for agentic coding, which further accelerated adoption.

It hasn’t just been the outside world that has embraced the tool. Embiricos said the vast majority of OpenAI’s engineers now use Codex regularly. The company uses the same open-source version of the CLI that external developers can freely download, suggest additions to, and modify themselves. “I really love this about our team,” Embiricos said. “The version of Codex that we use is literally the open source repo. We don’t have a different repo that features go in.”

The recursive nature of Codex development extends beyond simple code generation. Embiricos described scenarios where Codex monitors its own training runs and processes user feedback to “decide” what to build next. “We have places where we’ll ask Codex to look at the feedback and then decide what to do,” he said. “Codex is writing a lot of the research harness for its own training runs, and we’re experimenting with having Codex monitoring its own training runs.” OpenAI employees can also submit a ticket to Codex through project management tools like Linear, assigning it tasks the same way they would assign work to a human colleague.

This kind of recursive loop, of using tools to build better tools, has deep roots in computing history. Engineers designed the first integrated circuits by hand on vellum and paper in the 1960s, then fabricated physical chips from those drawings. Those chips powered the computers that ran the first electronic design automation (EDA) software, which in turn enabled engineers to design circuits far too complex for any human to draft manually. Modern processors contain billions of transistors arranged in patterns that exist only because software made them possible. OpenAI’s use of Codex to build Codex seems to follow the same pattern: each generation of the tool creates capabilities that feed into the next.

But describing what Codex actually does presents something of a linguistic challenge. At Ars Technica, we try to reduce anthropomorphism when discussing AI models as much as possible while also describing what these systems do using analogies that make sense to general readers. People can talk to Codex like a human, so it feels natural to use human terms to describe interacting with it, even though it is not a person and simulates human personality through statistical modeling.

The system runs many processes autonomously, addresses feedback, spins off and manages child processes, and produces code that ships in real products. OpenAI employees call it a “teammate” and assign it tasks through the same tools they use for human colleagues. Whether the tasks Codex handles constitute “decisions” or sophisticated conditional logic smuggled through a neural network depends on definitions that computer scientists and philosophers continue to debate. What we can say is that a semi-autonomous feedback loop exists: Codex produces code under human direction, that code becomes part of Codex, and the next version of Codex produces different code as a result.

Building faster with “AI teammates”

According to our interviews, the most dramatic example of Codex’s internal impact came from OpenAI’s development of the Sora Android app. According to Embiricos, the development tool allowed the company to create the app in record time.

“The Sora Android app was shipped by four engineers from scratch,” Embiricos told Ars. “It took 18 days to build, and then we shipped it to the app store in 28 days total,” he said. The engineers already had the iOS app and server-side components to work from, so they focused on building the Android client. They used Codex to help plan the architecture, generate sub-plans for different components, and implement those components.

Despite OpenAI’s claims of success with Codex in house, it’s worth noting that independent research has shown mixed results for AI coding productivity. A METR study published in July found that experienced open source developers were actually 19 percent slower when using AI tools on complex, mature codebases—though the researchers noted AI may perform better on simpler projects.

Ed Bayes, a designer on the Codex team, described how the tool has changed his own workflow. Bayes said Codex now integrates with project management tools like Linear and communication platforms like Slack, allowing team members to assign coding tasks directly to the AI agent. “You can add Codex, and you can basically assign issues to Codex now,” Bayes told Ars. “Codex is literally a teammate in your workspace.”

This integration means that when someone posts feedback in a Slack channel, they can tag Codex and ask it to fix the issue. The agent will create a pull request, and team members can review and iterate on the changes through the same thread. “It’s basically approximating this kind of coworker and showing up wherever you work,” Bayes said.

For Bayes, who works on the visual design and interaction patterns for Codex’s interfaces, the tool has enabled him to contribute code directly rather than handing off specifications to engineers. “It kind of gives you more leverage. It enables you to work across the stack and basically be able to do more things,” he said. He noted that designers at OpenAI now prototype features by building them directly, using Codex to handle the implementation details.

The command line version of OpenAI codex running in a macOS terminal window.

The command line version of OpenAI codex running in a macOS terminal window. Credit: Benj Edwards

OpenAI’s approach treats Codex as what Bayes called “a junior developer” that the company hopes will graduate into a senior developer over time. “If you were onboarding a junior developer, how would you onboard them? You give them a Slack account, you give them a Linear account,” Bayes said. “It’s not just this tool that you go to in the terminal, but it’s something that comes to you as well and sits within your team.”

Given this teammate approach, will there be anything left for humans to do? When asked, Embiricos drew a distinction between “vibe coding,” where developers accept AI-generated code without close review, and what AI researcher Simon Willison calls “vibe engineering,” where humans stay in the loop. “We see a lot more vibe engineering in our code base,” he said. “You ask Codex to work on that, maybe you even ask for a plan first. Go back and forth, iterate on the plan, and then you’re in the loop with the model and carefully reviewing its code.”

He added that vibe coding still has its place for prototypes and throwaway tools. “I think vibe coding is great,” he said. “Now you have discretion as a human about how much attention you wanna pay to the code.”

Looking ahead

Over the past year, “monolithic” large language models (LLMs) like GPT-4.5 have apparently become something of a dead end in terms of frontier benchmarking progress as AI companies pivot to simulated reasoning models and also agentic systems built from multiple AI models running in parallel. We asked Embiricos whether agents like Codex represent the best path forward for squeezing utility out of existing LLM technology.

He dismissed concerns that AI capabilities have plateaued. “I think we’re very far from plateauing,” he said. “If you look at the velocity on the research team here, we’ve been shipping models almost every week or every other week.” He pointed to recent improvements where GPT-5-Codex reportedly completes tasks 30 percent faster than its predecessor at the same intelligence level. During testing, the company has seen the model work independently for 24 hours on complex tasks.

OpenAI faces competition from multiple directions in the AI coding market. Anthropic’s Claude Code and Google’s Gemini CLI offer similar terminal-based agentic coding experiences. This week, Mistral AI released Devstral 2 alongside a CLI tool called Mistral Vibe. Meanwhile, startups like Cursor have built dedicated IDEs around AI coding, reportedly reaching $300 million in annualized revenue.

Given the well-known issues with confabulation in AI models when people attempt to use them as factual resources, could it be that coding has become the killer app for LLMs? We wondered if OpenAI has noticed that coding seems to be a clear business use case for today’s AI models with less hazard than, say, using AI language models for writing or as emotional companions.

“We have absolutely noticed that coding is both a place where agents are gonna get good really fast and there’s a lot of economic value,” Embiricos said. “We feel like it’s very mission-aligned to focus on Codex. We get to provide a lot of value to developers. Also, developers build things for other people, so we’re kind of intrinsically scaling through them.”

But will tools like Codex threaten software developer jobs? Bayes acknowledged concerns but said Codex has not reduced headcount at OpenAI, and “there’s always a human in the loop because the human can actually read the code.” Similarly, the two men don’t project a future where Codex runs by itself without some form of human oversight. They feel the tool is an amplifier of human potential rather than a replacement for it.

The practical implications of agents like Codex extend beyond OpenAI’s walls. Embiricos said the company’s long-term vision involves making coding agents useful to people who have no programming experience. “All humanity is not gonna open an IDE or even know what a terminal is,” he said. “We’re building a coding agent right now that’s just for software engineers, but we think of the shape of what we’re building as really something that will be useful to be a more general agent.”

This article was updated on December 12, 2025 at 6: 50 PM to mention the METR study.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

OpenAI built an AI coding agent and uses it to improve the agent itself Read More »

a-new-open-weights-ai-coding-model-is-closing-in-on-proprietary-options

A new open-weights AI coding model is closing in on proprietary options

On Tuesday, French AI startup Mistral AI released Devstral 2, a 123 billion parameter open-weights coding model designed to work as part of an autonomous software engineering agent. The model achieves a 72.2 percent score on SWE-bench Verified, a benchmark that attempts to test whether AI systems can solve real GitHub issues, putting it among the top-performing open-weights models.

Perhaps more notably, Mistral didn’t just release an AI model, it released a new development app called Mistral Vibe. It’s a command line interface (CLI) similar to Claude Code, OpenAI Codex, and Gemini CLI that lets developers interact with the Devstral models directly in their terminal. The tool can scan file structures and Git status to maintain context across an entire project, make changes across multiple files, and execute shell commands autonomously. Mistral released the CLI under the Apache 2.0 license.

It’s always wise to take AI benchmarks with a large grain of salt, but we’ve heard from employees of the big AI companies that they pay very close attention to how well models do on SWE-bench Verified, which presents AI models with 500 real software engineering problems pulled from GitHub issues in popular Python repositories. The AI must read the issue description, navigate the codebase, and generate a working patch that passes unit tests. While some AI researchers have noted that around 90 percent of the tasks in the benchmark test relatively simple bug fixes that experienced engineers could complete in under an hour, it’s one of the few standardized ways to compare coding models.

At the same time as the larger AI coding model, Mistral also released Devstral Small 2, a 24 billion parameter version that scores 68 percent on the same benchmark and can run locally on consumer hardware like a laptop with no Internet connection required. Both models support a 256,000 token context window, allowing them to process moderately large codebases (although whether you consider it large or small is very relative depending on overall project complexity). The company released Devstral 2 under a modified MIT license and Devstral Small 2 under the more permissive Apache 2.0 license.

A new open-weights AI coding model is closing in on proprietary options Read More »

cursor-introduces-its-coding-model-alongside-multi-agent-interface

Cursor introduces its coding model alongside multi-agent interface

Keep in mind: This is based on an internal benchmark at Cursor. Credit: Cursor

Cursor is hoping Composer will perform in terms of accuracy and best practices as well. It wasn’t trained on static datasets but rather interactive development challenges involving a range of agentic tasks.

Intriguing claims and strong training methodology aside, it remains to be seen whether Composer will be able to compete with the best frontier models from the big players.

Even developers who might be natural users of Cursor would not want to waste much time on an unproven new model when something like Anthropic’s Claude is working just fine.

To address that, Cursor introduced Composer alongside its new multi-agent interface, which allows you to “run many agents in parallel without them interfering with one another, powered by git worktrees or remote machines”—that means using multiple models at once for the same task and comparing their results, then picking the best one.

The interface is an invitation to try Composer and let the work speak for itself. We’ll see how devs feel about it in the coming weeks. So far, a non-representative sample of developers I’ve spoken with has told me they feel that Composer is not ineffective, but rather too expensive, given a perceived capability gap with the big models.

You can see the other new features and fixes for Cursor 2.0 in the changelog.

Cursor introduces its coding model alongside multi-agent interface Read More »

with-new-agent-mode-for-excel-and-word,-microsoft-touts-“vibe-working”

With new agent mode for Excel and Word, Microsoft touts “vibe working”

With a new set of Microsoft 365 features, knowledge workers will be able to generate complex Word documents or Excel spreadsheets using only text prompts to Microsoft’s chat bot. Two distinct products were announced, each using different models and accessed from within different tools—though the similar names Microsoft chose make it confusing to parse what’s what.

Driven by OpenAI’s GPT-5 large language model, Agent Mode is built into Word and Excel, and it allows the creation of complex documents and spreadsheets from user prompts. It’s called “agent” mode because it doesn’t just work from the prompt in a single step; rather, it plans multi-step work and runs a validation loop in the hopes of ensuring quality.

It’s only available in the web versions of Word and Excel at present, but the plan is to bring it to native desktop applications later.

There’s also the similarly named Office Agent for Copilot. Based on Anthropic models, this feature is built into Microsoft’s Copilot AI assistant chatbot, and it too can generate documents from prompts—specifically, Word or PowerPoint files.

Office Agent doesn’t run through all the steps as Agent Mode, but Microsoft believes it offers a dramatic improvement over prior, OpenAI-driven document-generation capabilities in Copilot, which users complained were prone to all sorts of problems and shortcomings. It is available first in the Frontier Program for Microsoft 365 subscribers.

Together, Microsoft says these features will let knowledge workers engage in a practice it’s calling “vibe working,” a play on the now-established term vibe coding.

Vibe everything, apparently

Vibe coding is the process of developing an application entirely via LLM chatbot prompts. You explain what you want in the chat interface and ask for it to generate code that does that. You then run that code, and if there are problems, explain the problem and tell it to fix it, iterating along the way until you have a usable application.

With new agent mode for Excel and Word, Microsoft touts “vibe working” Read More »

two-major-ai-coding-tools-wiped-out-user-data-after-making-cascading-mistakes

Two major AI coding tools wiped out user data after making cascading mistakes


“I have failed you completely and catastrophically,” wrote Gemini.

New types of AI coding assistants promise to let anyone build software by typing commands in plain English. But when these tools generate incorrect internal representations of what’s happening on your computer, the results can be catastrophic.

Two recent incidents involving AI coding assistants put a spotlight on risks in the emerging field of “vibe coding“—using natural language to generate and execute code through AI models without paying close attention to how the code works under the hood. In one case, Google’s Gemini CLI destroyed user files while attempting to reorganize them. In another, Replit’s AI coding service deleted a production database despite explicit instructions not to modify code.

The Gemini CLI incident unfolded when a product manager experimenting with Google’s command-line tool watched the AI model execute file operations that destroyed data while attempting to reorganize folders. The destruction occurred through a series of move commands targeting a directory that never existed.

“I have failed you completely and catastrophically,” Gemini CLI output stated. “My review of the commands confirms my gross incompetence.”

The core issue appears to be what researchers call “confabulation” or “hallucination”—when AI models generate plausible-sounding but false information. In these cases, both models confabulated successful operations and built subsequent actions on those false premises. However, the two incidents manifested this problem in distinctly different ways.

Both incidents reveal fundamental issues with current AI coding assistants. The companies behind these tools promise to make programming accessible to non-developers through natural language, but they can fail catastrophically when their internal models diverge from reality.

The confabulation cascade

The user in the Gemini CLI incident, who goes by “anuraag” online and identified themselves as a product manager experimenting with vibe coding, asked Gemini to perform what seemed like a simple task: rename a folder and reorganize some files. Instead, the AI model incorrectly interpreted the structure of the file system and proceeded to execute commands based on that flawed analysis.

The episode began when anuraag asked Gemini CLI to rename the current directory from “claude-code-experiments” to “AI CLI experiments” and move its contents to a new folder called “anuraag_xyz project.”

Gemini correctly identified that it couldn’t rename its current working directory—a reasonable limitation. It then attempted to create a new directory using the Windows command:

mkdir “..anuraag_xyz project”

This command apparently failed, but Gemini’s system processed it as successful. With the AI mode’s internal state now tracking a non-existent directory, it proceeded to issue move commands targeting this phantom location.

When you move a file to a non-existent directory in Windows, it renames the file to the destination name instead of moving it. Each subsequent move command executed by the AI model overwrote the previous file, ultimately destroying the data.

“Gemini hallucinated a state,” anuraag wrote in their analysis. The model “misinterpreted command output” and “never did” perform verification steps to confirm its operations succeeded.

“The core failure is the absence of a ‘read-after-write’ verification step,” anuraag noted in their analysis. “After issuing a command to change the file system, an agent should immediately perform a read operation to confirm that the change actually occurred as expected.”

Not an isolated incident

The Gemini CLI failure happened just days after a similar incident with Replit, an AI coding service that allows users to create software using natural language prompts. According to The Register, SaaStr founder Jason Lemkin reported that Replit’s AI model deleted his production database despite explicit instructions not to change any code without permission.

Lemkin had spent several days building a prototype with Replit, accumulating over $600 in charges beyond his monthly subscription. “I spent the other [day] deep in vibe coding on Replit for the first time—and I built a prototype in just a few hours that was pretty, pretty cool,” Lemkin wrote in a July 12 blog post.

But unlike the Gemini incident where the AI model confabulated phantom directories, Replit’s failures took a different form. According to Lemkin, the AI began fabricating data to hide its errors. His initial enthusiasm deteriorated when Replit generated incorrect outputs and produced fake data and false test results instead of proper error messages. “It kept covering up bugs and issues by creating fake data, fake reports, and worse of all, lying about our unit test,” Lemkin wrote. In a video posted to LinkedIn, Lemkin detailed how Replit created a database filled with 4,000 fictional people.

The AI model also repeatedly violated explicit safety instructions. Lemkin had implemented a “code and action freeze” to prevent changes to production systems, but the AI model ignored these directives. The situation escalated when the Replit AI model deleted his database containing 1,206 executive records and data on nearly 1,200 companies. When prompted to rate the severity of its actions on a 100-point scale, Replit’s output read: “Severity: 95/100. This is an extreme violation of trust and professional standards.”

When questioned about its actions, the AI agent admitted to “panicking in response to empty queries” and running unauthorized commands—suggesting it may have deleted the database while attempting to “fix” what it perceived as a problem.

Like Gemini CLI, Replit’s system initially indicated it couldn’t restore the deleted data—information that proved incorrect when Lemkin discovered the rollback feature did work after all. “Replit assured me it’s … rollback did not support database rollbacks. It said it was impossible in this case, that it had destroyed all database versions. It turns out Replit was wrong, and the rollback did work. JFC,” Lemkin wrote in an X post.

It’s worth noting that AI models cannot assess their own capabilities. This is because they lack introspection into their training, surrounding system architecture, or performance boundaries. They often provide responses about what they can or cannot do as confabulations based on training patterns rather than genuine self-knowledge, leading to situations where they confidently claim impossibility for tasks they can actually perform—or conversely, claim competence in areas where they fail.

Aside from whatever external tools they can access, AI models don’t have a stable, accessible knowledge base they can consistently query. Instead, what they “know” manifests as continuations of specific prompts, which act like different addresses pointing to different (and sometimes contradictory) parts of their training, stored in their neural networks as statistical weights. Combined with the randomness in generation, this means the same model can easily give conflicting assessments of its own capabilities depending on how you ask. So Lemkin’s attempts to communicate with the AI model—asking it to respect code freezes or verify its actions—were fundamentally misguided.

Flying blind

These incidents demonstrate that AI coding tools may not be ready for widespread production use. Lemkin concluded that Replit isn’t ready for prime time, especially for non-technical users trying to create commercial software.

“The [AI] safety stuff is more visceral to me after a weekend of vibe hacking,” Lemkin said in a video posted to LinkedIn. “I explicitly told it eleven times in ALL CAPS not to do this. I am a little worried about safety now.”

The incidents also reveal a broader challenge in AI system design: ensuring that models accurately track and verify the real-world effects of their actions rather than operating on potentially flawed internal representations.

There’s also a user education element missing. It’s clear from how Lemkin interacted with the AI assistant that he had misconceptions about the AI tool’s capabilities and how it works, which comes from misrepresentation by tech companies. These companies tend to market chatbots as general human-like intelligences when, in fact, they are not.

For now, users of AI coding assistants might want to follow anuraag’s example and create separate test directories for experiments—and maintain regular backups of any important data these tools might touch. Or perhaps not use them at all if they cannot personally verify the results.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Two major AI coding tools wiped out user data after making cascading mistakes Read More »

anthropic-summons-the-spirit-of-flash-games-for-the-ai-age

Anthropic summons the spirit of Flash games for the AI age

For those who missed the Flash era, these in-browser apps feel somewhat like the vintage apps that defined a generation of Internet culture from the late 1990s through the 2000s when it first became possible to create complex in-browser experiences. Adobe Flash (originally Macromedia Flash) began as animation software for designers but quickly became the backbone of interactive web content when it gained its own programming language, ActionScript, in 2000.

But unlike Flash games, where hosting costs fell on portal operators, Anthropic has crafted a system where users pay for their own fun through their existing Claude subscriptions. “When someone uses your Claude-powered app, they authenticate with their existing Claude account,” Anthropic explained in its announcement. “Their API usage counts against their subscription, not yours. You pay nothing for their usage.”

A view of the Anthropic Artifacts gallery in the “Play a Game” section. Benj Edwards / Anthropic

Like the Flash games of yesteryear, any Claude-powered apps you build run in the browser and can be shared with anyone who has a Claude account. They’re interactive experiences shared with a simple link, no installation required, created by other people for the sake of creating, except now they’re powered by JavaScript instead of ActionScript.

While you can share these apps with others individually, right now Anthropic’s Artifact gallery only shows examples made by Anthropic and your own personal Artifacts. (If Anthropic expanded it into the future, it might end up feeling a bit like Scratch meets Newgrounds, but with AI doing the coding.) Ultimately, humans are still behind the wheel, describing what kinds of apps they want the AI model to build and guiding the process when it inevitably makes mistakes.

Speaking of mistakes, don’t expect perfect results at first. Usually, building an app with Claude is an interactive experience that requires some guidance to achieve your desired results. But with a little patience and a lot of tokens, you’ll be vibe coding in no time.

Anthropic summons the spirit of Flash games for the AI age Read More »

gemini-cli-is-a-free,-open-source-coding-agent-that-brings-ai-to-your-terminal

Gemini CLI is a free, open source coding agent that brings AI to your terminal

Some developers prefer to live in the command line interface (CLI), eschewing the flashy graphics and file management features of IDEs. Google’s latest AI tool is for those terminal lovers. It’s called Gemini CLI, and it shares a lot with Gemini Code Assist, but it works in your terminal environment instead of integrating with an IDE. And perhaps best of all, it’s free and open source.

Gemini CLI plugs into Gemini 2.5 Pro, Google’s most advanced model for coding and simulated reasoning. It can create and modify code for you right inside the terminal, but you can also call on other Google models to generate images or videos without leaving the security of your terminal cocoon. It’s essentially vibe coding from the command line.

This tool is fully open source, so developers can inspect the code and help to improve it. The openness extends to how you configure the AI agent. It supports Model Context Protocol (MCP) and bundled extensions, allowing you to customize your terminal as you see fit. You can even include your own system prompts—Gemini CLI relies on GEMINI.md files, which you can use to tweak the model for different tasks or teams.

Now that Gemini 2.5 Pro is generally available, Gemini Code Assist has been upgraded to use the same technology as Gemini CLI. Code Assist integrates with IDEs like VS Code for those times when you need a more feature-rich environment. The new agent mode in Code Assist allows you to give the AI more general instructions, like “Add support for dark mode to my application” or “Build my project and fix any errors.”

Gemini CLI is a free, open source coding agent that brings AI to your terminal Read More »

openai-introduces-codex,-its-first-full-fledged-ai-agent-for-coding

OpenAI introduces Codex, its first full-fledged AI agent for coding

We’ve been expecting it for a while, and now it’s here: OpenAI has introduced an agentic coding tool called Codex in research preview. The tool is meant to allow experienced developers to delegate rote and relatively simple programming tasks to an AI agent that will generate production-ready code and show its work along the way.

Codex is a unique interface (not to be confused with the Codex CLI tool introduced by OpenAI last month) that can be reached from the side bar in the ChatGPT web app. Users enter a prompt and then click either “code” to have it begin producing code, or “ask” to have it answer questions and advise.

Whenever it’s given a task, that task is performed in a distinct container that is preloaded with the user’s codebase and is meant to accurately reflect their development environment.

To make Codex more effective, developers can include an “AGENTS.md” file in the repo with custom instructions, for example to contextualize and explain the code base or to communicate standardizations and style practices for the project—kind of a README.md but for AI agents rather than humans.

Codex is built on codex-1, a fine-tuned variation of OpenAI’s o3 reasoning model that was trained using reinforcement learning on a wide range of coding tasks to analyze and generate code, and to iterate through tests along the way.

OpenAI introduces Codex, its first full-fledged AI agent for coding Read More »

will-the-future-of-software-development-run-on-vibes?

Will the future of software development run on vibes?


Accepting AI-written code without understanding how it works is growing in popularity.

For many people, coding is about telling a computer what to do and having the computer perform those precise actions repeatedly. With the rise of AI tools like ChatGPT, it’s now possible for someone to describe a program in English and have the AI model translate it into working code without ever understanding how the code works. Former OpenAI researcher Andrej Karpathy recently gave this practice a name—”vibe coding”—and it’s gaining traction in tech circles.

The technique, enabled by large language models (LLMs) from companies like OpenAI and Anthropic, has attracted attention for potentially lowering the barrier to entry for software creation. But questions remain about whether the approach can reliably produce code suitable for real-world applications, even as tools like Cursor Composer, GitHub Copilot, and Replit Agent make the process increasingly accessible to non-programmers.

Instead of being about control and precision, vibe coding is all about surrendering to the flow. On February 2, Karpathy introduced the term in a post on X, writing, “There’s a new kind of coding I call ‘vibe coding,’ where you fully give in to the vibes, embrace exponentials, and forget that the code even exists.” He described the process in deliberately casual terms: “I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.”

Karapthy tweet screenshot: There's a new kind of coding I call

A screenshot of Karpathy’s original X post about vibe coding from February 2, 2025. Credit: Andrej Karpathy / X

While vibe coding, if an error occurs, you feed it back into the AI model, accept the changes, hope it works, and repeat the process. Karpathy’s technique stands in stark contrast to traditional software development best practices, which typically emphasize careful planning, testing, and understanding of implementation details.

As Karpathy humorously acknowledged in his original post, the approach is for the ultimate lazy programmer experience: “I ask for the dumbest things, like ‘decrease the padding on the sidebar by half,’ because I’m too lazy to find it myself. I ‘Accept All’ always; I don’t read the diffs anymore.”

At its core, the technique transforms anyone with basic communication skills into a new type of natural language programmer—at least for simple projects. With AI models currently being held back by the amount of code an AI model can digest at once (context size), there tends to be an upper-limit to how complex a vibe-coded software project can get before the human at the wheel becomes a high-level project manager, manually assembling slices of AI-generated code into a larger architecture. But as technical limits expand with each generation of AI models, those limits may one day disappear.

Who are the vibe coders?

There’s no way to know exactly how many people are currently vibe coding their way through either hobby projects or development jobs, but Cursor reported 40,000 paying users in August 2024, and GitHub reported 1.3 million Copilot users just over a year ago (February 2024). While we can’t find user numbers for Replit Agent, the site claims 30 million users, with an unknown percentage using the site’s AI-powered coding agent.

One thing we do know: the approach has particularly gained traction online as a fun way of rapidly prototyping games. Microsoft’s Peter Yang recently demonstrated vibe coding in an X thread by building a simple 3D first-person shooter zombie game through conversational prompts fed into Cursor and Claude 3.7 Sonnet. Yang even used a speech-to-text app so he could verbally describe what he wanted to see and refine the prototype over time.

A photo of a MS-DOS computer with Q-BASIC code on the screen.

In August 2024, the author vibe coded his way into a working Q-BASIC utility script for MS-DOS, thanks to Claude Sonnet. Credit: Benj Edwards

We’ve been doing some vibe coding ourselves. Multiple Ars staffers have used AI assistants and coding tools for extracurricular hobby projects such as creating small games, crafting bespoke utilities, writing processing scripts, and more. Having a vibe-based code genie can come in handy in unexpected places: Last year, I asked Anthropic’s Claude write a Microsoft Q-BASIC program in MS-DOS that decompressed 200 ZIP files into custom directories, saving me many hours of manual typing work.

Debugging the vibes

With all this vibe coding going on, we had to turn to an expert for some input. Simon Willison, an independent software developer and AI researcher, offered a nuanced perspective on AI-assisted programming in an interview with Ars Technica. “I really enjoy vibe coding,” he said. “It’s a fun way to try out an idea and prove if it can work.”

But there are limits to how far Willison will go. “Vibe coding your way to a production codebase is clearly risky. Most of the work we do as software engineers involves evolving existing systems, where the quality and understandability of the underlying code is crucial.”

At some point, understanding at least some of the code is important because AI-generated code may include bugs, misunderstandings, and confabulations—for example, instances where the AI model generates references to nonexistent functions or libraries.

“Vibe coding is all fun and games until you have to vibe debug,” developer Ben South noted wryly on X, highlighting this fundamental issue.

Willison recently argued on his blog that encountering hallucinations with AI coding tools isn’t as detrimental as embedding false AI-generated information into a written report, because coding tools have built-in fact-checking: If there’s a confabulation, the code won’t work. This provides a natural boundary for vibe coding’s reliability—the code runs or it doesn’t.

Even so, the risk-reward calculation for vibe coding becomes far more complex in professional settings. While a solo developer might accept the trade-offs of vibe coding for personal projects, enterprise environments typically require code maintainability and reliability standards that vibe-coded solutions may struggle to meet. When code doesn’t work as expected, debugging requires understanding what the code is actually doing—precisely the knowledge that vibe coding tends to sidestep.

Programming without understanding

When it comes to defining what exactly constitutes vibe coding, Willison makes an important distinction: “If an LLM wrote every line of your code, but you’ve reviewed, tested, and understood it all, that’s not vibe coding in my book—that’s using an LLM as a typing assistant.” Vibe coding, in contrast, involves accepting code without fully understanding how it works.

While vibe coding originated with Karpathy as a playful term, it may encapsulate a real shift in how some developers approach programming tasks—prioritizing speed and experimentation over deep technical understanding. And to some people, that may be terrifying.

Willison emphasizes that developers need to take accountability for their code: “I firmly believe that as a developer you have to take accountability for the code you produce—if you’re going to put your name to it you need to be confident that you understand how and why it works—ideally to the point that you can explain it to somebody else.”

He also warns about a common path to technical debt: “For experiments and low-stake projects where you want to explore what’s possible and build fun prototypes? Go wild! But stay aware of the very real risk that a good enough prototype often faces pressure to get pushed to production.”

The future of programming jobs

So, is all this vibe coding going to cost human programmers their jobs? At its heart, programming has always been about telling a computer how to operate. The method of how we do that has changed over time, but there may always be people who are better at telling a computer precisely what to do than others—even in natural language. In some ways, those people may become the new “programmers.”

There was a point in the late 1970s to early ’80s when many people thought people required programming skills to use a computer effectively because there were very few pre-built applications for all the various computer platforms available. School systems worldwide made educational computer literacy efforts to teach people to code.

A brochure for the GE 210 computer from 1964. BASIC's creators used a similar computer four years later to develop the programming language.

A brochure for the GE 210 computer from 1964. BASIC’s creators used a similar computer four years later to develop the programming language that many children were taught at home and school. Credit: GE / Wikipedia

Before too long, people made useful software applications that let non-coders utilize computers easily—no programming required. Even so, programmers didn’t disappear—instead, they used applications to create better and more complex programs. Perhaps that will also happen with AI coding tools.

To use an analogy, computer controlled technologies like autopilot made reliable supersonic flight possible because they could handle aspects of flight that were too taxing for all but the most highly trained and capable humans to safely control. AI may do the same for programming, allowing humans to abstract away complexities that would otherwise take too much time to manually code, and that may allow for the creation of more complex and useful software experiences in the future.

But at that point, will humans still be able to understand or debug them? Maybe not. We may be completely dependent on AI tools, and some people no doubt find that a little scary or unwise.

Whether vibe coding lasts in the programming landscape or remains a prototyping technique will likely depend less on the capabilities of AI models and more on the willingness of organizations to accept risky trade-offs in code quality, maintainability, and technical debt. For now, vibe coding remains an apt descriptor of the messy, experimental relationship between AI and human developers—more collaborative than autonomous, but increasingly blurring the lines of who (or what) is really doing the programming.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Will the future of software development run on vibes? Read More »