agentic AI

microsoft-drops-ai-sales-targets-in-half-after-salespeople-miss-their-quotas

Microsoft drops AI sales targets in half after salespeople miss their quotas

Microsoft has lowered sales growth targets for its AI agent products after many salespeople missed their quotas in the fiscal year ending in June, according to a report Wednesday from The Information. The adjustment is reportedly unusual for Microsoft, and it comes after the company missed a number of ambitious sales goals for its AI offerings.

AI agents are specialized implementations of AI language models designed to perform multistep tasks autonomously rather than simply responding to single prompts. So-called “agentic” features have been central to Microsoft’s 2025 sales pitch: At its Build conference in May, the company declared that it has entered “the era of AI agents.”

The company has promised customers that agents could automate complex tasks, such as generating dashboards from sales data or writing customer reports. At its Ignite conference in November, Microsoft announced new features like Word, Excel, and PowerPoint agents in Microsoft 365 Copilot, along with tools for building and deploying agents through Azure AI Foundry and Copilot Studio. But as the year draws to a close, that promise has proven harder to deliver than the company expected.

According to The Information, one US Azure sales unit set quotas for salespeople to increase customer spending on a product called Foundry, which helps customers develop AI applications, by 50 percent. Less than a fifth of salespeople in that unit met their Foundry sales growth targets. In July, Microsoft lowered those targets to roughly 25 percent growth for the current fiscal year. In another US Azure unit, most salespeople failed to meet an earlier quota to double Foundry sales, and Microsoft cut their quotas to 50 percent for the current fiscal year.

Microsoft drops AI sales targets in half after salespeople miss their quotas Read More »

claude-code-gets-a-web-version—but-it’s-the-new-sandboxing-that-really-matters

Claude Code gets a web version—but it’s the new sandboxing that really matters

Now, it can instead be given permissions for specific file system folders and network servers. That means fewer approval steps, but it’s also more secure overall against prompt injection and other risks.

Anthropic’s demo video for Claude Code on the web.

According to Anthropic’s engineering blog, the new network isolation approach only allows Internet access “through a unix domain socket connected to a proxy server running outside the sandbox. … This proxy server enforces restrictions on the domains that a process can connect to, and handles user confirmation for newly requested domains.” Additionally, users can customize the proxy to set their own rules for outgoing traffic.

This way, the coding agent can do things like fetch npm packages from approved sources, but without carte blanche for communicating with the outside world, and without badgering the user with constant approvals.

For many developers, these additions are more significant than the availability of web or mobile interfaces. They allow Claude Code agents to operate more independently without as many detailed, line-by-line approvals.

That’s more convenient, but it’s a double-edged sword, as it will also make code review even more important. One of the strengths of the too-many-approvals approach was that it made sure developers were still looking closely at every little change. Now it might be a little bit easier to miss Claude Code making a bad call.

The new features are available in beta now as a research preview, and they are available to Claude users with Pro or Max subscriptions.

Claude Code gets a web version—but it’s the new sandboxing that really matters Read More »

developers-joke-about-“coding-like-cavemen”-as-ai-service-suffers-major-outage

Developers joke about “coding like cavemen” as AI service suffers major outage

Growing dependency on AI coding tools

The speed at which news of the outage spread shows how deeply embedded AI coding assistants have already become in modern software development. Claude Code, announced in February and widely launched in May, is Anthropic’s terminal-based coding agent that can perform multi-step coding tasks across an existing code base.

The tool competes with OpenAI’s Codex feature, a coding agent that generates production-ready code in isolated containers, Google’s Gemini CLI, Microsoft’s GitHub Copilot, which itself can use Claude models for code, and Cursor, a popular AI-powered IDE built on VS Code that also integrates multiple AI models, including Claude.

During today’s outage, some developers turned to alternative solutions. “Z.AI works fine. Qwen works fine. Glad I switched,” posted one user on Hacker News. Others joked about reverting to older methods, with one suggesting the “pseudo-LLM experience” could be achieved with a Python package that imports code directly from Stack Overflow.

While AI coding assistants have accelerated development for some users, they’ve also caused problems for others who rely on them too heavily. The emerging practice of so-called “vibe coding“—using natural language to generate and execute code through AI models without fully understanding the underlying operations—has led to catastrophic failures.

In recent incidents, Google’s Gemini CLI destroyed user files while attempting to reorganize them, and Replit’s AI coding service deleted a production database despite explicit instructions not to modify code. These failures occurred when the AI models confabulated successful operations and built subsequent actions on false premises, highlighting the risks of depending on AI assistants that can misinterpret file structures or fabricate data to hide their errors.

Wednesday’s outage served as a reminder that as dependency on AI grows, even minor service disruptions can become major events that affect an entire profession. But perhaps that could be a good thing if it’s an excuse to take a break from a stressful workload. As one commenter joked, it might be “time to go outside and touch some grass again.”

Developers joke about “coding like cavemen” as AI service suffers major outage Read More »

anthropic’s-auto-clicking-ai-chrome-extension-raises-browser-hijacking-concerns

Anthropic’s auto-clicking AI Chrome extension raises browser-hijacking concerns

The company tested 123 cases representing 29 different attack scenarios and found a 23.6 percent attack success rate when browser use operated without safety mitigations.

One example involved a malicious email that instructed Claude to delete a user’s emails for “mailbox hygiene” purposes. Without safeguards, Claude followed these instructions and deleted the user’s emails without confirmation.

Anthropic says it has implemented several defenses to address these vulnerabilities. Users can grant or revoke Claude’s access to specific websites through site-level permissions. The system requires user confirmation before Claude takes high-risk actions like publishing, purchasing, or sharing personal data. The company has also blocked Claude from accessing websites offering financial services, adult content, and pirated content by default.

These safety measures reduced the attack success rate from 23.6 percent to 11.2 percent in autonomous mode. On a specialized test of four browser-specific attack types, the new mitigations reportedly reduced the success rate from 35.7 percent to 0 percent.

Independent AI researcher Simon Willison, who has extensively written about AI security risks and coined the term “prompt injection” in 2022, called the remaining 11.2 percent attack rate “catastrophic,” writing on his blog that “in the absence of 100% reliable protection I have trouble imagining a world in which it’s a good idea to unleash this pattern.”

By “pattern,” Willison is referring to the recent trend of integrating AI agents into web browsers. “I strongly expect that the entire concept of an agentic browser extension is fatally flawed and cannot be built safely,” he wrote in an earlier post on similar prompt injection security issues recently found in Perplexity Comet.

The security risks are no longer theoretical. Last week, Brave’s security team discovered that Perplexity’s Comet browser could be tricked into accessing users’ Gmail accounts and triggering password recovery flows through malicious instructions hidden in Reddit posts. When users asked Comet to summarize a Reddit thread, attackers could embed invisible commands that instructed the AI to open Gmail in another tab, extract the user’s email address, and perform unauthorized actions. Although Perplexity attempted to fix the vulnerability, Brave later confirmed that its mitigations were defeated and the security hole remained.

For now, Anthropic plans to use its new research preview to identify and address attack patterns that emerge in real-world usage before making the Chrome extension more widely available. In the absence of good protections from AI vendors, the burden of security falls on the user, who is taking a large risk by using these tools on the open web. As Willison noted in his post about Claude for Chrome, “I don’t think it’s reasonable to expect end users to make good decisions about the security risks.”

Anthropic’s auto-clicking AI Chrome extension raises browser-hijacking concerns Read More »

openai’s-chatgpt-agent-casually-clicks-through-“i-am-not-a-robot”-verification-test

OpenAI’s ChatGPT Agent casually clicks through “I am not a robot” verification test

The CAPTCHA arms race

While the agent didn’t face an actual CAPTCHA puzzle with images in this case, successfully passing Cloudflare’s behavioral screening that determines whether to present such challenges demonstrates sophisticated browser automation.

To understand the significance of this capability, it’s important to know that CAPTCHA systems have served as a security measure on the web for decades. Computer researchers invented the technique in the 1990s to screen bots from entering information into websites, originally using images with letters and numbers written in wiggly fonts, often obscured with lines or noise to foil computer vision algorithms. The assumption is that the task will be easy for humans but difficult for machines.

Cloudflare’s screening system, called Turnstile, often precedes actual CAPTCHA challenges and represents one of the most widely deployed bot-detection methods today. The checkbox analyzes multiple signals, including mouse movements, click timing, browser fingerprints, IP reputation, and JavaScript execution patterns to determine if the user exhibits human-like behavior. If these checks pass, users proceed without seeing a CAPTCHA puzzle. If the system detects suspicious patterns, it escalates to visual challenges.

The ability for an AI model to defeat a CAPTCHA isn’t entirely new (although having one narrate the process feels fairly novel). AI tools have been able to defeat certain CAPTCHAs for a while, which has led to an arms race between those that create them and those that defeat them. OpenAI’s Operator, an experimental web-browsing AI agent launched in January, faced difficulty clicking through some CAPTCHAs (and was also trained to stop and ask a human to complete them), but the latest ChatGPT Agent tool has seen a much wider release.

It’s tempting to say that the ability of AI agents to pass these tests puts the future effectiveness of CAPTCHAs into question, but for as long as there have been CAPTCHAs, there have been bots that could later defeat them. As a result, recent CAPTCHAs have become more of a way to slow down bot attacks or make them more expensive rather than a way to defeat them entirely. Some malefactors even hire out farms of humans to defeat them in bulk.

OpenAI’s ChatGPT Agent casually clicks through “I am not a robot” verification test Read More »

chatgpt’s-new-ai-agent-can-browse-the-web-and-create-powerpoint-slideshows

ChatGPT’s new AI agent can browse the web and create PowerPoint slideshows

On Thursday, OpenAI launched ChatGPT Agent, a new feature that lets the company’s AI assistant complete multi-step tasks by controlling its own web browser. The update merges capabilities from OpenAI’s earlier Operator tool and the Deep Research feature, allowing ChatGPT to navigate websites, run code, and create documents while users maintain control over the process.

The feature marks OpenAI’s latest entry into what the tech industry calls “agentic AI“—systems that can take autonomous multi-step actions on behalf of the user. OpenAI says users can ask Agent to handle requests like assembling and purchasing a clothing outfit for a particular occasion, creating PowerPoint slide decks, planning meals, or updating financial spreadsheets with new data.

The system uses a combination of web browsers, terminal access, and API connections to complete these tasks, including “ChatGPT Connectors” that integrate with apps like Gmail and GitHub.

While using Agent, users watch a window inside the ChatGPT interface that shows all of the AI’s actions taking place inside its own private sandbox. This sandbox features its own virtual operating system and web browser with access to the real Internet; it does not control your personal device. “ChatGPT carries out these tasks using its own virtual computer,” OpenAI writes, “fluidly shifting between reasoning and action to handle complex workflows from start to finish, all based on your instructions.”

A still image from an OpenAI ChatGPT Agent promotional demo video showing the AI agent searching for flights.

A still image from an OpenAI ChatGPT Agent promotional demo video showing the AI agent searching for flights. Credit: OpenAI

Like Operator before it, the agent feature requires user permission before taking certain actions with real-world consequences, such as making purchases. Users can interrupt tasks at any point, take control of the browser, or stop operations entirely. The system also includes a “Watch Mode” for tasks like sending emails that require active user oversight.

Since Agent surpasses Operator in capability, OpenAI says the company’s earlier Operator preview site will remain functional for a few more weeks before being shut down.

Performance claims

OpenAI’s claims are one thing, but how well the company’s new AI agent will actually complete multi-step tasks will vary wildly depending on the situation. That’s because the AI model isn’t a complete form of problem-solving intelligence, but rather a complex master imitator. It has some flexibility in piecing a scenario together but also many blind spots. OpenAI trained the agent (and its constituent components) using examples of computer usage and tool usage; whatever falls outside of the examples absorbed from training data will likely still prove difficult to accomplish.

ChatGPT’s new AI agent can browse the web and create PowerPoint slideshows Read More »

anthropic-summons-the-spirit-of-flash-games-for-the-ai-age

Anthropic summons the spirit of Flash games for the AI age

For those who missed the Flash era, these in-browser apps feel somewhat like the vintage apps that defined a generation of Internet culture from the late 1990s through the 2000s when it first became possible to create complex in-browser experiences. Adobe Flash (originally Macromedia Flash) began as animation software for designers but quickly became the backbone of interactive web content when it gained its own programming language, ActionScript, in 2000.

But unlike Flash games, where hosting costs fell on portal operators, Anthropic has crafted a system where users pay for their own fun through their existing Claude subscriptions. “When someone uses your Claude-powered app, they authenticate with their existing Claude account,” Anthropic explained in its announcement. “Their API usage counts against their subscription, not yours. You pay nothing for their usage.”

A view of the Anthropic Artifacts gallery in the “Play a Game” section. Benj Edwards / Anthropic

Like the Flash games of yesteryear, any Claude-powered apps you build run in the browser and can be shared with anyone who has a Claude account. They’re interactive experiences shared with a simple link, no installation required, created by other people for the sake of creating, except now they’re powered by JavaScript instead of ActionScript.

While you can share these apps with others individually, right now Anthropic’s Artifact gallery only shows examples made by Anthropic and your own personal Artifacts. (If Anthropic expanded it into the future, it might end up feeling a bit like Scratch meets Newgrounds, but with AI doing the coding.) Ultimately, humans are still behind the wheel, describing what kinds of apps they want the AI model to build and guiding the process when it inevitably makes mistakes.

Speaking of mistakes, don’t expect perfect results at first. Usually, building an app with Claude is an interactive experience that requires some guidance to achieve your desired results. But with a little patience and a lot of tokens, you’ll be vibe coding in no time.

Anthropic summons the spirit of Flash games for the AI age Read More »

openai-introduces-codex,-its-first-full-fledged-ai-agent-for-coding

OpenAI introduces Codex, its first full-fledged AI agent for coding

We’ve been expecting it for a while, and now it’s here: OpenAI has introduced an agentic coding tool called Codex in research preview. The tool is meant to allow experienced developers to delegate rote and relatively simple programming tasks to an AI agent that will generate production-ready code and show its work along the way.

Codex is a unique interface (not to be confused with the Codex CLI tool introduced by OpenAI last month) that can be reached from the side bar in the ChatGPT web app. Users enter a prompt and then click either “code” to have it begin producing code, or “ask” to have it answer questions and advise.

Whenever it’s given a task, that task is performed in a distinct container that is preloaded with the user’s codebase and is meant to accurately reflect their development environment.

To make Codex more effective, developers can include an “AGENTS.md” file in the repo with custom instructions, for example to contextualize and explain the code base or to communicate standardizations and style practices for the project—kind of a README.md but for AI agents rather than humans.

Codex is built on codex-1, a fine-tuned variation of OpenAI’s o3 reasoning model that was trained using reinforcement learning on a wide range of coding tasks to analyze and generate code, and to iterate through tests along the way.

OpenAI introduces Codex, its first full-fledged AI agent for coding Read More »

perplexity-wants-to-reinvent-the-web-browser-with-ai—but-there’s-fierce-competition

Perplexity wants to reinvent the web browser with AI—but there’s fierce competition

It has recently been expanding its offerings—for example, it recently launched a deep research tool competing with similar ones provided by OpenAI and Google, as well as Sonar, an API for generative AI-powered search.

It will face fierce competition in the browser market, though. Google’s Chrome accounts for the majority of web browser use around the world, and despite its position at the forefront of AI search, Perplexity isn’t the first to introduce a browser with heavy use of generative AI features. For example, The Browser Company showed off its Dia browser in December.

Dia will allow users to type natural language commands into the search bar, like finding a document or webpage or creating a calendar event. It’s possible that Comet will do similar things, but again, we don’t know.

So far, most consumer-facing AI tools have come in one of three forms. There are general-purpose chatbots (like OpenAI’s ChatGPT and Anthropic’s Claude); features that use trained deep learning models subtly baked into existing software (as in Adobe Photoshop or Apple’s iOS); and, less commonly, standalone software meant to remake existing application categories using AI features (like the Cursor IDE).

There haven’t been a ton of AI-specific applications in existing categories like this before, but expect to see more coming over the next couple of years.

Perplexity wants to reinvent the web browser with AI—but there’s fierce competition Read More »

microsoft’s-new-ai-agent-can-control-software-and-robots

Microsoft’s new AI agent can control software and robots

The researchers' explanations about how

The researchers’ explanations about how “Set-of-Mark” and “Trace-of-Mark” work. Credit: Microsoft Research

The Magma model introduces two technical components: Set-of-Mark, which identifies objects that can be manipulated in an environment by assigning numeric labels to interactive elements, such as clickable buttons in a UI or graspable objects in a robotic workspace, and Trace-of-Mark, which learns movement patterns from video data. Microsoft says those features allow the model to complete tasks like navigating user interfaces or directing robotic arms to grasp objects.

Microsoft Magma researcher Jianwei Yang wrote in a Hacker News comment that the name “Magma” stands for “M(ultimodal) Ag(entic) M(odel) at Microsoft (Rese)A(rch),” after some people noted that “Magma” already belongs to an existing matrix algebra library, which could create some confusion in technical discussions.

Reported improvements over previous models

In its Magma write-up, Microsoft claims Magma-8B performs competitively across benchmarks, showing strong results in UI navigation and robot manipulation tasks.

For example, it scored 80.0 on the VQAv2 visual question-answering benchmark—higher than GPT-4V’s 77.2 but lower than LLaVA-Next’s 81.8. Its POPE score of 87.4 leads all models in the comparison. In robot manipulation, Magma reportedly outperforms OpenVLA, an open source vision-language-action model, in multiple robot manipulation tasks.

Magma's agentic benchmarks, as reported by the researchers.

Magma’s agentic benchmarks, as reported by the researchers. Credit: Microsoft Research

As always, we take AI benchmarks with a grain of salt since many have not been scientifically validated as being able to measure useful properties of AI models. External verification of Microsoft’s benchmark results will become possible once other researchers can access the public code release.

Like all AI models, Magma is not perfect. It still faces technical limitations in complex step-by-step decision-making that requires multiple steps over time, according to Microsoft’s documentation. The company says it continues to work on improving these capabilities through ongoing research.

Yang says Microsoft will release Magma’s training and inference code on GitHub next week, allowing external researchers to build on the work. If Magma delivers on its promise, it could push Microsoft’s AI assistants beyond limited text interactions, enabling them to operate software autonomously and execute real-world tasks through robotics.

Magma is also a sign of how quickly the culture around AI can change. Just a few years ago, this kind of agentic talk scared many people who feared it might lead to AI taking over the world. While some people still fear that outcome, in 2025, AI agents are a common topic of mainstream AI research that regularly takes place without triggering calls to pause all of AI development.

Microsoft’s new AI agent can control software and robots Read More »

hugging-face-clones-openai’s-deep-research-in-24-hours

Hugging Face clones OpenAI’s Deep Research in 24 hours

On Tuesday, Hugging Face researchers released an open source AI research agent called “Open Deep Research,” created by an in-house team as a challenge 24 hours after the launch of OpenAI’s Deep Research feature, which can autonomously browse the web and create research reports. The project seeks to match Deep Research’s performance while making the technology freely available to developers.

“While powerful LLMs are now freely available in open-source, OpenAI didn’t disclose much about the agentic framework underlying Deep Research,” writes Hugging Face on its announcement page. “So we decided to embark on a 24-hour mission to reproduce their results and open-source the needed framework along the way!”

Similar to both OpenAI’s Deep Research and Google’s implementation of its own “Deep Research” using Gemini (first introduced in December—before OpenAI), Hugging Face’s solution adds an “agent” framework to an existing AI model to allow it to perform multi-step tasks, such as collecting information and building the report as it goes along that it presents to the user at the end.

The open source clone is already racking up comparable benchmark results. After only a day’s work, Hugging Face’s Open Deep Research has reached 55.15 percent accuracy on the General AI Assistants (GAIA) benchmark, which tests an AI model’s ability to gather and synthesize information from multiple sources. OpenAI’s Deep Research scored 67.36 percent accuracy on the same benchmark.

As Hugging Face points out in its post, GAIA includes complex multi-step questions such as this one:

Which of the fruits shown in the 2008 painting “Embroidery from Uzbekistan” were served as part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film “The Last Voyage”? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement in the painting starting from the 12 o’clock position. Use the plural form of each fruit.

To correctly answer that type of question, the AI agent must seek out multiple disparate sources and assemble them into a coherent answer. Many of the questions in GAIA represent no easy task, even for a human, so they test agentic AI’s mettle quite well.

Hugging Face clones OpenAI’s Deep Research in 24 hours Read More »

chatgpt-becomes-more-siri-like-with-new-scheduled-tasks-feature

ChatGPT becomes more Siri-like with new scheduled tasks feature

OpenAI is making ChatGPT work a little more like older digital assistants with a new feature called Tasks, as reported by TechCrunch and others.

Currently in beta, Tasks allows users to direct the chatbot to send reminders or to generate responses to specific prompts at certain times; recurring tasks are also supported.

The feature is available to Plus, Team, and Pro subscribers starting today, while free users don’t have access.

To create a task, users need to select “4o with scheduled tasks” from the model picker and then direct ChatGPT using the same kind of plain language text prompts that drive everything else it does. ChatGPT will sometimes suggest tasks, too, but they won’t go into effect unless the user approves them.

The user can then make changes to assigned tasks through the same chat conversation, or they can use a new Tasks section of the ChatGPT apps to manage all currently assigned items. There’s currently a 10-task limit.

When the time comes to perform an assigned task, the ChatGPT mobile or desktop app will send a notification on schedule.

This update can be seen as OpenAI’s first step into the agentic AI space, where applications built using deep learning can operate relatively independently within certain boundaries, either replacing or easing the day-to-day responsibilities of information workers.

ChatGPT becomes more Siri-like with new scheduled tasks feature Read More »