microsoft

ai-isn’t-ready-to-replace-human-coders-for-debugging,-researchers-say

AI isn’t ready to replace human coders for debugging, researchers say

A graph showing agents with tools nearly doubling the success rates of those without, but still achieving a success score under 50 percent

Agents using debugging tools drastically outperformed those that didn’t, but their success rate still wasn’t high enough. Credit: Microsoft Research

This approach is much more successful than relying on the models as they’re usually used, but when your best case is a 48.4 percent success rate, you’re not ready for primetime. The limitations are likely because the models don’t fully understand how to best use the tools, and because their current training data is not tailored to this use case.

“We believe this is due to the scarcity of data representing sequential decision-making behavior (e.g., debugging traces) in the current LLM training corpus,” the blog post says. “However, the significant performance improvement… validates that this is a promising research direction.”

This initial report is just the start of the efforts, the post claims.  The next step is to “fine-tune an info-seeking model specialized in gathering the necessary information to resolve bugs.” If the model is large, the best move to save inference costs may be to “build a smaller info-seeking model that can provide relevant information to the larger one.”

This isn’t the first time we’ve seen outcomes that suggest some of the ambitious ideas about AI agents directly replacing developers are pretty far from reality. There have been numerous studies already showing that even though an AI tool can sometimes create an application that seems acceptable to the user for a narrow task, the models tend to produce code laden with bugs and security vulnerabilities, and they aren’t generally capable of fixing those problems.

This is an early step on the path to AI coding agents, but most researchers agree it remains likely that the best outcome is an agent that saves a human developer a substantial amount of time, not one that can do everything they can do.

AI isn’t ready to replace human coders for debugging, researchers say Read More »

that-groan-you-hear-is-users’-reaction-to-recall-going-back-into-windows

That groan you hear is users’ reaction to Recall going back into Windows

Security and privacy advocates are girding themselves for another uphill battle against Recall, the AI tool rolling out in Windows 11 that will screenshot, index, and store everything a user does every three seconds.

When Recall was first introduced in May 2024, security practitioners roundly castigated it for creating a gold mine for malicious insiders, criminals, or nation-state spies if they managed to gain even brief administrative access to a Windows device. Privacy advocates warned that Recall was ripe for abuse in intimate partner violence settings. They also noted that there was nothing stopping Recall from preserving sensitive disappearing content sent through privacy-protecting messengers such as Signal.

Enshittification at a new scale

Following months of backlash, Microsoft later suspended Recall. On Thursday, the company said it was reintroducing Recall. It currently is available only to insiders with access to the Windows 11 Build 26100.3902 preview version. Over time, the feature will be rolled out more broadly. Microsoft officials wrote:

Recall (preview)saves you time by offering an entirely new way to search for things you’ve seen or done on your PC securely. With the AI capabilities of Copilot+ PCs, it’s now possible to quickly find and get back to any app, website, image, or document just by describing its content. To use Recall, you will need to opt-in to saving snapshots, which are images of your activity, and enroll in Windows Hello to confirm your presence so only you can access your snapshots. You are always in control of what snapshots are saved and can pause saving snapshots at any time. As you use your Copilot+ PC throughout the day working on documents or presentations, taking video calls, and context switching across activities, Recall will take regular snapshots and help you find things faster and easier. When you need to find or get back to something you’ve done previously, open Recall and authenticate with Windows Hello. When you’ve found what you were looking for, you can reopen the application, website, or document, or use Click to Do to act on any image or text in the snapshot you found.

Microsoft is hoping that the concessions requiring opt-in and the ability to pause Recall will help quell the collective revolt that broke out last year. It likely won’t for various reasons.

That groan you hear is users’ reaction to Recall going back into Windows Read More »

windows-11’s-copilot-vision-wants-to-help-you-learn-to-use-complicated-apps

Windows 11’s Copilot Vision wants to help you learn to use complicated apps

Some elements of Microsoft’s Copilot assistant in Windows 11 have felt like a solution in search of a problem—and it hasn’t helped that Microsoft has frequently changed Copilot’s capabilities, turning it from a native Windows app into a web app and back again.

But I find myself intrigued by a new addition to Copilot Vision that Microsoft began rolling out this week to testers in its Windows Insider program. Copilot Vision launched late last year as a feature that could look at pages in the Microsoft Edge browser and answer questions based on those pages’ contents. The new Vision update extends that capability to any app window, allowing you to ask Copilot not just about the contents of a document but also about the user interface of the app itself.

Microsoft’s Copilot Vision update can see the contents of any app window you share with it. Credit: Microsoft

Provided the app works as intended—not a given for any software, but especially for AI features—Copilot Vision could replace “frantic Googling” as a way to learn how to use a new app or how to do something new or obscure in complex PC apps like Word, Excel, or Photoshop. I recently switched from Photoshop to Affinity Photo, for example, and I’m still finding myself tripped up by small differences in workflows and UI between the two apps. Copilot Vision could, in theory, ease that sort of transition.

Windows 11’s Copilot Vision wants to help you learn to use complicated apps Read More »

carmack-defends-ai-tools-after-quake-fan-calls-microsoft-ai-demo-“disgusting”

Carmack defends AI tools after Quake fan calls Microsoft AI demo “disgusting”

The current generative Quake II demo represents a slight advancement from Microsoft’s previous generative AI gaming model (confusingly titled “WHAM” with only one “M”) we covered in February. That earlier model, while showing progress in generating interactive gameplay footage, operated at 300×180 resolution at 10 frames per second—far below practical modern gaming standards. The new WHAMM demonstration doubles the resolution to 640×360. However, both remain well below what gamers expect from a functional video game in almost every conceivable way. It truly is an AI tech demo.

A Microsoft diagram of the WHAMM system.

A Microsoft diagram of the WHAM system. Credit: Microsoft

For example, the technology faces substantial challenges beyond just performance metrics. Microsoft acknowledges several limitations, including poor enemy interactions, a short context length of just 0.9 seconds (meaning the system forgets objects outside its view), and unreliable numerical tracking for game elements like health values.

Which brings us to another point: A significant gap persists between the technology’s marketing portrayal and its practical applications. While industry veterans like Carmack and Sweeney view AI as another tool in the development arsenal, demonstrations like the Quake II instance may create inflated expectations about AI’s current capabilities for complete game generation.

The most realistic near-term application of generative AI technology remains as coding assistants and perhaps rapid prototyping tools for developers, rather than a drop-in replacement for traditional game development pipelines. The technology’s current limitations suggest that human developers will remain essential for creating compelling, polished game experiences for now. But given the general pace of progress, that might be small comfort for those who worry about losing jobs to AI in the near-term.

Ultimately, Sweeney says not to worry: “There’s always a fear that automation will lead companies to make the same old products while employing fewer people to do it,” Sweeney wrote in a follow-up post on X. “But competition will ultimately lead to companies producing the best work they’re capable of given the new tools, and that tends to mean more jobs.”

And Carmack closed with this: “Will there be more or less game developer jobs? That is an open question. It could go the way of farming, where labor-saving technology allow a tiny fraction of the previous workforce to satisfy everyone, or it could be like social media, where creative entrepreneurship has flourished at many different scales. Regardless, “don’t use power tools because they take people’s jobs” is not a winning strategy.”

Carmack defends AI tools after Quake fan calls Microsoft AI demo “disgusting” Read More »

microsoft-turns-50-today,-and-it-made-me-think-about-ms-dos-5.0

Microsoft turns 50 today, and it made me think about MS-DOS 5.0

On this day in 1975, Bill Gates and Paul Allen founded a company called Micro-Soft in Albuquerque, New Mexico.

The two men had worked together before, as members of the Lakeside Programming group in the early 70s and as co-founders of a road traffic analysis company called Traf-O-Data. But Micro-Soft, later renamed to drop the hyphen and relocated to its current headquarters in Redmond, Washington, would be the company that would transform personal computing over the next five decades.

I’m not here to do a history of Microsoft, because Wikipedia already exists and because the company has already put together a gauzy 50th-anniversary retrospective site with some retro-themed wallpapers. But the anniversary did make me try to remember which Microsoft product I consciously used for the first time, the one that made me aware of the company and the work it was doing.

To get the answer, just put a decimal point in the number “50”—my first Microsoft product was MS-DOS 5.0.

Riding with DOS in the Windows era

I remember this version of MS-DOS so vividly because it was the version that we ran on our first computer. I couldn’t actually tell you what computer it was, though, not because I don’t remember it but because it was a generic yellowed hand-me-down that was prodigiously out of date, given to us by well-meaning people from our church who didn’t know enough to know how obsolete the system was.

It was a clone of the original IBM PC 5150, initially released in 1981; I believe we took ownership of it sometime in 1995 or 1996. It had an Intel 8088, two 5.25-inch floppy drives, and 500-something KB of RAM (also, if memory serves, a sac of spider eggs). But it had no hard drive inside, meaning that anything I wanted to run on or save from this computer needed to use a pile of moldering black plastic diskettes, more than a few of which were already going bad.

Microsoft turns 50 today, and it made me think about MS-DOS 5.0 Read More »

ai-search-engines-cite-incorrect-sources-at-an-alarming-60%-rate,-study-says

AI search engines cite incorrect sources at an alarming 60% rate, study says

A new study from Columbia Journalism Review’s Tow Center for Digital Journalism finds serious accuracy issues with generative AI models used for news searches. The research tested eight AI-driven search tools equipped with live search functionality and discovered that the AI models incorrectly answered more than 60 percent of queries about news sources.

Researchers Klaudia Jaźwińska and Aisvarya Chandrasekar noted in their report that roughly 1 in 4 Americans now uses AI models as alternatives to traditional search engines. This raises serious concerns about reliability, given the substantial error rate uncovered in the study.

Error rates varied notably among the tested platforms. Perplexity provided incorrect information in 37 percent of the queries tested, whereas ChatGPT Search incorrectly identified 67 percent (134 out of 200) of articles queried. Grok 3 demonstrated the highest error rate, at 94 percent.

A graph from CJR shows

A graph from CJR shows “confidently wrong” search results. Credit: CJR

For the tests, researchers fed direct excerpts from actual news articles to the AI models, then asked each model to identify the article’s headline, original publisher, publication date, and URL. They ran 1,600 queries across the eight different generative search tools.

The study highlighted a common trend among these AI models: rather than declining to respond when they lacked reliable information, the models frequently provided confabulations—plausible-sounding incorrect or speculative answers. The researchers emphasized that this behavior was consistent across all tested models, not limited to just one tool.

Surprisingly, premium paid versions of these AI search tools fared even worse in certain respects. Perplexity Pro ($20/month) and Grok 3’s premium service ($40/month) confidently delivered incorrect responses more often than their free counterparts. Though these premium models correctly answered a higher number of prompts, their reluctance to decline uncertain responses drove higher overall error rates.

Issues with citations and publisher control

The CJR researchers also uncovered evidence suggesting some AI tools ignored Robot Exclusion Protocol settings, which publishers use to prevent unauthorized access. For example, Perplexity’s free version correctly identified all 10 excerpts from paywalled National Geographic content, despite National Geographic explicitly disallowing Perplexity’s web crawlers.

AI search engines cite incorrect sources at an alarming 60% rate, study says Read More »

six-ways-microsoft’s-portable-xbox-could-be-a-steam-deck-killer

Six ways Microsoft’s portable Xbox could be a Steam Deck killer

Bring old Xbox games to PC

The ultimate handheld system seller.

Credit: Microsoft / Bizarre Creations

The ultimate handheld system seller. Credit: Microsoft / Bizarre Creations

Microsoft has made a lot of hay over the way recent Xbox consoles can play games dating all the way back to the original Xbox. If Microsoft wants to set its first gaming handheld apart, it should make those old console games officially available on a Windows-based system for the first time.

The ability to download previous console games dating back to the Xbox 360 era (or beyond) would be an instant “system seller” feature for any portable Xbox. While this wouldn’t be a trivial technical lift on Microsoft’s part, the same emulation layer that powers Xbox console backward compatibility could surely be ported to Windows with a little bit of work. That process might be easier with a specific branded portable, too, since Microsoft would be working with full knowledge of what hardware was being used.

If Microsoft can give us a way to play Geometry Wars 2 on the go without having to deal with finicky third-party emulators, we’ll be eternally grateful.

Multiple hardware tiers

Xbox Series S (left), next to Xbox Series X (right).

One size does not fit all when it comes to consoles or to handhelds.

Credit: Sam Machkovech

One size does not fit all when it comes to consoles or to handhelds. Credit: Sam Machkovech

On the console side, Microsoft’s split simultaneous release of the Xbox Series S and X showed an understanding that not everyone wants to pay more money for the most powerful possible gaming hardware. Microsoft should extend this philosophy to gaming handhelds by releasing different tiers of portable Xbox hardware for price-conscious consumers.

Raw hardware power is the most obvious differentiator that could set a more expensive tier of Xbox portables apart from any cheaper options. But Microsoft could also offer portable options that reduce the overall bulk (a la the Nintendo Switch Lite) or offer relative improvements in screen size and quality (a la the Steam Deck OLED and Switch OLED).

“Made for Xbox”

It worked for Valve, it can work for Microsoft.

Credit: Valve

It worked for Valve, it can work for Microsoft. Credit: Valve

One of the best things about console gaming is that you can be confident any game you buy for a console will “just work” with your hardware. In the world of PC gaming handhelds, Valve has tried to replicate this with the “Deck Verified” program to highlight Steam games that are guaranteed to work in a portable setting.

Microsoft is well-positioned to work with game publishers to launch a similar program for its own Xbox-branded portable. There’s real value in offering gamers assurances that “Made for Xbox” PC games will “just work” on their Xbox-branded handheld.

This kind of verification system could also help simplify and clarify hardware requirements across different tiers of portable hardware power; any handheld marketed as “level 2” could play any games marketed as level 2 or below, for instance.

Six ways Microsoft’s portable Xbox could be a Steam Deck killer Read More »

nearly-1-million-windows-devices-targeted-in-advanced-“malvertising”-spree

Nearly 1 million Windows devices targeted in advanced “malvertising” spree

A broad overview of the four stages. Credit: Microsoft

The campaign targeted “nearly” 1 million devices belonging both to individuals and a wide range of organizations and industries. The indiscriminate approach indicates the campaign was opportunistic, meaning it attempted to ensnare anyone, rather than targeting certain individuals, organizations, or industries. GitHub was the platform primarily used to host the malicious payload stages, but Discord and Dropbox were also used.

The malware located resources on the infected computer and sent them to the attacker’s c2 server. The exfiltrated data included the following browser files, which can store login cookies, passwords, browsing histories, and other sensitive data.

  • AppDataRoamingMozillaFirefoxProfiles.default-releasecookies.sqlite
  • AppDataRoamingMozillaFirefoxProfiles.default-releaseformhistory.sqlite
  • AppDataRoamingMozillaFirefoxProfiles.default-releasekey4.db
  • AppDataRoamingMozillaFirefoxProfiles.default-releaselogins.json
  • AppDataLocalGoogleChromeUser DataDefaultWeb Data
  • AppDataLocalGoogleChromeUser DataDefaultLogin Data
  • AppDataLocalMicrosoftEdgeUser DataDefaultLogin Data

Files stored on Microsoft’s OneDrive cloud service were also targeted. The malware also checked for the presence of cryptocurrency wallets including Ledger Live, Trezor Suite, KeepKey, BCVault, OneKey, and BitBox, “indicating potential financial data theft,” Microsoft said.

Microsoft said it suspects the sites hosting the malicious ads were streaming platforms providing unauthorized content. Two of the domains are movies7[.]net and 0123movie[.]art.

Microsoft Defender now detects the files used in the attack, and it’s likely other malware defense apps do the same. Anyone who thinks they may have been targeted can check indicators of compromise at the end of the Microsoft post. The post includes steps users can take to prevent falling prey to similar malvertising campaigns.

Nearly 1 million Windows devices targeted in advanced “malvertising” spree Read More »

on-may-5,-microsoft’s-skype-will-shut-down-for-good

On May 5, Microsoft’s Skype will shut down for good

After more than 21 years, Skype will soon be no more. Last night, some users (including Ars readers) poked around in the latest Skype preview update and noticed as-yet-unsurfaced text that read “Starting in May, Skype will no longer be available. Continue your calls and chats in Teams.”

This morning, Microsoft has confirmed to Ars that it’s true. May 5, 2025, will mark the end of Skype’s long run.

Alongside the verification that the end is nigh, Microsoft shared a bunch of details about how it plans to migrate Skype users over. Starting right away, some Skype users (those in Teams and Skype Insider) will be able to log in to Teams using their Skype credentials. More people will gain that ability over the next few days.

Microsoft claims that users who do this will see their existing contacts and chats from Skype in Teams from the start. Alternatively, users who don’t want to do this can export their Skype data—specifically contacts, call history, and chats.

On May 5, Microsoft’s Skype will shut down for good Read More »

trump-should-block-biden’s-ai-“gift”-to-china,-microsoft-argues

Trump should block Biden’s AI “gift” to China, Microsoft argues

“Countries including Brazil, India, Israel, and the UAE are eminently capable of ramping up investments aimed at securing new ways to access increased computing capacity,” the Brookings Institute said. “Preventing companies in middle-tier countries from relying on the US to supply computing chips is a surefire way to push them into building non-US alliances that include stronger technology ties with China.”

The rule could also complicate the global AI landscape in ways the US may not anticipate, the Center for Strategic and International Studies (CSIS), a bipartisan, nonprofit policy research organization, forecast last week. It could “breed resentment, not cooperation” in tier-two countries that will likely “bristle at the fact that their AI ambitions depend on Washington’s goodwill and that they are being deliberately kept a generation behind the frontier,” CSIS wrote. And it could drive more open source AI like DeepSeek to be key to development in tier-two nations, perhaps further endangering US global leadership in AI, CSIS suggested.

“Ironically, the AI Diffusion Framework, meant to lock in American advantage, may instead midwife the very outcome it sought to prevent—an alternate AI stack and increased open-source development where China, as its most prolific contributor, emerges as the de facto leader,” CSIS reported.

China wooing countries targeted by rule, Microsoft says

But according to Smith, some parts of the rule should be preserved, like datacenter restrictions, including “qualitative provisions” that “ensure that AI technology components are deployed in certified, secure, and trusted datacenters.” That part of the rule, Smith suggested, “helps reduce the risk of chip diversion to China.”

And other parts of the rule can be strengthened, Smith wrote, such as ensuring the Commerce Department has resources to enforce it and “expedite approvals” for any countries in the middle tier who may appeal to either move into the top tier or limit tier-two restrictions.

Trump should block Biden’s AI “gift” to China, Microsoft argues Read More »

microsoft’s-new-ai-agent-can-control-software-and-robots

Microsoft’s new AI agent can control software and robots

The researchers' explanations about how

The researchers’ explanations about how “Set-of-Mark” and “Trace-of-Mark” work. Credit: Microsoft Research

The Magma model introduces two technical components: Set-of-Mark, which identifies objects that can be manipulated in an environment by assigning numeric labels to interactive elements, such as clickable buttons in a UI or graspable objects in a robotic workspace, and Trace-of-Mark, which learns movement patterns from video data. Microsoft says those features allow the model to complete tasks like navigating user interfaces or directing robotic arms to grasp objects.

Microsoft Magma researcher Jianwei Yang wrote in a Hacker News comment that the name “Magma” stands for “M(ultimodal) Ag(entic) M(odel) at Microsoft (Rese)A(rch),” after some people noted that “Magma” already belongs to an existing matrix algebra library, which could create some confusion in technical discussions.

Reported improvements over previous models

In its Magma write-up, Microsoft claims Magma-8B performs competitively across benchmarks, showing strong results in UI navigation and robot manipulation tasks.

For example, it scored 80.0 on the VQAv2 visual question-answering benchmark—higher than GPT-4V’s 77.2 but lower than LLaVA-Next’s 81.8. Its POPE score of 87.4 leads all models in the comparison. In robot manipulation, Magma reportedly outperforms OpenVLA, an open source vision-language-action model, in multiple robot manipulation tasks.

Magma's agentic benchmarks, as reported by the researchers.

Magma’s agentic benchmarks, as reported by the researchers. Credit: Microsoft Research

As always, we take AI benchmarks with a grain of salt since many have not been scientifically validated as being able to measure useful properties of AI models. External verification of Microsoft’s benchmark results will become possible once other researchers can access the public code release.

Like all AI models, Magma is not perfect. It still faces technical limitations in complex step-by-step decision-making that requires multiple steps over time, according to Microsoft’s documentation. The company says it continues to work on improving these capabilities through ongoing research.

Yang says Microsoft will release Magma’s training and inference code on GitHub next week, allowing external researchers to build on the work. If Magma delivers on its promise, it could push Microsoft’s AI assistants beyond limited text interactions, enabling them to operate software autonomously and execute real-world tasks through robotics.

Magma is also a sign of how quickly the culture around AI can change. Just a few years ago, this kind of agentic talk scared many people who feared it might lead to AI taking over the world. While some people still fear that outcome, in 2025, AI agents are a common topic of mainstream AI research that regularly takes place without triggering calls to pause all of AI development.

Microsoft’s new AI agent can control software and robots Read More »

hugging-face-clones-openai’s-deep-research-in-24-hours

Hugging Face clones OpenAI’s Deep Research in 24 hours

On Tuesday, Hugging Face researchers released an open source AI research agent called “Open Deep Research,” created by an in-house team as a challenge 24 hours after the launch of OpenAI’s Deep Research feature, which can autonomously browse the web and create research reports. The project seeks to match Deep Research’s performance while making the technology freely available to developers.

“While powerful LLMs are now freely available in open-source, OpenAI didn’t disclose much about the agentic framework underlying Deep Research,” writes Hugging Face on its announcement page. “So we decided to embark on a 24-hour mission to reproduce their results and open-source the needed framework along the way!”

Similar to both OpenAI’s Deep Research and Google’s implementation of its own “Deep Research” using Gemini (first introduced in December—before OpenAI), Hugging Face’s solution adds an “agent” framework to an existing AI model to allow it to perform multi-step tasks, such as collecting information and building the report as it goes along that it presents to the user at the end.

The open source clone is already racking up comparable benchmark results. After only a day’s work, Hugging Face’s Open Deep Research has reached 55.15 percent accuracy on the General AI Assistants (GAIA) benchmark, which tests an AI model’s ability to gather and synthesize information from multiple sources. OpenAI’s Deep Research scored 67.36 percent accuracy on the same benchmark.

As Hugging Face points out in its post, GAIA includes complex multi-step questions such as this one:

Which of the fruits shown in the 2008 painting “Embroidery from Uzbekistan” were served as part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film “The Last Voyage”? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement in the painting starting from the 12 o’clock position. Use the plural form of each fruit.

To correctly answer that type of question, the AI agent must seek out multiple disparate sources and assemble them into a coherent answer. Many of the questions in GAIA represent no easy task, even for a human, so they test agentic AI’s mettle quite well.

Hugging Face clones OpenAI’s Deep Research in 24 hours Read More »