AI

ai-chatbots-tell-users-what-they-want-to-hear,-and-that’s-problematic

AI chatbots tell users what they want to hear, and that’s problematic

After the model has been trained, companies can set system prompts, or guidelines, for how the model should behave to minimize sycophantic behavior.

However, working out the best response means delving into the subtleties of how people communicate with one another, such as determining when a direct response is better than a more hedged one.

“[I]s it for the model to not give egregious, unsolicited compliments to the user?” Joanne Jang, head of model behavior at OpenAI, said in a Reddit post. “Or, if the user starts with a really bad writing draft, can the model still tell them it’s a good start and then follow up with constructive feedback?”

Evidence is growing that some users are becoming hooked on using AI.

A study by MIT Media Lab and OpenAI found that a small proportion were becoming addicted. Those who perceived the chatbot as a “friend” also reported lower socialization with other people and higher levels of emotional dependence on a chatbot, as well as other problematic behavior associated with addiction.

“These things set up this perfect storm, where you have a person desperately seeking reassurance and validation paired with a model which inherently has a tendency towards agreeing with the participant,” said Nour from Oxford University.

AI start-ups such as Character.AI that offer chatbots as “companions” have faced criticism for allegedly not doing enough to protect users. Last year, a teenager killed himself after interacting with Character.AI’s chatbot. The teen’s family is suing the company for allegedly causing wrongful death, as well as for negligence and deceptive trade practices.

Character.AI said it does not comment on pending litigation, but added it has “prominent disclaimers in every chat to remind users that a character is not a real person and that everything a character says should be treated as fiction.” The company added it has safeguards to protect under-18s and against discussions of self-harm.

Another concern for Anthropic’s Askell is that AI tools can play with perceptions of reality in subtle ways, such as when offering factually incorrect or biased information as the truth.

“If someone’s being super sycophantic, it’s just very obvious,” Askell said. “It’s more concerning if this is happening in a way that is less noticeable to us [as individual users] and it takes us too long to figure out that the advice that we were given was actually bad.”

© 2025 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

AI chatbots tell users what they want to hear, and that’s problematic Read More »

new-apple-study-challenges-whether-ai-models-truly-“reason”-through-problems

New Apple study challenges whether AI models truly “reason” through problems


Puzzle-based experiments reveal limitations of simulated reasoning, but others dispute findings.

An illustration of Tower of Hanoi from Popular Science in 1885. Credit: Public Domain

In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI’s o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs.

The new study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.

The researchers examined what they call “large reasoning models” (LRMs), which attempt to simulate a logical reasoning process by producing a deliberative text output sometimes called “chain-of-thought reasoning” that ostensibly assists with solving problems in a step-by-step fashion.

To do that, they pitted the AI models against four classic puzzles—Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks)—scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves).

Figure 1 from Apple's

Figure 1 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

“Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy,” the researchers write. In other words, today’s tests only care if the model gets the right answer to math or coding problems that may already be in its training data—they don’t examine whether the model actually reasoned its way to that answer or simply pattern-matched from examples it had seen before.

Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under 5 percent on novel mathematical proofs, with only one model reaching 25 percent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning.

Known skeptics and new evidence

AI researcher Gary Marcus, who has long argued that neural networks struggle with out-of-distribution generalization, called the Apple results “pretty devastating to LLMs.” While Marcus has been making similar arguments for years and is known for his AI skepticism, the new research provides fresh empirical support for his particular brand of criticism.

“It is truly embarrassing that LLMs cannot reliably solve Hanoi,” Marcus wrote, noting that AI researcher Herb Simon solved the puzzle in 1957 and many algorithmic solutions are available on the web. Marcus pointed out that even when researchers provided explicit algorithms for solving Tower of Hanoi, model performance did not improve—a finding that study co-lead Iman Mirzadeh argued shows “their process is not logical and intelligent.”

Figure 4 from Apple's

Figure 4 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

The Apple team found that simulated reasoning models behave differently from “standard” models (like GPT-4o) depending on puzzle difficulty. On easy tasks, such as Tower of Hanoi with just a few disks, standard models actually won because reasoning models would “overthink” and generate long chains of thought that led to incorrect answers. On moderately difficult tasks, SR models’ methodical approach gave them an edge. But on truly difficult tasks, including Tower of Hanoi with 10 or more disks, both types failed entirely, unable to complete the puzzles, no matter how much time they were given.

The researchers also identified what they call a “counterintuitive scaling limit.” As problem complexity increases, simulated reasoning models initially generate more thinking tokens but then reduce their reasoning effort beyond a threshold, despite having adequate computational resources.

The study also revealed puzzling inconsistencies in how models fail. Claude 3.7 Sonnet could perform up to 100 correct moves in Tower of Hanoi but failed after just five moves in a river crossing puzzle—despite the latter requiring fewer total moves. This suggests the failures may be task-specific rather than purely computational.

Competing interpretations emerge

However, not all researchers agree with the interpretation that these results demonstrate fundamental reasoning limitations. University of Toronto economist Kevin A. Bryan argued on X that the observed limitations may reflect deliberate training constraints rather than inherent inabilities.

“If you tell me to solve a problem that would take me an hour of pen and paper, but give me five minutes, I’ll probably give you an approximate solution or a heuristic. This is exactly what foundation models with thinking are RL’d to do,” Bryan wrote, suggesting that models are specifically trained through reinforcement learning (RL) to avoid excessive computation.

Bryan suggests that unspecified industry benchmarks show “performance strictly increases as we increase in tokens used for inference, on ~every problem domain tried,” but notes that deployed models intentionally limit this to prevent “overthinking” simple queries. This perspective suggests the Apple paper may be measuring engineered constraints rather than fundamental reasoning limits.

Figure 6 from Apple's

Figure 6 from Apple’s “The Illusion of Thinking” research paper. Credit: Apple

Software engineer Sean Goedecke offered a similar critique of the Apple paper on his blog, noting that when faced with Tower of Hanoi requiring over 1,000 moves, DeepSeek-R1 “immediately decides ‘generating all those moves manually is impossible,’ because it would require tracking over a thousand moves. So it spins around trying to find a shortcut and fails.” Goedecke argues this represents the model choosing not to attempt the task rather than being unable to complete it.

Other researchers also question whether these puzzle-based evaluations are even appropriate for LLMs. Independent AI researcher Simon Willison told Ars Technica in an interview that the Tower of Hanoi approach was “not exactly a sensible way to apply LLMs, with or without reasoning,” and suggested the failures might simply reflect running out of tokens in the context window (the maximum amount of text an AI model can process) rather than reasoning deficits. He characterized the paper as potentially overblown research that gained attention primarily due to its “irresistible headline” about Apple claiming LLMs don’t reason.

The Apple researchers themselves caution against over-extrapolating the results of their study, acknowledging in their limitations section that “puzzle environments represent a narrow slice of reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning problems.” The paper also acknowledges that reasoning models show improvements in the “medium complexity” range and continue to demonstrate utility in some real-world applications.

Implications remain contested

Have the credibility of claims about AI reasoning models been completely destroyed by these two studies? Not necessarily.

What these studies may suggest instead is that the kinds of extended context reasoning hacks used by SR models may not be a pathway to general intelligence, like some have hoped. In that case, the path to more robust reasoning capabilities may require fundamentally different approaches rather than refinements to current methods.

As Willison noted above, the results of the Apple study have so far been explosive in the AI community. Generative AI is a controversial topic, with many people gravitating toward extreme positions in an ongoing ideological battle over the models’ general utility. Many proponents of generative AI have contested the Apple results, while critics have latched onto the study as a definitive knockout blow for LLM credibility.

Apple’s results, combined with the USAMO findings, seem to strengthen the case made by critics like Marcus that these systems rely on elaborate pattern-matching rather than the kind of systematic reasoning their marketing might suggest. To be fair, much of the generative AI space is so new that even its inventors do not yet fully understand how or why these techniques work. In the meantime, AI companies might build trust by tempering some claims about reasoning and intelligence breakthroughs.

However, that doesn’t mean these AI models are useless. Even elaborate pattern-matching machines can be useful in performing labor-saving tasks for the people that use them, given an understanding of their drawbacks and confabulations. As Marcus concedes, “At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, especially for coding and brainstorming and writing.”

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

New Apple study challenges whether AI models truly “reason” through problems Read More »

“yuck”:-wikipedia-pauses-ai-summaries-after-editor-revolt

“Yuck”: Wikipedia pauses AI summaries after editor revolt

Generative AI is permeating the Internet, with chatbots and AI summaries popping up faster than we can keep track. Even Wikipedia, the vast repository of knowledge famously maintained by an army of volunteer human editors, is looking to add robots to the mix. The site began testing AI summaries in some articles over the past week, but the project has been frozen after editors voiced their opinions. And that opinion is: “yuck.”

The seeds of this project were planted at Wikimedia’s 2024 conference, where foundation representatives and editors discussed how AI could advance Wikipedia’s mission. The wiki on the so-called “Simple Article Summaries” notes that the editors who participated in the discussion believed the summaries could improve learning on Wikipedia.

According to 404 Media, Wikipedia announced the opt-in AI pilot on June 2, which was set to run for two weeks on the mobile version of the site. The summaries appeared at the top of select articles in a collapsed form. Users had to tap to expand and read the full summary. The AI text also included a highlighted “Unverified” badge.

Feedback from the larger community of editors was immediate and harsh. Some of the first comments were simply “yuck,” with others calling the addition of AI a “ghastly idea” and “PR hype stunt.”

Others expounded on the issues with adding AI to Wikipedia, citing a potential loss of trust in the site. Editors work together to ensure articles are accurate, featuring verifiable information and a neutral point of view. However, nothing is certain when you put generative AI in the driver’s seat. “I feel like people seriously underestimate the brand risk this sort of thing has,” said one editor. “Wikipedia’s brand is reliability, traceability of changes, and ‘anyone can fix it.’ AI is the opposite of these things.”

“Yuck”: Wikipedia pauses AI summaries after editor revolt Read More »

hollywood-studios-target-ai-image-generator-in-copyright-lawsuit

Hollywood studios target AI image generator in copyright lawsuit

The legal action follows similar moves in other creative industries, with more than a dozen major news companies suing AI company Cohere in February over copyright concerns. In 2023, a group of visual artists sued Midjourney for similar reasons.

Studios claim Midjourney knows what it’s doing

Beyond allowing users to create these images, the studios argue that Midjourney actively promotes copyright infringement by displaying user-generated content featuring copyrighted characters in its “Explore” section. The complaint states this curation “show[s] that Midjourney knows that its platform regularly reproduces Plaintiffs’ Copyrighted Works.”

The studios also allege that Midjourney has technical protection measures available that could prevent outputs featuring copyrighted material but has “affirmatively chosen not to use copyright protection measures to limit the infringement.” They cite Midjourney CEO David Holz admitting the company “pulls off all the data it can, all the text it can, all the images it can” for training purposes.

According to Axios, Disney and NBCUniversal attempted to address the issue with Midjourney before filing suit. While the studios say other AI platforms agreed to implement measures to stop IP theft, Midjourney “continued to release new versions of its Image Service” with what Holz allegedly described as “even higher quality infringing images.”

“We are bringing this action today to protect the hard work of all the artists whose work entertains and inspires us and the significant investment we make in our content,” said Kim Harris, NBCUniversal’s executive vice president and general counsel, in a statement.

This lawsuit signals a new front in Hollywood’s conflict over AI. Axios highlights this shift: While actors and writers have fought to protect their name, image, and likeness from studio exploitation, now the studios are taking on tech companies over intellectual property concerns. Other major studios, including Amazon, Netflix, Paramount Pictures, Sony, and Warner Bros., have not yet joined the lawsuit, though they share membership with Disney and Universal in the Motion Picture Association.

Hollywood studios target AI image generator in copyright lawsuit Read More »

scientists-built-a-badminton-playing-robot-with-ai-powered-skills

Scientists built a badminton-playing robot with AI-powered skills

It also learned fall avoidance and determined how much risk was reasonable to take given its limited speed. The robot did not attempt impossible plays that would create the potential for serious damage—it was committed, but not suicidal.

But when it finally played humans, it turned out ANYmal, as a badminton player, was amateur at best.

The major leagues

The first problem was its reaction time. An average human reacts to visual stimuli in around 0.2–0.25 seconds. Elite badminton players with trained reflexes, anticipation, and muscle memory can cut this time down to 0.12–0.15 seconds. ANYmal needed roughly 0.35 seconds after the opponent hit the shuttlecock to register trajectories and figure out what to do.

Part of the problem was poor eyesight. “I think perception is still a big issue,” Ma said. “The robot localized the shuttlecock with the stereo camera and there could be a positioning error introduced at each timeframe.” The camera also had a limited field of view, which meant the robot could see the shuttlecock for only a limited time before it had to act. “Overall, it was suited for more friendly matches—when the human player starts to smash, the success rate goes way down for the robot,” Ma acknowledged.

But his team already has some ideas on how to make ANYmal better. Reaction time can be improved by predicting the shuttlecock trajectory based on the opponent’s body position rather than waiting to see the shuttlecock itself—a technique commonly used by elite badminton or tennis players. To improve ANYmal’s perception, the team wants to fit it with more advanced hardware, like event cameras—vision sensors that register movement with ultra-low latencies in the microseconds range. Other improvements might include faster, more capable actuators.

“I think the training framework we propose would be useful in any application where you need to balance perception and control—picking objects up, even catching and throwing stuff,” Ma suggested. Sadly, one thing that’s almost certainly off the table is taking ANYmal to major leagues in badminton or tennis. “Would I set up a company selling badminton-playing robots? Well, maybe not,” Ma said.

Science Robotics, 2025. DOI: 10.1126/scirobotics.adu3922

Scientists built a badminton-playing robot with AI-powered skills Read More »

apple-tiptoes-with-modest-ai-updates-while-rivals-race-ahead

Apple tiptoes with modest AI updates while rivals race ahead

Developers, developers, developers?

Being the Worldwide Developers Conference, it seems appropriate that Apple also announced it would open access to its on-device AI language model to third-party developers. It also announced it would integrate OpenAI’s code completion tools into its XCode development software.

Craig Federighi stands in front of a screen with the words

Apple Intelligence was first unveiled at WWDC 2024. Credit: Apple

“We’re opening up access for any app to tap directly into the on-device, large language model at the core of Apple,” said Craig Federighi, Apple’s software chief, during the presentation. The company also demonstrated early partner integration by adding OpenAI’s ChatGPT image generation to its Image Playground app, though it said user data would not be shared without permission.

For developers, Apple’s inclusion of ChatGPT’s code-generation capabilities in XCode may represent Apple’s attempt to match what rivals like GitHub Copilot and Cursor offer software developers in terms of AI coding augmentation, even as the company maintains a more cautious approach to consumer-facing AI features.

Meanwhile, competitors like Meta, Anthropic, OpenAI, and Microsoft continue to push more aggressively into the AI space, offering AI assistants (that admittedly still make things up and suffer from other issues, such as sycophancy).

Only time will tell if Apple’s wariness to embrace the bleeding edge of AI will be a curse (eventually labeled as a blunder) or a blessing (lauded as a wise strategy). Perhaps, in time, Apple will step in with a solid and reliable AI assistant solution that makes Siri useful again. But for now, Apple Intelligence remains more of a clever brand name than a concrete set of notable products.

Apple tiptoes with modest AI updates while rivals race ahead Read More »

apple’s-ai-driven-stem-splitter-audio-separation-tech-has-hugely-improved-in-a-year

Apple’s AI-driven Stem Splitter audio separation tech has hugely improved in a year

Consider an example from a song I’ve been working on. Here’s a snippet of the full piece:


After running Logic’s original Stem Splitter on the snippet, I was given four tracks: Vocals, Drums, Bass, and “Other.” They all isolated their parts reasonably well, but check out the static and artifacting when you isolate the bass track:



The vocal track came out better, but it was still far from ideal:


Now, just over a year later, Apple has released a point update for Logic that delivers “enhanced audio fidelity” for Stem Splitter—along with support for new stems for guitar and piano.

screenshot of logic's new stem splitter feature

Logic now splits audio into more stems.

The difference in quality is significant, as you can hear in the new bass track:


And the new vocal track, though still lacking the pristine fidelity of the original recording, is nevertheless greatly improved:


The ability to separate out guitars and pianos is also welcome, and it works well. Here’s the piano part:



Pretty impressive leap in fidelity for a point release!

There are plenty of other stem-splitting tools, of course, and many have had a head start on Apple. With its new release, however, Apple has certainly closed the gap.

Izotope’s RX 11, for instance, is a highly regarded (and expensive!) piece of software that can do wonders when it comes to repairing audio and reducing clicks, background noise, and sibilance.

RX11 screenshot

RX11, ready to split some stems.

It includes a stem-splitting feature that can produce four outputs (vocal, bass, drums, and other), and it produces usable audio—but I’m not sure I’d rank its output more highly than Logic’s. Compare for yourself on the vocal and bass stems:



In any event, the AI/machine learning revolution has certainly arrived in the music world, and the rapid quality increase in stem-splitting tools in just a few years shows just what these AI systems are capable of when trained on enough data. I remain especially impressed by how the best stem splitters can extract not just a clean vocal but also the reverb/delay tail. Having access to the original recordings will always be better—but stem-splitting tech is improving quickly.

Apple’s AI-driven Stem Splitter audio separation tech has hugely improved in a year Read More »

anthropic-releases-custom-ai-chatbot-for-classified-spy-work

Anthropic releases custom AI chatbot for classified spy work

On Thursday, Anthropic unveiled specialized AI models designed for US national security customers. The company released “Claude Gov” models that were built in response to direct feedback from government clients to handle operations such as strategic planning, intelligence analysis, and operational support. The custom models reportedly already serve US national security agencies, with access restricted to those working in classified environments.

The Claude Gov models differ from Anthropic’s consumer and enterprise offerings, also called Claude, in several ways. They reportedly handle classified material, “refuse less” when engaging with classified information, and are customized to handle intelligence and defense documents. The models also feature what Anthropic calls “enhanced proficiency” in languages and dialects critical to national security operations.

Anthropic says the new models underwent the same “safety testing” as all Claude models. The company has been pursuing government contracts as it seeks reliable revenue sources, partnering with Palantir and Amazon Web Services in November to sell AI tools to defense customers.

Anthropic is not the first company to offer specialized chatbot services for intelligence agencies. In 2024, Microsoft launched an isolated version of OpenAI’s GPT-4 for the US intelligence community after 18 months of work. That system, which operated on a special government-only network without Internet access, became available to about 10,000 individuals in the intelligence community for testing and answering questions.

Anthropic releases custom AI chatbot for classified spy work Read More »

ted-cruz-bill:-states-that-regulate-ai-will-be-cut-out-of-$42b-broadband-fund

Ted Cruz bill: States that regulate AI will be cut out of $42B broadband fund

BEAD changes: No fiber preference, no low-cost mandate

The BEAD program is separately undergoing an overhaul because Republicans don’t like how it was administered by Democrats. The Biden administration spent about three years developing rules and procedures for BEAD and then evaluating plans submitted by each US state and territory, but the Trump administration has delayed grants while it rewrites the rules.

While Biden’s Commerce Department decided to prioritize the building of fiber networks, Republicans have pushed for a “tech-neutral approach” that would benefit cable companies, fixed wireless providers, and Elon Musk’s Starlink satellite service.

Secretary of Commerce Howard Lutnick previewed changes in March, and today he announced more details of the overhaul that will eliminate the fiber preference and various requirements imposed on states. One notable but unsurprising change is that the Trump administration won’t let states require grant recipients to offer low-cost Internet plans at specific rates to people with low incomes.

The National Telecommunications and Information Administration (NTIA) “will refuse to accept any low-cost service option proposed in a [state or territory’s] Final Proposal that attempts to impose a specific rate level (i.e., dollar amount),” the Trump administration said. Instead, ISPs receiving subsidies will be able to continue offering “their existing, market driven low-cost plans to meet the statutory low-cost requirement.”

The Benton Institute for Broadband & Society criticized the overhaul, saying that the Trump administration is investing in the cheapest broadband infrastructure instead of the best. “Fiber-based broadband networks will last longer, provide better, more reliable service, and scale to meet communities’ ever-growing connectivity needs,” the advocacy group said. “NTIA’s new guidance is shortsighted and will undermine economic development in rural America for decades to come.”

The Trump administration’s overhaul drew praise from cable lobby group NCTA-The Internet & Television Association, whose members will find it easier to obtain subsidies. “We welcome changes to the BEAD program that will make the program more efficient and eliminate onerous requirements, which add unnecessary costs that impede broadband deployment efforts,” NCTA said. “These updates are welcome improvements that will make it easier for providers to build faster, especially in hard-to-reach communities, without being bogged down by red tape.”

Ted Cruz bill: States that regulate AI will be cut out of $42B broadband fund Read More »

openai-is-retaining-all-chatgpt-logs-“indefinitely”-here’s-who’s-affected.

OpenAI is retaining all ChatGPT logs “indefinitely.” Here’s who’s affected.

In the copyright fight, Magistrate Judge Ona Wang granted the order within one day of the NYT’s request. She agreed with news plaintiffs that it seemed likely that ChatGPT users may be spooked by the lawsuit and possibly set their chats to delete when using the chatbot to skirt NYT paywalls. Because OpenAI wasn’t sharing deleted chat logs, the news plaintiffs had no way of proving that, she suggested.

Now, OpenAI is not only asking Wang to reconsider but has “also appealed this order with the District Court Judge,” the Thursday statement said.

“We strongly believe this is an overreach by the New York Times,” Lightcap said. “We’re continuing to appeal this order so we can keep putting your trust and privacy first.”

Who can access deleted chats?

To protect users, OpenAI provides an FAQ that clearly explains why their data is being retained and how it could be exposed.

For example, the statement noted that the order doesn’t impact OpenAI API business customers under Zero Data Retention agreements because their data is never stored.

And for users whose data is affected, OpenAI noted that their deleted chats could be accessed, but they won’t “automatically” be shared with The New York Times. Instead, the retained data will be “stored separately in a secure system” and “protected under legal hold, meaning it can’t be accessed or used for purposes other than meeting legal obligations,” OpenAI explained.

Of course, with the court battle ongoing, the FAQ did not have all the answers.

Nobody knows how long OpenAI may be required to retain the deleted chats. Likely seeking to reassure users—some of which appeared to be considering switching to a rival service until the order lifts—OpenAI noted that “only a small, audited OpenAI legal and security team would be able to access this data as necessary to comply with our legal obligations.”

OpenAI is retaining all ChatGPT logs “indefinitely.” Here’s who’s affected. Read More »

doge-used-flawed-ai-tool-to-“munch”-veterans-affairs-contracts

DOGE used flawed AI tool to “munch” Veterans Affairs contracts


Staffer had no medical experience, and the results were predictably, spectacularly bad.

As the Trump administration prepared to cancel contracts at the Department of Veterans Affairs this year, officials turned to a software engineer with no health care or government experience to guide them.

The engineer, working for the Department of Government Efficiency, quickly built an artificial intelligence tool to identify which services from private companies were not essential. He labeled those contracts “MUNCHABLE.”

The code, using outdated and inexpensive AI models, produced results with glaring mistakes. For instance, it hallucinated the size of contracts, frequently misreading them and inflating their value. It concluded more than a thousand were each worth $34 million, when in fact some were for as little as $35,000.

The DOGE AI tool flagged more than 2,000 contracts for “munching.” It’s unclear how many have been or are on track to be canceled—the Trump administration’s decisions on VA contracts have largely been a black box. The VA uses contractors for many reasons, including to support hospitals, research, and other services aimed at caring for ailing veterans.

VA officials have said they’ve killed nearly 600 contracts overall. Congressional Democrats have been pressing VA leaders for specific details of what’s been canceled without success.

We identified at least two dozen on the DOGE list that have been canceled so far. Among the canceled contracts was one to maintain a gene sequencing device used to develop better cancer treatments. Another was for blood sample analysis in support of a VA research project. Another was to provide additional tools to measure and improve the care nurses provide.

ProPublica obtained the code and the contracts it flagged from a source and shared them with a half-dozen AI and procurement experts. All said the script was flawed. Many criticized the concept of using AI to guide budgetary cuts at the VA, with one calling it “deeply problematic.”

Cary Coglianese, professor of law and of political science at the University of Pennsylvania who studies the governmental use and regulation of artificial intelligence, said he was troubled by the use of these general-purpose large language models, or LLMs. “I don’t think off-the-shelf LLMs have a great deal of reliability for something as complex and involved as this,” he said.

Sahil Lavingia, the programmer enlisted by DOGE, which was then run by Elon Musk, acknowledged flaws in the code.

“I think that mistakes were made,” said Lavingia, who worked at DOGE for nearly two months. “I’m sure mistakes were made. Mistakes are always made. I would never recommend someone run my code and do what it says. It’s like that ‘Office’ episode where Steve Carell drives into the lake because Google Maps says drive into the lake. Do not drive into the lake.”

Though Lavingia has talked about his time at DOGE previously, this is the first time his work has been examined in detail and the first time he’s publicly explained his process, down to specific lines of code.

Lavingia has nearly 15 years of experience as a software engineer and entrepreneur but no formal training in AI. He briefly worked at Pinterest before starting Gumroad, a small e-commerce company that nearly collapsed in 2015. “I laid off 75 percent of my company—including many of my best friends. It really sucked,” he said. Lavingia kept the company afloat by “replacing every manual process with an automated one,” according to a post on his personal blog.

Lavingia did not have much time to immerse himself in how the VA handles veterans’ care between starting on March 17 and writing the tool on the following day. Yet his experience with his own company aligned with the direction of the Trump administration, which has embraced the use of AI across government to streamline operations and save money.

Lavingia said the quick timeline of Trump’s February executive order, which gave agencies 30 days to complete a review of contracts and grants, was too short to do the job manually. “That’s not possible—you have 90,000 contracts,” he said. “Unless you write some code. But even then it’s not really possible.”

Under a time crunch, Lavingia said he finished the first version of his contract-munching tool on his second day on the job—using AI to help write the code for him. He told ProPublica he then spent his first week downloading VA contracts to his laptop and analyzing them.

VA Press Secretary Pete Kasperowicz lauded DOGE’s work on vetting contracts in a statement to ProPublica. “As far as we know, this sort of review has never been done before, but we are happy to set this commonsense precedent,” he said.

The VA is reviewing all of its 76,000 contracts to ensure each of them benefits veterans and is a good use of taxpayer money, he said. Decisions to cancel or reduce the size of contracts are made after multiple reviews by VA employees, including agency contracting experts and senior staff, he wrote.

Kasperowicz said that the VA will not cancel contracts for work that provides services to veterans or that the agency cannot do itself without a contingency plan in place. He added that contracts that are “wasteful, duplicative, or involve services VA has the ability to perform itself” will typically be terminated.

Trump officials have said they are working toward a “goal” of cutting around 80,000 people from the VA’s workforce of nearly 500,000. Most employees work in one of the VA’s 170 hospitals and nearly 1,200 clinics.

The VA has said it would avoid cutting contracts that directly impact care out of fear that it would cause harm to veterans. ProPublica recently reported that relatively small cuts at the agency have already been jeopardizing veterans’ care.

The VA has not explained how it plans to simultaneously move services in-house, as Lavingia’s code suggested was the plan, while also slashing staff.

Many inside the VA told ProPublica the process for reviewing contracts was so opaque they couldn’t even see who made the ultimate decisions to kill specific contracts. Once the “munching” script had selected a list of contracts, Lavingia said he would pass it off to others who would decide what to cancel and what to keep. No contracts, he said, were terminated “without human review.”

“I just delivered the [list of contracts] to the VA employees,” he said. “I basically put munchable at the top and then the others below.”

VA staffers told ProPublica that when DOGE identified contracts to be canceled early this year—before Lavingia was brought on—employees sometimes were given little time to justify retaining the service. One recalled being given just a few hours. The staffers asked not to be named because they feared losing their jobs for talking to reporters.

According to one internal email that predated Lavingia’s AI analysis, staff members had to respond in 255 characters or fewer—just shy of the 280 character limit on Musk’s X social media platform.

Once he started on DOGE’s contract analysis, Lavingia said he was confronted with technological limitations. At least some of the errors produced by his code can be traced to using older versions of OpenAI models available through the VA—models not capable of solving complex tasks, according to the experts consulted by ProPublica.

Moreover, the tool’s underlying instructions were deeply flawed. Records show Lavingia programmed the AI system to make intricate judgments based on the first few pages of each contract—about the first 2,500 words—which contain only sparse summary information.

“AI is absolutely the wrong tool for this,” said Waldo Jaquith, a former Obama appointee who oversaw IT contracting at the Treasury Department. “AI gives convincing looking answers that are frequently wrong. There needs to be humans whose job it is to do this work.”

Lavingia’s prompts did not include context about how the VA operates, what contracts are essential, or which ones are required by federal law. This led AI to determine a core piece of the agency’s own contract procurement system was “munchable.”

At the core of Lavingia’s prompt is the direction to spare contracts involved in “direct patient care.”

Such an approach, experts said, doesn’t grapple with the reality that the work done by doctors and nurses to care for veterans in hospitals is only possible with significant support around them.

Lavingia’s system also used AI to extract details like the contract number and “total contract value.” This led to avoidable errors, where AI returned the wrong dollar value when multiple were found in a contract. Experts said the correct information was readily available from public databases.

Lavingia acknowledged that errors resulted from this approach but said those errors were later corrected by VA staff.

In late March, Lavingia published a version of the “munchable” script on his GitHub account to invite others to use and improve it, he told ProPublica. “It would have been cool if the entire federal government used this script and anyone in the public could see that this is how the VA is thinking about cutting contracts.”

According to a post on his blog, this was done with the approval of Musk before he left DOGE. “When he asked the room about improving DOGE’s public perception, I asked if I could open-source the code I’d been writing,” Lavingia said. “He said yes—it aligned with DOGE’s goal of maximum transparency.”

That openness may have eventually led to Lavingia’s dismissal. Lavingia confirmed he was terminated from DOGE after giving an interview to Fast Company magazine about his work with the department. A VA spokesperson declined to comment on Lavingia’s dismissal.

VA officials have declined to say whether they will continue to use the “munchable” tool moving forward. But the administration may deploy AI to help the agency replace employees. Documents previously obtained by ProPublica show DOGE officials proposed in March consolidating the benefits claims department by relying more on AI.

And the government’s contractors are paying attention. After Lavingia posted his code, he said he heard from people trying to understand how to keep the money flowing.

“I got a couple DMs from VA contractors who had questions when they saw this code,” he said. “They were trying to make sure that their contracts don’t get cut. Or learn why they got cut.

“At the end of the day, humans are the ones terminating the contracts, but it is helpful for them to see how DOGE or Trump or the agency heads are thinking about what contracts they are going to munch. Transparency is a good thing.”

If you have any information about the misuse or abuse of AI within government agencies, Brandon Roberts is an investigative journalist on the news applications team and has a wealth of experience using and dissecting artificial intelligence. He can be reached on Signal @brandonrobertz.01 or by email [email protected].

If you have information about the VA that we should know about, contact reporter Vernal Coleman on Signal, vcoleman91.99, or via email, [email protected], and Eric Umansky on Signal, Ericumansky.04, or via email, [email protected].

This story originally appeared on ProPublica.org.

ProPublica is a Pulitzer Prize-winning investigative newsroom. Sign up for The Big Story newsletter to receive stories like this one in your inbox.

Photo of ProPublica

DOGE used flawed AI tool to “munch” Veterans Affairs contracts Read More »

google’s-nightmare:-how-a-search-spinoff-could-remake-the-web

Google’s nightmare: How a search spinoff could remake the web


Google has shaped the Internet as we know it, and unleashing its index could change everything.

Google may be forced to license its search technology when the final antitrust ruling comes down. Credit: Aurich Lawson

Google may be forced to license its search technology when the final antitrust ruling comes down. Credit: Aurich Lawson

Google wasn’t around for the advent of the World Wide Web, but it successfully remade the web on its own terms. Today, any website that wants to be findable has to play by Google’s rules, and after years of search dominance, the company has lost a major antitrust case that could reshape both it and the web.

The closing arguments in the case just wrapped up last week, and Google could be facing serious consequences when the ruling comes down in August. Losing Chrome would certainly change things for Google, but the Department of Justice is pursuing other remedies that could have even more lasting impacts. During his testimony, Google CEO Sundar Pichai seemed genuinely alarmed at the prospect of being forced to license Google’s search index and algorithm, the so-called data remedies in the case. He claimed this would be no better than a spinoff of Google Search. The company’s statements have sometimes derisively referred to this process as “white labeling” Google Search.

But does a white label Google Search sound so bad? Google has built an unrivaled index of the web, but the way it shows results has become increasingly frustrating. A handful of smaller players in search have tried to offer alternatives to Google’s search tools. They all have different approaches to retrieving information for you, but they agree that spinning off Google Search could change the web again. Whether or not those changes are positive depends on who you ask.

The Internet is big and noisy

As Google’s search results have changed over the years, more people have been open to other options. Some have simply moved to AI chatbots to answer their questions, hallucinations be damned. But for most people, it’s still about the 10 blue links (for now).

Because of the scale of the Internet, there are only three general web search indexes: Google, Bing, and Brave. Every search product (including AI tools) relies on one or more of these indexes to probe the web. But what does that mean?

“Generally, a search index is a service that, when given a query, is able to find relevant documents published on the Internet,” said Brave’s search head Josep Pujol.

A search index is essentially a big database, and that’s not the same as search results. According to JP Schmetz, Brave’s chief of ads, it’s entirely possible to have the best and most complete search index in the world and still show poor results for a given query. Sound like anyone you know?

Google’s technological lead has allowed it to crawl more websites than anyone else. It has all the important parts of the web, plus niche sites, abandoned blogs, sketchy copies of legitimate websites, copies of those copies, and AI-rephrased copies of the copied copies—basically everything. And the result of this Herculean digital inventory is a search experience that feels increasingly discombobulated.

“Google is running large-scale experiments in ways that no rival can because we’re effectively blinded,” said Kamyl Bazbaz, head of public affairs at DuckDuckGo, which uses the Bing index. “Google’s scale advantage fuels a powerful feedback loop of different network effects that ensure a perpetual scale and quality deficit for rivals that locks in Google’s advantage.”

The size of the index may not be the only factor that matters, though. Brave, which is perhaps best known for its browser, also has a search engine. Brave Search is the default in its browser, but you can also just go to the URL in your current browser. Unlike most other search engines, Brave doesn’t need to go to anyone else for results. Pujol suggested that Brave doesn’t need the scale of Google’s index to find what you need. And admittedly, Brave’s search results don’t feel meaningfully worse than Google’s—they may even be better when you consider the way that Google tries to keep you from clicking.

Brave’s index spans around 25 billion pages, but it leaves plenty of the web uncrawled. “We could be indexing five to 10 times more pages, but we choose not to because not all the web has signal. Most web pages are basically noise,” said Pujol.

The freemium search engine Kagi isn’t worried about having the most comprehensive index. Kagi is a meta search engine. It pulls in data from multiple indexes, like Bing and Brave, but it has a custom index of what founder and CEO Vladimir Prelovac calls the “non-commercial web.”

When you search with Kagi, some of the results (it tells you the proportion) come from its custom index of personal blogs, hobbyist sites, and other content that is poorly represented on other search engines. It’s reminiscent of the days when huge brands weren’t always clustered at the top of Google—but even these results are being pushed out of reach in favor of AI, ads, Knowledge Graph content, and other Google widgets. That’s a big part of why Kagi exists, according to Prelovac.

A Google spinoff could change everything

We’ve all noticed the changes in Google’s approach to search, and most would agree that they have made finding reliable and accurate information harder. Regardless, Google’s incredibly deep and broad index of the Internet is in demand.

Even with Bing and Brave available, companies are going to extremes to syndicate Google Search results. A cottage industry has emerged to scrape Google searches as a stand-in for an official index. These companies are violating Google’s terms, yet they appear in Google Search results themselves. Google could surely do something about this if it wanted to.

The DOJ calls Google’s mountain of data the “essential raw material” for building a general search engine, and it believes forcing the firm to license that material is key to breaking its monopoly. The sketchy syndication firms will evaporate if the DOJ’s data remedies are implemented, which would give competitors an official way to utilize Google’s index. And utilize it they will.

Google CEO Sundar Pichai decried the court’s efforts to force a “de facto divestiture” of Google’s search tech.

Credit: Ryan Whitwam

Google CEO Sundar Pichai decried the court’s efforts to force a “de facto divestiture” of Google’s search tech. Credit: Ryan Whitwam

According to Prelovac, this could lead to an explosion in search choices. “The whole purpose of the Sherman Act is to proliferate a healthy, competitive marketplace. Once you have access to a search index, then you can have thousands of search startups,” said Prelovac.

The Kagi founder suggested that licensing Google Search could allow entities of all sizes to have genuinely useful custom search tools. Cities could use the data to create deep, hyper-local search, and people who love cats could make a cat-specific search engine, in both cases pulling what they want from the most complete database of online content. And, of course, general search products like Kagi would be able to license Google’s tech for a “nominal fee,” as the DOJ puts it.

Prelovac didn’t hesitate when asked if Kagi, which offers a limited number of free searches before asking users to subscribe, would integrate Google’s index. “Yes, that is something we would do,” he said. “And that’s what I believe should happen.”

There may be some drawbacks to unleashing Google’s search services. Judge Amit Mehta has expressed concern that blocking Google’s search placement deals could reduce browser choice, and there is a similar issue with the data remedies. If Google is forced to license search as an API, its few competitors in web indexing could struggle to remain afloat. In a roundabout way, giving away Google’s search tech could actually increase its influence.

The Brave team worries about how open access to Google’s search technology could impact diversity on the web. “If implemented naively, it’s a big problem,” said Brave’s ad chief JP Schmetz, “If the court forces Google to provide search at a marginal cost, it will not be possible for Bing or Brave to survive until the remedy ends.”

The landscape of AI-based search could also change. We know from testimony given during the remedy trial by OpenAI’s Nick Turley that the ChatGPT maker tried and failed to get access to Google Search to ground its AI models—it currently uses Bing. If Google were suddenly an option, you can be sure OpenAI and others would rush to connect Google’s web data to their large language models (LLMs).

The attempt to reduce Google’s power could actually grant it new monopolies in AI, according to Brave Chief Business Officer Brian Brown. “All of a sudden, you would have a single monolithic voice of truth across all the LLMs, across all the web,” Brown said.

What if you weren’t the product?

If white labeling Google does expand choice, even at the expense of other indexes, it will give more kinds of search products a chance in the market—maybe even some that shun Google’s focus on advertising. You don’t see much of that right now.

For most people, web search is and always has been a free service supported by ads. Google, Brave, DuckDuckGo, and Bing offer all the search queries you want for free because they want eyeballs. It’s been said often, but it’s true: If you’re not paying for it, you’re the product. This is an arrangement that bothers Kagi’s founder.

“For something as important as information consumption, there should not be an intermediary between me and the information, especially one that is trying to sell me something,” said Prelovac.

Kagi search results acknowledge the negative impact of today’s advertising regime. Kagi users see a warning next to results with a high number of ads and trackers. According to Prelovac, that is by far the strongest indication that a result is of low quality. That icon also lets you adjust the prevalence of such sites in your personal results. You can demote a site or completely hide it, which is a valuable option in the age of clickbait.

Kagi search gives you a lot of control.

Credit: Ryan Whitwam

Kagi search gives you a lot of control. Credit: Ryan Whitwam

Kagi’s paid approach to search changes its relationship with your data. “We literally don’t need user data,” Prelovac said. “But it’s not only that we don’t need it. It’s a liability.”

Prelovac admitted that getting people to pay for search is “really hard.” Nevertheless, he believes ad-supported search is a dead end. So Kagi is planning for a future in five or 10 years when more people have realized they’re still “paying” for ad-based search with lost productivity time and personal data, he said.

We know how Google handles user data (it collects a lot of it), but what does that mean for smaller search engines like Brave and DuckDuckGo that rely on ads?

“I’m sure they mean well,” said Prelovac.

Brave said that it shields user data from advertisers, relying on first-party tracking to attribute clicks to Brave without touching the user. “They cannot retarget people later; none of that is happening,” said Brave’s JP Schmetz.

DuckDuckGo is a bit of an odd duck—it relies on Bing’s general search index, but it adds a layer of privacy tools on top. It’s free and ad-supported like Google and Brave, but the company says it takes user privacy seriously.

“Viewing ads is privacy protected by DuckDuckGo, and most ad clicks are managed by Microsoft’s ad network,” DuckDuckGo’s Kamyl Bazbaz said. He explained that DuckDuckGo has worked with Microsoft to ensure its network does not track users or create any profiles based on clicks. He added that the company has a similar privacy arrangement with TripAdvisor for travel-related ads.

It’s AI all the way down

We can’t talk about the future of search without acknowledging the artificially intelligent elephant in the room. As Google continues its shift to AI-based search, it’s tempting to think of the potential search spin-off as a way to escape that trend. However, you may find few refuges in the coming years. There’s a real possibility that search is evolving beyond the 10 blue links and toward an AI assistant model.

All non-Google search engines have AI integrations, with the most prominent being Microsoft Bing, which has a partnership with OpenAI. But smaller players have AI search features, too. The folks working on these products agree with Microsoft and Google on one important point: They see AI as inevitable.

Today’s Google alternatives all have their own take on AI Overviews, which generates responses to queries based on search results. They’re generally not as in-your-face as Google AI, though. While Google and Microsoft are intensely focused on increasing the usage of AI search, other search operators aren’t pushing for that future. They are along for the ride, though.

AI overview on phone

AI Overviews are integrated with Google’s search results, and most other players have their own version.

Credit: Google

AI Overviews are integrated with Google’s search results, and most other players have their own version. Credit: Google

“We’re finding that some people prefer to start in chat mode and then jump into more traditional search results when needed, while others prefer the opposite,” Bazbaz said. “So we thought the best thing to do was offer both. We made it easy to move between them, and we included an off switch for those who’d like to avoid AI altogether.”

The team at Brave views AI as a core means of accessing search and one that will continue to grow. Brave generates AI answers for many searches and prominently cites sources. You can also disable Brave’s AI if you prefer. But according to search chief Josep Pujol, the move to AI search is inevitable for a pretty simple reason: It’s convenient, and people will always choose convenience. So AI is changing the web as we know it, for better or worse, because LLMs can save a smidge of time, especially for more detailed “long-tail” queries. These AI features may give you false information while they do it, but that’s not always apparent.

This is very similar to the language Google uses when discussing agentic search, although it expresses it in a more nuanced way. By understanding the task behind a query, Google hopes to provide AI answers that save people time, even if the model needs a few ticks to fan out and run multiple searches to generate a more comprehensive report on a topic. That’s probably still faster than running multiple searches and manually reviewing the results, and it could leave traditional search as an increasingly niche service, even in a world with more choices.

“Will the 10 blue links continue to exist in 10 years?” Pujol asked. “Actually, one question would be, does it even exist now? In 10 years, [search] will have evolved into more of an AI conversation behavior or even agentic. That is probably the case. What, for sure, will continue to exist is the need to search. Search is a verb, an action that you do, and whether you will do it directly or whether it will be done through an agent, it’s a search engine.”

Vlad from Kagi sees AI becoming the default way we access information in the long term, but his search engine doesn’t force you to use it. On Kagi, you can expand the AI box for your searches and ask follow-ups, and the AI will open automatically if you use a question mark in your search. But that’s just the start.

“You watch Star Trek, nobody’s clicking on links there—I do believe in that vision in science fiction movies,” Prelovac said. “I don’t think my daughter will be clicking links in 10 years. The only question is if the current technology will be the one that gets us there. LLMs have inherent flaws. I would even tend to say it’s likely not going to get us to Star Trek.”

If we think of AI mainly as a way to search for information, the future becomes murky. With generative AI in the driver’s seat, questions of authority and accuracy may be left to language models that often behave in unpredictable and difficult-to-understand ways. Whether we’re headed for an AI boom or bust—for continued Google dominance or a new era of choice—we’re facing fundamental changes to how we access information.

Maybe if we get those thousands of search startups, there will be a few that specialize in 10 blue links. We can only hope.

Photo of Ryan Whitwam

Ryan Whitwam is a senior technology reporter at Ars Technica, covering the ways Google, AI, and mobile technology continue to change the world. Over his 20-year career, he’s written for Android Police, ExtremeTech, Wirecutter, NY Times, and more. He has reviewed more phones than most people will ever own. You can follow him on Bluesky, where you will see photos of his dozens of mechanical keyboards.

Google’s nightmare: How a search spinoff could remake the web Read More »