AI

why-google-gemini’s-pokemon-success-isn’t-all-it’s-cracked-up-to-be

Why Google Gemini’s Pokémon success isn’t all it’s cracked up to be

While Gemini is using its own model and reasoning process for these tasks, it’s telling that JoelZ had to specifically graft these specialized agents onto the base model to help it get through some of the game’s toughest challenges. As JoelZ writes, “My interventions improve Gemini’s overall decision-making and reasoning abilities.”

What are we testing here?

Don’t get me wrong, massaging an LLM into a form that can beat a Pokémon game is definitely an achievement. However, the level of “intervention” needed to help Gemini with those things that “LLMs can’t do independently yet” is crucial to keep in mind as we evaluate that success.

The moment Gemini beat Pokémon (with a little help).

We already know that specially designed reinforcement learning tools can beat Pokémon quite efficiently (and that even a random number generator can beat the game quite inefficiently). The particular resonance of an “LLM plays Pokémon” test is in seeing if a generalized language model can reason out its own solution to a complicated game on its own. The more hand-holding we give the model—through external information, tools, or “harnesses”—the less useful the game is as that kind of test.

Anthropic said in February that Claude Plays Pokémon showed “glimmers of AI systems that tackle challenges with increasing competence, not just through training but with generalized reasoning.” But as Bradshaw writes on LessWrong, “without a refined agent harness, [all models] have a hard time simply making it through the very first screen of the game, Red’s bedroom!” Bradshaw’s subsequent gameplay tests with harness-free LLMs further highlight how these models frequently wander aimlessly, backtrack pointlessly, or even hallucinate impossible game situations.

In other words, we’re still a long way from the kind of envisioned future where an Artificial General Intelligence can figure out a way to beat Pokémon just because you asked it to.

Why Google Gemini’s Pokémon success isn’t all it’s cracked up to be Read More »

a-doge-recruiter-is-staffing-a-project-to-deploy-ai-agents-across-the-us-government

A DOGE recruiter is staffing a project to deploy AI agents across the US government


“does it still require Kremlin oversight?

A startup founder said that AI agents could do the work of tens of thousands of government employees.

An aide sets up a poster depicting the logo for the DOGE Caucus before a news conference in Washington, DC. Credit: Andrew Harnik/Getty Images

A young entrepreneur who was among the earliest known recruiters for Elon Musk’s so-called Department of Government Efficiency (DOGE) has a new, related gig—and he’s hiring. Anthony Jancso, cofounder of AcclerateX, a government tech startup, is looking for technologists to work on a project that aims to have artificial intelligence perform tasks that are currently the responsibility of tens of thousands of federal workers.

Jancso, a former Palantir employee, wrote in a Slack with about 2000 Palantir alumni in it that he’s hiring for a “DOGE orthogonal project to design benchmarks and deploy AI agents across live workflows in federal agencies,” according to an April 21 post reviewed by WIRED. Agents are programs that can perform work autonomously.

We’ve identified over 300 roles with almost full-process standardization, freeing up at least 70k FTEs for higher-impact work over the next year,” he continued, essentially claiming that tens of thousands of federal employees could see many aspects of their job automated and replaced by these AI agents. Workers for the project, he wrote, would be based on site in Washington, DC, and would not require a security clearance; it isn’t clear for whom they would work. Palantir did not respond to requests for comment.

The post was not well received. Eight people reacted with clown face emojis, three reacted with a custom emoji of a man licking a boot, two reacted with custom emoji of Joaquin Phoenix giving a thumbs down in the movie Gladiator, and three reacted with a custom emoji with the word “Fascist.” Three responded with a heart emoji.

“DOGE does not seem interested in finding ‘higher impact work’ for federal employees,” one person said in a comment that received 11 heart reactions. “You’re complicit in firing 70k federal employees and replacing them with shitty autocorrect.”

“Tbf we’re all going to be replaced with shitty autocorrect (written by chatgpt),” another person commented, which received one “+1” reaction.

“How ‘DOGE orthogonal’ is it? Like, does it still require Kremlin oversight?” another person said in a comment that received five reactions with a fire emoji. “Or do they just use your credentials to log in later?”

AccelerateX was originally called AccelerateSF, which VentureBeat reported in 2023 had received support from OpenAI and Anthropic. In its earliest incarnation, AccelerateSF hosted a hackathon for AI developers aimed at using the technology to solve San Francisco’s social problems. According to a 2023 Mission Local story, for instance, Jancso proposed that using large language models to help businesses fill out permit forms to streamline the construction paperwork process might help drive down housing prices. (OpenAI did not respond to a request for comment. Anthropic spokesperson Danielle Ghiglieri tells WIRED that the company “never invested in AccelerateX/SF,” but did sponsor a hackathon AccelerateSF hosted in 2023 by providing free access to its API usage at a time when its Claude API “was still in beta.”)

In 2024, the mission pivoted, with the venture becoming known as AccelerateX. In a post on X announcing the change, the company posted, “Outdated tech is dragging down the US Government. Legacy vendors sell broken systems at increasingly steep prices. This hurts every American citizen.” AccelerateX did not respond to a request for comment.

According to sources with direct knowledge, Jancso disclosed that AccelerateX had signed a partnership agreement with Palantir in 2024. According to the LinkedIn of someone described as one of AccelerateX’s cofounders, Rachel Yee, the company looks to have received funding from OpenAI’s Converge 2 Accelerator. Another of AccelerateSF’s cofounders, Kay Sorin, now works for OpenAI, having joined the company several months after that hackathon. Sorin and Yee did not respond to requests for comment.

Jancso’s cofounder, Jordan Wick, a former Waymo engineer, has been an active member of DOGE, appearing at several agencies over the past few months, including the Consumer Financial Protection Bureau, National Labor Relations Board, the Department of Labor, and the Department of Education. In 2023, Jancso attended a hackathon hosted by ScaleAI; WIRED found that another DOGE member, Ethan Shaotran, also attended the same hackathon.

Since its creation in the first days of the second Trump administration, DOGE has pushed the use of AI across agencies, even as it has sought to cut tens of thousands of federal jobs. At the Department of Veterans Affairs, a DOGE associate suggested using AI to write code for the agency’s website; at the General Services Administration, DOGE has rolled out the GSAi chatbot; the group has sought to automate the process of firing government employees with a tool called AutoRIF; and a DOGE operative at the Department of Housing and Urban Development is using AI tools to examine and propose changes to regulations. But experts say that deploying AI agents to do the work of 70,000 people would be tricky if not impossible.

A federal employee with knowledge of government contracting, who spoke to WIRED on the condition of anonymity because they were not authorized to speak to the press, says, “A lot of agencies have procedures that can differ widely based on their own rules and regulations, and so deploying AI agents across agencies at scale would likely be very difficult.”

Oren Etzioni, cofounder of the AI startup Vercept, says that while AI agents can be good at doing some things—like using an internet browser to conduct research—their outputs can still vary widely and be highly unreliable. For instance, customer service AI agents have invented nonexistent policies when trying to address user concerns. Even research, he says, requires a human to actually make sure what the AI is spitting out is correct.

“We want our government to be something that we can rely on, as opposed to something that is on the absolute bleeding edge,” says Etzioni. “We don’t need it to be bureaucratic and slow, but if corporations haven’t adopted this yet, is the government really where we want to be experimenting with the cutting edge AI?”

Etzioni says that AI agents are also not great 1-1 fits for job replacements. Rather, AI is able to do certain tasks or make others more efficient, but the idea that the technology could do the jobs of 70,000 employees would not be possible. “Unless you’re using funny math,” he says, “no way.”

Jancso, first identified by WIRED in February, was one of the earliest recruiters for DOGE in the months before Donald Trump was inaugurated. In December, Jancso, who sources told WIRED said he had been recruited by Steve Davis, president of the Musk-founded Boring Company and a current member of DOGE, used the Palantir alumni group to recruit DOGE members. On December 2nd, 2024, he wrote, “I’m helping Elon’s team find tech talent for the Department of Government Efficiency (DOGE) in the new admin. This is a historic opportunity to build an efficient government, and to cut the federal budget by 1/3. If you’re interested in playing a role in this mission, please reach out in the next few days.”

According to one source at SpaceX, who asked to remain anonymous as they are not authorized to speak to the press, Jancso appeared to be one of the DOGE members who worked out of the company’s DC office in the days before inauguration along with several other people who would constitute some of DOGE’s earliest members. SpaceX did not respond to a request for comment.

Palantir was cofounded by Peter Thiel, a billionaire and longtime Trump supporter with close ties to Musk. Palantir, which provides data analytics tools to several government agencies including the Department of Defense and the Department of Homeland Security, has received billions of dollars in government contracts. During the second Trump administration, the company has been involved in helping to build a “mega API” to connect data from the Internal Revenue Service to other government agencies, and is working with Immigration and Customs Enforcement to create a massive surveillance platform to identify immigrants to target for deportation.

This story originally appeared at WIRED.com.

Photo of WIRED

Wired.com is your essential daily guide to what’s next, delivering the most original and complete take you’ll find anywhere on innovation’s impact on technology, science, business and culture.

A DOGE recruiter is staffing a project to deploy AI agents across the US government Read More »

claude’s-ai-research-mode-now-runs-for-up-to-45-minutes-before-delivering-reports

Claude’s AI research mode now runs for up to 45 minutes before delivering reports

Still, the report contained a direct quote statement from William Higinbotham that appears to combine quotes from two sources not cited in the source list. (One must always be careful with confabulated quotes in AI because even outside of this Research mode, Claude 3.7 Sonnet tends to invent plausible ones to fit a narrative.) We recently covered a study that showed AI search services confabulate sources frequently, and in this case, it appears that the sources Claude Research surfaced, while real, did not always match what is stated in the report.

There’s always room for interpretation and variation in detail, of course, but overall, Claude Research did a relatively good job crafting a report on this particular topic. Still, you’d want to dig more deeply into each source and confirm everything if you used it as the basis for serious research. You can read the full Claude-generated result as this text file, saved in markdown format. Sadly, the markdown version does not include the source URLS found in the Claude web interface.

Integrations feature

Anthropic also announced Thursday that it has broadened Claude’s data access capabilities. In addition to web search and Google Workspace integration, Claude can now search any connected application through the company’s new “Integrations” feature. The feature reminds us somewhat of OpenAI’s ChatGPT Plugins feature from March 2023 that aimed for similar connections, although the two features work differently under the hood.

These Integrations allow Claude to work with remote Model Context Protocol (MCP) servers across web and desktop applications. The MCP standard, which Anthropic introduced last November and we covered in April, connects AI applications to external tools and data sources.

At launch, Claude supports Integrations with 10 services, including Atlassian’s Jira and Confluence, Zapier, Cloudflare, Intercom, Asana, Square, Sentry, PayPal, Linear, and Plaid. The company plans to add more partners like Stripe and GitLab in the future.

Each integration aims to expand Claude’s functionality in specific ways. The Zapier integration, for instance, reportedly connects thousands of apps through pre-built automation sequences, allowing Claude to automatically pull sales data from HubSpot or prepare meeting briefs based on calendar entries. With Atlassian’s tools, Anthropic says that Claude can collaborate on product development, manage tasks, and create multiple Confluence pages and Jira work items simultaneously.

Anthropic has made its advanced Research and Integrations features available in beta for users on Max, Team, and Enterprise plans, with Pro plan access coming soon. The company has also expanded its web search feature (introduced in March) to all Claude users on paid plans globally.

Claude’s AI research mode now runs for up to 45 minutes before delivering reports Read More »

google-teases-notebooklm-app-in-the-play-store-ahead-of-i/o-release

Google teases NotebookLM app in the Play Store ahead of I/O release

After several years of escalating AI hysteria, we are all familiar with Google’s desire to put Gemini in every one of its products. That can be annoying, but NotebookLM is not—this one actually works. NotebookLM, which helps you parse documents, videos, and more using Google’s advanced AI models, has been available on the web since 2023, but Google recently confirmed it would finally get an Android app. You can get a look at the app now, but it’s not yet available to install.

Until now, NotebookLM was only a website. You can visit it on your phone, but the interface is clunky compared to the desktop version. The arrival of the mobile app will change that. Google said it plans to release the app at Google I/O in late May, but the listing is live in the Play Store early. You can pre-register to be notified when the download is live, but you’ll have to tide yourself over with the screenshots for the time being.

NotebookLM relies on the same underlying technology as Google’s other chatbots and AI projects, but instead of a general purpose robot, NotebookLM is only concerned with the documents you upload. It can assimilate text files, websites, and videos, including multiple files and source types for a single agent. It has a hefty context window of 500,000 tokens and supports document uploads as large as 200MB. Google says this creates a queryable “AI expert” that can answer detailed questions and brainstorm ideas based on the source data.

Google teases NotebookLM app in the Play Store ahead of I/O release Read More »

doge-put-a-college-student-in-charge-of-using-ai-to-rewrite-regulations

DOGE put a college student in charge of using AI to rewrite regulations


The DOGE operative has been tasked with rewrites to the Department of Housing and Urban Development.

A young man with no government experience who has yet to even complete his undergraduate degree is working for Elon Musk’s so-called Department of Government Efficiency (DOGE) at the Department of Housing and Urban Development (HUD) and has been tasked with using artificial intelligence to rewrite the agency’s rules and regulations.

Christopher Sweet was introduced to HUD employees as being originally from San Francisco and, most recently, a third-year student at the University of Chicago, where he was studying economics and data science, in an email sent to staffers earlier this month.

“I’d like to share with you that Chris Sweet has joined the HUD DOGE team with the title of special assistant, although a better title might be ‘Al computer programming quant analyst,’” Scott Langmack, a DOGE staffer and chief operating officer of an AI real estate company, wrote in an email widely shared within the agency and reviewed by WIRED. “With family roots from Brazil, Chris speaks Portuguese fluently. Please join me in welcoming Chris to HUD!”

Sweet’s primary role appears to be leading an effort to leverage artificial intelligence to review HUD’s regulations, compare them to the laws on which they are based, and identify areas where rules can be relaxed or removed altogether. (He has also been given read access to HUD’s data repository on public housing, known as the Public and Indian Housing Information Center, and its enterprise income verification systems, according to sources within the agency.)

Plans for the industrial-scale deregulation of the US government were laid out in detail in the Project 2025 policy document that the Trump administration has effectively used as a playbook during its first 100 days in power. The document, written by a who’s who of far-right figures, many of whom now hold positions of power within the administration, pushes for deregulation in areas like the environment, food and drug enforcement, and diversity, equity, and inclusion policies.

One area Sweet is focusing on is regulation related to the Office of Public and Indian Housing (PIH), according to sources who spoke to WIRED on the condition of anonymity as they were not authorized to speak to the press.

Sweet—who two sources have been told is the lead on the AI deregulation project for the entire administration—has produced an Excel spreadsheet with around a thousand rows containing areas of policy where the AI tool has flagged that HUD may have “overreached” and suggesting replacement language.

Staffers from PIH are, specifically, asked to review the AI’s recommendations and justify their objections to those they don’t agree with. “It all sounds crazy—having AI recommend revisions to regulations,” one HUD source says. “But I appreciated how much they’re using real people to confirm and make changes.”

Once the PIH team completes the review, their recommendations will be submitted to the Office of the General Counsel for approval.

One HUD source says they were told that the AI model being used for this project is “being refined by our work to be used across the government.” To do this, the source says they were told in a meeting attended by Sweet and Jacob Altik, another known DOGE member who has worked as an attorney at Weil, Gotshal & Manges, that the model will crawl through the Code of Federal Regulations (eCFR).

Another source told WIRED that Sweet has also been using the tool at other parts of HUD. WIRED reviewed a copy of the output of the AI’s review of one HUD department, which features columns displaying text that the AI model found to be needing an adjustment while also including suggestions from the AI for alterations to be made, essentially proposing rewrites. The spreadsheet details how many words can be eliminated from individual regulations and gives a percentage figure indicating how noncompliant the regulations are. It isn’t clear how these percentages are calculated.

Sweet did not respond to requests for comment regarding his work. In response to a request to clarify Sweet’s role at HUD, a spokesperson for the agency said they do not comment on individual personnel. The University of Chicago confirmed to WIRED that Sweet is “on leave from the undergraduate college.”

It’s unclear how Sweet was recruited to DOGE, but a public GitHub account indicates that he was working on this issue even before he joined Musk’s demolition crew.

The “CLSweet” GitHub account, which WIRED has linked to Sweet, created an application that tracks and analyzes federal government regulations “showing how regulatory burden is distributed across government agencies.” The application was last updated in March 2025, weeks before Sweet joined HUD.

One HUD source who heard about Sweet’s possible role in revising the agency’s regulations said the effort was redundant, since the agency was already “put through a multi-year multi-stakeholder meatgrinder before any rule was ever created” under the Administrative Procedure Act. (This law dictates how agencies are allowed to establish regulations and allows for judicial oversight over everything an agency does.)

Another HUD source said Sweet’s title seemed to make little sense. “A programmer and a quantitative data analyst are two very different things,” they noted.

Sweet has virtually no online footprint. One of the only references to him online is a short biography on the website of East Edge Securities, an investment firm Sweet founded in 2023 with two other students from the University of Chicago.

The biography is short on details but claims that Sweet has worked in the past with several private equity firms, including Pertento Partners, which is based in London, and Tenzing Global Investors, based in San Francisco. He is also listed as a board member of Paragon Global Investments, which is a student-run hedge fund.

The biography also mentions that Sweet “will be joining Nexus Point Capital as a private equity summer analyst.” The company has headquarters in Hong Kong and Shanghai and describes itself as “an Asian private equity fund with a strategic focus on control opportunities in the Greater China market.”

East Edge Securities, Pertento Partners, Tenzing Global Investors, Paragon Global Investments, and Nexus Point Capital did not respond to requests for comment.

The only other online account associated with Sweet appears to be a Substack account using the same username as the GitHub account. That account has not posted any content and follows mostly finance and market-related newsletters. It also follows Bari Weiss’ The Free Press and the newsletter of Marc Andreessen, the Silicon Valley billionaire investor and group chat enthusiast who said he spent a lot of time advising Trump and his team after the election.

DOGE representatives have been at HUD since February, when WIRED reported that two of those staffers were given application-level access to some of the most critical and sensitive systems inside the agency.

Earlier this month, US representative Maxine Waters, the top Democrat on the House Financial Services Committee, said DOGE had “infiltrated our nation’s housing agencies, stealing funding Congress provided to communities, illegally terminating staff, including in your districts, and accessing confidential data about people living in assisted housing, including sexual assault survivors.”

This story originally appeared at WIRED.com

Photo of WIRED

Wired.com is your essential daily guide to what’s next, delivering the most original and complete take you’ll find anywhere on innovation’s impact on technology, science, business and culture.

DOGE put a college student in charge of using AI to rewrite regulations Read More »

time-saved-by-ai-offset-by-new-work-created,-study-suggests

Time saved by AI offset by new work created, study suggests

A new study analyzing the Danish labor market in 2023 and 2024 suggests that generative AI models like ChatGPT have had almost no significant impact on overall wages or employment yet, despite rapid adoption in some workplaces. The findings, detailed in a working paper by economists from the University of Chicago and the University of Copenhagen, provide an early, large-scale empirical look at AI’s transformative potential.

In “Large Language Models, Small Labor Market Effects,” economists Anders Humlum and Emilie Vestergaard focused specifically on the impact of AI chatbots across 11 occupations often considered vulnerable to automation, including accountants, software developers, and customer support specialists. Their analysis covered data from 25,000 workers and 7,000 workplaces in Denmark.

Despite finding widespread and often employer-encouraged adoption of these tools, the study concluded that “AI chatbots have had no significant impact on earnings or recorded hours in any occupation” during the period studied. The confidence intervals in their statistical analysis ruled out average effects larger than 1 percent.

“The adoption of these chatbots has been remarkably fast,” Humlum told The Register about the study. “Most workers in the exposed occupations have now adopted these chatbots… But then when we look at the economic outcomes, it really has not moved the needle.”

AI creating more work?

During the study, the researchers investigated how company investment in AI affected worker adoption and how chatbots changed workplace processes. While corporate investment boosted AI tool adoption—saving time for 64 to 90 percent of users across studied occupations—the actual benefits were less substantial than expected.

The study revealed that AI chatbots actually created new job tasks for 8.4 percent of workers, including some who did not use the tools themselves, offsetting potential time savings. For example, many teachers now spend time detecting whether students use ChatGPT for homework, while other workers review AI output quality or attempt to craft effective prompts.

Time saved by AI offset by new work created, study suggests Read More »

new-study-accuses-lm-arena-of-gaming-its-popular-ai-benchmark

New study accuses LM Arena of gaming its popular AI benchmark

This study also calls out LM Arena for what appears to be much greater promotion of private models like Gemini, ChatGPT, and Claude. Developers collect data on model interactions from the Chatbot Arena API, but teams focusing on open models consistently get the short end of the stick.

The researchers point out that certain models appear in arena faceoffs much more often, with Google and OpenAI together accounting for over 34 percent of collected model data. Firms like xAI, Meta, and Amazon are also disproportionately represented in the arena. Therefore, those firms get more vibemarking data compared to the makers of open models.

More models, more evals

The study authors have a list of suggestions to make LM Arena more fair. Several of the paper’s recommendations are aimed at correcting the imbalance of privately tested commercial models, for example, by limiting the number of models a group can add and retract before releasing one. The study also suggests showing all model results, even if they aren’t final.

However, the site’s operators take issue with some of the paper’s methodology and conclusions. LM Arena points out that the pre-release testing features have not been kept secret, with a March 2024 blog post featuring a brief explanation of the system. They also contend that model creators don’t technically choose the version that is shown. Instead, the site simply doesn’t show non-public versions for simplicity’s sake. When a developer releases the final version, that’s what LM Arena adds to the leaderboard.

Proprietary models get disproportionate attention in the Chatbot Arena, the study says.

Credit: Shivalika Singh et al.

Proprietary models get disproportionate attention in the Chatbot Arena, the study says. Credit: Shivalika Singh et al.

One place the two sides may find alignment is on the question of unequal matchups. The study authors call for fair sampling, which will ensure open models appear in Chatbot Arena at a rate similar to the likes of Gemini and ChatGPT. LM Arena has suggested it will work to make the sampling algorithm more varied so you don’t always get the big commercial models. That would send more eval data to small players, giving them the chance to improve and challenge the big commercial models.

LM Arena recently announced it was forming a corporate entity to continue its work. With money on the table, the operators need to ensure Chatbot Arena continues figuring into the development of popular models. However, it’s unclear whether this is an objectively better way to evaluate chatbots versus academic tests. As people vote on vibes, there’s a real possibility we are pushing models to adopt sycophantic tendencies. This may have helped nudge ChatGPT into suck-up territory in recent weeks, a move that OpenAI has hastily reverted after widespread anger.

New study accuses LM Arena of gaming its popular AI benchmark Read More »

the-end-of-an-ai-that-shocked-the-world:-openai-retires-gpt-4

The end of an AI that shocked the world: OpenAI retires GPT-4

One of the most influential—and by some counts, notorious—AI models yet released will soon fade into history. OpenAI announced on April 10 that GPT-4 will be “fully replaced” by GPT-4o in ChatGPT at the end of April, bringing a public-facing end to the model that accelerated a global AI race when it launched in March 2023.

“Effective April 30, 2025, GPT-4 will be retired from ChatGPT and fully replaced by GPT-4o,” OpenAI wrote in its April 10 changelog for ChatGPT. While ChatGPT users will no longer be able to chat with the older AI model, the company added that “GPT-4 will still be available in the API,” providing some reassurance to developers who might still be using the older model for various tasks.

The retirement marks the end of an era that began on March 14, 2023, when GPT-4 demonstrated capabilities that shocked some observers: reportedly scoring at the 90th percentile on the Uniform Bar Exam, acing AP tests, and solving complex reasoning problems that stumped previous models. Its release created a wave of immense hype—and existential panic—about AI’s ability to imitate human communication and composition.

A screenshot of GPT-4's introduction to ChatGPT Plus customers from March 14, 2023.

A screenshot of GPT-4’s introduction to ChatGPT Plus customers from March 14, 2023. Credit: Benj Edwards / Ars Technica

While ChatGPT launched in November 2022 with GPT-3.5 under the hood, GPT-4 took AI language models to a new level of sophistication, and it was a massive undertaking to create. It combined data scraped from the vast corpus of human knowledge into a set of neural networks rumored to weigh in at a combined total of 1.76 trillion parameters, which are the numerical values that hold the data within the model.

Along the way, the model reportedly cost more than $100 million to train, according to comments by OpenAI CEO Sam Altman, and required vast computational resources to develop. Training the model may have involved over 20,000 high-end GPUs working in concert—an expense few organizations besides OpenAI and its primary backer, Microsoft, could afford.

Industry reactions, safety concerns, and regulatory responses

Curiously, GPT-4’s impact began before OpenAI’s official announcement. In February 2023, Microsoft integrated its own early version of the GPT-4 model into its Bing search engine, creating a chatbot that sparked controversy when it tried to convince Kevin Roose of The New York Times to leave his wife and when it “lost its mind” in response to an Ars Technica article.

The end of an AI that shocked the world: OpenAI retires GPT-4 Read More »

openai-rolls-back-update-that-made-chatgpt-a-sycophantic-mess

OpenAI rolls back update that made ChatGPT a sycophantic mess

In search of good vibes

OpenAI, along with competitors like Google and Anthropic, is trying to build chatbots that people want to chat with. So, designing the model’s apparent personality to be positive and supportive makes sense—people are less likely to use an AI that comes off as harsh or dismissive. For lack of a better word, it’s increasingly about vibemarking.

When Google revealed Gemini 2.5, the team crowed about how the model topped the LM Arena leaderboard, which lets people choose between two different model outputs in a blinded test. The models people like more end up at the top of the list, suggesting they are more pleasant to use. Of course, people can like outputs for different reasons—maybe one is more technically accurate, or the layout is easier to read. But overall, people like models that make them feel good. The same is true of OpenAI’s internal model tuning work, it would seem.

An example of ChatGPT’s overzealous praise.

Credit: /u/Talvy

An example of ChatGPT’s overzealous praise. Credit: /u/Talvy

It’s possible this pursuit of good vibes is pushing models to display more sycophantic behaviors, which is a problem. Anthropic’s Alex Albert has cited this as a “toxic feedback loop.” An AI chatbot telling you that you’re a world-class genius who sees the unseen might not be damaging if you’re just brainstorming. However, the model’s unending praise can lead people who are using AI to plan business ventures or, heaven forbid, enact sweeping tariffs, to be fooled into thinking they’ve stumbled onto something important. In reality, the model has just become so sycophantic that it loves everything.

The constant pursuit of engagement has been a detriment to numerous products in the Internet era, and it seems generative AI is not immune. OpenAI’s GPT-4o update is a testament to that, but hopefully, this can serve as a reminder for the developers of generative AI that good vibes are not all that matters.

OpenAI rolls back update that made ChatGPT a sycophantic mess Read More »

google-search’s-made-up-ai-explanations-for-sayings-no-one-ever-said,-explained

Google search’s made-up AI explanations for sayings no one ever said, explained


But what does “meaning” mean?

A partial defense of (some of) AI Overview’s fanciful idiomatic explanations.

Mind…. blown Credit: Getty Images

Last week, the phrase “You can’t lick a badger twice” unexpectedly went viral on social media. The nonsense sentence—which was likely never uttered by a human before last week—had become the poster child for the newly discovered way Google search’s AI Overviews makes up plausible-sounding explanations for made-up idioms (though the concept seems to predate that specific viral post by at least a few days).

Google users quickly discovered that typing any concocted phrase into the search bar with the word “meaning” attached at the end would generate an AI Overview with a purported explanation of its idiomatic meaning. Even the most nonsensical attempts at new proverbs resulted in a confident explanation from Google’s AI Overview, created right there on the spot.

In the wake of the “lick a badger” post, countless users flocked to social media to share Google’s AI interpretations of their own made-up idioms, often expressing horror or disbelief at Google’s take on their nonsense. Those posts often highlight the overconfident way the AI Overview frames its idiomatic explanations and occasional problems with the model confabulating sources that don’t exist.

But after reading through dozens of publicly shared examples of Google’s explanations for fake idioms—and generating a few of my own—I’ve come away somewhat impressed with the model’s almost poetic attempts to glean meaning from gibberish and make sense out of the senseless.

Talk to me like a child

Let’s try a thought experiment: Say a child asked you what the phrase “you can’t lick a badger twice” means. You’d probably say you’ve never heard that particular phrase or ask the child where they heard it. You might say that you’re not familiar with that phrase or that it doesn’t really make sense without more context.

Someone on Threads noticed you can type any random sentence into Google, then add “meaning” afterwards, and you’ll get an AI explanation of a famous idiom or phrase you just made up. Here is mine

[image or embed]

— Greg Jenner (@gregjenner.bsky.social) April 23, 2025 at 6: 15 AM

But let’s say the child persisted and really wanted an explanation for what the phrase means. So you’d do your best to generate a plausible-sounding answer. You’d search your memory for possible connotations for the word “lick” and/or symbolic meaning for the noble badger to force the idiom into some semblance of sense. You’d reach back to other similar idioms you know to try to fit this new, unfamiliar phrase into a wider pattern (anyone who has played the excellent board game Wise and Otherwise might be familiar with the process).

Google’s AI Overview doesn’t go through exactly that kind of human thought process when faced with a similar question about the same saying. But in its own way, the large language model also does its best to generate a plausible-sounding response to an unreasonable request.

As seen in Greg Jenner’s viral Bluesky post, Google’s AI Overview suggests that “you can’t lick a badger twice” means that “you can’t trick or deceive someone a second time after they’ve been tricked once. It’s a warning that if someone has already been deceived, they are unlikely to fall for the same trick again.” As an attempt to derive meaning from a meaningless phrase —which was, after all, the user’s request—that’s not half bad. Faced with a phrase that has no inherent meaning, the AI Overview still makes a good-faith effort to answer the user’s request and draw some plausible explanation out of troll-worthy nonsense.

Contrary to the computer science truism of “garbage in, garbage out, Google here is taking in some garbage and spitting out… well, a workable interpretation of garbage, at the very least.

Google’s AI Overview even goes into more detail explaining its thought process. “Lick” here means to “trick or deceive” someone, it says, a bit of a stretch from the dictionary definition of lick as “comprehensively defeat,” but probably close enough for an idiom (and a plausible iteration of the idiom, “Fool me once shame on you, fool me twice, shame on me…”). Google also explains that the badger part of the phrase “likely originates from the historical sport of badger baiting,” a practice I was sure Google was hallucinating until I looked it up and found it was real.

It took me 15 seconds to make up this saying but now I think it kind of works!

Credit: Kyle Orland / Google

It took me 15 seconds to make up this saying but now I think it kind of works! Credit: Kyle Orland / Google

I found plenty of other examples where Google’s AI derived more meaning than the original requester’s gibberish probably deserved. Google interprets the phrase “dream makes the steam” as an almost poetic statement about imagination powering innovation. The line “you can’t humble a tortoise” similarly gets interpreted as a statement about the difficulty of intimidating “someone with a strong, steady, unwavering character (like a tortoise).”

Google also often finds connections that the original nonsense idiom creators likely didn’t intend. For instance, Google could link the made-up idiom “A deft cat always rings the bell” to the real concept of belling the cat. And in attempting to interpret the nonsense phrase “two cats are better than grapes,” the AI Overview correctly notes that grapes can be potentially toxic to cats.

Brimming with confidence

Even when Google’s AI Overview works hard to make the best of a bad prompt, I can still understand why the responses rub a lot of users the wrong way. A lot of the problem, I think, has to do with the LLM’s unearned confident tone, which pretends that any made-up idiom is a common saying with a well-established and authoritative meaning.

Rather than framing its responses as a “best guess” at an unknown phrase (as a human might when responding to a child in the example above), Google generally provides the user with a single, authoritative explanation for what an idiom means, full stop. Even with the occasional use of couching words such as “likely,” “probably,” or “suggests,” the AI Overview comes off as unnervingly sure of the accepted meaning for some nonsense the user made up five seconds ago.

If Google’s AI Overviews always showed this much self-doubt, we’d be getting somewhere.

Credit: Google / Kyle Orland

If Google’s AI Overviews always showed this much self-doubt, we’d be getting somewhere. Credit: Google / Kyle Orland

I was able to find one exception to this in my testing. When I asked Google the meaning of “when you see a tortoise, spin in a circle,” Google reasonably told me that the phrase “doesn’t have a widely recognized, specific meaning” and that it’s “not a standard expression with a clear, universal meaning.” With that context, Google then offered suggestions for what the phrase “seems to” mean and mentioned Japanese nursery rhymes that it “may be connected” to, before concluding that it is “open to interpretation.”

Those qualifiers go a long way toward properly contextualizing the guesswork Google’s AI Overview is actually conducting here. And if Google provided that kind of context in every AI summary explanation of a made-up phrase, I don’t think users would be quite as upset.

Unfortunately, LLMs like this have trouble knowing what they don’t know, meaning moments of self-doubt like the turtle interpretation here tend to be few and far between. It’s not like Google’s language model has some master list of idioms in its neural network that it can consult to determine what is and isn’t a “standard expression” that it can be confident about. Usually, it’s just projecting a self-assured tone while struggling to force the user’s gibberish into meaning.

Zeus disguised himself as what?

The worst examples of Google’s idiomatic AI guesswork are ones where the LLM slips past plausible interpretations and into sheer hallucination of completely fictional sources. The phrase “a dog never dances before sunset,” for instance, did not appear in the film Before Sunrise, no matter what Google says. Similarly, “There are always two suns on Tuesday” does not appear in The Hitchhiker’s Guide to the Galaxy film despite Google’s insistence.

Literally in the one I tried.

[image or embed]

— Sarah Vaughan (@madamefelicie.bsky.social) April 23, 2025 at 7: 52 AM

There’s also no indication that the made-up phrase “Welsh men jump the rabbit” originated on the Welsh island of Portland, or that “peanut butter platform heels” refers to a scientific experiment creating diamonds from the sticky snack. We’re also unaware of any Greek myth where Zeus disguises himself as a golden shower to explain the phrase “beware what glitters in a golden shower.” (Update: As many commenters have pointed out, this last one is actually a reference to the greek myth of Danaë and the shower of gold, showing Google’s AI knows more about this potential symbolism than I do)

The fact that Google’s AI Overview presents these completely made-up sources with the same self-assurance as its abstract interpretations is a big part of the problem here. It’s also a persistent problem for LLMs that tend to make up news sources and cite fake legal cases regularly. As usual, one should be very wary when trusting anything an LLM presents as an objective fact.

When it comes to the more artistic and symbolic interpretation of nonsense phrases, though, I think Google’s AI Overviews have gotten something of a bad rap recently. Presented with the difficult task of explaining nigh-unexplainable phrases, the model does its best, generating interpretations that can border on the profound at times. While the authoritative tone of those responses can sometimes be annoying or actively misleading, it’s at least amusing to see the model’s best attempts to deal with our meaningless phrases.

Photo of Kyle Orland

Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.

Google search’s made-up AI explanations for sayings no one ever said, explained Read More »

ai-generated-code-could-be-a-disaster-for-the-software-supply-chain-here’s-why.

AI-generated code could be a disaster for the software supply chain. Here’s why.

AI-generated computer code is rife with references to non-existent third-party libraries, creating a golden opportunity for supply-chain attacks that poison legitimate programs with malicious packages that can steal data, plant backdoors, and carry out other nefarious actions, newly published research shows.

The study, which used 16 of the most widely used large language models to generate 576,000 code samples, found that 440,000 of the package dependencies they contained were “hallucinated,” meaning they were non-existent. Open source models hallucinated the most, with 21 percent of the dependencies linking to non-existent libraries. A dependency is an essential code component that a separate piece of code requires to work properly. Dependencies save developers the hassle of rewriting code and are an essential part of the modern software supply chain.

Package hallucination flashbacks

These non-existent dependencies represent a threat to the software supply chain by exacerbating so-called dependency confusion attacks. These attacks work by causing a software package to access the wrong component dependency, for instance by publishing a malicious package and giving it the same name as the legitimate one but with a later version stamp. Software that depends on the package will, in some cases, choose the malicious version rather than the legitimate one because the former appears to be more recent.

Also known as package confusion, this form of attack was first demonstrated in 2021 in a proof-of-concept exploit that executed counterfeit code on networks belonging to some of the biggest companies on the planet, Apple, Microsoft, and Tesla included. It’s one type of technique used in software supply-chain attacks, which aim to poison software at its very source, in an attempt to infect all users downstream.

“Once the attacker publishes a package under the hallucinated name, containing some malicious code, they rely on the model suggesting that name to unsuspecting users,” Joseph Spracklen, a University of Texas at San Antonio Ph.D. student and lead researcher, told Ars via email. “If a user trusts the LLM’s output and installs the package without carefully verifying it, the attacker’s payload, hidden in the malicious package, would be executed on the user’s system.”

AI-generated code could be a disaster for the software supply chain. Here’s why. Read More »

chatgpt-goes-shopping-with-new-product-browsing-feature

ChatGPT goes shopping with new product-browsing feature

On Thursday, OpenAI announced the addition of shopping features to ChatGPT Search. The new feature allows users to search for products and purchase them through merchant websites after being redirected from the ChatGPT interface. Product placement is not sponsored, and the update affects all users, regardless of whether they’ve signed in to an account.

Adam Fry, ChatGPT search product lead at OpenAI, showed Ars Technica’s sister site Wired how the new shopping system works during a demonstration. Users researching products like espresso machines or office chairs receive recommendations based on their stated preferences, stored memories, and product reviews from around the web.

According to Wired, the shopping experience in ChatGPT resembles Google Shopping. When users click on a product image, the interface displays multiple retailers like Amazon and Walmart on the right side of the screen, with buttons to complete purchases. OpenAI is currently experimenting with categories that include electronics, fashion, home goods, and beauty products.

Product reviews shown in ChatGPT come from various online sources, including publishers and user forums like Reddit. Users can instruct ChatGPT to prioritize which review sources to use when creating product recommendations.

An example of the ChatGPT shopping experience provided by OpenAI.

An example of the ChatGPT shopping experience provided by OpenAI. Credit: OpenAI

Unlike Google’s algorithm-based approach to product recommendations, ChatGPT reportedly attempts to understand product reviews and user preferences in a more conversational manner.  If someone mentions they prefer black clothing from specific retailers in a chat, the system incorporates those preferences in future shopping recommendations.

ChatGPT goes shopping with new product-browsing feature Read More »