DDoS

open-source-devs-say-ai-crawlers-dominate-traffic,-forcing-blocks-on-entire-countries

Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries


AI bots hungry for data are taking down FOSS sites by accident, but humans are fighting back.

Software developer Xe Iaso reached a breaking point earlier this year when aggressive AI crawler traffic from Amazon overwhelmed their Git repository service, repeatedly causing instability and downtime. Despite configuring standard defensive measures—adjusting robots.txt, blocking known crawler user-agents, and filtering suspicious traffic—Iaso found that AI crawlers continued evading all attempts to stop them, spoofing user-agents and cycling through residential IP addresses as proxies.

Desperate for a solution, Iaso eventually resorted to moving their server behind a VPN and creating “Anubis,” a custom-built proof-of-work challenge system that forces web browsers to solve computational puzzles before accessing the site. “It’s futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more,” Iaso wrote in a blog post titled “a desperate cry for help.” “I don’t want to have to close off my Gitea server to the public, but I will if I have to.”

Iaso’s story highlights a broader crisis rapidly spreading across the open source community, as what appear to be aggressive AI crawlers increasingly overload community-maintained infrastructure, causing what amounts to persistent distributed denial-of-service (DDoS) attacks on vital public resources. According to a comprehensive recent report from LibreNews, some open source projects now see as much as 97 percent of their traffic originating from AI companies’ bots, dramatically increasing bandwidth costs, service instability, and burdening already stretched-thin maintainers.

Kevin Fenzi, a member of the Fedora Pagure project’s sysadmin team, reported on his blog that the project had to block all traffic from Brazil after repeated attempts to mitigate bot traffic failed. GNOME GitLab implemented Iaso’s “Anubis” system, requiring browsers to solve computational puzzles before accessing content. GNOME sysadmin Bart Piotrowski shared on Mastodon that only about 3.2 percent of requests (2,690 out of 84,056) passed their challenge system, suggesting the vast majority of traffic was automated. KDE’s GitLab infrastructure was temporarily knocked offline by crawler traffic originating from Alibaba IP ranges, according to LibreNews, citing a KDE Development chat.

While Anubis has proven effective at filtering out bot traffic, it comes with drawbacks for legitimate users. When many people access the same link simultaneously—such as when a GitLab link is shared in a chat room—site visitors can face significant delays. Some mobile users have reported waiting up to two minutes for the proof-of-work challenge to complete, according to the news outlet.

The situation isn’t exactly new. In December, Dennis Schubert, who maintains infrastructure for the Diaspora social network, described the situation as “literally a DDoS on the entire internet” after discovering that AI companies accounted for 70 percent of all web requests to their services.

The costs are both technical and financial. The Read the Docs project reported that blocking AI crawlers immediately decreased their traffic by 75 percent, going from 800GB per day to 200GB per day. This change saved the project approximately $1,500 per month in bandwidth costs, according to their blog post “AI crawlers need to be more respectful.”

A disproportionate burden on open source

The situation has created a tough challenge for open source projects, which rely on public collaboration and typically operate with limited resources compared to commercial entities. Many maintainers have reported that AI crawlers deliberately circumvent standard blocking measures, ignoring robots.txt directives, spoofing user agents, and rotating IP addresses to avoid detection.

As LibreNews reported, Martin Owens from the Inkscape project noted on Mastodon that their problems weren’t just from “the usual Chinese DDoS from last year, but from a pile of companies that started ignoring our spider conf and started spoofing their browser info.” Owens added, “I now have a prodigious block list. If you happen to work for a big company doing AI, you may not get our website anymore.”

On Hacker News, commenters in threads about the LibreNews post last week and a post on Iaso’s battles in January expressed deep frustration with what they view as AI companies’ predatory behavior toward open source infrastructure. While these comments come from forum posts rather than official statements, they represent a common sentiment among developers.

As one Hacker News user put it, AI firms are operating from a position that “goodwill is irrelevant” with their “$100bn pile of capital.” The discussions depict a battle between smaller AI startups that have worked collaboratively with affected projects and larger corporations that have been unresponsive despite allegedly forcing thousands of dollars in bandwidth costs on open source project maintainers.

Beyond consuming bandwidth, the crawlers often hit expensive endpoints, like git blame and log pages, placing additional strain on already limited resources. Drew DeVault, founder of SourceHut, reported on his blog that the crawlers access “every page of every git log, and every commit in your repository,” making the attacks particularly burdensome for code repositories.

The problem extends beyond infrastructure strain. As LibreNews points out, some open source projects began receiving AI-generated bug reports as early as December 2023, first reported by Daniel Stenberg of the Curl project on his blog in a post from January 2024. These reports appear legitimate at first glance but contain fabricated vulnerabilities, wasting valuable developer time.

Who is responsible, and why are they doing this?

AI companies have a history of taking without asking. Before the mainstream breakout of AI image generators and ChatGPT attracted attention to the practice in 2022, the machine learning field regularly compiled datasets with little regard to ownership.

While many AI companies engage in web crawling, the sources suggest varying levels of responsibility and impact. Dennis Schubert’s analysis of Diaspora’s traffic logs showed that approximately one-fourth of its web traffic came from bots with an OpenAI user agent, while Amazon accounted for 15 percent and Anthropic for 4.3 percent.

The crawlers’ behavior suggests different possible motivations. Some may be collecting training data to build or refine large language models, while others could be executing real-time searches when users ask AI assistants for information.

The frequency of these crawls is particularly telling. Schubert observed that AI crawlers “don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not.” This pattern suggests ongoing data collection rather than one-time training exercises, potentially indicating that companies are using these crawls to keep their models’ knowledge current.

Some companies appear more aggressive than others. KDE’s sysadmin team reported that crawlers from Alibaba IP ranges were responsible for temporarily knocking their GitLab offline. Meanwhile, Iaso’s troubles came from Amazon’s crawler. A member of KDE’s sysadmin team told LibreNews that Western LLM operators like OpenAI and Anthropic were at least setting proper user agent strings (which theoretically allows websites to block them), while some Chinese AI companies were reportedly more deceptive in their approaches.

It remains unclear why these companies don’t adopt more collaborative approaches and, at a minimum, rate-limit their data harvesting runs so they don’t overwhelm source websites. Amazon, OpenAI, Anthropic, and Meta did not immediately respond to requests for comment, but we will update this piece if they reply.

Tarpits and labyrinths: The growing resistance

In response to these attacks, new defensive tools have emerged to protect websites from unwanted AI crawlers. As Ars reported in January, an anonymous creator identified only as “Aaron” designed a tool called “Nepenthes” to trap crawlers in endless mazes of fake content. Aaron explicitly describes it as “aggressive malware” intended to waste AI companies’ resources and potentially poison their training data.

“Any time one of these crawlers pulls from my tarpit, it’s resources they’ve consumed and will have to pay hard cash for,” Aaron explained to Ars. “It effectively raises their costs. And seeing how none of them have turned a profit yet, that’s a big problem for them.”

On Friday, Cloudflare announced “AI Labyrinth,” a similar but more commercially polished approach. Unlike Nepenthes, which is designed as an offensive weapon against AI companies, Cloudflare positions its tool as a legitimate security feature to protect website owners from unauthorized scraping, as we reported at the time.

“When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them,” Cloudflare explained in its announcement. The company reported that AI crawlers generate over 50 billion requests to their network daily, accounting for nearly 1 percent of all web traffic they process.

The community is also developing collaborative tools to help protect against these crawlers. The “ai.robots.txt” project offers an open list of web crawlers associated with AI companies and provides premade robots.txt files that implement the Robots Exclusion Protocol, as well as .htaccess files that return error pages when detecting AI crawler requests.

As it currently stands, both the rapid growth of AI-generated content overwhelming online spaces and aggressive web-crawling practices by AI firms threaten the sustainability of essential online resources. The current approach taken by some large AI companies—extracting vast amounts of data from open-source projects without clear consent or compensation—risks severely damaging the very digital ecosystem on which these AI models depend.

Responsible data collection may be achievable if AI firms collaborate directly with the affected communities. However, prominent industry players have shown little incentive to adopt more cooperative practices. Without meaningful regulation or self-restraint by AI firms, the arms race between data-hungry bots and those attempting to defend open source infrastructure seems likely to escalate further, potentially deepening the crisis for the digital ecosystem that underpins the modern Internet.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries Read More »

uk-outlaws-awful-default-passwords-on-connected-devices

UK outlaws awful default passwords on connected devices

Tacking an S onto IoT —

The law aims to prevent global-scale botnet attacks.

UK outlaws awful default passwords on connected devices

Getty Images

If you build a gadget that connects to the Internet and sell it in the United Kingdom, you can no longer make the default password “password.” In fact, you’re not supposed to have default passwords at all.

A new version of the 2022 Product Security and Telecommunications Infrastructure Act (PTSI) is now in effect, covering just about everything that a consumer can buy that connects to the web. Under the guidelines, even the tiniest Wi-Fi board must either have a randomized password or else generate a password upon initialization (through a smartphone app or other means). This password can’t be incremental (“password1,” “password54”), and it can’t be “related in an obvious way to public information,” such as MAC addresses or Wi-Fi network names. A device should be sufficiently strong against brute-force access attacks, including credential stuffing, and should have a “simple mechanism” for changing the password.

There’s more, and it’s just as head-noddingly obvious. Software components, where reasonable, “should be securely updateable,” should actually check for updates, and should update either automatically or in a way “simple for the user to apply.” Perhaps most importantly, device owners can report security issues and expect to hear back about how that report is being handled.

Violations of the new device laws can result in fines up to 10 million pounds (roughly $12.5 million) or 4 percent of related worldwide revenue, whichever is higher.

Besides giving consumers better devices, these regulations are aimed squarely at malware like Mirai, which can conscript devices like routers, cable modems, and DVRs into armies capable of performing distributed denial-of-service attacks (DDoS) on various targets.

As noted by The Record, the European Union’s Cyber Resilience Act has been shaped but not yet passed and enforced, and even if it does pass, would not take effect until 2027. In the US, there is the Cyber Trust Mark, which would at least give customers the choice of buying decently secured or genially abandoned devices. But the particulars of that label are under debate and seemingly a ways from implementation. At the federal level, a 2020 bill tasked the National Institutes of Standard and Technology with applying related standards to connected devices deployed by the feds.

UK outlaws awful default passwords on connected devices Read More »

doj-quietly-removed-russian-malware-from-routers-in-us-homes-and-businesses

DOJ quietly removed Russian malware from routers in US homes and businesses

Fancy Bear —

Feds once again fix up compromised retail routers under court order.

Ethernet cable plugged into a router LAN port

Getty Images

More than 1,000 Ubiquiti routers in homes and small businesses were infected with malware used by Russian-backed agents to coordinate them into a botnet for crime and spy operations, according to the Justice Department.

That malware, which worked as a botnet for the Russian hacking group Fancy Bear, was removed in January 2024 under a secret court order as part of “Operation Dying Ember,” according to the FBI’s director. It affected routers running Ubiquiti’s EdgeOS, but only those that had not changed their default administrative password. Access to the routers allowed the hacking group to “conceal and otherwise enable a variety of crimes,” the DOJ claims, including spearphishing and credential harvesting in the US and abroad.

Unlike previous attacks by Fancy Bear—that the DOJ ties to GRU Military Unit 26165, which is also known as APT 28, Sofacy Group, and Sednit, among other monikers—the Ubiquiti intrusion relied on a known malware, Moobot. Once infected by “Non-GRU cybercriminals,” GRU agents installed “bespoke scripts and files” to connect and repurpose the devices, according to the DOJ.

The DOJ also used the Moobot malware to copy and delete the botnet files and data, according to the DOJ, and then changed the routers’ firewall rules to block remote management access. During the court-sanctioned intrusion, the DOJ “enabled temporary collection of non-content routing information” that would “expose GRU attempts to thwart the operation.” This did not “impact the routers’ normal functionality or collect legitimate user content information,” the DOJ claims.

“For the second time in two months, we’ve disrupted state-sponsored hackers from launching cyber-attacks behind the cover of compromised US routers,” said Deputy Attorney General Lisa Monaco in a press release.

The DOJ states it will notify affected customers to ask them to perform a factory reset, install the latest firmware, and change their default administrative password.

Christopher A. Wray, director of the FBI, expanded on the Fancy Bear operation and international hacking threats generally at the ongoing Munich Security Conference. Russia has recently targeted underwater cables and industrial control systems worldwide, Wray said, according to a New York Times report. And since its invasion of Ukraine, Russia has focused on the US energy sector, Wray said.

The past year has been an active time for attacks on routers and other network infrastructure. TP-Link routers were found infected in May 2023 with malware from a reportedly Chinese-backed group. In September, modified firmware in Cisco routers was discovered as part of a Chinese-backed intrusion into multinational companies, according to US and Japanese authorities. Malware said by the DOJ to be tied to the Chinese government was removed from SOHO routers by the FBI last month in similar fashion to the most recently revealed operation, targeting Cisco and Netgear devices that had mostly reached their end of life and were no longer receiving security patches.

In each case, the routers provided a highly valuable service to the groups; that service was secondary to whatever primary aims later attacks might have. By nesting inside the routers, hackers could send commands from their overseas locations but have the traffic appear to be coming from a far more safe-looking location inside the target country or even inside a company.

Similar inside-the-house access has been sought by international attackers through VPN products, as in the three different Ivanti vulnerabilities discovered recently.

DOJ quietly removed Russian malware from routers in US homes and businesses Read More »