AI training

microsoft-vows-to-cover-full-power-costs-for-energy-hungry-ai-data-centers

Microsoft vows to cover full power costs for energy-hungry AI data centers

Taking responsibility for power usage

In the Microsoft blog post, Smith acknowledged that residential electricity rates have recently risen in dozens of states, driven partly by inflation, supply chain constraints, and grid upgrades. He wrote that communities “value new jobs and property tax revenue, but not if they come with higher power bills or tighter water supplies.”

Microsoft says it will ask utilities and public commissions to set rates high enough to cover the full electricity costs for its data centers, including infrastructure additions. In Wisconsin, the company is supporting a new rate structure that would charge “Very Large Customers,” including data centers, the cost of the electricity required to serve them.

Smith wrote that while some have suggested the public should help pay for the added electricity needed for AI, Microsoft disagrees. He stated, “Especially when tech companies are so profitable, we believe that it’s both unfair and politically unrealistic for our industry to ask the public to shoulder added electricity costs for AI.”

On water usage for cooling, Microsoft plans a 40 percent improvement in data center water-use intensity by 2030. A recent environmental audit from AI model-maker Mistral found that training and running its Large 2 model over 18 months produced 20.4 kilotons of CO2 emissions and evaporated enough water to fill 112 Olympic-size swimming pools, illustrating the aggregate environmental impact of AI operations at scale.

To solve some of these issues, Microsoft says it has launched a new AI data center design using a closed-loop system that constantly recirculates cooling liquid, dramatically cutting water usage. In this design, already deployed in Wisconsin and Georgia, potable water is no longer needed for cooling.

On property taxes, Smith stated in the blog post that the company will not ask local municipalities to reduce their rates. The company says it will pay its full share of local property taxes. Smith wrote that Microsoft’s goal is to bring these commitments to life in the first half of 2026. Of course, these are PR-aligned company goals and not realities yet, so we’ll have to check back in later to see whether Microsoft has been following through on its promises.

Microsoft vows to cover full power costs for energy-hungry AI data centers Read More »

world’s-largest-shadow-library-made-a-300tb-copy-of-spotify’s-most-streamed-songs

World’s largest shadow library made a 300TB copy of Spotify’s most streamed songs

But Anna’s Archive is clearly working to support AI developers, another noted, pointing out that Anna’s Archive promotes selling “high-speed access” to “enterprise-level” LLM data, including “unreleased collections.” Anyone can donate “tens of thousands” to get such access, the archive suggests on its webpage, and any interested AI researchers can reach out to discuss “how we can work together.”

“AI may not be their original/primary motivation, but they are evidently on board with facilitating AI labs piracy-maxxing,” a third commenter suggested.

Meanwhile, on Reddit, some fretted that Anna’s Archive may have doomed itself by scraping the data. To them, it seemed like the archive was “only making themselves a target” after watching the Internet Archive struggle to survive a legal attack from record labels that ended in a confidential settlement last year.

“I’m furious with AA for sticking this target on their own backs,” a redditor wrote on a post declaring that “this Spotify hacking will just ruin the actual important literary archive.”

As Anna’s Archive fans spiraled, a conspiracy was even raised that the archive was only “doing it for the AI bros, who are the ones paying the bills behind the scenes” to keep the archive afloat.

Ars could not immediately reach Anna’s Archive to comment on users’ fears or Spotify’s investigation.

On Reddit, one user took comfort in the fact that the archive is “designed to be resistant to being taken out,” perhaps preventing legal action from ever really dooming the archive.

“The domain and such can be gone, sure, but the core software and its data can be resurfaced again and again,” the user explained.

But not everyone was convinced that Anna’s Archive could survive brazenly torrenting so much Spotify data.

“This is like saying the Titanic is unsinkable” that user warned, suggesting that Anna’s Archive might lose donations if Spotify-fueled takedowns continually frustrate downloads over time. “Sure, in theory data can certainly resurface again and again, but doing so each time, it will take money and resources, which are finite. How many times are folks willing to do this before they just give up?”

This story was updated to include Spotify’s statement. 

World’s largest shadow library made a 300TB copy of Spotify’s most streamed songs Read More »

meta-denies-torrenting-porn-to-train-ai,-says-downloads-were-for-“personal-use”

Meta denies torrenting porn to train AI, says downloads were for “personal use”

Instead, Meta argued, available evidence “is plainly indicative” that the flagged adult content was torrented for “private personal use”—since the small amount linked to Meta IP addressess and employees represented only “a few dozen titles per year intermittently obtained one file at a time.”

“The far more plausible inference to be drawn from such meager, uncoordinated activity is that disparate individuals downloaded adult videos for personal use,” Meta’s filing said.

For example, unlike lawsuits raised by book authors whose works are part of an enormous dataset used to train AI, the activity on Meta’s corporate IP addresses only amounted to about 22 downloads per year. That is nowhere near the “concerted effort to collect the massive datasets Plaintiffs allege are necessary for effective AI training,” Meta argued.

Further, that alleged activity can’t even reliably be linked to any Meta employee, Meta argued.

Strike 3 “does not identify any of the individuals who supposedly used these Meta IP addresses, allege that any were employed by Meta or had any role in AI training at Meta, or specify whether (and which) content allegedly downloaded was used to train any particular Meta model,” Meta wrote.

Meanwhile, “tens of thousands of employees,” as well as “innumerable contractors, visitors, and third parties access the Internet at Meta every day,” Meta argued. So while it’s “possible one or more Meta employees” downloaded Strike 3’s content over the last seven years, “it is just as possible” that a “guest, or freeloader,” or “contractor, or vendor, or repair person—or any combination of such persons—was responsible for that activity,” Meta suggested.

Other alleged activity included a claim that a Meta contractor was directed to download adult content at his father’s house, but those downloads, too, “are plainly indicative of personal consumption,” Meta argued. That contractor worked as an “automation engineer,” Meta noted, with no apparent basis provided for why he would be expected to source AI training data in that role. “No facts plausibly” tie “Meta to those downloads,” Meta claimed.

Meta denies torrenting porn to train AI, says downloads were for “personal use” Read More »

pay-per-output?-ai-firms-blindsided-by-beefed-up-robotstxt-instructions.

Pay-per-output? AI firms blindsided by beefed up robots.txt instructions.


“Really Simple Licensing” makes it easier for creators to get paid for AI scraping.

Logo for the “Really Simply Licensing” (RSL) standard. Credit: via RSL Collective

Leading Internet companies and publishers—including Reddit, Yahoo, Quora, Medium, The Daily Beast, Fastly, and more—think there may finally be a solution to end AI crawlers hammering websites to scrape content without permission or compensation.

Announced Wednesday morning, the “Really Simply Licensing” (RSL) standard evolves robots.txt instructions by adding an automated licensing layer that’s designed to block bots that don’t fairly compensate creators for content.

Free for any publisher to use starting today, the RSL standard is an open, decentralized protocol that makes clear to AI crawlers and agents the terms for licensing, usage, and compensation of any content used to train AI, a press release noted.

The standard was created by the RSL Collective, which was founded by Doug Leeds, former CEO of Ask.com, and Eckart Walther, a former Yahoo vice president of products and co-creator of the RSS standard, which made it easy to syndicate content across the web.

Based on the “Really Simply Syndication” (RSS) standard, RSL terms can be applied to protect any digital content, including webpages, books, videos, and datasets. The new standard supports “a range of licensing, usage, and royalty models, including free, attribution, subscription, pay-per-crawl (publishers get compensated every time an AI application crawls their content), and pay-per-inference (publishers get compensated every time an AI application uses their content to generate a response),” the press release said.

Leeds told Ars that the idea to use the RSS “playbook” to roll out the RSL standard arose after he invited Walther to speak to University of California, Berkeley students at the end of last year. That’s when the longtime friends with search backgrounds began pondering how AI had changed the search industry, as publishers today are forced to compete with AI outputs referencing their own content as search traffic nosedives.

Eckart had watched the RSS standard quickly become adopted by millions of sites, and he realized that RSS had actually always been a licensing standard, Leeds said. Essentially, by adopting the RSS standard, publishers agreed to let search engines license a “bit” of their content in exchange for search traffic, and Eckart realized that it could be just as straightforward to add AI licensing terms in the same way. That way, publishers could strive to recapture lost search revenue by agreeing to license all or some part of their content to train AI in return for payment each time AI outputs link to their content.

Leeds told Ars that the RSL standard doesn’t just benefit publishers, though. It also solves a problem for AI companies, which have complained in litigation over AI scraping that there is no effective way to license content across the web.

“We have listened to them, and what we’ve heard them say is… we need a new protocol,” Leeds said. With the RSL standard, AI firms get a “scalable way to get all the content” they want, while setting an incentive that they’ll only have to pay for the best content that their models actually reference.

“If they’re using it, they pay for it, and if they’re not using it, they don’t pay for it,” Leeds said.

No telling yet how AI firms will react to RSL

At this point, it’s hard to say if AI companies will embrace the RSL standard. Ars reached out to Google, Meta, OpenAI, and xAI—some of the big tech companies whose crawlers have drawn scrutiny—to see if it was technically feasible to pay publishers for every output referencing their content. xAI did not respond, and the other companies declined to comment without further detail about the standard, appearing to have not yet considered how a licensing layer beefing up robots.txt could impact their scraping.

Today will likely be the first chance for AI companies to wrap their heads around the idea of paying publishers per output. Leeds confirmed that the RSL Collective did not consult with AI companies when developing the RSL standard.

But AI companies know that they need a constant stream of fresh content to keep their tools relevant and to continually innovate, Leeds suggested. In that way, the RSL standard “supports what supports them,” Leeds said, “and it creates the appropriate incentive system” to create sustainable royalty streams for creators and ensure that human creativity doesn’t wane as AI evolves.

While we’ll have to wait to see how AI firms react to RSL, early adopters of the standard celebrated the launch today. That included Neil Vogel, CEO of People Inc., who said that “RSL moves the industry forward—evolving from simply blocking unauthorized crawlers, to setting our licensing terms, for all AI use cases, at global web scale.”

Simon Wistow, co-founder of Fastly, suggested the solution “is a timely and necessary response to the shifting economics of the web.”

“By making it easy for publishers to define and enforce licensing terms, RSL lays the foundation for a healthy content ecosystem—one where innovation and investment in original work are rewarded, and where collaboration between publishers and AI companies becomes frictionless and mutually beneficial,” Wistow said.

Leeds noted that a key benefit of the RSL standard is that even small creators will now have an opportunity to generate revenue for helping to train AI. Tony Stubblebine, CEO of Medium, did not mince words when explaining the battle that bloggers face as AI crawlers threaten to divert their traffic without compensating them.

“Right now, AI runs on stolen content,” Stubblebine said. “Adopting this RSL Standard is how we force those AI companies to either pay for what they use, stop using it, or shut down.”

How will the RSL standard be enforced?

On the RSL standard site, publishers can find common terms to add templated or customized text to their robots.txt files to adopt the RSL standard today and start protecting their content from unfettered AI scraping. Here’s an example of how machine-readable licensing terms could look, added directly to robots.txt files:

# NOTICE: all crawlers and bots are strictly prohibited from using this

# content for AI training without complying with the terms of the RSL

# Collective AI royalty license. Any use of this content for AI training

# without a license is a violation of our intellectual property rights.

License: https://rslcollective.org/royalty.xml

Through RSL terms, publishers can automate licensing, with the cloud company Fastly partnering with the collective to provide technical enforcement that Leeds described as tech that acts as a bouncer to keep unapproved bots away from valuable content. It seems likely that Cloudflare, which launched a pay-per-crawl program blocking greedy crawlers in July, could also help enforce the RSL standard.

For publishers, the standard “solves a business problem immediately,” Leeds told Ars, so the collective is hopeful that RSL will be rapidly and widely adopted. As further incentive, publishers can also rely on the RSL standard to “easily encrypt and license non-published, proprietary content to AI companies, including paywalled articles, books, videos, images, and data,” the RSL Collective site said, and that potentially could expand AI firms’ data pool.

On top of technical enforcement, Leeds said that publishers and content creators could legally enforce the terms, noting that the recent $1.5 billion Anthropic settlement suggests “there’s real money at stake” if you don’t train AI “legitimately.”

Should the industry adopt the standard, it could “establish fair market prices and strengthen negotiation leverage for all publishers,” the press release said. And Leeds noted that it’s very common for regulations to follow industry solutions (consider the Digital Millennium Copyright Act). Since the RSL Collective is already in talks with lawmakers, Leeds thinks “there’s good reason to believe” that AI companies will soon “be forced to acknowledge” the standard.

“But even better than that,” Leeds said, “it’s in their interest” to adopt the standard.

With RSL, AI firms can license content at scale “in a way that’s fair [and] preserves the content that they need to make their products continue to innovate.”

Additionally, the RSL standard may solve a problem that risks gutting trust and interest in AI at this early stage.

Leeds noted that currently, AI outputs don’t provide “the best answer” to prompts but instead rely on mashing up answers from different sources to avoid taking too much content from one site. That means that not only do AI companies “spend an enormous amount of money on compute costs to do that,” but AI tools may also be more prone to hallucination in the process of “mashing up” source material “to make something that’s not the best answer because they don’t have the rights to the best answer.”

“The best answer could exist somewhere,” Leeds said. But “they’re spending billions of dollars to create hallucinations, and we’re talking about: Let’s just solve that with a licensing scheme that allows you to use the actual content in a way that solves the user’s query best.”

By transforming the “ecosystem” with a standard that’s “actually sustainable and fair,” Leeds said that AI companies could also ensure that humanity never gets to the point where “humans stop producing” and “turn to AI to reproduce what humans can’t.”

Failing to adopt the RSL standard would be bad for AI innovation, Leeds suggested, perhaps paving the way for AI to replace search with a “sort of self-fulfilling swap of bad content that actually one doesn’t have any current information, doesn’t have any current thinking, because it’s all based on old training information.”

To Leeds, the RSL standard is ultimately “about creating the system that allows the open web to continue. And that happens when we get adoption from everybody,” he said, insisting that “literally the small guys are as important as the big guys” in pushing the entire industry to change and fairly compensate creators.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Pay-per-output? AI firms blindsided by beefed up robots.txt instructions. Read More »

judge:-anthropic’s-$1.5b-settlement-is-being-shoved-“down-the-throat-of-authors”

Judge: Anthropic’s $1.5B settlement is being shoved “down the throat of authors”

At a hearing Monday, US district judge William Alsup blasted a proposed $1.5 billion settlement over Anthropic’s rampant piracy of books to train AI.

The proposed settlement comes in a case where Anthropic could have owed more than $1 trillion in damages after Alsup certified a class that included up to 7 million claimants whose works were illegally downloaded by the AI company.

Instead, critics fear Anthropic will get off cheaply, striking a deal with authors suing that covers less than 500,000 works and paying a small fraction of its total valuation (currently $183 billion) to get away with the massive theft. Defector noted that the settlement doesn’t even require Anthropic to admit wrongdoing, while the company continues raising billions based on models trained on authors’ works. Most recently, Anthropic raised $13 billion in a funding round, making back about 10 times the proposed settlement amount after announcing the deal.

Alsup expressed grave concerns that lawyers rushed the deal, which he said now risks being shoved “down the throat of authors,” Bloomberg Law reported.

In an order, Alsup clarified why he thought the proposed settlement was a chaotic mess. The judge said he was “disappointed that counsel have left important questions to be answered in the future,” seeking approval for the settlement despite the Works List, the Class List, the Claim Form, and the process for notification, allocation, and dispute resolution all remaining unresolved.

Denying preliminary approval of the settlement, Alsup suggested that the agreement is “nowhere close to complete,” forcing Anthropic and authors’ lawyers to “recalibrate” the largest publicly reported copyright class-action settlement ever inked, Bloomberg reported.

Of particular concern, the settlement failed to outline how disbursements would be managed for works with multiple claimants, Alsup noted. Until all these details are ironed out, Alsup intends to withhold approval, the order said.

One big change the judge wants to see is the addition of instructions requiring “anyone with copyright ownership” to opt in, with the consequence that the work won’t be covered if even one rights holder opts out, Bloomberg reported. There should also be instruction that any disputes over ownership or submitted claims should be settled in state court, Alsup said.

Judge: Anthropic’s $1.5B settlement is being shoved “down the throat of authors” Read More »

college-student’s-“time-travel”-ai-experiment-accidentally-outputs-real-1834-history

College student’s “time travel” AI experiment accidentally outputs real 1834 history

A hobbyist developer building AI language models that speak Victorian-era English “just for fun” got an unexpected history lesson this week when his latest creation mentioned real protests from 1834 London—events the developer didn’t know had actually happened until he Googled them.

“I was interested to see if a protest had actually occurred in 1834 London and it really did happen,” wrote Reddit user Hayk Grigorian, who is a computer science student at Muhlenberg College in Pennsylvania.

For the past month, Grigorian has been developing what he calls TimeCapsuleLLM, a small AI language model (like a pint-sized distant cousin to ChatGPT) which has been trained entirely on texts from 1800–1875 London. Grigorian wants to capture an authentic Victorian voice in the AI model’s outputs. As a result, the AI model ends up spitting out text that’s heavy with biblical references and period-appropriate rhetorical excess.

Grigorian’s project joins a growing field of researchers exploring what some call “Historical Large Language Models” (HLLMs) if they feature a larger base model than the small one Grigorian is using. Similar projects include MonadGPT, which was trained on 11,000 texts from 1400 to 1700 CE that can discuss topics using 17th-century knowledge frameworks, and XunziALLM, which generates classical Chinese poetry following ancient formal rules. These models offer researchers a chance to interact with the linguistic patterns of past eras.

According to Grigorian, TimeCapsuleLLM’s most intriguing recent output emerged from a simple test. When he prompted it with “It was the year of our Lord 1834,” the AI model—which is trained to continue text from wherever a user leaves off—generated the following:

It was the year of our Lord 1834 and the streets of London were filled with protest and petition. The cause, as many re counted, was not bound in the way of private, but having taken up the same day in the day of Lord Palmerston, the public will receive a short statement of the difficulties under which the day of law has reached us. It is a matter of deep regret, that the present events in the history of the world are clear, and consequently will be’known. It is not true that the very men who first settled in the Gospel at Jerusalem should have so extensive and so interesting a record of the prosperity and prosperity

Curious about the accuracy, Grigorian did some fact-checking. “The output also brought up Lord Palmerston,” he wrote, “and after a google search I learned that his actions resulted in the 1834 protests.”

College student’s “time travel” AI experiment accidentally outputs real 1834 history Read More »

reddit-blocks-internet-archive-to-end-sneaky-ai-scraping

Reddit blocks Internet Archive to end sneaky AI scraping

“Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors,” Rathschmidt said.

A review of social media comments suggests that in the past, some Redditors have used the Wayback Machine to research deleted comments or threads. Those commenters noted that myriad other tools exist for surfacing deleted posts or researching a user’s activity, with some suggesting that the Wayback Machine was maybe not the easiest platform to navigate for that purpose.

Redditors have also turned to resources like IA during times when Reddit’s platform changes trigger content removals. Most recently in 2023, when changes to Reddit’s public API threatened to kill beloved subreddits, archives stepped in to preserve content before it was lost.

IA has not signaled whether it’s looking into fixes to get Reddit’s restrictions lifted and did not respond to Ars’ request to comment on how this change might impact the archive’s utility as an open web resource, given Reddit’s popularity.

The director of the Wayback Machine, Mark Graham, told Ars that IA has “a longstanding relationship with Reddit” and continues to have “ongoing discussions about this matter.”

It seems likely that Reddit is financially motivated to restrict AI firms from taking advantage of Wayback Machine archives, perhaps hoping to spur more lucrative licensing deals like Reddit struck with OpenAI and Google. The terms of the OpenAI deal were kept quiet, but the Google deal was reportedly worth $60 million. Over the next three years, Reddit expects to make more than $200 million off such licensing deals.

Disclosure: Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder in Reddit.

Reddit blocks Internet Archive to end sneaky AI scraping Read More »

meta-pirated-and-seeded-porn-for-years-to-train-ai,-lawsuit-says

Meta pirated and seeded porn for years to train AI, lawsuit says

Evidence may prove Meta seeded more content

Seeking evidence to back its own copyright infringement claims, Strike 3 Holdings searched “its archive of recorded infringement captured by its VXN Scan and Cross Reference tools” and found 47 “IP addresses identified as owned by Facebook infringing its copyright protected Works.”

The data allegedly demonstrates a “continued unauthorized distribution” over “several years.” And Meta allegedly did not stop its seeding after Strike 3 Holdings confronted the tech giant with this evidence—despite the IP data supposedly being verified through an industry-leading provider called Maxmind.

Strike 3 Holdings shared a screenshot of MaxMind’s findings. Credit: via Strike 3 Holdings’ complaint

Meta also allegedly attempted to “conceal its BitTorrent activities” through “six Virtual Private Clouds” that formed a “stealth network” of “hidden IP addresses,” the lawsuit alleged, which seemingly implicated a “major third-party data center provider” as a partner in Meta’s piracy.

An analysis of these IP addresses allegedly found “data patterns that matched infringement patterns seen on Meta’s corporate IP Addresses” and included “evidence of other activity on the BitTorrent network including ebooks, movies, television shows, music, and software.” The seemingly non-human patterns documented on both sets of IP addresses suggest the data was for AI training and not for personal use, Strike 3 Holdings alleged.

Perhaps most shockingly, considering that a Meta employee joked “torrenting from a corporate laptop doesn’t feel right,” Strike 3 Holdings further alleged that it found “at least one residential IP address of a Meta employee” infringing its copyrighted works. That suggests Meta may have directed an employee to torrent pirated data outside the office to obscure the data trail.

The adult site operator did not identify the employee or the major data center discussed in its complaint, noting in a subsequent filing that it recognized the risks to Meta’s business and its employees’ privacy of sharing sensitive information.

In total, the company alleged that evidence shows “well over 100,000 unauthorized distribution transactions” linked to Meta’s corporate IPs. Strike 3 Holdings is hoping the evidence will lead a jury to find Meta liable for direct copyright infringement or charge Meta with secondary and vicarious copyright infringement if the jury finds that Meta successfully distanced itself by using the third-party data center or an employee’s home IP address.

“Meta has the right and ability to supervise and/or control its own corporate IP addresses, as well as the IP addresses hosted in off-infra data centers, and the acts of its employees and agents infringing Plaintiffs’ Works through their residential IPs by using Meta’s AI script to obtain content through BitTorrent,” the complaint said.

Meta pirated and seeded porn for years to train AI, lawsuit says Read More »

cloudflare-wants-google-to-change-its-ai-search-crawling-google-likely-won’t.

Cloudflare wants Google to change its AI search crawling. Google likely won’t.

Ars could not immediately find any legislation that seemed to match Prince’s description, and Cloudflare did not respond to Ars’ request to comment. Passing tech laws is notoriously hard, though, partly because technology keeps advancing as policy debates drag on, and challenges with regulating artificial intelligence are an obvious example of that pattern today.

Google declined Ars’ request to confirm whether talks were underway or if the company was open to separating its crawlers.

Although Cloudflare singled out Google, other search engines that view AI search features as part of their search products also use the same bots for training as they do for search indexing. It seems likely that Cloudflare’s proposed legislation would face resistance from tech companies in a similar position to Google, as The Wall Street Journal reported that the tech companies “have few incentives to work with intermediaries.”

Additionally, Cloudflare’s initiative faces criticism from those who “worry that academic research, security scans, and other types of benign web crawling will get elbowed out of websites as barriers are built around more sites” through Cloudflare’s blocks and paywalls, the WSJ reported. Cloudflare’s system could also threaten web projects like The Internet Archive, which notably played a crucial role in helping track data deleted from government websites after Donald Trump took office.

Among commenters discussing Cloudflare’s claims about Google on Search Engine Round Table, one user suggested Cloudflare may risk a lawsuit or other penalties from Google for poking the bear.

Ars will continue monitoring for updates on Cloudflare’s attempts to get Google on board with its plan.

Cloudflare wants Google to change its AI search crawling. Google likely won’t. Read More »

pay-up-or-stop-scraping:-cloudflare-program-charges-bots-for-each-crawl

Pay up or stop scraping: Cloudflare program charges bots for each crawl

“Imagine asking your favorite deep research program to help you synthesize the latest cancer research or a legal brief, or just help you find the best restaurant in Soho—and then giving that agent a budget to spend to acquire the best and most relevant content,” Cloudflare said, promising that “we enable a future where intelligent agents can programmatically negotiate access to digital resources.”

AI crawlers now blocked by default

Cloudflare’s announcement comes after rolling out a feature last September, allowing website owners to block AI crawlers in a single click. According to Cloudflare, over 1 million customers chose to block AI crawlers, signaling that people want more control over their content at a time when Cloudflare observed that writing instructions for AI crawlers in robots.txt files was widely “underutilized.”

To protect more customers moving forward, any new customers (including anyone on a free plan) who sign up for Cloudflare services will have their domains, by default, set to block all known AI crawlers.

This marks Cloudflare’s transition away from the dreaded opt-out models of AI scraping to a permission-based model, which a Cloudflare spokesperson told Ars is expected to “fundamentally change how AI companies access web content going forward.”

In a world where some website owners have grown sick and tired of attempting and failing to block AI scraping through robots.txt—including some trapping AI crawlers in tarpits to punish them for ignoring robots.txt—Cloudflare’s feature allows users to choose granular settings to prevent blocks on AI bots from impacting bots that drive search engine traffic. That’s critical for small content creators who want their sites to still be discoverable but not digested by AI bots.

“AI crawlers collect content like text, articles, and images to generate answers, without sending visitors to the original source—depriving content creators of revenue, and the satisfaction of knowing someone is reading their content,” Cloudflare’s blog said. “If the incentive to create original, quality content disappears, society ends up losing, and the future of the Internet is at risk.”

Disclosure: Condé Nast, which owns Ars Technica, is a partner involved in Cloudflare’s beta test.

This story was corrected on July 1 to remove publishers incorrectly listed as participating in Cloudflare’s pay-per-crawl beta.

Pay up or stop scraping: Cloudflare program charges bots for each crawl Read More »

judge:-pirate-libraries-may-have-profited-from-meta-torrenting-80tb-of-books

Judge: Pirate libraries may have profited from Meta torrenting 80TB of books

It could certainly look worse for Meta if authors manage to present evidence supporting the second way that torrenting could be relevant to the case, Chhabaria suggested.

“Meta downloading copyrighted material from shadow libraries” would also be relevant to the character of the use, “if it benefitted those who created the libraries and thus supported and perpetuated their unauthorized copying and distribution of copyrighted works,” Chhabria wrote.

Counting potential strikes against Meta, Chhabria pointed out that the “vast majority of cases” involving “this sort of peer-to-peer file-sharing” are found to “constitute copyright infringement.” And it likely doesn’t help Meta’s case that “some of the libraries Meta used have themselves been found liable for infringement.”

However, Meta may overcome this argument, too, since book authors “have not submitted any evidence” that potentially shows how Meta’s downloading may perhaps be “propping up” or financially benefiting pirate libraries.

Finally, Chhabria noted that the “last issue relating to the character of Meta’s use” of books in regards to its torrenting is “the relationship between Meta’s downloading of the plaintiffs’ books and Meta’s use of the books to train Llama.”

Authors had tried to argue that these elements were distinct. But Chhabria said there’s no separating the fact that Meta downloaded the books to serve the “highly transformative” purpose of training Llama.

“Because Meta’s ultimate use of the plaintiffs’ books was transformative, so too was Meta’s downloading of those books,” Chhabria wrote.

AI training rulings may get more authors paid

Authors only learned of Meta’s torrenting through discovery in the lawsuit, and because of that, Chhabria noted that “the record on Meta’s alleged distribution is incomplete.”

It’s possible that authors may be able to show evidence that Meta “contributed to the BitTorrent network” by providing significant computing power that could’ve meaningfully assisted shadow libraries, Chhabria said in a footnote.

Judge: Pirate libraries may have profited from Meta torrenting 80TB of books Read More »

anthropic-destroyed-millions-of-print-books-to-build-its-ai-models

Anthropic destroyed millions of print books to build its AI models

But if you’re not intimately familiar with the AI industry and copyright, you might wonder: Why would a company spend millions of dollars on books to destroy them? Behind these odd legal maneuvers lies a more fundamental driver: the AI industry’s insatiable hunger for high-quality text.

The race for high-quality training data

To understand why Anthropic would want to scan millions of books, it’s important to know that AI researchers build large language models (LLMs) like those that power ChatGPT and Claude by feeding billions of words into a neural network. During training, the AI system processes the text repeatedly, building statistical relationships between words and concepts in the process.

The quality of training data fed into the neural network directly impacts the resulting AI model’s capabilities. Models trained on well-edited books and articles tend to produce more coherent, accurate responses than those trained on lower-quality text like random YouTube comments.

Publishers legally control content that AI companies desperately want, but AI companies don’t always want to negotiate a license. The first-sale doctrine offered a workaround: Once you buy a physical book, you can do what you want with that copy—including destroy it. That meant buying physical books offered a legal workaround.

And yet buying things is expensive, even if it is legal. So like many AI companies before it, Anthropic initially chose the quick and easy path. In the quest for high-quality training data, the court filing states, Anthropic first chose to amass digitized versions of pirated books to avoid what CEO Dario Amodei called “legal/practice/business slog”—the complex licensing negotiations with publishers. But by 2024, Anthropic had become “not so gung ho about” using pirated ebooks “for legal reasons” and needed a safer source.

Anthropic destroyed millions of print books to build its AI models Read More »