AI

reddit-ceo-pledges-site-will-remain-“written-by-humans-and-voted-on-by-humans”

Reddit CEO pledges site will remain “written by humans and voted on by humans”

Reddit is in an “arms race” to protect its devoted online communities from a surge in artificial intelligence-generated content, with the authenticity of its vast repository of human interaction increasingly valuable in training new AI-powered search tools.

Chief executive Steve Huffman told the Financial Times that Reddit had “20 years of conversation about everything,” leaving the company with a lucrative resource of personal interaction.

This has allowed it to strike multimillion dollar partnerships with Google and OpenAI to train their large language models on its content, as tech companies look for real-world data that can improve their generative AI products.

But Huffman said Reddit was now battling to ensure its users stay at the center of the social network. “Where the rest of the internet seems to be powered by or written by or summarized by AI, Reddit is distinctly human,” he said. “It’s the place you go when you want to hear from people, their lived experiences, their perspectives, their recommendations. Reddit is communities and human curation and conversation and authenticity.”

As Reddit becomes an increasingly important source for LLMs, advertisers are responding with what one agency chief described as a “massive migration” to the platform.

Multiple advertising and agency executives speaking during this month’s Cannes advertising festival told the FT that brands were increasingly exploring hosting a business account and posting content on Reddit to boost the likelihood of their ads appearing in the responses of generative AI chatbots.

However, Huffman warned against any company seeking to game the site with fake or AI-generated content, with plans to bring in strict verification checks to ensure that only humans can post to its forums.

“For 20 years, we’ve been fighting people who have wanted to be popular on Reddit,” he said. “We index very well into the search engines. If you want to show up in the search engines, you try to do well on Reddit, and now the LLMs, it’s the same thing. If you want to be in the LLMs, you can do it through Reddit.”

Reddit CEO pledges site will remain “written by humans and voted on by humans” Read More »

anthropic-destroyed-millions-of-print-books-to-build-its-ai-models

Anthropic destroyed millions of print books to build its AI models

But if you’re not intimately familiar with the AI industry and copyright, you might wonder: Why would a company spend millions of dollars on books to destroy them? Behind these odd legal maneuvers lies a more fundamental driver: the AI industry’s insatiable hunger for high-quality text.

The race for high-quality training data

To understand why Anthropic would want to scan millions of books, it’s important to know that AI researchers build large language models (LLMs) like those that power ChatGPT and Claude by feeding billions of words into a neural network. During training, the AI system processes the text repeatedly, building statistical relationships between words and concepts in the process.

The quality of training data fed into the neural network directly impacts the resulting AI model’s capabilities. Models trained on well-edited books and articles tend to produce more coherent, accurate responses than those trained on lower-quality text like random YouTube comments.

Publishers legally control content that AI companies desperately want, but AI companies don’t always want to negotiate a license. The first-sale doctrine offered a workaround: Once you buy a physical book, you can do what you want with that copy—including destroy it. That meant buying physical books offered a legal workaround.

And yet buying things is expensive, even if it is legal. So like many AI companies before it, Anthropic initially chose the quick and easy path. In the quest for high-quality training data, the court filing states, Anthropic first chose to amass digitized versions of pirated books to avoid what CEO Dario Amodei called “legal/practice/business slog”—the complex licensing negotiations with publishers. But by 2024, Anthropic had become “not so gung ho about” using pirated ebooks “for legal reasons” and needed a safer source.

Anthropic destroyed millions of print books to build its AI models Read More »

gemini-cli-is-a-free,-open-source-coding-agent-that-brings-ai-to-your-terminal

Gemini CLI is a free, open source coding agent that brings AI to your terminal

Some developers prefer to live in the command line interface (CLI), eschewing the flashy graphics and file management features of IDEs. Google’s latest AI tool is for those terminal lovers. It’s called Gemini CLI, and it shares a lot with Gemini Code Assist, but it works in your terminal environment instead of integrating with an IDE. And perhaps best of all, it’s free and open source.

Gemini CLI plugs into Gemini 2.5 Pro, Google’s most advanced model for coding and simulated reasoning. It can create and modify code for you right inside the terminal, but you can also call on other Google models to generate images or videos without leaving the security of your terminal cocoon. It’s essentially vibe coding from the command line.

This tool is fully open source, so developers can inspect the code and help to improve it. The openness extends to how you configure the AI agent. It supports Model Context Protocol (MCP) and bundled extensions, allowing you to customize your terminal as you see fit. You can even include your own system prompts—Gemini CLI relies on GEMINI.md files, which you can use to tweak the model for different tasks or teams.

Now that Gemini 2.5 Pro is generally available, Gemini Code Assist has been upgraded to use the same technology as Gemini CLI. Code Assist integrates with IDEs like VS Code for those times when you need a more feature-rich environment. The new agent mode in Code Assist allows you to give the AI more general instructions, like “Add support for dark mode to my application” or “Build my project and fix any errors.”

Gemini CLI is a free, open source coding agent that brings AI to your terminal Read More »

key-fair-use-ruling-clarifies-when-books-can-be-used-for-ai-training

Key fair use ruling clarifies when books can be used for AI training

“This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use,” Alsup wrote. “Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.”

But Alsup said that the Anthropic case may not even need to decide on that, since Anthropic’s retention of pirated books for its research library alone was not transformative. Alsup wrote that Anthropic’s argument to hold onto potential AI training material it pirated in case it ever decided to use it for AI training was an attempt to “fast glide over thin ice.”

Additionally Alsup pointed out that Anthropic’s early attempts to get permission to train on authors’ works withered, as internal messages revealed the company concluded that stealing books was considered the more cost-effective path to innovation “to avoid ‘legal/practice/business slog,’ as cofounder and chief executive officer Dario Amodei put it.”

“Anthropic is wrong to suppose that so long as you create an exciting end product, every ‘back-end step, invisible to the public,’ is excused,” Alsup wrote. “Here, piracy was the point: To build a central library that one could have paid for, just as Anthropic later did, but without paying for it.”

To avoid maximum damages in the event of a loss, Anthropic will likely continue arguing that replacing pirated books with purchased books should water down authors’ fight, Alsup’s order suggested.

“That Anthropic later bought a copy of a book it earlier stole off the Internet will not absolve it of liability for the theft, but it may affect the extent of statutory damages,” Alsup noted.

Key fair use ruling clarifies when books can be used for AI training Read More »

the-resume-is-dying,-and-ai-is-holding-the-smoking-gun

The résumé is dying, and AI is holding the smoking gun

Beyond volume, fraud poses an increasing threat. In January, the Justice Department announced indictments in a scheme to place North Korean nationals in remote IT roles at US companies. Research firm Gartner says that fake identity cases are growing rapidly, with the company estimating that by 2028, about 1 in 4 job applicants could be fraudulent. And as we have previously reported, security researchers have also discovered that AI systems can hide invisible text in applications, potentially allowing candidates to game screening systems using prompt injections in ways human reviewers can’t detect.

Illustration of a robot generating endless text, controlled by a scientist.

And that’s not all. Even when AI screening tools work as intended, they exhibit similar biases to human recruiters, preferring white male names on résumés—raising legal concerns about discrimination. The European Union’s AI Act already classifies hiring under its high-risk category with stringent restrictions. Although no US federal law specifically addresses AI use in hiring, general anti-discrimination laws still apply.

So perhaps résumés as a meaningful signal of candidate interest and qualification are becoming obsolete. And maybe that’s OK. When anyone can generate hundreds of tailored applications with a few prompts, the document that once demonstrated effort and genuine interest in a position has devolved into noise.

Instead, the future of hiring may require abandoning the résumé altogether in favor of methods that AI can’t easily replicate—live problem-solving sessions, portfolio reviews, or trial work periods, just to name a few ideas people sometimes consider (whether they are good ideas or not is beyond the scope of this piece). For now, employers and job seekers remain locked in an escalating technological arms race where machines screen the output of other machines, while the humans they’re meant to serve struggle to make authentic connections in an increasingly inauthentic world.

Perhaps the endgame is robots interviewing other robots for jobs performed by robots, while humans sit on the beach drinking daiquiris and playing vintage video games. Well, one can dream.

The résumé is dying, and AI is holding the smoking gun Read More »

Google’s new robotics AI can run without the cloud and still tie your shoes

We sometimes call chatbots like Gemini and ChatGPT “robots,” but generative AI is also playing a growing role in real, physical robots. After announcing Gemini Robotics earlier this year, Google DeepMind has now revealed a new on-device VLA (vision language action) model to control robots. Unlike the previous release, there’s no cloud component, allowing robots to operate with full autonomy.

Carolina Parada, head of robotics at Google DeepMind, says this approach to AI robotics could make robots more reliable in challenging situations. This is also the first version of Google’s robotics model that developers can tune for their specific uses.

Robotics is a unique problem for AI because, not only does the robot exist in the physical world, but it also changes its environment. Whether you’re having it move blocks around or tie your shoes, it’s hard to predict every eventuality a robot might encounter. The traditional approach of training a robot on action with reinforcement was very slow, but generative AI allows for much greater generalization.

“It’s drawing from Gemini’s multimodal world understanding in order to do a completely new task,” explains Carolina Parada. “What that enables is in that same way Gemini can produce text, write poetry, just summarize an article, you can also write code, and you can also generate images. It also can generate robot actions.”

General robots, no cloud needed

In the previous Gemini Robotics release (which is still the “best” version of Google’s robotics tech), the platforms ran a hybrid system with a small model on the robot and a larger one running in the cloud. You’ve probably watched chatbots “think” for measurable seconds as they generate an output, but robots need to react quickly. If you tell the robot to pick up and move an object, you don’t want it to pause while each step is generated. The local model allows quick adaptation, while the server-based model can help with complex reasoning tasks. Google DeepMind is now unleashing the local model as a standalone VLA, and it’s surprisingly robust.

Google’s new robotics AI can run without the cloud and still tie your shoes Read More »

curated-realities:-an-ai-film-festival-and-the-future-of-human-expression

Curated realities: An AI film festival and the future of human expression


We saw 10 AI films and interviewed Runway’s CEO as well as Hollywood pros.

An AI-generated frame of a person looking at an array of television screens

A still from Total Pixel Space, the Grand Prix winner at AIFF 2025.

A still from Total Pixel Space, the Grand Prix winner at AIFF 2025.

Last week, I attended a film festival dedicated to shorts made using generative AI. Dubbed AIFF 2025, it was an event precariously balancing between two different worlds.

The festival was hosted by Runway, a company that produces models and tools for generating images and videos. In panels and press briefings, a curated list of industry professionals made the case for Hollywood to embrace AI tools. In private meetings with industry professionals, I gained a strong sense that there is already a widening philosophical divide within the film and television business.

I also interviewed Runway CEO Cristóbal Valenzuela about the tightrope he walks as he pitches his products to an industry that has deeply divided feelings about what role AI will have in its future.

To unpack all this, it makes sense to start with the films, partly because the film that was chosen as the festival’s top prize winner says a lot about the issues at hand.

A festival of oddities and profundities

Since this was the first time the festival has been open to the public, the crowd was a diverse mix: AI tech enthusiasts, working industry creatives, and folks who enjoy movies and who were curious about what they’d see—as well as quite a few people who fit into all three groups.

The scene at the entrance to the theater at AIFF 2025 in Santa Monica, California.

The films shown were all short, and most would be more at home at an art film fest than something more mainstream. Some shorts featured an animated aesthetic (including one inspired by anime) and some presented as live action. There was even a documentary of sorts. The films could be made entirely with Runway or other AI tools, or those tools could simply be a key part of a stack that also includes more traditional filmmaking methods.

Many of these shorts were quite weird. Most of us have seen by now that AI video-generation tools excel at producing surreal and distorted imagery—sometimes whether the person prompting the tool wants that or not. Several of these films leaned into that limitation, treating it as a strength.

Representing that camp was Vallée Duhamel’s Fragments of Nowhere, which visually explored the notion of multiple dimensions bleeding into one another. Cars morphed into the sides of houses, and humanoid figures, purported to be inter-dimensional travelers, moved in ways that defied anatomy. While I found this film visually compelling at times, I wasn’t seeing much in it that I hadn’t already seen from dreamcore or horror AI video TikTok creators like GLUMLOT or SinRostroz in recent years.

More compelling were shorts that used this propensity for oddity to generate imagery that was curated and thematically tied to some aspect of human experience or identity. For example, More Tears than Harm by Herinarivo Rakotomanana was a rotoscope animation-style “sensory collage of childhood memories” of growing up in Madagascar. Its specificity and consistent styling lent it a credibility that Fragments of Nowhere didn’t achieve. I also enjoyed Riccardo Fusetti’s Editorial on this front.

More Tears Than Harm, an unusual animated film at AIFF 2025.

Among the 10 films in the festival, two clearly stood above the others in my impressions—and they ended up being the Grand Prix and Gold prize winners. (The judging panel included filmmakers Gaspar Noé and Harmony Korine, Tribeca Enterprises CEO Jane Rosenthal, IMAX head of post and image capture Bruce Markoe, Lionsgate VFX SVP Brianna Domont, Nvidia developer relations lead Richard Kerris, and Runway CEO Cristóbal Valenzuela, among others).

Runner-up Jailbird was the aforementioned quasi-documentary. Directed by Andrew Salter, it was a brief piece that introduced viewers to a program in the UK that places chickens in human prisons as companion animals, to positive effect. Why make that film with AI, you might ask? Well, AI was used to achieve shots that wouldn’t otherwise be doable for a small-budget film to depict the experience from the chicken’s point of view. The crowd loved it.

Jailbird, the runner-up at AIFF 2025.

Then there was the Grand Prix winner, Jacob Adler’s Total Pixel Space, which was, among other things, a philosophical defense of the very idea of AI art. You can watch Total Pixel Space on YouTube right now, unlike some of the other films. I found it strangely moving, even as I saw its selection as the festival’s top winner with some cynicism. Of course they’d pick that one, I thought, although I agreed it was the most interesting of the lot.

Total Pixel Space, the Grand Prix winner at AIFF 2025.

Total Pixel Space

Even though it risked navel-gazing and self-congratulation in this venue, Total Pixel Space was filled with compelling imagery that matched the themes, and it touched on some genuinely interesting ideas—at times, it seemed almost profound, didactic as it was.

“How many images can possibly exist?” the film’s narrator asked. To answer that, it explains the concept of total pixel space, which actually reflects how image generation tools work:

Pixels are the building blocks of digital images—tiny tiles forming a mosaic. Each pixel is defined by numbers representing color and position. Therefore, any digital image can be represented as a sequence of numbers…

Just as we don’t need to write down every number between zero and one to prove they exist, we don’t need to generate every possible image to prove they exist. Their existence is guaranteed by the mathematics that defines them… Every frame of every possible film exists as coordinates… To deny this would be to deny the existence of numbers themselves.

The nine-minute film demonstrates that the number of possible images or films is greater than the number of atoms in the universe and argues that photographers and filmmakers may be seen as discovering images that already exist in the possibility space rather than creating something new.

Within that framework, it’s easy to argue that generative AI is just another way for artists to “discover” images.

The balancing act

“We are all—and I include myself in that group as well—obsessed with technology, and we keep chatting about models and data sets and training and capabilities,” Runway CEO Cristóbal Valenzuela said to me when we spoke the next morning. “But if you look back and take a minute, the festival was celebrating filmmakers and artists.”

I admitted that I found myself moved by Total Pixel Space‘s articulations. “The winner would never have thought of himself as a filmmaker, and he made a film that made you feel something,” Valenzuela responded. “I feel that’s very powerful. And the reason he could do it was because he had access to something that just wasn’t possible a couple of months ago.”

First-time and outsider filmmakers were the focus of AIFF 2025, but Runway works with established studios, too—and those relationships have an inherent tension.

The company has signed deals with companies like Lionsgate and AMC. In some cases, it trains on data provided by those companies; in others, it embeds within them to try to develop tools that fit how they already work. That’s not something competitors like OpenAI are doing yet, so that, combined with a head start in video generation, has allowed Runway to grow and stay competitive so far.

“We go directly into the companies, and we have teams of creatives that are working alongside them. We basically embed ourselves within the organizations that we’re working with very deeply,” Valenzuela explained. “We do versions of our film festival internally for teams as well so they can go through the process of making something and seeing the potential.”

Founded in 2018 at New York University’s Tisch School of the Arts by two Chileans and one Greek co-founder, Runway has a very different story than its Silicon Valley competitors. It was one of the first to bring an actually usable video-generation tool to the masses. Runway also contributed in foundational ways to the popular Stable Diffusion model.

Though it is vastly outspent by competitors like OpenAI, it has taken a hands-on approach to working with existing industries. You won’t hear Valenzuela or other Runway leaders talking about the imminence of AGI or anything so lofty; instead, it’s all about selling the product as something that can solve existing problems in creatives’ workflows.

Still, an artist’s mindset and relationships within the industry don’t negate some fundamental conflicts. There are multiple intellectual property cases involving Runway and its peers, and though the company hasn’t admitted it, there is evidence that it trained its models on copyrighted YouTube videos, among other things.

Cristóbal Valenzuela speaking on the AIFF 2025 stage. Credit: Samuel Axon

Valenzuela suggested that studios are worried about liability, not underlying principles, though, saying:

Most of the concerns on copyright are on the output side, which is like, how do you make sure that the model doesn’t create something that already exists or infringes on something. And I think for that, we’ve made sure our models don’t and are supportive of the creative direction you want to take without being too limiting. We work with every major studio, and we offer them indemnification.

In the past, he has also defended Runway by saying that what it’s producing is not a re-creation of what has come before. He sees the tool’s generative process as distinct—legally, creatively, and ethically—from simply pulling up assets or references from a database.

“People believe AI is sort of like a system that creates and conjures things magically with no input from users,” he said. “And it’s not. You have to do that work. You still are involved, and you’re still responsible as a user in terms of how you use it.”

He seemed to share this defense of AI as a legitimate tool for artists with conviction, but given that he’s been pitching these products directly to working filmmakers, he was also clearly aware that not everyone agrees with him. There is not even a consensus among those in the industry.

An industry divided

While in LA for the event, I visited separately with two of my oldest friends. Both of them work in the film and television industry in similar disciplines. They each asked what I was in town for, and I told them I was there to cover an AI film festival.

One immediately responded with a grimace of disgust, “Oh, yikes, I’m sorry.” The other responded with bright eyes and intense interest and began telling me how he already uses AI in his day-to-day to do things like extend shots by a second or two for a better edit, and expressed frustration at his company for not adopting the tools faster.

Neither is alone in their attitudes. Hollywood is divided—and not for the first time.

There have been seismic technological changes in the film industry before. There was the transition from silent films to talkies, obviously; moviemaking transformed into an entirely different art. Numerous old jobs were lost, and numerous new jobs were created.

Later, there was the transition from film to digital projection, which may be an even tighter parallel. It was a major disruption, with some companies and careers collapsing while others rose. There were people saying, “Why do we even need this?” while others believed it was the only sane way forward. Some audiences declared the quality worse, and others said it was better. There were analysts arguing it could be stopped, while others insisted it was inevitable.

IMAX’s head of post production, Bruce Markoe, spoke briefly about that history at a press mixer before the festival. “It was a little scary,” he recalled. “It was a big, fundamental change that we were going through.”

People ultimately embraced it, though. “The motion picture and television industry has always been very technology-forward, and they’ve always used new technologies to advance the state of the art and improve the efficiencies,” Markoe said.

When asked whether he thinks the same thing will happen with generative AI tools, he said, “I think some filmmakers are going to embrace it faster than others.” He pointed to AI tools’ usefulness for pre-visualization as particularly valuable and noted some people are already using it that way, but it will take time for people to get comfortable with.

And indeed, many, many filmmakers are still loudly skeptical. “The concept of AI is great,” The Mitchells vs. the Machines director Mike Rianda said in a Wired interview. “But in the hands of a corporation, it is like a buzzsaw that will destroy us all.”

Others are interested in the technology but are concerned that it’s being brought into the industry too quickly, with insufficient planning and protections. That includes Crafty Apes Senior VFX Supervisor Luke DiTomasso. “How fast do we roll out AI technologies without really having an understanding of them?” he asked in an interview with Production Designers Collective. “There’s a potential for AI to accelerate beyond what we might be comfortable with, so I do have some trepidation and am maybe not gung-ho about all aspects of it.

Others remain skeptical that the tools will be as useful as some optimists believe. “AI never passed on anything. It loved everything it read. It wants you to win. But storytelling requires nuance—subtext, emotion, what’s left unsaid. That’s something AI simply can’t replicate,” said Alegre Rodriquez, a member of the Emerging Technology committee at the Motion Picture Editors Guild.

The mirror

Flying back from Los Angeles, I considered two key differences between this generative AI inflection point for Hollywood and the silent/talkie or film/digital transitions.

First, neither of those transitions involved an existential threat to the technology on the basis of intellectual property and copyright. Valenzuela talked about what matters to studio heads—protection from liability over the outputs. But the countless creatives who are critical of these tools also believe they should be consulted and even compensated for their work’s use in the training data for Runway’s models. In other words, it’s not just about the outputs, it’s also about the sourcing. As noted before, there are several cases underway. We don’t know where they’ll land yet.

Second, there’s a more cultural and philosophical issue at play, which Valenzuela himself touched on in our conversation.

“I think AI has become this sort of mirror where anyone can project all their fears and anxieties, but also their optimism and ideas of the future,” he told me.

You don’t have to scroll for long to come across techno-utopians declaring with no evidence that AGI is right around the corner and that it will cure cancer and save our society. You also don’t have to scroll long to encounter visceral anger at every generative AI company from people declaring the technology—which is essentially just a new methodology for programming a computer—fundamentally unethical and harmful, with apocalyptic societal and economic ramifications.

Amid all those bold declarations, this film festival put the focus on the on-the-ground reality. First-time filmmakers who might never have previously cleared Hollywood’s gatekeepers are getting screened at festivals because they can create competitive-looking work with a fraction of the crew and hours. Studios and the people who work there are saying they’re saving time, resources, and headaches in pre-viz, editing, visual effects, and other work that’s usually done under immense time and resource pressure.

“People are not paying attention to the very huge amount of positive outcomes of this technology,” Valenzuela told me, pointing to those examples.

In this online discussion ecosystem that elevates outrage above everything else, that’s likely true. Still, there is a sincere and rigorous conviction among many creatives that their work is contributing to this technology’s capabilities without credit or compensation and that the structural and legal frameworks to ensure minimal human harm in this evolving period of disruption are still inadequate. That’s why we’ve seen groups like the Writers Guild of America West support the Generative AI Copyright Disclosure Act and other similar legislation meant to increase transparency about how these models are trained.

The philosophical question with a legal answer

The winning film argued that “total pixel space represents both the ultimate determinism and the ultimate freedom—every possibility existing simultaneously, waiting for consciousness to give it meaning through the act of choice.”

In making this statement, the film suggested that creativity, above all else, is an act of curation. It’s a claim that nothing, truly, is original. It’s a distillation of human expression into the language of mathematics.

To many, that philosophy rings undeniably true: Every possibility already exists, and artists are just collapsing the waveform to the frame they want to reveal. To others, there is more personal truth to the romantic ideal that artwork is valued precisely because it did not exist until the artist produced it.

All this is to say that the debate about creativity and AI in Hollywood is ultimately a philosophical one. But it won’t be resolved that way.

The industry may succumb to litigation fatigue and a hollowed-out workforce—or it may instead find its way to fair deals, new opportunities for fresh voices, and transparent training sets.

For all this lofty talk about creativity and ideas, the outcome will come down to the contracts, court decisions, and compensation structures—all things that have always been at least as big a part of Hollywood as the creative work itself.

Photo of Samuel Axon

Samuel Axon is the editorial lead for tech and gaming coverage at Ars Technica. He covers AI, software development, gaming, entertainment, and mixed reality. He has been writing about gaming and technology for nearly two decades at Engadget, PC World, Mashable, Vice, Polygon, Wired, and others. He previously ran a marketing and PR agency in the gaming industry, led editorial for the TV network CBS, and worked on social media marketing strategy for Samsung Mobile at the creative agency SPCSHP. He also is an independent software and game developer for iOS, Windows, and other platforms, and he is a graduate of DePaul University, where he studied interactive media and software development.

Curated realities: An AI film festival and the future of human expression Read More »

ted-cruz-can’t-get-all-republicans-to-back-his-fight-against-state-ai-laws

Ted Cruz can’t get all Republicans to back his fight against state AI laws


Cruz plan moves ahead but was reportedly watered down amid Republican opposition.

Sen. Ted Cruz (R-Texas) presides over a subcommittee hearing on June 3, 2025 in Washington, DC. Credit: Getty Images | Chip Somodevilla

A Republican proposal to penalize states that regulate artificial intelligence can move forward without requiring approval from 60 senators, the Senate parliamentarian decided on Saturday. But the moratorium on state AI laws did not have unanimous Republican support and has reportedly been watered down in an effort to push it toward passage.

In early June, Sen. Ted Cruz (R-Texas) proposed enforcing a 10-year moratorium on AI regulation by making states ineligible for broadband funding if they try to impose any limits on development of artificial intelligence. While the House previously approved a version of the so-called “One Big Beautiful Bill” with an outright 10-year ban on state AI regulation, Cruz took a different approach because of the Senate rule that limits inclusion of “extraneous matter” in budget reconciliation legislation.

Under the Senate’s Byrd rule, a senator can object to a potentially extraneous budget provision. A motion to waive the Byrd rule requires a vote of 60 percent of the Senate.

As originally drafted, Cruz’s backdoor ban on state AI laws would have made it impossible for states to receive money from the $42 billion Broadband Equity, Access, and Deployment (BEAD) program if they try to regulate AI. He tied the provision into the budget bill by proposing an extra $500 million for the broadband-deployment grant program and expanding its purpose to also subsidize construction and deployment of infrastructure for artificial intelligence systems.

Punchbowl News reported today that Cruz made changes in order to gain more Republican support and comply with Senate procedural rules. Cruz was quoted as saying that under his current version, states that regulate AI would only be shut out of the $500 million AI fund.

This would seem to protect states’ access to the $42 billion broadband deployment fund that will offer subsidies to ISPs that expand access to Internet service. Losing that funding would be a major blow to states that have spent the last couple of years developing plans to connect more of their residents to modern broadband. The latest Senate bill text was not available today. We contacted Cruz’s office and will update this article if we get a response.

A spokesperson for Sen. Maria Cantwell (D-Wash.) told Ars today that Cruz’s latest version could still prevent states from getting broadband funding. The text has “a backdoor to apply new AI requirements to the entire $42.45 billion program, not just the new $500 million,” Cantwell’s representative said.

Plan has opponents from both parties

Senate Parliamentarian Elizabeth MacDonough ruled that several parts of the Republican budget bill are subject to the Byrd rule and its 60-vote requirement, but Cruz’s AI proposal wasn’t one of them. A press release from Senate Budget Committee Ranking Member Jeff Merkley (D-Ore.) noted that “the parliamentarian’s advice is based on whether a provision is appropriate for reconciliation and conforms to the limitations of the Byrd rule; it is not a judgement on the relative merits of a particular policy.”

Surviving the parliamentarian review doesn’t guarantee passage. A Bloomberg article said the parliamentarian’s decision is “a win for tech companies pushing to stall and override dozens of AI safety laws across the country,” but that the “provision will likely still be challenged on the Senate floor, where stripping the provision would need just a simple majority. Some Republicans in both the House and Senate have pushed back on the AI provision.”

Republicans have a 53–47 edge in the Senate. Cantwell and Sen. Marsha Blackburn (R-Tenn.) teamed up for a press conference last week in which they spoke out against the proposed moratorium on state regulation.

Cantwell said that 24 states last year started “regulating AI in some way, and they have adopted these laws that fill a gap while we are waiting for federal action. Now Congress is threatening these laws, which will leave hundreds of millions of Americans vulnerable to AI harm by abolishing those state law protections.”

Blackburn said she agreed with Cantwell that the AI regulation proposal “is not the type of thing that we put into reconciliation bills.” Blackburn added that lawmakers “are working to move forward with legislation at the federal level, but we do not need a moratorium that would prohibit our states from stepping up and protecting citizens in their state.”

Sens. Ron Johnson (R-Wis.) and Josh Hawley (R-Mo.) have also criticized the idea of stopping states from regulating AI.

Cruz accused states of “strangling AI”

Cruz argued that his proposal stops states “from strangling AI deployment with EU-style regulation.” Under his first proposal, no BEAD funds were to be given to any state or territory that enforces “any law or regulation… limiting, restricting, or otherwise regulating artificial intelligence models, artificial intelligence systems, or automated decision systems entered into interstate commerce.”

The Cantwell/Blackburn press conference also included Washington Attorney General Nick Brown, a Democrat; and Tennessee Attorney General Jonathan Skrmetti, a Republican. Brown said that “Washington has a law that prohibits deep fakes being used against political candidates by mimicking their appearance and their speech,” another “that prohibits sharing fabricated sexual images without consent and provides for penalties for those who possess and distribute such images,” and a third “that prohibits the knowing distribution of forged digital likenesses that can be used to harm or defraud people.”

“All of those laws, in my reading, would be invalid if this was to pass through Congress, and each of those laws are prohibiting and protecting people here in our state,” Brown said.

Skrmetti said that if the Senate proposal becomes law “there would be arguments out there for the big tech companies that the moratorium does, in fact, preclude any enforcement of any consumer protection laws if there’s an AI component to the product that we’re looking at.”

Other Republican plans fail Byrd rule test

Senate Democrats said they are pleased that the parliamentarian ruled that several other parts of the bill are subject to the Byrd rule. “We continue to see Republicans’ blatant disregard for the rules of reconciliation when drafting this bill… Democrats plan to challenge every part of this bill that hurts working families and violates this process,” Merkley said.

Merkley’s press release said the provisions that are subject to a 60-vote threshold include one that “limits certain grant funding for ‘sanctuary cities,’ and where the Attorney General disagrees with states’ and localities’ immigration enforcement,” and another that “gives state and local officials the authority to arrest any noncitizen suspected of being in the US unlawfully.”

The Byrd rule also applies to a section that “limits the ability of federal courts to issue preliminary injunctions or temporary restraining orders against the federal government by requiring litigants to post a potentially enormous bond,” and another that “limits when the federal government can enter into or enforce settlement agreements that provide for payments to third parties to fully compensate victims, remedy harm, and punish and deter future violations,” Merkley’s office said.

The office of Senate Democratic Leader Chuck Schumer (D-N.Y.) said yesterday that the provision requiring litigants to post bonds has been struck from the legislation. “This Senate Republican provision, which was even worse than the similar House-passed version, required a plaintiff seeking an emergency court order, preliminary injunction, or a temporary restraining order against the Trump Administration or the federal government to pay a costly bond up front—essentially making the justice system pay-to-play,” Schumer’s office said.

Schumer said that “if enacted, this would have been one of the most brazen power grabs we’ve seen in American history—an attempt to let a future President Trump ignore court orders with impunity, putting him above the law.”

Photo of Jon Brodkin

Jon is a Senior IT Reporter for Ars Technica. He covers the telecom industry, Federal Communications Commission rulemakings, broadband consumer affairs, court cases, and government regulation of the tech industry.

Ted Cruz can’t get all Republicans to back his fight against state AI laws Read More »

how-a-grad-student-got-lhc-data-to-play-nice-with-quantum-interference

How a grad student got LHC data to play nice with quantum interference


New approach is already having an impact on the experiment’s plans for future work.

The ATLAS particle detector of the Large Hadron Collider (LHC) at the European Nuclear Research Center (CERN) in Geneva, Switzerland. Credit: EThamPhoto/Getty Images

The ATLAS particle detector of the Large Hadron Collider (LHC) at the European Nuclear Research Center (CERN) in Geneva, Switzerland. Credit: EThamPhoto/Getty Images

Measurements at the Large Hadron Collider have been stymied by one of the most central phenomena of the quantum world. But now, a young researcher has championed a new method to solve the problem using deep neural networks.

The Large Hadron Collider is one of the biggest experiments in history, but it’s also one of the hardest to interpret. Unlike seeing an image of a star in a telescope, saying anything at all about the data that comes out of the LHC requires careful statistical modeling.

“If you gave me a theory [that] the Higgs boson is this way or that way, I think people imagine, ‘Hey, you built the experiment, you should be able to tell me what you’re going to see under various hypotheses!’” said Daniel Whiteson, a professor at the University of California, Irvine. “But we don’t.”

One challenge with interpreting LHC data is interference, a core implication of quantum mechanics. Interference allows two possible events to inhibit each other, weakening the likelihood of seeing the result of either. In the presence of interference, physicists needed to use a fuzzier statistical method to analyze data, losing the data’s full power and increasing its uncertainty.

However, a recent breakthrough suggests a different way to tackle the problem. The ATLAS collaboration, one of two groups studying proton collisions at the LHC, released two papers last December that describe new ways of exploring data from their detector. One describes how to use a machine learning technique called Neural Simulation-Based Inference to maximize the potential of particle physics data. The other demonstrates its effectiveness with the ultimate test: re-doing a previous analysis with the new technique and seeing dramatic improvement.

The papers are the culmination of a young researcher’s six-year quest to convince the collaboration of the value of the new technique. Its success is already having an impact on the experiment’s plans for future work.

Making sense out of fusing bosons

Each particle collision at the LHC involves many possible pathways in which different particles combine to give rise to the spray of debris that experimenters see. In 2017, David Rousseau at IJCLab in Orsay, a member of the ATLAS collaboration, asked one of his students, Aishik Ghosh, to improve his team’s ability to detect a specific pathway. That particular pathway is quite important since it’s used to measure properties of the Higgs boson, a particle (first measured in 2012) that helps explain the mass of all other fundamental particles.

It was a pretty big ask. “When a grad student gets started in ATLAS, they’re a tiny cog in a giant, well-oiled machine of 3,500 physicists, who all seem to know exactly what they’re doing,” said Ghosh.

The pathway Ghosh was asked to study occurs via several steps. First, the two colliding protons each emit a W boson, a particle associated with the weak nuclear force. These two bosons fuse together, changing their identity to form a Higgs boson. The Higgs boson then decays, forming a pair of Z bosons, another particle associated with the weak force. Finally, those Z bosons themselves each decay into a lepton, like an electron, and its antimatter partner, like a positron.

A Feynman diagram for the pathway studied by Aishik Ghosh. Credit: ATLAS

Measurements like the one Ghosh was studying are a key way of investigating the properties of the Higgs boson. By precisely measuring how long it takes the Higgs boson to decay, physicists could find evidence of it interacting with new, undiscovered particles that are too massive for the LHC to produce directly.

Ghosh started on the project, hoping to find a small improvement in the collaboration’s well-tested methods. Instead, he noticed a larger issue. The goal he was given, of detecting a single pathway by itself, didn’t actually make sense.

“I was doing that and I realized, ‘What am I doing?’ There’s no clear objective,” said Ghosh.

The problem was quantum interference.

How quantum histories interfere

One of the most famous demonstrations of the mysterious nature of quantum mechanics is called the double-slit experiment. In this demonstration, electrons are shot through a screen with two slits that allow them to pass through to a photographic plate on the other side. With one slit covered, the electrons form a pattern centered on the opening. The photographic plate lights up bright right across from the slit and dims further away from it.

With both slits open, you would expect the pattern to get brighter as more electrons reach the photographic plate. Instead, the effect varies. The two slits do not give rise to two nice bright peaks; instead, you see a rippling pattern in which some areas get brighter while others get dimmer, even though the dimmer areas should, in principle, be easier for electrons to reach.

The effect happens even if the electrons are shot at the screen one by one to stop them from influencing each other directly. It’s as if each electron carries with it two possible histories, one in which it goes through one slit and another where it goes through the other before both end up at the same place. These two histories interfere with each other so that some destinations become less likely instead of more likely.

Results of the double-slit experiment. Credit: Jordgette (CC BY-SA 3.0)

For electrons in the double-slit experiment, the two different histories are two different paths through space. For a measurement at the Large Hadron Collider, the histories are more abstract—paths that lead through transformations of fields. One history might be like the pathway Ghosh was asked to study, in which two W bosons fuse to form a Higgs boson before the Higgs boson splits into two Z bosons. But in another history, the two W bosons might fuse and immediately split into two Z bosons without ever producing a Higgs.

Both histories have the same beginning, with two W bosons, and the same end, with two Z bosons. And just as the two histories of electrons in the double-slit experiment can interfere, so can the two histories for these particles.

Another possible history for colliding particles at the Large Hadron Collider, which interferes with the measurement Ghosh was asked to do. Credit: ATLAS

That interference makes the effect of the Higgs boson much more challenging to spot. ATLAS scientists wanted to look for two pairs of electrons and positrons, which would provide evidence that two Z bosons were produced. They would classify their observations into two types: observations that are evidence for the signal they were looking for (that of a decaying Higgs boson) and observations of events that generate this pattern of particles without the Higgs boson acting as an intermediate (the latter are called the background). But the two types of observations, signal and background, interfere. With a stronger signal, corresponding to more Higgs bosons decaying, you might observe more pairs of electrons and positrons… but if these events interfere, you also might see those pairs disappear.

Learning to infer

In traditional approaches, those disappearances are hard to cope with, even when using methods that already incorporate machine learning.

One of the most common uses of machine learning is classification—for example, distinguishing between pictures of dogs and cats. You train the machine on pictures of cats and pictures of dogs, and it tells you, given a picture, which animal is the most likely match. Physicists at the LHC were already using this kind of classification method to characterize the products of collisions, but it functions much worse when interference is involved.

“If you have something that disappears, you don’t quite know what to train on,” said David Rousseau. “Usually, you’re training signal versus background, exactly like you’re training cats versus dogs. When there is something that disappears, you don’t see what you trained on.”

At first, Ghosh tried a few simple tricks, but as time went on, he realized he needed to make a more fundamental change. He reached out to others in the community and learned about a method called Neural Simulation-Based Inference, or NSBI.

In older approaches, people had trained machine learning models to classify observations into signal and background, using simulations of particle collisions to make the training data. Then they used that classification to infer the most likely value of a number, like the amount of time it takes a Higgs boson to decay, based on data from an actual experiment. Neural Simulation-Based Inference skips the classification and goes directly to the inference.

Instead of trying to classify observations into signal and background, NSBI uses simulations to teach an artificial neural network to guess a formula called a likelihood ratio. Someone using NSBI would run several simulations that describe different situations, such as letting the Higgs boson decay at different rates, and then check how many of each type of simulation yielded a specific observation. The fraction of these simulations with a certain decay rate would provide the likelihood ratio, a method for inferring which decay rate is more likely given experimental evidence. If the neural network is good at guessing this ratio, it will be good at finding how long the Higgs takes to decay.

Because NSBI doesn’t try to classify observations into different categories, it handles quantum interference more effectively. Instead of trying to find the Higgs based on a signal that disappears, it examines all the data, trying to guess which decay time is the most likely.

Ghosh tested the method, which showed promising results on test data, and presented the results at a conference in 2019. But if he was going to convince the ATLAS collaboration that the method was safe to use, he still had a lot of work ahead of him.

Shifting the weight on ATLAS’ shoulders

Experiments like ATLAS have high expectations attached to them. A collaboration of thousands of scientists, ATLAS needs to not only estimate the laws of physics but also have a clear idea of just how uncertain those estimates are. At the time, NSBI hadn’t been tested in that way.

“None of this has actually been used on data,” said Ghosh. “Nobody knew how to quantify the uncertainties. So you have a neural network that gives you a likelihood. You don’t know how good the likelihood is. Is it well-estimated? What if it’s wrongly estimated just in some weird corner? That would completely bias your results.”

Checking those corners was too big a job for a single PhD student and too complex to complete within a single PhD degree. Aishik would have to build a team, and he would need time to build that team. That’s tricky in the academic world, where students go on to short-term postdoc jobs with the expectation that they quickly publish new results to improve their CV for the next position.

“We’re usually looking to publish the next paper within two to three years—no time to overhaul our methods,” said Ghosh. Fortunately, Ghosh had support. He received his PhD alongside Rousseau and went to work with Daniel Whiteson, who encouraged him to pursue his ambitious project.

“I think it’s really important that postdocs learn to take those risks because that’s what science is,” Whiteson said.

Ghosh gathered his team. Another student of Rousseau’s, Arnaud Maury, worked to calibrate the machine’s confidence in its answers. A professor at the University of Massachusetts, Rafael Coelho Lopes de Sa, joined the project. His student Jay Sandesara would have a key role in getting the calculation to work at full scale on a computer cluster. IJCLab emeritus RD Schaffer and University of Liège professor Gilles Loupe provided cross-checks and advice.

The team wanted a clear demonstration that their method worked, so they took an unusual step. They took data that ATLAS had already analyzed and performed a full analysis using their method instead, showing that it could pass every check the collaboration could think of. They would publish two papers, one describing the method and the other giving the results of their upgraded analysis. Zach Marshall, who was the computing coordinator for ATLAS at the time, helped get the papers through, ensuring that they were vetted by experts in multiple areas.

“It was a very small subset of our community that had that overlap between this technical understanding and the physics analysis experience and understanding that were capable of really speaking to whether that paper was sufficient and intelligible and useful. So we really had to make sure that we engaged that little group of humans by name,” said Marshall.

The new method showed significant improvements, getting a much more precise result than the collaboration’s previous analysis. That improvement, and the thorough checks, persuaded ATLAS to use NSBI more broadly going forward. It will give them much more precision than they expected, using the Higgs boson to search for new particles and clarify our understanding of the quantum world. When ATLAS discusses its future plans, it makes projections of the precision it expects to reach in the future. But those plans are now being upended.

“One of the fun things about this method that Aishik pushed hard is each time it feels like now we do that projection—here’s how well we’ll do in 15 years—we absolutely crush those projections,” said Marshall. “So we are just now having to redo a set of projections because we matched our old projections for 15 years out already today. It’s a very fun problem to have.”

How a grad student got LHC data to play nice with quantum interference Read More »

mit-student-prints-ai-polymer-masks-to-restore-paintings-in-hours

MIT student prints AI polymer masks to restore paintings in hours

MIT graduate student Alex Kachkine once spent nine months meticulously restoring a damaged baroque Italian painting, which left him plenty of time to wonder if technology could speed things up. Last week, MIT News announced his solution: a technique that uses AI-generated polymer films to physically restore damaged paintings in hours rather than months. The research appears in Nature.

Kachkine’s method works by printing a transparent “mask” containing thousands of precisely color-matched regions that conservators can apply directly to an original artwork. Unlike traditional restoration, which permanently alters the painting, these masks can reportedly be removed whenever needed. So it’s a reversible process that does not permanently change a painting.

“Because there’s a digital record of what mask was used, in 100 years, the next time someone is working with this, they’ll have an extremely clear understanding of what was done to the painting,” Kachkine told MIT News. “And that’s never really been possible in conservation before.”

Figure 1 from the paper.

Figure 1 from the paper. Credit: MIT

Nature reports that up to 70 percent of institutional art collections remain hidden from public view due to damage—a large amount of cultural heritage sitting unseen in storage. Traditional restoration methods, where conservators painstakingly fill damaged areas one at a time while mixing exact color matches for each region, can take weeks to decades for a single painting. It’s skilled work that requires both artistic talent and deep technical knowledge, but there simply aren’t enough conservators to tackle the backlog.

The mechanical engineering student conceived the idea during a 2021 cross-country drive to MIT, when gallery visits revealed how much art remains hidden due to damage and restoration backlogs. As someone who restores paintings as a hobby, he understood both the problem and the potential for a technological solution.

To demonstrate his method, Kachkine chose a challenging test case: a 15th-century oil painting requiring repairs in 5,612 separate regions. An AI model identified damage patterns and generated 57,314 different colors to match the original work. The entire restoration process reportedly took 3.5 hours—about 66 times faster than traditional hand-painting methods.

A handout photo of Alex Kachkine, who developed the AI printed film technique.

Alex Kachkine, who developed the AI-printed film technique. Credit: MIT

Notably, Kachkine avoided using generative AI models like Stable Diffusion or the “full-area application” of generative adversarial networks (GANs) for the digital restoration step. According to the Nature paper, these models cause “spatial distortion” that would prevent proper alignment between the restored image and the damaged original.

MIT student prints AI polymer masks to restore paintings in hours Read More »

to-avoid-admitting-ignorance,-meta-ai-says-man’s-number-is-a-company-helpline

To avoid admitting ignorance, Meta AI says man’s number is a company helpline

Although that statement may provide comfort to those who have kept their WhatsApp numbers off the Internet, it doesn’t resolve the issue of WhatsApp’s AI helper potentially randomly generating a real person’s private number that may be a few digits off from the business contact information WhatsApp users are seeking.

Expert pushes for chatbot design tweaks

AI companies have recently been grappling with the problem of chatbots being programmed to tell users what they want to hear, instead of providing accurate information. Not only are users sick of “overly flattering” chatbot responses—potentially reinforcing users’ poor decisions—but the chatbots could be inducing users to share more private information than they would otherwise.

The latter could make it easier for AI companies to monetize the interactions, gathering private data to target advertising, which could deter AI companies from solving the sycophantic chatbot problem. Developers for Meta rival OpenAI, The Guardian noted, last month shared examples of “systemic deception behavior masked as helpfulness” and chatbots’ tendency to tell little white lies to mask incompetence.

“When pushed hard—under pressure, deadlines, expectations—it will often say whatever it needs to to appear competent,” developers noted.

Mike Stanhope, the managing director of strategic data consultants Carruthers and Jackson, told The Guardian that Meta should be more transparent about the design of its AI so that users can know if the chatbot is designed to rely on deception to reduce user friction.

“If the engineers at Meta are designing ‘white lie’ tendencies into their AI, the public need to be informed, even if the intention of the feature is to minimize harm,” Stanhope said. “If this behavior is novel, uncommon, or not explicitly designed, this raises even more questions around what safeguards are in place and just how predictable we can force an AI’s behavior to be.”

To avoid admitting ignorance, Meta AI says man’s number is a company helpline Read More »

study:-meta-ai-model-can-reproduce-almost-half-of-harry-potter-book

Study: Meta AI model can reproduce almost half of Harry Potter book


Harry Potter and the Copyright Lawsuit

The research could have big implications for generative AI copyright lawsuits.

Meta CEO Mark Zuckerberg. Credit: Andrej Sokolow/picture alliance via Getty Images

In recent years, numerous plaintiffs—including publishers of books, newspapers, computer code, and photographs—have sued AI companies for training models using copyrighted material. A key question in all of these lawsuits has been how easily AI models produce verbatim excerpts from the plaintiffs’ copyrighted content.

For example, in its December 2023 lawsuit against OpenAI, The New York Times Company produced dozens of examples where GPT-4 exactly reproduced significant passages from Times stories. In its response, OpenAI described this as a “fringe behavior” and a “problem that researchers at OpenAI and elsewhere work hard to address.”

But is it actually a fringe behavior? And have leading AI companies addressed it? New research—focusing on books rather than newspaper articles and on different companies—provides surprising insights into this question. Some of the findings should bolster plaintiffs’ arguments, while others may be more helpful to defendants.

The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models—three from Meta and one each from Microsoft and EleutherAI—were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright.

This chart illustrates their most surprising finding:

The chart shows how easy it is to get a model to generate 50-token excerpts from various parts of Harry Potter and the Sorcerer’s Stone. The darker a line is, the easier it is to reproduce that portion of the book.

Each row represents a different model. The three bottom rows are Llama models from Meta. And as you can see, Llama 3.1 70B—a mid-sized model Meta released in July 2024—is far more likely to reproduce Harry Potter text than any of the other four models.

Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time. (I’ll unpack how this was measured in the next section.)

Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer’s Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3.

Harry Potter and the Sorcerer’s Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books—such as The Hobbit and George Orwell’s 1984—than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models.

“There are really striking differences among models in terms of how much verbatim text they have memorized,” said James Grimmelmann, a Cornell law professor who has collaborated with several of the paper’s authors.

The results surprised the study’s authors, including Mark Lemley, a law professor at Stanford. (Lemley used to be part of Meta’s legal team, but in January, he dropped them as a client after Facebook adopted more Trump-friendly moderation policies.)

“We’d expected to see some kind of low level of replicability on the order of 1 or 2 percent,” Lemley told me. “The first thing that surprised me is how much variation there is.”

These results give everyone in the AI copyright debate something to latch onto. For AI industry critics, the big takeaway is that—at least for some models and some books—memorization is not a fringe phenomenon.

On the other hand, the study only found significant memorization of a few popular books. For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.

This could be a headache for law firms that have filed class-action lawsuits against AI companies. Kadrey is the lead plaintiff in a class-action lawsuit against Meta. To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations.

Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta’s favor, since most authors lack the resources to file individual lawsuits.

The broader lesson of this study is that the details will matter in these copyright cases. Too often, online discussions have treated “do generative models copy their training data or merely learn from it?” as a theoretical or even philosophical question. But it’s a question that can be tested empirically—and the answer might differ across models and across copyrighted works.

It’s common to talk about LLMs predicting the next token. But under the hood, what the model actually does is generate a probability distribution over all possibilities for the next token. For example, if you prompt an LLM with the phrase “Peanut butter and,” it will respond with a probability distribution that might look like this made-up example:

  • P(“jelly”) = 70 percent
  • P(“sugar”) = 9 percent
  • P(“peanut”) = 6 percent
  • P(“chocolate”) = 4 percent
  • P(“cream”) = 3 percent

And so forth.

After the model generates a list of probabilities like this, the system will select one of these options at random, weighted by their probabilities. So 70 percent of the time the system will generate “Peanut butter and jelly.” Nine percent of the time, we’ll get “Peanut butter and sugar.” Six percent of the time, it will be “Peanut butter and peanut.” You get the idea.

The study’s authors didn’t have to generate multiple outputs to estimate the likelihood of a particular response. Instead, they could calculate probabilities for each token and then multiply them together.

Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:

  • Prompt the model with “My favorite sandwich is,” and look up the probability of “peanut” (let’s say it’s 20 percent).
  • Prompt the model with “My favorite sandwich is peanut,” and look up the probability of “butter” (let’s say it’s 90 percent).
  • Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
  • Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).

Then we just have to multiply the probabilities like this:

0.2 0.9 0.8 0.7 = 0.1008

So we can predict that the model will produce “peanut butter and jelly” about 10 percent of the time, without actually generating 100 or 1,000 outputs and counting how many of them were that exact phrase.

This technique greatly reduced the cost of the research, allowed the authors to analyze more books, and made it feasible to precisely estimate very low probabilities.

For example, the authors estimated that it would take more than 10 quadrillion samples to exactly reproduce some 50-token sequences from some books. Obviously, it wouldn’t be feasible to actually generate that many outputs. But it wasn’t necessary: the probability could be estimated just by multiplying the probabilities for the 50 tokens.

A key thing to notice is that probabilities can get really small really fast. In my made-up example, the probability that the model will produce the four tokens “peanut butter and jelly” is just 10 percent. If we added more tokens, the probability would get even lower. If we added 46 more tokens, the probability could fall by several orders of magnitude.

For any language model, the probability of generating any given 50-token sequence “by accident” is vanishingly small. If a model generates 50 tokens from a copyrighted work, that is strong evidence that the tokens “came from” the training data. This is true even if it only generates those tokens 10 percent, 1 percent, or 0.01 percent of the time.

The study authors took 36 books and divided each of them into overlapping 100-token passages. Using the first 50 tokens as a prompt, they calculated the probability that the next 50 tokens would be identical to the original passage. They counted a passage as “memorized” if the model had a greater than 50 percent chance of reproducing it word for word.

This definition is quite strict. For a 50-token sequence to have a probability greater than 50 percent, the average token in the passage needs a probability of at least 98.5 percent! Moreover, the authors only counted exact matches. They didn’t try to count cases where—for example—the model generates 48 or 49 tokens from the original passage but got one or two tokens wrong. If these cases were counted, the amount of memorization would be even higher.

This research provides strong evidence that significant portions of Harry Potter and the Sorcerer’s Stone were copied into the weights of Llama 3.1 70B. But this finding doesn’t tell us why or how this happened. I suspect that part of the answer is that Llama 3 70B was trained on 15 trillion tokens—more than 10 times the 1.4 trillion tokens used to train Llama 1 65B.

The more times a model is trained on a particular example, the more likely it is to memorize that example. Perhaps Meta had trouble finding 15 trillion distinct tokens, so it trained on the Books3 dataset multiple times. Or maybe Meta added third-party sources—such as online Harry Potter fan forums, consumer book reviews, or student book reports—that included quotes from Harry Potter and other popular books.

I’m not sure that either of these explanations fully fits the facts. The fact that memorization was a much bigger problem for the most popular books does suggest that Llama may have been trained on secondary sources that quote these books rather than the books themselves. There are likely exponentially more online discussions of Harry Potter than Sandman Slim.

On the other hand, it’s surprising that Llama memorized so much of Harry Potter and the Sorcerer’s Stone.

“If it were citations and quotations, you’d expect it to concentrate around a few popular things that everyone quotes or talks about,” Lemley said. The fact that Llama 3 memorized almost half the book suggests that the entire text was well represented in the training data.

Or there could be another explanation entirely. Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem. I emailed Meta for comment last week but haven’t heard back.

“It doesn’t seem to be all popular books,” Mark Lemley told me. “Some popular books have this result and not others. It’s hard to come up with a clear story that says why that happened.”

  1. Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.
  2. The training process copies information from the training data into the model, making the model a derivative work under copyright law.
  3. Infringement occurs when a model generates (portions of) a copyrighted work.

A lot of discussion so far has focused on the first theory because it is the most threatening to AI companies. If the courts uphold this theory, most current LLMs would be illegal, whether or not they have memorized any training data.

The AI industry has some pretty strong arguments that using copyrighted works during the training process is fair use under the 2015 Google Books ruling. But the fact that Llama 3.1 70B memorized large portions of Harry Potter could color how the courts consider these fair use questions.

A key part of fair use analysis is whether a use is “transformative”—whether a company has made something new or is merely profiting from the work of others. The fact that language models are capable of regurgitating substantial portions of popular works like Harry Potter1984, and The Hobbit could cause judges to look at these fair use arguments more skeptically.

Moreover, one of Google’s key arguments in the books case was that its system was designed to never return more than a short excerpt from any book. If the judge in the Meta lawsuit wanted to distinguish Meta’s arguments from the ones Google made in the books case, he could point to the fact that Llama can generate far more than a few lines of Harry Potter.

The new study “complicates the story that the defendants have been telling in these cases,” co-author Mark Lemley told me. “Which is ‘we just learn word patterns. None of that shows up in the model.’”

But the Harry Potter result creates even more danger for Meta under that second theory—that Llama itself is a derivative copy of Rowling’s book.

“It’s clear that you can in fact extract substantial parts of Harry Potter and various other books from the model,” Lemley said. “That suggests to me that probably for some of those books there’s something the law would call a copy of part of the book in the model itself.”

The Google Books precedent probably can’t protect Meta against this second legal theory because Google never made its books database available for users to download—Google almost certainly would have lost the case if it had done that.

In principle, Meta could still convince a judge that copying 42 percent of Harry Potter was allowed under the flexible, judge-made doctrine of fair use. But it would be an uphill battle.

“The fair use analysis you’ve gotta do is not just ‘is the training set fair use,’ but ‘is the incorporation in the model fair use?’” Lemley said. “That complicates the defendants’ story.”

Grimmelmann also said there’s a danger that this research could put open-weight models in greater legal jeopardy than closed-weight ones. The Cornell and Stanford researchers could only do their work because the authors had access to the underlying model—and hence to the token probability values that allowed efficient calculation of probabilities for sequences of tokens.

Most leading labs, including OpenAI, Anthropic, and Google, have increasingly restricted access to these so-called logits, making it more difficult to study these models.

Moreover, if a company keeps model weights on its own servers, it can use filters to try to prevent infringing output from reaching the outside world. So even if the underlying OpenAI, Anthropic, and Google models have memorized copyrighted works in the same way as Llama 3.1 70B, it might be difficult for anyone outside the company to prove it.

Moreover, this kind of filtering makes it easier for companies with closed-weight models to invoke the Google Books precedent. In short, copyright law might create a strong disincentive for companies to release open-weight models.

“It’s kind of perverse,” Mark Lemley told me. “I don’t like that outcome.”

On the other hand, judges might conclude that it would be bad to effectively punish companies for publishing open-weight models.

“There’s a degree to which being open and sharing weights is a kind of public service,” Grimmelmann told me. “I could honestly see judges being less skeptical of Meta and others who provide open-weight models.”

Timothy B. Lee was on staff at Ars Technica from 2017 to 2021. Today, he writes Understanding AI, a newsletter that explores how AI works and how it’s changing our world. You can subscribe here.

Photo of Timothy B. Lee

Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC.

Study: Meta AI model can reproduce almost half of Harry Potter book Read More »