In authors’ bad books —
Copyright suits over AI training data reportedly decreasing AI transparency.
Book authors are suing Nvidia, alleging that the chipmaker’s AI platform NeMo—used to power customized chatbots—was trained on a controversial dataset that illegally copied and distributed their books without their consent.
In a proposed class action, novelists Abdi Nazemian (Like a Love Story), Brian Keene (Ghost Walk), and Stewart O’Nan (Last Night at the Lobster) argued that Nvidia should pay damages and destroy all copies of the Books3 dataset used to power NeMo large language models (LLMs).
The Books3 dataset, novelists argued, copied “all of Bibliotek,” a shadow library of approximately 196,640 pirated books. Initially shared through the AI community Hugging Face, the Books3 dataset today “is defunct and no longer accessible due to reported copyright infringement,” the Hugging Face website says.
According to the authors, Hugging Face removed the dataset last October, but not before AI companies like Nvidia grabbed it and “made multiple copies.” By training NeMo models on this dataset, the authors alleged that Nvidia “violated their exclusive rights under the Copyright Act.” The authors argued that the US district court in San Francisco must intervene and stop Nvidia because the company “has continued to make copies of the Infringed Works for training other models.”
A Hugging Face spokesperson clarified to Ars that “Hugging Face never removed this dataset, and we did not host the Books3 dataset on the Hub.” Instead, “Hugging Face hosted a script that downloads the data from The Eye, which is the place where ELeuther hosted the data,” until “Eleuther removed the data from The Eye” over copyright concerns, causing the dataset script on Hugging Face to break.
Nvidia did not immediately respond to Ars’ request to comment.
Demanding a jury trial, authors are hoping the court will rule that Nvidia has no possible defense for both allegedly violating copyrights and intending “to cause further infringement” by distributing NeMo models “as a base from which to build further models.”
AI models decreasing transparency amid suits
The class action was filed by the same legal team representing authors suing OpenAI, whose lawsuit recently saw many claims dismissed, but crucially not their claim of direct copyright infringement. Lawyers told Ars last month that authors would be amending their complaints against OpenAI and were “eager to move forward and litigate” their direct copyright infringement claim.
In that lawsuit, the authors alleged copyright infringement both when OpenAI trained LLMs and when chatbots referenced books in outputs. But authors seemed more concerned about alleged damages from chatbot outputs, warning that AI tools had an “uncanny ability to generate text similar to that found in copyrighted textual materials, including thousands of books.”
Uniquely, in the Nvidia suit, authors are focused exclusively on Nvidia’s training data, seemingly concerned that Nvidia could empower businesses to create any number of AI models on the controversial dataset, which could affect thousands of authors whose works could allegedly be broadly infringed just by training these models.
There’s no telling yet how courts will rule on the direct copyright claims in either lawsuit—or in the New York Times’ lawsuit against OpenAI—but so far, OpenAI has failed to convince courts to toss claims aside.
However, OpenAI doesn’t appear very shaken by the lawsuits. In February, OpenAI said that it expected to beat book authors’ direct copyright infringement claim at a “later stage” of the case and, most recently in the New York Times case, tried to convince the court that NYT “hacked” ChatGPT to “set up” the lawsuit.
And Microsoft, a co-defendant in the NYT lawsuit, even more recently introduced a new argument that could help tech companies defeat copyright suits over LLMs. Last month, Microsoft argued that The New York Times was attempting to stop a “groundbreaking new technology” and would fail, just like movie producers attempting to kill off the VCR in the 1980s.
“Despite The Times’s contentions, copyright law is no more an obstacle to the LLM than it was to the VCR (or the player piano, copy machine, personal computer, Internet, or search engine),” Microsoft wrote.
In December, Hugging Face’s machine learning and society lead, Yacine Jernite, noted that developers appeared to be growing less transparent about training data after copyright lawsuits raised red flags about companies using the Books3 dataset, “especially for commercial models.”
Meta, for example, “limited the amount of information [it] disclosed about” its LLM, Llama-2, “to a single paragraph description and one additional page of safety and bias analysis—after [its] use of the Books3 dataset when training the first Llama model was brought up in a copyright lawsuit,” Jernite wrote.
Jernite warned that AI models lacking transparency could hinder “the ability of regulatory safeguards to remain relevant as training methods evolve, of individuals to ensure that their rights are respected, and of open science and development to play their role in enabling democratic governance of new technologies.” To support “more accountability,” Jernite recommended “minimum meaningful public transparency standards to support effective AI regulation,” as well as companies providing options for anyone to opt out of their data being included in training data.
“More data transparency supports better governance and fosters technology development that more reliably respects peoples’ rights,” Jernite wrote.