copyright

he-got-sued-for-sharing-public-youtube-videos;-nightmare-ended-in-settlement

He got sued for sharing public YouTube videos; nightmare ended in settlement


Librarian vows to stop invasive ed tech after ending lawsuit with Proctorio.

Librarian Ian Linkletter remains one of Proctorio’s biggest critics after 5-year legal battle. Credit: Ashley Linkletter

Nobody expects to get sued for re-posting a YouTube video on social media by using the “share” button, but librarian Ian Linkletter spent the past five years embroiled in a copyright fight after doing just that.

Now that a settlement has been reached, Linkletter told Ars why he thinks his 2020 tweets sharing public YouTube videos put a target on his back.

Linkletter’s legal nightmare started in 2020 after an education technology company, Proctorio, began monitoring student backlash on Reddit over its AI tool used to remotely scan rooms, identify students, and prevent cheating on exams. On Reddit, students echoed serious concerns raised by researchers, warning of privacy issues, racist and sexist biases, and barriers to students with disabilities.

At that time, Linkletter was a learning technology specialist at the University of British Columbia. He had been aware of Proctorio as a tool that some professors used, but he ultimately joined UBC students criticizing Proctorio, as, practically overnight, it became a default tool that every teacher relied on during the early stages of the pandemic.

To Linkletter, the AI tool not only seemed flawed, but it also seemingly made students more anxious about exams. However, he didn’t post any tweets criticizing the tech—until he grew particularly disturbed to see Proctorio’s CEO, Mike Olsen, “showing up in the comments” on Reddit to fire back at one of his university’s loudest student critics. Defending Proctorio, Olsen roused even more backlash by posting the student’s private chat logs publicly to prove the student “lied” about a support interaction, The Guardian reported.

“If you’re gonna lie bro … don’t do it when the company clearly has an entire transcript of your conversation,” Olsen wrote, later apologizing for the now-deleted post.

“That set me off, and I was just like, this is completely unacceptable for a CEO to be going after our students like this,” Linkletter told Ars.

The more that Linkletter researched Proctorio, the more concerned he became. Taking to then-Twitter, he posted a series of seven tweets over a couple days that linked to YouTube videos that Proctorio hosted in its help center. He felt the videos—which showed how Proctorio flagged certain behaviors, tracked “abnormal” eye and head movements, and scanned rooms—helped demonstrate why students were so upset. And while he had fewer than 1,000 followers, he hoped that the influential higher education administrators who followed him would see his posts and consider dropping the tech.

Rather than request Linkletter remove the tweets—which was the company’s standard practice—Proctorio moved quickly to delete the videos. Proctorio supposedly expected that the removals would put Linkletter on notice to stop tweeting out help center videos. Instead, Linkletter posted a screenshot of the help center showing all the disabled videos, while suggesting that Proctorio seemed so invested in secrecy that it was willing to gut its own support resources to censor criticism of their tools.

Together, the videos, the help center screenshot, and another screenshot showing course material describing how Proctorio works were enough for Proctorio to take Linkletter to court.

The ed tech company promptly filed a lawsuit and obtained a temporary injunction by spuriously claiming that Linkletter shared private YouTube videos containing confidential information. Because the YouTube videos—which were public but “unlisted” when Linkletter shared them—had been removed, Linkletter did not have to delete the seven tweets that initially caught Proctorio’s attention, but the injunction required that he remove two tweets, including the screenshots.

In the five years since, the legal fight dragged on, with no end in sight until last week, as Canadian courts tangled with copyright allegations that tested a recently passed law intended to shield Canadian rights to free expression, the Protection of Public Participation Act.

To fund his defense, Linkletter said in a blog announcing the settlement that he invested his life savings “ten times over.” Additionally, about 900 GoFundMe supporters and thousands of members of the Association of Administrative and Professional Staff at UBC contributed tens of thousands more. For the last year of the battle, a law firm, Norton Rose Fulbright, agreed to represent him on a pro bono basis, which Linkletter said “was a huge relief to me, as it meant I could defend myself all the way if Proctorio chose to proceed with the litigation.”

The terms of the settlement remain confidential, but both Linkletter and Proctorio confirmed that no money was exchanged.

For Proctorio, the settlement made permanent the injunction that restricted Linkletter from posting the company’s help center or instructional materials. But it doesn’t stop Linkletter from remaining the company’s biggest critic, as “there are no other restrictions on my freedom of expression,” Linkletter’s blog noted.

“I’ve won my life back!” Linkletter wrote, while reassuring his supporters that he’s “fine” with how things ended.

“It doesn’t take much imagination to understand why Proctorio is a nightmare for students,” Linkletter wrote. “I can say everything that matters about Proctorio using public information.”

Proctorio’s YouTube “mistake” triggered injunction

In a statement to Ars, Kevin Rockmael, Proctorio’s head of marketing, suggested that the ed tech company sees the settlement as a win.

“After years of successful litigation, we are pleased that this settlement (which did not include any monetary compensation) protects our interests by making our initial restraining order permanent,” Rockmael said. “Most importantly, we are glad to close this chapter and focus our efforts on helping teachers and educational institutions deliver valuable and secure assessments.”

Responding to Rockmael, Linkletter clarified that the settlement upholds a modified injunction, noting that Proctorio’s initial injunction was significantly narrowed after a court ruled it overly broad. Linkletter also pointed to testimony from Proctorio’s former head of marketing, John Devoy, whose affidavit “mistakenly” swearing that Linkletter was sharing private YouTube videos was the sole basis for the court approving the injunction. That testimony, Linkletter told Ars, suggested that Proctorio knew that the librarian had shared videos the company had accidentally made public and used it as “some sort of excuse to pull the trigger” on a lawsuit after Linkletter commented on the sub-Reddit incident.

“Even a child understands how YouTube works, so how are we supposed to trust a surveillance company that doesn’t?” Linkletter wrote in his blog.

Grilled by Linkletter’s lawyer, Devoy insisted that he was not “lying” when he claimed the videos Linkletter shared came from a private channel. Instead—even though he knew the difference between a private and public channel—Devoy claimed that he made a simple mistake, even suggesting that the inaccurate claim was a “typo.”

Linkletter maintains that Proctorio’s lawsuit had nothing to do with the videos he shared—which his legal team discovered had been shared publicly by many parties, including UBC, none of which Proctorio decided to sue. Instead, he felt targeted to silence his criticism of the company, and he successfully fought to keep Proctorio from accessing his private communications, which seemed to be a fishing expedition to find other critics to monitor.

“In my opinion, and this is just my opinion, one of the purposes of the lawsuit was to have a chilling effect on public discourse around proctoring,” Linkletter told Ars. “And it worked. I mean, a lot of people were scared to use the word Proctorio, especially in writing.”

Joe Mullin, a senior policy analyst who monitored Linkletter’s case for the nonprofit digital rights group the Electronic Frontier Foundation, agreed that Proctorio’s lawsuit risked chilling speech.

“We’re glad to see this lawsuit finally resolved in a way that protects Ian Linkletter’s freedom to speak out,” Mullin told Ars, noting that Linkletter “raised serious concerns about proctoring software at a time when students were subjected to unprecedented monitoring.”

“This case should never have dragged on for five years,” Mullin said. “Using copyright claims to retaliate against critics is wrong, and it chills public debate about surveillance technology.”

Preventing the “next” Proctorio

Linkletter is not the only critic to be targeted by Proctorio, Lia Holland, campaigns and communications director for a nonprofit digital rights group called Fight for the Future, told Ars.

Holland’s group was subpoenaed in a US fight after Proctorio sent a copyright infringement notice to Erik Johnson, a then-18-year-old college freshman who shared one of Linkletter’s screenshots. The ensuing litigation was similarly settled after Proctorio “threw every semi-plausible legal weapon at Johnson full force,” Holland told Ars. The pressure forced Johnson to choose between “living his life and his life being this suit from Proctorio,” Holland said.

Linkletter suspected that he and Johnson were added to a “list” of critics that Proctorio closely monitored online, but Proctorio has denied that such a list exists. Holland pushed back, though, telling Ars that Proctorio has “an incredibly long history of fudging the truth in the interest of profit.”

“We’re no strangers to Proctorio’s shady practices when it comes to oppressing dissent or criticism of their technologies,” Holland said. “I am utterly not shocked that they would employ tactics that appear to be doing the same thing when it comes to Ian Linkletter’s case.”

Regardless of Proctorio’s tactics for brand management, it seems clear that public criticism has impacted Proctorio’s sales, though. In 2021, Vice reported that student backlash led some schools to quickly abandon the software. UBC dropped Proctorio in 2021, too, citing “ethical concerns.”

Today, Linkletter works as an emerging technology and open education librarian at the British Columbia Institute of Technology (BCIT). While he considers himself an expert on Proctorio and continues to give lectures discussing harms of academic surveillance software, he’s ready to get away from discussing Proctorio now that the lawsuit has ended.

“I think I will continue to pay attention to what they do and say, and if there’s any new reports of harm that I can elevate,” Linkletter told Ars. “But I have definitely made my points in terms of my specific concerns, and I feel less obliged to spend more and more and more time repeating myself.”

Instead, Linkletter is determined to “prevent the next Proctorio” from potentially blindsiding students on his campus. In his role as vice chair of BCIT’s educational technology and learning design committee, he’s establishing “checks and balances” to ensure that if another pandemic-like situation arises forcing every student to work from home, he can stop “a bunch of creepy stuff” from being rolled out.

“I spent the last year advocating for and implementing algorithmic impact assessments as a mandatory thing that the institute has to do, including identifying how risk is going to be mitigated before we approve any new ed tech ever again,” Linkletter explained.

He also created the Canadian Privacy Library, where he posts privacy impact assessments that he collects by sending freedom-of-information requests to higher education institutions in British Columbia. That’s one way local students could monitor privacy concerns as AI use expands across campuses, increasingly impacting not just how exams are proctored, but how assignments are graded.

Holland told Ars that students concerned about ed tech surveillance “are most powerful when they act in solidarity with each other.” While the pandemic was widely forcing remote learning, student groups were able to successfully remove harmful proctoring tech by “working together so that there was not one single scapegoat or one single face that the ed tech company could go after,” she suggested. Those movements typically start with one or two students learning how the technology works, so that they can educate others about top concerns, Holland said.

Since Linkletter’s lawsuit started, Proctorio has stopped fighting with students on Reddit and suing critics over tweets, Holland said. But Linkletter told Ars that the company still seems to leave students in the dark when it comes to how its software works, and that “could lead to academic discipline for honest students, and unnecessary stress for everyone,” his earliest court filing defending his tweets said.

“I was and am gravely concerned about Proctorio’s lack of transparency about how its algorithms work, and how it labels student behaviours as ‘suspicious,’” Linkletter swore in the filing. One of his deleted tweets urged that all schools have to demand transparency and ask why Proctorio was “hiding” information about how the software worked. But in the end, Linkletter saw no point in continuing to argue over whether two deleted tweets re-posting Proctorio’s videos using YouTube’s sharing tool violated Proctorio’s copyrights.

“I didn’t feel too censored,” Linkletter told Ars. “But yeah, I guess it’s censorship, and I do believe they filed it to try and censor me. But as you can see, I just refused to go down, and I remained their biggest critic.”

As universities prepare to break ahead of the winter holidays, Linkletter told Ars that he’s looking forward to a change in dinner table conversation topics.

“It’s one of those things where I’m 41 and I have aging parents, and I’ve had to waste the last five Christmases talking to them about the lawsuit and their concerns about me,” Linkletter said. “So I’m really looking forward to this Thanksgiving, this Christmas, with this all behind me and the ability to just focus with my parents and my family.”

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

He got sued for sharing public YouTube videos; nightmare ended in settlement Read More »

researchers-isolate-memorization-from-reasoning-in-ai-neural-networks

Researchers isolate memorization from reasoning in AI neural networks


The hills and valleys of knowledge

Basic arithmetic ability lives in the memorization pathways, not logic circuits.

When engineers build AI language models like GPT-5 from training data, at least two major processing features emerge: memorization (reciting exact text they’ve seen before, like famous quotes or passages from books) and reasoning (solving new problems using general principles). New research from AI startup Goodfire.ai provides the first potentially clear evidence that these different functions actually work through completely separate neural pathways in the model’s architecture.

The researchers discovered that this separation proves remarkably clean. In a preprint paper released in late October, they described that when they removed the memorization pathways, models lost 97 percent of their ability to recite training data verbatim but kept nearly all their “logical reasoning” ability intact.

For example, at layer 22 in Allen Institute for AI’s OLMo-7B language model, the bottom 50 percent of weight components showed 23 percent higher activation on memorized data, while the top 10 percent showed 26 percent higher activation on general, non-memorized text. This mechanistic split enabled the researchers to surgically remove memorization while preserving other capabilities.

Perhaps most surprisingly, the researchers found that arithmetic operations seem to share the same neural pathways as memorization rather than logical reasoning. When they removed memorization circuits, mathematical performance plummeted to 66 percent while logical tasks remained nearly untouched. This discovery may explain why AI language models notoriously struggle with math without the use of external tools. They’re attempting to recall arithmetic from a limited memorization table rather than computing it, like a student who memorized times tables but never learned how multiplication works. The finding suggests that at current scales, language models treat “2+2=4” more like a memorized fact than a logical operation.

It’s worth noting that “reasoning” in AI research covers a spectrum of abilities that don’t necessarily match what we might call reasoning in humans. The logical reasoning that survived memory removal in this latest research includes tasks like evaluating true/false statements and following if-then rules, which are essentially applying learned patterns to new inputs. This also differs from the deeper “mathematical reasoning” required for proofs or novel problem-solving, which current AI models struggle with even when their pattern-matching abilities remain intact.

Looking ahead, if the information removal techniques receive further development in the future, AI companies could potentially one day remove, say, copyrighted content, private information, or harmful memorized text from a neural network without destroying the model’s ability to perform transformative tasks. However, since neural networks store information in distributed ways that are still not completely understood, for the time being, the researchers say their method “cannot guarantee complete elimination of sensitive information.” These are early steps in a new research direction for AI.

Traveling the neural landscape

To understand how researchers from Goodfire distinguished memorization from reasoning in these neural networks, it helps to know about a concept in AI called the “loss landscape.” The “loss landscape” is a way of visualizing how wrong or right an AI model’s predictions are as you adjust its internal settings (which are called “weights”).

Imagine you’re tuning a complex machine with millions of dials. The “loss” measures the number of mistakes the machine makes. High loss means many errors, low loss means few errors. The “landscape” is what you’d see if you could map out the error rate for every possible combination of dial settings.

During training, AI models essentially “roll downhill” in this landscape (gradient descent), adjusting their weights to find the valleys where they make the fewest mistakes. This process provides AI model outputs, like answers to questions.

Figure 1: Overview of our approach. We collect activations and gradients from a sample of training data (a), which allows us to approximate loss curvature w.r.t. a weight matrix using K-FAC (b). We decompose these weight matrices into components (each the same size as the matrix), ordered from high to low curvature. In language models, we show that data from different tasks interacts with parts of the spectrum of components differently (c).

Figure 1 from the paper “From Memorization to Reasoning in the Spectrum of Loss Curvature.” Credit: Merullo et al.

The researchers analyzed the “curvature” of the loss landscapes of particular AI language models, measuring how sensitive the model’s performance is to small changes in different neural network weights. Sharp peaks and valleys represent high curvature (where tiny changes cause big effects), while flat plains represent low curvature (where changes have minimal impact).

Using a technique called K-FAC (Kronecker-Factored Approximate Curvature), they found that individual memorized facts create sharp spikes in this landscape, but because each memorized item spikes in a different direction, when averaged together they create a flat profile. Meanwhile, reasoning abilities that many different inputs rely on maintain consistent moderate curves across the landscape, like rolling hills that remain roughly the same shape regardless of the direction from which you approach them.

“Directions that implement shared mechanisms used by many inputs add coherently and remain high-curvature on average,” the researchers write, describing reasoning pathways. In contrast, memorization uses “idiosyncratic sharp directions associated with specific examples” that appear flat when averaged across data.

Different tasks reveal a spectrum of mechanisms

The researchers tested their technique on multiple AI systems to verify the findings held across different architectures. They primarily used Allen Institute’s OLMo-2 family of open language models, specifically the 7-billion and 1-billion parameter versions, chosen because their training data is openly accessible. For vision models, they trained custom 86-million parameter Vision Transformers (ViT-Base models) on ImageNet with intentionally mislabeled data to create controlled memorization. They also validated their findings against existing memorization removal methods like BalancedSubnet to establish performance benchmarks.

The team tested their discovery by selectively removing low-curvature weight components from these trained models. Memorized content dropped to 3.4 percent recall from nearly 100 percent. Meanwhile, logical reasoning tasks maintained 95 to 106 percent of baseline performance.

These logical tasks included Boolean expression evaluation, logical deduction puzzles where solvers must track relationships like “if A is taller than B,” object tracking through multiple swaps, and benchmarks like BoolQ for yes/no reasoning, Winogrande for common sense inference, and OpenBookQA for science questions requiring reasoning from provided facts. Some tasks fell between these extremes, revealing a spectrum of mechanisms.

Mathematical operations and closed-book fact retrieval shared pathways with memorization, dropping to 66 to 86 percent performance after editing. The researchers found arithmetic particularly brittle. Even when models generated identical reasoning chains, they failed at the calculation step after low-curvature components were removed.

Figure 3: Sensitivity of different kinds of tasks to ablation of flatter eigenvectors. Parametric knowledge retrieval, arithmetic, and memorization are brittle, but openbook fact retrieval and logical reasoning is robust and maintain around 100% of original performance.

Figure 3 from the paper “From Memorization to Reasoning in the Spectrum of Loss Curvature.” Credit: Merullo et al.

“Arithmetic problems themselves are memorized at the 7B scale, or because they require narrowly used directions to do precise calculations,” the team explains. Open-book question answering, which relies on provided context rather than internal knowledge, proved most robust to the editing procedure, maintaining nearly full performance.

Curiously, the mechanism separation varied by information type. Common facts like country capitals barely changed after editing, while rare facts like company CEOs dropped 78 percent. This suggests models allocate distinct neural resources based on how frequently information appears in training.

The K-FAC technique outperformed existing memorization removal methods without needing training examples of memorized content. On unseen historical quotes, K-FAC achieved 16.1 percent memorization versus 60 percent for the previous best method, BalancedSubnet.

Vision transformers showed similar patterns. When trained with intentionally mislabeled images, the models developed distinct pathways for memorizing wrong labels versus learning correct patterns. Removing memorization pathways restored 66.5 percent accuracy on previously mislabeled images.

Limits of memory removal

However, the researchers acknowledged that their technique isn’t perfect. Once-removed memories might return if the model receives more training, as other research has shown that current unlearning methods only suppress information rather than completely erasing it from the neural network’s weights. That means the “forgotten” content can be reactivated with just a few training steps targeting those suppressed areas.

The researchers also can’t fully explain why some abilities, like math, break so easily when memorization is removed. It’s unclear whether the model actually memorized all its arithmetic or whether math just happens to use similar neural circuits as memorization. Additionally, some sophisticated capabilities might look like memorization to their detection method, even when they’re actually complex reasoning patterns. Finally, the mathematical tools they use to measure the model’s “landscape” can become unreliable at the extremes, though this doesn’t affect the actual editing process.

Photo of Benj Edwards

Benj Edwards is Ars Technica’s Senior AI Reporter and founder of the site’s dedicated AI beat in 2022. He’s also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Researchers isolate memorization from reasoning in AI neural networks Read More »

internet-archive’s-big-battle-with-music-publishers-ends-in-settlement

Internet Archive’s big battle with music publishers ends in settlement

A settlement has been reached in a lawsuit where music publishers sued the Internet Archive over the Great 78 Project, an effort to preserve early music recordings that only exist on brittle shellac records.

No details of the settlement have so far been released, but a court filing on Monday confirmed that the Internet Archive and UMG Recordings, Capitol Records, Sony Music Entertainment, and other record labels “have settled this matter.” More details may come in the next 45 days, when parties must submit filings to officially dismiss the lawsuit, but it’s unlikely the settlement amount will be publicly disclosed.

Days before the settlement was announced, record labels had indicated that everyone but the Internet Archive and its founder, Brewster Kahle, had agreed to sign a joint settlement, seemingly including the Great 78 Project’s recording engineer George Blood, who was also a target of the litigation. But in the days since, IA has gotten on board, posting a blog confirming that “the parties have reached a confidential resolution of all claims and will have no further public comment on this matter.”

For IA—which strove to digitize 3 million recordings to help historians document recording history—the lawsuit from music publishers could have meant financial ruin. Initially, record labels alleged that damages amounted to $400 million, claiming they lost streams when IA visitors played Great 78 recordings.

But despite IA arguing that there were comparably low downloads and streams on the Great 78 recordings—as well as a music publishing industry vet suggesting that damages were likely no more than $41,000—the labels intensified their attacks in March. In a court filing, the labels added so many more infringing works that the estimated damages increased to $700 million. It seemed like labels were intent on doubling down on a fight that, at least one sound historian suggested, the labels might one day regret.

Internet Archive’s big battle with music publishers ends in settlement Read More »

judge:-anthropic’s-$1.5b-settlement-is-being-shoved-“down-the-throat-of-authors”

Judge: Anthropic’s $1.5B settlement is being shoved “down the throat of authors”

At a hearing Monday, US district judge William Alsup blasted a proposed $1.5 billion settlement over Anthropic’s rampant piracy of books to train AI.

The proposed settlement comes in a case where Anthropic could have owed more than $1 trillion in damages after Alsup certified a class that included up to 7 million claimants whose works were illegally downloaded by the AI company.

Instead, critics fear Anthropic will get off cheaply, striking a deal with authors suing that covers less than 500,000 works and paying a small fraction of its total valuation (currently $183 billion) to get away with the massive theft. Defector noted that the settlement doesn’t even require Anthropic to admit wrongdoing, while the company continues raising billions based on models trained on authors’ works. Most recently, Anthropic raised $13 billion in a funding round, making back about 10 times the proposed settlement amount after announcing the deal.

Alsup expressed grave concerns that lawyers rushed the deal, which he said now risks being shoved “down the throat of authors,” Bloomberg Law reported.

In an order, Alsup clarified why he thought the proposed settlement was a chaotic mess. The judge said he was “disappointed that counsel have left important questions to be answered in the future,” seeking approval for the settlement despite the Works List, the Class List, the Claim Form, and the process for notification, allocation, and dispute resolution all remaining unresolved.

Denying preliminary approval of the settlement, Alsup suggested that the agreement is “nowhere close to complete,” forcing Anthropic and authors’ lawyers to “recalibrate” the largest publicly reported copyright class-action settlement ever inked, Bloomberg reported.

Of particular concern, the settlement failed to outline how disbursements would be managed for works with multiple claimants, Alsup noted. Until all these details are ironed out, Alsup intends to withhold approval, the order said.

One big change the judge wants to see is the addition of instructions requiring “anyone with copyright ownership” to opt in, with the consequence that the work won’t be covered if even one rights holder opts out, Bloomberg reported. There should also be instruction that any disputes over ownership or submitted claims should be settled in state court, Alsup said.

Judge: Anthropic’s $1.5B settlement is being shoved “down the throat of authors” Read More »

warner-bros.-sues-midjourney-to-stop-ai-knockoffs-of-batman,-scooby-doo

Warner Bros. sues Midjourney to stop AI knockoffs of Batman, Scooby-Doo


AI would’ve gotten away with it too…

Warner Bros. case builds on arguments raised in a Disney/Universal lawsuit.

DVD art for the animated movie Scooby-Doo & Batman: The Brave and the Bold. Credit: Warner Bros. Discovery

Warner Bros. hit Midjourney with a lawsuit Thursday, crafting a complaint that strives to shoot down defenses that the AI company has already raised in a similar lawsuit filed by Disney and Universal Studios earlier this year.

The big film studios have alleged that Midjourney profits off image generation models trained to produce outputs of popular characters. For Disney and Universal, intellectual property rights to pop icons like Darth Vader and the Simpsons were allegedly infringed. And now, the WB complaint defends rights over comic characters like Superman, Wonder Woman, and Batman, as well as characters considered “pillars of pop culture with a lasting impact on generations,” like Scooby-Doo and Bugs Bunny, and modern cartoon characters like Rick and Morty.

“Midjourney brazenly dispenses Warner Bros. Discovery’s intellectual property as if it were its own,” the WB complaint said, accusing Midjourney of allowing subscribers to “pick iconic” copyrighted characters and generate them in “every imaginable scene.”

Planning to seize Midjourney’s profits from allegedly using beloved characters to promote its service, Warner Bros. described Midjourney as “defiant and undeterred” by the Disney/Universal lawsuit. Despite that litigation, WB claimed that Midjourney has recently removed copyright protections in its supposedly shameful ongoing bid for profits. Nothing but a permanent injunction will end Midjourney’s outputs of allegedly “countless infringing images,” WB argued, branding Midjourney’s alleged infringements as “vast, intentional, and unrelenting.”

Examples of closely matching outputs include prompts for “screencaps” showing specific movie frames, a search term that at least one artist, Reid Southen, had optimistically predicted Midjourney would block last year, but it apparently did not.

Here are some examples included in WB’s complaint:

Midjourney’s output for the prompt, “Superman, classic cartoon character, DC comics.”

Midjourney could face devastating financial consequences in a loss. At trial, WB is hoping discovery will show the true extent of Midjourney’s alleged infringement, asking the court for maximum statutory damages, at $150,000 per infringing output. Just 2,000 infringing outputs unearthed could cost Midjourney more than its total revenue for 2024, which was approximately $300 million, the WB complaint said.

Warner Bros. hopes to hobble Midjourney’s best defense

For Midjourney, the WB complaint could potentially hit harder than the Disney/Universal lawsuit. WB’s complaint shows how closely studios are monitoring AI copyright litigation, likely choosing ideal moments to strike when studios feel they can better defend their property. So, while much of WB’s complaint echoes Disney and Universal’s arguments—which Midjourney has already begun defending against—IP attorney Randy McCarthy suggested in statements provided to Ars that WB also looked for seemingly smart ways to potentially overcome some of Midjourney’s best defenses when filing its complaint.

WB likely took note when Midjourney filed its response to the Disney/Universal lawsuit last month, arguing that its system is “trained on billions of publicly available images” and generates images not by retrieving a copy of an image in its database but based on “complex statistical relationships between visual features and words in the text-image pairs are encoded within the model.”

This defense could allow Midjourney to avoid claims that it copied WB images and distributes copies through its models. But hoping to dodge this defense, WB didn’t argue that Midjourney retains copies of its images. Rather, the entertainment giant raised a more nuanced argument that:

Midjourney used software, servers, and other technology to store and fix data associated with Warner Bros. Discovery’s Copyrighted Works in such a manner that those works are thereby embodied in the model, from which Midjourney is then able to generate, reproduce, publicly display, and distribute unlimited “copies” and “derivative works” of Warner Bros. Discovery’s works as defined by the Copyright Act.”

McCarthy noted that WB’s argument pushes the court to at least consider that even though “Midjourney does not store copies of the works in its model,” its system “nonetheless accesses the data relating to the works that are stored by Midjourney’s system.”

“This seems to be a very clever way to counter MJ’s ‘statistical pattern analysis’ arguments,” McCarthy said.

If it’s a winning argument, that could give WB a path to wipe Midjourney’s models. WB argued that each time Midjourney provides a “substantially new” version of its image generator, it “repeats this process.” And that ongoing activity—due to Midjourney’s initial allegedly “massive copying” of WB works—allows Midjourney to “further reproduce, publicly display, publicly perform, and distribute image and video outputs that are identical or virtually identical to Warner Bros. Discovery’s Copyrighted Works in response to simple prompts from subscribers.”

Perhaps further strengthening the WB’s argument, the lawsuit noted that Midjourney promotes allegedly infringing outputs on its 24/7 YouTube channel and appears to have plans to compete with traditional TV and streaming services. Asking the court to block Midjourney’s outputs instead, WB claims it’s already been “substantially and irreparably harmed” and risks further damages if the AI image generator is left unchecked.

As alleged proof that the AI company knows its tool is being used to infringe WB property, WB pointed to Midjourney’s own Discord server and subreddit, where users post outputs depicting WB characters and share tips to help others do the same. They also called out Midjourney’s “Explore” page, which allows users to drop a WB-referencing output into the prompt field to generate similar images.

“It is hard to imagine copyright infringement that is any more willful than what Midjourney is doing here,” the WB complaint said.

WB and Midjourney did not immediately respond to Ars’ request to comment.

Midjourney slammed for promising “fewer blocked jobs”

McCarthy noted that WB’s legal strategy differs in other ways from the arguments Midjourney’s already weighed in the Disney/Universal lawsuit.

The WB complaint also anticipates Midjourney’s likely defense that users are generating infringing outputs, not Midjourney, which could invalidate any charges of direct copyright infringement.

In the Disney/Universal lawsuit, Midjourney argued that courts have recently found that AI tools referencing copyrighted works is “a quintessentially transformative fair use,” accusing studios of trying to censor “an instrument for user expression.” They claim that Midjourney cannot know about infringing outputs unless studios use the company’s DMCA process, while noting that subscribers have “any number of legitimate, noninfringing grounds to create images incorporating characters from popular culture,” including “non-commercial fan art, experimentation and ideation, and social commentary and criticism.”

To avoid losing on that front, the WB complaint doesn’t depend on a ruling that Midjourney directly infringed copyrights. Instead, the complaint “more fully” emphasizes how Midjourney may be “secondarily liable for infringement via contributory, inducement and/or vicarious liability by inducing its users to directly infringe,” McCarthy suggested.

Additionally, WB’s complaint “seems to be emphasizing” that Midjourney “allegedly has the technical means to prevent its system from accepting prompts that directly reference copyrighted characters,” and “that would prevent infringing outputs from being displayed,” McCarthy said.

The complaint noted that Midjourney is in full control of what outputs can be generated. Noting that Midjourney “temporarily refused to ‘animate'” outputs of WB characters after launching video generations, the lawsuit appears to have been filed in response to Midjourney “deliberately” removing those protections and then announcing that subscribers would experience “fewer blocked jobs.”

Together, these arguments “appear to be intended to lead to the inference that Midjourney is willfully enticing its users to infringe,” McCarthy said.

WB’s complaint details simple user prompts that generate allegedly infringing outputs without any need to manipulate the system. The ease of generating popular characters seems to make Midjourney a destination for users frustrated by other AI image generators that make it harder to generate infringing outputs, WB alleged.

On top of that, Midjourney also infringes copyrights by generating WB characters, “even in response to generic prompts like ‘classic comic book superhero battle.'” And while Midjourney has seemingly taken steps to block WB characters from appearing on its “Explore” page, where users can find inspiration for prompts, these guardrails aren’t perfect, but rather “spotty and suspicious,” WB alleged. Supposedly, searches for correctly spelled character names like “Batman” are blocked, but any user who accidentally or intentionally mispells a character’s name like “Batma” can learn an easy way to work around that block.

Additionally, WB alleged, “the outputs often contain extensive nuance and detail, background elements, costumes, and accessories beyond what was specified in the prompt.” And every time that Midjourney outputs an allegedly infringing image, it “also trains on the outputs it has generated,” the lawsuit noted, creating a never-ending cycle of continually enhanced AI fakes of pop icons.

Midjourney could slow down the cycle and “minimize” these allegedly infringing outputs, if it cannot automatically block them all, WB suggested. But instead, “Midjourney has made a calculated and profit-driven decision to offer zero protection for copyright owners even though Midjourney knows about the breathtaking scope of its piracy and copyright infringement,” WB alleged.

Fearing a supposed scheme to replace WB in the market by stealing its best-known characters, WB accused Midjourney of willfully allowing WB characters to be generated in order to “generate more money for Midjourney” to potentially compete in streaming markets.

Midjourney will remove protections “on a whim”

As Midjourney’s efforts to expand its features escalate, WB claimed that trust is lost. Even if Midjourney takes steps to address rightsholders’ concerns, WB argued, studios must remain watchful of every upgrade, since apparently, “Midjourney can and will remove copyright protection measures on a whim.”

The complaint noted that Midjourney just this week announced “plans to continue deploying new versions” of its image generator, promising to make it easier to search for and save popular artists’ styles—updating a feature that many artists loathe.

Without an injunction, Midjourney’s alleged infringement could interfere with WB’s licensing opportunities for its content, while “illegally and unfairly” diverting customers who buy WB products like posters, wall art, prints, and coloring books, the complaint said.

Perhaps Midjourney’s strongest defense could be efforts to prove that WB benefits from its image generator. In the Disney/Universal lawsuit, Midjourney pointed out that studios “benefit from generative AI models,” claiming that “many dozens of Midjourney subscribers are associated with” Disney and Universal corporate email addresses. If WB corporate email addresses are found among subscribers, Midjourney could claim that WB is trying to “have it both ways” by “seeking to profit” from AI tools while preventing Midjourney and its subscribers from doing the same.

McCarthy suggested it’s too soon to say how the WB battle will play out, but Midjourney’s response will reveal how it intends to shift tactics to avoid courts potentially picking apart its defense of its training data, while keeping any blame for copyright-infringing outputs squarely on users.

“As with the Disney/Universal lawsuit, we need to wait to see how Midjourney answers these latest allegations,” McCarthy said. “It is definitely an interesting development that will have widespread implications for many sectors of our society.”

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

Warner Bros. sues Midjourney to stop AI knockoffs of Batman, Scooby-Doo Read More »

nyt-to-start-searching-deleted-chatgpt-logs-after-beating-openai-in-court

NYT to start searching deleted ChatGPT logs after beating OpenAI in court


What are the odds NYT will access your ChatGPT logs in OpenAI court battle?

Last week, OpenAI raised objections in court, hoping to overturn a court order requiring the AI company to retain all ChatGPT logs “indefinitely,” including deleted and temporary chats.

But Sidney Stein, the US district judge reviewing OpenAI’s request, immediately denied OpenAI’s objections. He was seemingly unmoved by the company’s claims that the order forced OpenAI to abandon “long-standing privacy norms” and weaken privacy protections that users expect based on ChatGPT’s terms of service. Rather, Stein suggested that OpenAI’s user agreement specified that their data could be retained as part of a legal process, which Stein said is exactly what is happening now.

The order was issued by magistrate judge Ona Wang just days after news organizations, led by The New York Times, requested it. The news plaintiffs claimed the order was urgently needed to preserve potential evidence in their copyright case, alleging that ChatGPT users are likely to delete chats where they attempted to use the chatbot to skirt paywalls to access news content.

A spokesperson told Ars that OpenAI plans to “keep fighting” the order, but the ChatGPT maker seems to have few options left. They could possibly petition the Second Circuit Court of Appeals for a rarely granted emergency order that could intervene to block Wang’s order, but the appeals court would have to consider Wang’s order an extraordinary abuse of discretion for OpenAI to win that fight.

OpenAI’s spokesperson declined to confirm if the company plans to pursue this extreme remedy.

In the meantime, OpenAI is negotiating a process that will allow news plaintiffs to search through the retained data. Perhaps the sooner that process begins, the sooner the data will be deleted. And that possibility puts OpenAI in the difficult position of having to choose between either caving to some data collection to stop retaining data as soon as possible or prolonging the fight over the order and potentially putting more users’ private conversations at risk of exposure through litigation or, worse, a data breach.

News orgs will soon start searching ChatGPT logs

The clock is ticking, and so far, OpenAI has not provided any official updates since a June 5 blog post detailing which ChatGPT users will be affected.

While it’s clear that OpenAI has been and will continue to retain mounds of data, it would be impossible for The New York Times or any news plaintiff to search through all that data.

Instead, only a small sample of the data will likely be accessed, based on keywords that OpenAI and news plaintiffs agree on. That data will remain on OpenAI’s servers, where it will be anonymized, and it will likely never be directly produced to plaintiffs.

Both sides are negotiating the exact process for searching through the chat logs, with both parties seemingly hoping to minimize the amount of time the chat logs will be preserved.

For OpenAI, sharing the logs risks revealing instances of infringing outputs that could further spike damages in the case. The logs could also expose how often outputs attribute misinformation to news plaintiffs.

But for news plaintiffs, accessing the logs is not considered key to their case—perhaps providing additional examples of copying—but could help news organizations argue that ChatGPT dilutes the market for their content. That could weigh against the fair use argument, as a judge opined in a recent ruling that evidence of market dilution could tip an AI copyright case in favor of plaintiffs.

Jay Edelson, a leading consumer privacy lawyer, told Ars that he’s concerned that judges don’t seem to be considering that any evidence in the ChatGPT logs wouldn’t “advance” news plaintiffs’ case “at all,” while really changing “a product that people are using on a daily basis.”

Edelson warned that OpenAI itself probably has better security than most firms to protect against a potential data breach that could expose these private chat logs. But “lawyers have notoriously been pretty bad about securing data,” Edelson suggested, so “the idea that you’ve got a bunch of lawyers who are going to be doing whatever they are” with “some of the most sensitive data on the planet” and “they’re the ones protecting it against hackers should make everyone uneasy.”

So even though odds are pretty good that the majority of users’ chats won’t end up in the sample, Edelson said the mere threat of being included might push some users to rethink how they use AI. He further warned that ChatGPT users turning to OpenAI rival services like Anthropic’s Claude or Google’s Gemini could suggest that Wang’s order is improperly influencing market forces, which also seems “crazy.”

To Edelson, the most “cynical” take could be that news plaintiffs are possibly hoping the order will threaten OpenAI’s business to the point where the AI company agrees to a settlement.

Regardless of the news plaintiffs’ motives, the order sets an alarming precedent, Edelson said. He joined critics suggesting that more AI data may be frozen in the future, potentially affecting even more users as a result of the sweeping order surviving scrutiny in this case. Imagine if litigation one day targets Google’s AI search summaries, Edelson suggested.

Lawyer slams judges for giving ChatGPT users no voice

Edelson told Ars that the order is so potentially threatening to OpenAI’s business that the company may not have a choice but to explore every path available to continue fighting it.

“They will absolutely do something to try to stop this,” Edelson predicted, calling the order “bonkers” for overlooking millions of users’ privacy concerns while “strangely” excluding enterprise customers.

From court filings, it seems possible that enterprise users were excluded to protect OpenAI’s competitiveness, but Edelson suggested there’s “no logic” to their exclusion “at all.” By excluding these ChatGPT users, the judge’s order may have removed the users best resourced to fight the order, Edelson suggested.

“What that means is the big businesses, the ones who have the power, all of their stuff remains private, and no one can touch that,” Edelson said.

Instead, the order is “only going to intrude on the privacy of the common people out there,” which Edelson said “is really offensive,” given that Wang denied two ChatGPT users’ panicked request to intervene.

“We are talking about billions of chats that are now going to be preserved when they weren’t going to be preserved before,” Edelson said, noting that he’s input information about his personal medical history into ChatGPT. “People ask for advice about their marriages, express concerns about losing jobs. They say really personal things. And one of the bargains in dealing with OpenAI is that you’re allowed to delete your chats and you’re allowed to temporary chats.”

The greatest risk to users would be a data breach, Edelson said, but that’s not the only potential privacy concern. Corynne McSherry, legal director for the digital rights group the Electronic Frontier Foundation, previously told Ars that as long as users’ data is retained, it could also be exposed through future law enforcement and private litigation requests.

Edelson pointed out that most privacy attorneys don’t consider OpenAI CEO Sam Altman to be a “privacy guy,” despite Altman recently slamming the NYT, alleging it sued OpenAI because it doesn’t “like user privacy.”

“He’s trying to protect OpenAI, and he does not give a hoot about the privacy rights of consumers,” Edelson said, echoing one ChatGPT user’s dismissed concern that OpenAI may not prioritize users’ privacy concerns in the case if it’s financially motivated to resolve the case.

“The idea that he and his lawyers are really going to be the safeguards here isn’t very compelling,” Edelson said. He criticized the judges for dismissing users’ concerns and rejecting OpenAI’s request that users get a chance to testify.

“What’s really most appalling to me is the people who are being affected have had no voice in it,” Edelson said.

Photo of Ashley Belanger

Ashley is a senior policy reporter for Ars Technica, dedicated to tracking social impacts of emerging policies and new technologies. She is a Chicago-based journalist with 20 years of experience.

NYT to start searching deleted ChatGPT logs after beating OpenAI in court Read More »

in-a-wild-time-for-copyright-law,-the-us-copyright-office-has-no-leader

In a wild time for copyright law, the US Copyright Office has no leader


Rudderless Copyright Office has taken on new prominence during the AI boom.

It’s a tumultuous time for copyright in the United States, with dozens of potentially economy-shaking AI copyright lawsuits winding through the courts. It’s also the most turbulent moment in the US Copyright Office’s history. Described as “sleepy” in the past, the Copyright Office has taken on new prominence during the AI boom, issuing key rulings about AI and copyright. It also hasn’t had a leader in more than a month.

In May, Copyright Register Shira Perlmutter was abruptly fired by email by the White House’s deputy director of personnel. Perlmutter is now suing the Trump administration, alleging that her firing was invalid; the government maintains that the executive branch has the authority to dismiss her. As the legality of the ouster is debated, the reality within the office is this: There’s effectively nobody in charge. And without a leader actually showing up at work, the Copyright Office is not totally business-as-usual; in fact, there’s debate over whether the copyright certificates it’s issuing could be challenged.

The firing followed a pattern. The USCO is part of the Library of Congress; Perlmutter had been appointed to her role by Librarian of Congress Carla Hayden. A few days before Perlmutter’s dismissal, Hayden, who had been in her role since 2016, was also fired by the White House via email. The White House appointed Deputy Attorney General Todd Blanche, who had previously served as President Trump’s defense attorney, as the new acting Librarian of Congress.

Two days after Pelmutter’s firing, Justice Department official Paul Perkins showed up at the Copyright Office, along with his colleague Brian Nieves. According to an affidavit from Perlmutter, they were carrying “printed versions of emails” from Blanche indicating that they had been appointed to new roles within the Copyright Office. Perkins, the email said, was designated as Acting Register of Copyrights. In other words, he was Perlmutter’s replacement.

But was Blanche actually the acting Librarian, and thus able to appoint Perkins as such? Within the Library of Congress, someone else had already assumed the role—Robert Newlen, Hayden’s former second-in-command, who has worked at the LOC since the 1970s. Following Hayden’s ouster, Newlen emailed LOC staff asserting that he was the acting Librarian—never mentioning Blanche—and noting that “Congress is engaged with the White House” on how to proceed.

In her lawsuit, Perlmutter argues that only the Librarian of Congress can fire and appoint a new Register. In a filing on Tuesday, defendants argued that the president does indeed have the authority to fire and appoint the Librarian of Congress and that his appointees then have the ability to choose a new Copyright Register.

Neither the Department of Justice nor the White House responded to requests for comment on this issue; the Library of Congress declined to comment.

Perkins and Nieves did not enter the USCO office or assume the roles they purported to fill the day they showed up. And since they left, sources within the Library of Congress tell WIRED, they have never returned, nor have they assumed any of the duties associated with the roles. These sources say that Congress is in talks with the White House to reach an agreement over these personnel disputes.

A congressional aide familiar with the situation told WIRED that Blanche, Perkins, and Nieves had not shown up for work “because they don’t have jobs to show up to.” The aide continued: “As we’ve always maintained, the President has no authority to appoint them. Robert Newlen has always been the Acting Librarian of Congress.”

If talks are happening, they remain out of public view. But Perlmutter does have some members of Congress openly on her side. “The president has no authority to remove the Register of Copyrights. That power lies solely with the Librarian of Congress. I’m relieved that the situation at the Library and Copyright Office has stabilized following the administration’s unconstitutional attempt to seize control for the executive branch. I look forward to quickly resolving this matter in a bipartisan way,” Senator Alex Padilla tells WIRED in a statement.

In the meantime, the Copyright Office is in the odd position of attempting to carry on as though it wasn’t missing its head. Immediately after Perlmutter’s dismissal, the Copyright Office paused issuing registration certificates “out of an abundance of caution,” according to USCO spokesperson Lisa Berardi Marflak, who says the pause impacted around 20,000 registrations. It resumed activities on May 29 but is now sending out registration certificates with a blank spot where Perlmutter’s signature would ordinarily be.

This unusual change has prompted discussion amongst copyright experts as to whether the registrations are now more vulnerable to legal challenges. The Copyright Office maintains that they are valid: “There is no requirement that the Register’s signature must appear on registration certificates,” says Berardi Marflak.

In a motion related to Perlmutter’s lawsuit, though, she alleges that sending out the registrations without a signature opens them up to “challenges in litigation,” something outside copyright experts have also pointed out. “It’s true the law doesn’t explicitly require a signature,” IP lawyer Rachael Dickson says. “However, the law really explicitly says that it’s the Register of Copyright determining whether the material submitted for the application is copyrightable subject matter.”

Without anyone acting as Register, Dickson thinks it would be reasonable to argue that the statutory requirements are not being met. “If you take them completely out of the equation, you have a really big problem,” she says. “Litigators who are trying to challenge a copyright registration’s validity will jump on this.”

Perlmutter’s lawyers have argued that leaving the Copyright Office without an active boss will cause dysfunction beyond the registration certificate issue, as the Register performs a variety of tasks, from advising Congress on copyright to recertifying organizations like the Mechanical Licensing Collective, the nonprofit in charge of administering royalties for streaming and download music in the United States. Since the MLC’s certification is up right now, Perlmutter would ordinarily be moving forward with recertifying the organization; as her lawsuit notes, right now, the recertification process is not moving forward.

The MLC may not be as impacted by Perlmutter’s absence as the complaint suggests. A source close to the MLC told WIRED that the organization does indeed need to be recertified but that the law doesn’t require the recertification process to be completed within a specific time frame, so it will be able to continue operating as usual.

Still, there are other ways that the lack of a boss is a clear liability. The Copyright Claims Board, a three-person tribunal that resolves some copyright disputes, needs to replace one of its members this year, as a current board member, who did not reply to a request for comment, is leaving. The job posting is already live and says applications are being reviewed, but as the position is supposed to be appointed by the Librarian of Congress with the guidance of the Copyright Register, it’s unclear how exactly it will be filled. A source familiar at the Library of Congress tells WIRED that Newlen could make the appointment if necessary, but they “expect there to be some kind of greater resolution by then.”

As they wait for the resolution, it remains an especially inopportune time for a headless Copyright Office. Perlmutter was fired just days after the office released a hotly contested report on generative AI training and fair use. That report has already been heavily cited in a new class action lawsuit against AI tools Suno and Udio, even though it was technically a “prepublication” version and not finalized. But everyone looking to see what a final report will say—or what guidance the office will issue next—can only keep waiting.

This story originally appeared on wired.com.

Photo of WIRED

Wired.com is your essential daily guide to what’s next, delivering the most original and complete take you’ll find anywhere on innovation’s impact on technology, science, business and culture.

In a wild time for copyright law, the US Copyright Office has no leader Read More »

judge:-pirate-libraries-may-have-profited-from-meta-torrenting-80tb-of-books

Judge: Pirate libraries may have profited from Meta torrenting 80TB of books

It could certainly look worse for Meta if authors manage to present evidence supporting the second way that torrenting could be relevant to the case, Chhabaria suggested.

“Meta downloading copyrighted material from shadow libraries” would also be relevant to the character of the use, “if it benefitted those who created the libraries and thus supported and perpetuated their unauthorized copying and distribution of copyrighted works,” Chhabria wrote.

Counting potential strikes against Meta, Chhabria pointed out that the “vast majority of cases” involving “this sort of peer-to-peer file-sharing” are found to “constitute copyright infringement.” And it likely doesn’t help Meta’s case that “some of the libraries Meta used have themselves been found liable for infringement.”

However, Meta may overcome this argument, too, since book authors “have not submitted any evidence” that potentially shows how Meta’s downloading may perhaps be “propping up” or financially benefiting pirate libraries.

Finally, Chhabria noted that the “last issue relating to the character of Meta’s use” of books in regards to its torrenting is “the relationship between Meta’s downloading of the plaintiffs’ books and Meta’s use of the books to train Llama.”

Authors had tried to argue that these elements were distinct. But Chhabria said there’s no separating the fact that Meta downloaded the books to serve the “highly transformative” purpose of training Llama.

“Because Meta’s ultimate use of the plaintiffs’ books was transformative, so too was Meta’s downloading of those books,” Chhabria wrote.

AI training rulings may get more authors paid

Authors only learned of Meta’s torrenting through discovery in the lawsuit, and because of that, Chhabria noted that “the record on Meta’s alleged distribution is incomplete.”

It’s possible that authors may be able to show evidence that Meta “contributed to the BitTorrent network” by providing significant computing power that could’ve meaningfully assisted shadow libraries, Chhabria said in a footnote.

Judge: Pirate libraries may have profited from Meta torrenting 80TB of books Read More »

anthropic-destroyed-millions-of-print-books-to-build-its-ai-models

Anthropic destroyed millions of print books to build its AI models

But if you’re not intimately familiar with the AI industry and copyright, you might wonder: Why would a company spend millions of dollars on books to destroy them? Behind these odd legal maneuvers lies a more fundamental driver: the AI industry’s insatiable hunger for high-quality text.

The race for high-quality training data

To understand why Anthropic would want to scan millions of books, it’s important to know that AI researchers build large language models (LLMs) like those that power ChatGPT and Claude by feeding billions of words into a neural network. During training, the AI system processes the text repeatedly, building statistical relationships between words and concepts in the process.

The quality of training data fed into the neural network directly impacts the resulting AI model’s capabilities. Models trained on well-edited books and articles tend to produce more coherent, accurate responses than those trained on lower-quality text like random YouTube comments.

Publishers legally control content that AI companies desperately want, but AI companies don’t always want to negotiate a license. The first-sale doctrine offered a workaround: Once you buy a physical book, you can do what you want with that copy—including destroy it. That meant buying physical books offered a legal workaround.

And yet buying things is expensive, even if it is legal. So like many AI companies before it, Anthropic initially chose the quick and easy path. In the quest for high-quality training data, the court filing states, Anthropic first chose to amass digitized versions of pirated books to avoid what CEO Dario Amodei called “legal/practice/business slog”—the complex licensing negotiations with publishers. But by 2024, Anthropic had become “not so gung ho about” using pirated ebooks “for legal reasons” and needed a safer source.

Anthropic destroyed millions of print books to build its AI models Read More »

key-fair-use-ruling-clarifies-when-books-can-be-used-for-ai-training

Key fair use ruling clarifies when books can be used for AI training

“This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use,” Alsup wrote. “Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.”

But Alsup said that the Anthropic case may not even need to decide on that, since Anthropic’s retention of pirated books for its research library alone was not transformative. Alsup wrote that Anthropic’s argument to hold onto potential AI training material it pirated in case it ever decided to use it for AI training was an attempt to “fast glide over thin ice.”

Additionally Alsup pointed out that Anthropic’s early attempts to get permission to train on authors’ works withered, as internal messages revealed the company concluded that stealing books was considered the more cost-effective path to innovation “to avoid ‘legal/practice/business slog,’ as cofounder and chief executive officer Dario Amodei put it.”

“Anthropic is wrong to suppose that so long as you create an exciting end product, every ‘back-end step, invisible to the public,’ is excused,” Alsup wrote. “Here, piracy was the point: To build a central library that one could have paid for, just as Anthropic later did, but without paying for it.”

To avoid maximum damages in the event of a loss, Anthropic will likely continue arguing that replacing pirated books with purchased books should water down authors’ fight, Alsup’s order suggested.

“That Anthropic later bought a copy of a book it earlier stole off the Internet will not absolve it of liability for the theft, but it may affect the extent of statutory damages,” Alsup noted.

Key fair use ruling clarifies when books can be used for AI training Read More »

hollywood-studios-target-ai-image-generator-in-copyright-lawsuit

Hollywood studios target AI image generator in copyright lawsuit

The legal action follows similar moves in other creative industries, with more than a dozen major news companies suing AI company Cohere in February over copyright concerns. In 2023, a group of visual artists sued Midjourney for similar reasons.

Studios claim Midjourney knows what it’s doing

Beyond allowing users to create these images, the studios argue that Midjourney actively promotes copyright infringement by displaying user-generated content featuring copyrighted characters in its “Explore” section. The complaint states this curation “show[s] that Midjourney knows that its platform regularly reproduces Plaintiffs’ Copyrighted Works.”

The studios also allege that Midjourney has technical protection measures available that could prevent outputs featuring copyrighted material but has “affirmatively chosen not to use copyright protection measures to limit the infringement.” They cite Midjourney CEO David Holz admitting the company “pulls off all the data it can, all the text it can, all the images it can” for training purposes.

According to Axios, Disney and NBCUniversal attempted to address the issue with Midjourney before filing suit. While the studios say other AI platforms agreed to implement measures to stop IP theft, Midjourney “continued to release new versions of its Image Service” with what Holz allegedly described as “even higher quality infringing images.”

“We are bringing this action today to protect the hard work of all the artists whose work entertains and inspires us and the significant investment we make in our content,” said Kim Harris, NBCUniversal’s executive vice president and general counsel, in a statement.

This lawsuit signals a new front in Hollywood’s conflict over AI. Axios highlights this shift: While actors and writers have fought to protect their name, image, and likeness from studio exploitation, now the studios are taking on tech companies over intellectual property concerns. Other major studios, including Amazon, Netflix, Paramount Pictures, Sony, and Warner Bros., have not yet joined the lawsuit, though they share membership with Disney and Universal in the Motion Picture Association.

Hollywood studios target AI image generator in copyright lawsuit Read More »

openai-is-retaining-all-chatgpt-logs-“indefinitely”-here’s-who’s-affected.

OpenAI is retaining all ChatGPT logs “indefinitely.” Here’s who’s affected.

In the copyright fight, Magistrate Judge Ona Wang granted the order within one day of the NYT’s request. She agreed with news plaintiffs that it seemed likely that ChatGPT users may be spooked by the lawsuit and possibly set their chats to delete when using the chatbot to skirt NYT paywalls. Because OpenAI wasn’t sharing deleted chat logs, the news plaintiffs had no way of proving that, she suggested.

Now, OpenAI is not only asking Wang to reconsider but has “also appealed this order with the District Court Judge,” the Thursday statement said.

“We strongly believe this is an overreach by the New York Times,” Lightcap said. “We’re continuing to appeal this order so we can keep putting your trust and privacy first.”

Who can access deleted chats?

To protect users, OpenAI provides an FAQ that clearly explains why their data is being retained and how it could be exposed.

For example, the statement noted that the order doesn’t impact OpenAI API business customers under Zero Data Retention agreements because their data is never stored.

And for users whose data is affected, OpenAI noted that their deleted chats could be accessed, but they won’t “automatically” be shared with The New York Times. Instead, the retained data will be “stored separately in a secure system” and “protected under legal hold, meaning it can’t be accessed or used for purposes other than meeting legal obligations,” OpenAI explained.

Of course, with the court battle ongoing, the FAQ did not have all the answers.

Nobody knows how long OpenAI may be required to retain the deleted chats. Likely seeking to reassure users—some of which appeared to be considering switching to a rival service until the order lifts—OpenAI noted that “only a small, audited OpenAI legal and security team would be able to access this data as necessary to comply with our legal obligations.”

OpenAI is retaining all ChatGPT logs “indefinitely.” Here’s who’s affected. Read More »