LAION-5b

nonprofit-scrubs-illegal-content-from-controversial-ai-training-dataset

Nonprofit scrubs illegal content from controversial AI training dataset

Nonprofit scrubs illegal content from controversial AI training dataset

After Stanford Internet Observatory researcher David Thiel found links to child sexual abuse materials (CSAM) in an AI training dataset tainting image generators, the controversial dataset was immediately taken down in 2023.

Now, the LAION (Large-scale Artificial Intelligence Open Network) team has released a scrubbed version of the LAION-5B dataset called Re-LAION-5B and claimed that it “is the first web-scale, text-link to images pair dataset to be thoroughly cleaned of known links to suspected CSAM.”

To scrub the dataset, LAION partnered with the Internet Watch Foundation (IWF) and the Canadian Center for Child Protection (C3P) to remove 2,236 links that matched with hashed images in the online safety organizations’ databases. Removals include all the links flagged by Thiel, as well as content flagged by LAION’s partners and other watchdogs, like Human Rights Watch, which warned of privacy issues after finding photos of real kids included in the dataset without their consent.

In his study, Thiel warned that “the inclusion of child abuse material in AI model training data teaches tools to associate children in illicit sexual activity and uses known child abuse images to generate new, potentially realistic child abuse content.”

Thiel urged LAION and other researchers scraping the Internet for AI training data that a new safety standard was needed to better filter out not just CSAM, but any explicit imagery that could be combined with photos of children to generate CSAM. (Recently, the US Department of Justice pointedly said that “CSAM generated by AI is still CSAM.”)

While LAION’s new dataset won’t alter models that were trained on the prior dataset, LAION claimed that Re-LAION-5B sets “a new safety standard for cleaning web-scale image-link datasets.” Where before illegal content “slipped through” LAION’s filters, the researchers have now developed an improved new system “for identifying and removing illegal content,” LAION’s blog said.

Thiel told Ars that he would agree that LAION has set a new safety standard with its latest release, but “there are absolutely ways to improve it.” However, “those methods would require possession of all original images or a brand new crawl,” and LAION’s post made clear that it only utilized image hashes and did not conduct a new crawl that could have risked pulling in more illegal or sensitive content. (On Threads, Thiel shared more in-depth impressions of LAION’s effort to clean the dataset.)

LAION warned that “current state-of-the-art filters alone are not reliable enough to guarantee protection from CSAM in web scale data composition scenarios.”

“To ensure better filtering, lists of hashes of suspected links or images created by expert organizations (in our case, IWF and C3P) are suitable choices,” LAION’s blog said. “We recommend research labs and any other organizations composing datasets from the public web to partner with organizations like IWF and C3P to obtain such hash lists and use those for filtering. In the longer term, a larger common initiative can be created that makes such hash lists available for the research community working on dataset composition from web.”

According to LAION, the bigger concern is that some links to known CSAM scraped into a 2022 dataset are still active more than a year later.

“It is a clear hint that law enforcement bodies have to intensify the efforts to take down domains that host such image content on public web following information and recommendations by organizations like IWF and C3P, making it a safer place, also for various kinds of research related activities,” LAION’s blog said.

HRW researcher Hye Jung Han praised LAION for removing sensitive data that she flagged, while also urging more interventions.

“LAION’s responsive removal of some children’s personal photos from their dataset is very welcome, and will help to protect these children from their likenesses being misused by AI systems,” Han told Ars. “It’s now up to governments to pass child data protection laws that would protect all children’s privacy online.”

Although LAION’s blog said that the content removals represented an “upper bound” of CSAM that existed in the initial dataset, AI specialist and Creative.AI co-founder Alex Champandard told Ars that he’s skeptical that all CSAM was removed.

“They only filter out previously identified CSAM, which is only a partial solution,” Champandard told Ars. “Statistically speaking, most instances of CSAM have likely never been reported nor investigated by C3P or IWF. A more reasonable estimate of the problem is about 25,000 instances of things you’d never want to train generative models on—maybe even 50,000.”

Champandard agreed with Han that more regulations are needed to protect people from AI harms when training data is scraped from the web.

“There’s room for improvement on all fronts: privacy, copyright, illegal content, etc.,” Champandard said. Because “there are too many data rights being broken with such web-scraped datasets,” Champandard suggested that datasets like LAION’s won’t “stand the test of time.”

“LAION is simply operating in the regulatory gap and lag in the judiciary system until policymakers realize the magnitude of the problem,” Champandard said.

Nonprofit scrubs illegal content from controversial AI training dataset Read More »

artists-claim-“big”-win-in-copyright-suit-fighting-ai-image-generators

Artists claim “big” win in copyright suit fighting AI image generators

Back to the drawing board —

Artists prepare to take on AI image generators as copyright suit proceeds

Artists claim “big” win in copyright suit fighting AI image generators

Artists defending a class-action lawsuit are claiming a major win this week in their fight to stop the most sophisticated AI image generators from copying billions of artworks to train AI models and replicate their styles without compensating artists.

In an order on Monday, US district judge William Orrick denied key parts of motions to dismiss from Stability AI, Midjourney, Runway AI, and DeviantArt. The court will now allow artists to proceed with discovery on claims that AI image generators relying on Stable Diffusion violate both the Copyright Act and the Lanham Act, which protects artists from commercial misuse of their names and unique styles.

“We won BIG,” an artist plaintiff, Karla Ortiz, wrote on X (formerly Twitter), celebrating the order. “Not only do we proceed on our copyright claims,” but “this order also means companies who utilize” Stable Diffusion models and LAION-like datasets that scrape artists’ works for AI training without permission “could now be liable for copyright infringement violations, amongst other violations.”

Lawyers for the artists, Joseph Saveri and Matthew Butterick, told Ars that artists suing “consider the Court’s order a significant step forward for the case,” as “the Court allowed Plaintiffs’ core copyright-infringement claims against all four defendants to proceed.”

Stability AI was the only company that responded to Ars’ request to comment, but it declined to comment.

Artists prepare to defend their livelihoods from AI

To get to this stage of the suit, artists had to amend their complaint to better explain exactly how AI image generators work to allegedly train on artists’ images and copy artists’ styles.

For example, they were told that if they “contend Stable Diffusion contains ‘compressed copies’ of the Training Images, they need to define ‘compressed copies’ and explain plausible facts in support. And if plaintiffs’ compressed copies theory is based on a contention that Stable Diffusion contains mathematical or statistical methods that can be carried out through algorithms or instructions in order to reconstruct the Training Images in whole or in part to create the new Output Images, they need to clarify that and provide plausible facts in support,” Orrick wrote.

To keep their fight alive, the artists pored through academic articles to support their arguments that “Stable Diffusion is built to a significant extent on copyrighted works and that the way the product operates necessarily invokes copies or protected elements of those works.” Orrick agreed that their amended complaint made plausible inferences that “at this juncture” is enough to support claims “that Stable Diffusion by operation by end users creates copyright infringement and was created to facilitate that infringement by design.”

“Specifically, the Court found Plaintiffs’ theory that image-diffusion models like Stable Diffusion contain compressed copies of their datasets to be plausible,” Saveri and Butterick’s statement to Ars said. “The Court also found it plausible that training, distributing, and copying such models constitute acts of copyright infringement.”

Not all of the artists’ claims survived, with Orrick granting motions to dismiss claims alleging that AI companies removed content management information from artworks in violation of the Digital Millennium Copyright Act (DMCA). Because artists failed to show evidence of defendants altering or stripping this information, they must permanently drop the DMCA claims.

Part of Orrick’s decision on the DMCA claims, however, indicates that the legal basis for dismissal is “unsettled,” with Orrick simply agreeing with Stability AI’s unsettled argument that “because the output images are admittedly not identical to the Training Images, there can be no liability for any removal of CMI that occurred during the training process.”

Ortiz wrote on X that she respectfully disagreed with that part of the decision but expressed enthusiasm that the court allowed artists to proceed with false endorsement claims, alleging that Midjourney violated the Lanham Act.

Five artists successfully argued that because “their names appeared on the list of 4,700 artists posted by Midjourney’s CEO on Discord” and that list was used to promote “the various styles of artistic works its AI product could produce,” this plausibly created confusion over whether those artists had endorsed Midjourney.

“Whether or not a reasonably prudent consumer would be confused or misled by the Names List and showcase to conclude that the included artists were endorsing the Midjourney product can be tested at summary judgment,” Orrick wrote. “Discovery may show that it is or that is it not.”

While Orrick agreed with Midjourney that “plaintiffs have no protection over ‘simple, cartoony drawings’ or ‘gritty fantasy paintings,'” artists were able to advance a “trade dress” claim under the Lanham Act, too. This is because Midjourney allegedly “allows users to create works capturing the ‘trade dress of each of the Midjourney Named Plaintiffs [that] is inherently distinctive in look and feel as used in connection with their artwork and art products.'”

As discovery proceeds in the case, artists will also have an opportunity to amend dismissed claims of unjust enrichment. According to Orrick, their next amended complaint will be their last chance to prove that AI companies have “deprived plaintiffs ‘the benefit of the value of their works.'”

Saveri and Butterick confirmed that “though the Court dismissed certain supplementary claims, Plaintiffs’ central claims will now proceed to discovery and trial.” On X, Ortiz suggested that the artists’ case is “now potentially one of THE biggest copyright infringement and trade dress cases ever!”

“Looking forward to the next stage of our fight!” Ortiz wrote.

Artists claim “big” win in copyright suit fighting AI image generators Read More »

ai-trained-on-photos-from-kids’-entire-childhood-without-their-consent

AI trained on photos from kids’ entire childhood without their consent

AI trained on photos from kids’ entire childhood without their consent

Photos of Brazilian kids—sometimes spanning their entire childhood—have been used without their consent to power AI tools, including popular image generators like Stable Diffusion, Human Rights Watch (HRW) warned on Monday.

This act poses urgent privacy risks to kids and seems to increase risks of non-consensual AI-generated images bearing their likenesses, HRW’s report said.

An HRW researcher, Hye Jung Han, helped expose the problem. She analyzed “less than 0.0001 percent” of LAION-5B, a dataset built from Common Crawl snapshots of the public web. The dataset does not contain the actual photos but includes image-text pairs derived from 5.85 billion images and captions posted online since 2008.

Among those images linked in the dataset, Han found 170 photos of children from at least 10 Brazilian states. These were mostly family photos uploaded to personal and parenting blogs most Internet surfers wouldn’t easily stumble upon, “as well as stills from YouTube videos with small view counts, seemingly uploaded to be shared with family and friends,” Wired reported.

LAION, the German nonprofit that created the dataset, has worked with HRW to remove the links to the children’s images in the dataset.

That may not completely resolve the problem, though. HRW’s report warned that the removed links are “likely to be a significant undercount of the total amount of children’s personal data that exists in LAION-5B.” Han told Wired that she fears that the dataset may still be referencing personal photos of kids “from all over the world.”

Removing the links also does not remove the images from the public web, where they can still be referenced and used in other AI datasets, particularly those relying on Common Crawl, LAION’s spokesperson, Nate Tyler, told Ars.

“This is a larger and very concerning issue, and as a nonprofit, volunteer organization, we will do our part to help,” Tyler told Ars.

Han told Ars that “Common Crawl should stop scraping children’s personal data, given the privacy risks involved and the potential for new forms of misuse.”

According to HRW’s analysis, many of the Brazilian children’s identities were “easily traceable,” due to children’s names and locations being included in image captions that were processed when building the LAION dataset.

And at a time when middle and high school-aged students are at greater risk of being targeted by bullies or bad actors turning “innocuous photos” into explicit imagery, it’s possible that AI tools may be better equipped to generate AI clones of kids whose images are referenced in AI datasets, HRW suggested.

“The photos reviewed span the entirety of childhood,” HRW’s report said. “They capture intimate moments of babies being born into the gloved hands of doctors, young children blowing out candles on their birthday cake or dancing in their underwear at home, students giving a presentation at school, and teenagers posing for photos at their high school’s carnival.”

There is less risk that the Brazilian kids’ photos are currently powering AI tools since “all publicly available versions of LAION-5B were taken down” in December, Tyler told Ars. That decision came out of an “abundance of caution” after a Stanford University report “found links in the dataset pointing to illegal content on the public web,” Tyler said, including 3,226 suspected instances of child sexual abuse material.

Han told Ars that “the version of the dataset that we examined pre-dates LAION’s temporary removal of its dataset in December 2023.” The dataset will not be available again until LAION determines that all flagged illegal content has been removed.

“LAION is currently working with the Internet Watch Foundation, the Canadian Centre for Child Protection, Stanford, and Human Rights Watch to remove all known references to illegal content from LAION-5B,” Tyler told Ars. “We are grateful for their support and hope to republish a revised LAION-5B soon.”

In Brazil, “at least 85 girls” have reported classmates harassing them by using AI tools to “create sexually explicit deepfakes of the girls based on photos taken from their social media profiles,” HRW reported. Once these explicit deepfakes are posted online, they can inflict “lasting harm,” HRW warned, potentially remaining online for their entire lives.

“Children should not have to live in fear that their photos might be stolen and weaponized against them,” Han said. “The government should urgently adopt policies to protect children’s data from AI-fueled misuse.”

Ars could not immediately reach Stable Diffusion maker Stability AI for comment.

AI trained on photos from kids’ entire childhood without their consent Read More »