AI training data

Cloudflare moves to end free, endless AI scraping with one-click blocking

ai content deals, ai scraping, AI training data, Artificial Intelligence, cloudflare, LLMs, Policy / Paul Patrick / September 23, 2024

Cloudflare announced new tools Monday that it claims will help end the era of endless AI scraping by giving all sites on its network the power to block bots in one click.

That will help stop the firehose of unrestricted AI scraping, but, perhaps even more intriguing to content creators everywhere, Cloudflare says it will also make it easier to identify which content that bots scan most, so that sites can eventually wall off access and charge bots to scrape their most valuable content. To pave the way for that future, Cloudflare is also creating a marketplace for all sites to negotiate content deals based on more granular AI audits of their sites.

These tools, Cloudflare’s blog said, give content creators “for the first time” ways “to quickly and easily understand how AI model providers are using their content, and then take control of whether and how the models are able to access it.”

That’s necessary for content creators because the rise of generative AI has made it harder to value their content, Cloudflare suggested in a longer blog explaining the tools.

Previously, sites could distinguish between approving access to helpful bots that drive traffic, like search engine crawlers, and denying access to bad bots that try to take down sites or scrape sensitive or competitive data.

But now, “Large Language Models (LLMs) and other generative tools created a murkier third category” of bots, Cloudflare said, that don’t perfectly fit in either category. They don’t “necessarily drive traffic” like a good bot, but they also don’t try to steal sensitive data like a bad bot, so many site operators don’t have a clear way to think about the “value exchange” of allowing AI scraping, Cloudflare said.

That’s a problem because enabling all scraping could hurt content creators in the long run, Cloudflare predicted.

“Many sites allowed these AI crawlers to scan their content because these crawlers, for the most part, looked like ‘good’ bots—only for the result to mean less traffic to their site as their content is repackaged in AI-written answers,” Cloudflare said.

All this unrestricted AI scraping “poses a risk to an open Internet,” Cloudflare warned, proposing that its tools could set a new industry standard for how content is scraped online.

How to block bots in one click

Increasingly, creators fighting to control what happens with their content have been pushed to either sue AI companies to block unwanted scraping, as The New York Times has, or put content behind paywalls, decreasing public access to information.

While some big publishers have been striking content deals with AI companies to license content, Cloudflare is hoping new tools will help to level the playing field for everyone. That way, “there can be a transparent exchange between the websites that want greater control over their content, and the AI model providers that require fresh data sources, so that everyone benefits,” Cloudflare said.

Today, Cloudflare site operators can stop manually blocking each AI bot one by one and instead choose to “block all AI bots in one click,” Cloudflare said.

They can do this by visiting the Bots section under the Security tab of the Cloudflare dashboard, then clicking a blue link in the top-right corner “to configure how Cloudflare’s proxy handles bot traffic,” Cloudflare said. On that screen, operators can easily “toggle the button in the ‘Block AI Scrapers and Crawlers’ card to the ‘On’ position,” blocking everything and giving content creators time to strategize what access they want to re-enable, if any.

Beyond just blocking bots, operators can also conduct AI audits, quickly analyzing which sections of their sites are scanned most by which bots. From there, operators can decide which scraping is allowed and use sophisticated controls to decide which bots can scrape which parts of their sites.

“For some teams, the decision will be to allow the bots associated with AI search engines to scan their Internet properties because those tools can still drive traffic to the site,” Cloudflare’s blog explained. “Other organizations might sign deals with a specific model provider, and they want to allow any type of bot from that provider to access their content.”

For publishers already playing whack-a-mole with bots, a key perk would be if Cloudflare’s tools allowed them to write rules to restrict certain bots that scrape sites for both “good” and “bad” purposes to keep the good and throw away the bad.

Perhaps the most frustrating bot for publishers today is the Googlebot, which scrapes sites to populate search results as well as to train AI to generate Google search AI overviews that could negatively impact traffic to source sites by summarizing content. Publishers currently have no way of opting out of training models fueling Google’s AI overviews without losing visibility in search results, and Cloudflare’s tools won’t be able to get publishers out of that uncomfortable position, Cloudflare CEO Matthew Prince confirmed to Ars.

For any site operators tempted to toggle off all AI scraping, blocking the Googlebot from scraping and inadvertently causing dips in traffic may be a compelling reason not to use Cloudflare’s one-click solution.

However, Prince expects “that Google’s practices over the long term won’t be sustainable” and “that Cloudflare will be a part of getting Google and other folks that are like Google” to give creators “much more granular control over” how bots like the Googlebot scrape the web to train AI.

Prince told Ars that while Google solves its “philosophical” internal question of whether the Googlebot’s scraping is for search or for AI, a technical solution to block one bot from certain kinds of scraping will likely soon emerge. And in the meantime, “there can also be a legal solution” that “can rely on contract law” based on improving sites’ terms of service.

Not every site would, of course, be able to afford a lawsuit to challenge AI scraping, but to help creators better defend themselves, Cloudflare drafted “model terms of use that every content creator can add to their sites to legally protect their rights as sites gain more control over AI scraping.” With these terms, sites could perhaps more easily dispute any restricted scraping discovered through Cloudflare’s analytics tools.

“One way or another, Google is going to get forced to be more fine-grained here,” Prince predicted.

Cloudflare moves to end free, endless AI scraping with one-click blocking Read More »

How to stop LinkedIn from training AI on your data

AI models, AI training, AI training data, Artificial Intelligence, generative ai, LinkedIn, Policy / Kris Guyer / September 19, 2024

Better to beg for forgiveness than ask for permission? —

LinkedIn limits opt-outs to future training, warns AI models may spout personal data.

Ashley Belanger – Sep 19, 2024 10: 00 pm UTC

LinkedIn admitted Wednesday that it has been training its own AI on many users’ data without seeking consent. Now there’s no way for users to opt out of training that has already occurred, as LinkedIn limits opt-out to only future AI training.

In a blog detailing updates coming on November 20, LinkedIn general counsel Blake Lawit confirmed that LinkedIn’s user agreement and privacy policy will be changed to better explain how users’ personal data powers AI on the platform.

Under the new privacy policy, LinkedIn now informs users that “we may use your personal data… [to] develop and train artificial intelligence (AI) models, develop, provide, and personalize our Services, and gain insights with the help of AI, automated systems, and inferences, so that our Services can be more relevant and useful to you and others.”

An FAQ explained that the personal data could be collected any time a user interacts with generative AI or other AI features, as well as when a user composes a post, changes their preferences, provides feedback to LinkedIn, or uses the platform for any amount of time.

That data is then stored until the user deletes the AI-generated content. LinkedIn recommends that users use its data access tool if they want to delete or request to delete data collected about past LinkedIn activities.

LinkedIn’s AI models powering generative AI features “may be trained by LinkedIn or another provider,” such as Microsoft, which provides some AI models through its Azure OpenAI service, the FAQ said.

A potentially major privacy risk for users, LinkedIn’s FAQ noted, is that users who “provide personal data as an input to a generative AI powered feature” could end up seeing their “personal data being provided as an output.”

LinkedIn claims that it “seeks to minimize personal data in the data sets used to train the models,” relying on “privacy enhancing technologies to redact or remove personal data from the training dataset.”

While Lawit’s blog avoids clarifying if data already collected can be removed from AI training data sets, the FAQ affirmed that users who automatically opted in to sharing personal data for AI training can only opt out of the invasive data collection “going forward.”

Opting out “does not affect training that has already taken place,” the FAQ said.

A LinkedIn spokesperson told Ars that it “benefits all members” to be opted in to AI training “by default.”

“People can choose to opt out, but they come to LinkedIn to be found for jobs and networking and generative AI is part of how we are helping professionals with that change,” LinkedIn’s spokesperson said.

By allowing opt-outs of future AI training, LinkedIn’s spokesperson additionally claimed that the platform is giving “people using LinkedIn even more choice and control when it comes to how we use data to train our generative AI technology.”

How to opt out of AI training on LinkedIn

Users can opt out of AI training by navigating to the “Data privacy” section in their account settings, then turning off the option allowing collection of “data for generative AI improvement” that LinkedIn otherwise automatically turns on for most users.

The only exception is for users in the European Economic Area or Switzerland, who are protected by stricter privacy laws that either require consent from platforms to collect personal data or for platforms to justify the data collection as a legitimate interest. Those users will not see an option to opt out, because they were never opted in, LinkedIn repeatedly confirmed.

Additionally, users can “object to the use of their personal data for training” generative AI models not used to generate LinkedIn content—such as models used for personalization or content moderation purposes, The Verge noted—by submitting the LinkedIn Data Processing Objection Form.

Last year, LinkedIn shared AI principles, promising to take “meaningful steps to reduce the potential risks of AI.”

One risk that the updated user agreement specified is that using LinkedIn’s generative features to help populate a profile or generate suggestions when writing a post could generate content that “might be inaccurate, incomplete, delayed, misleading or not suitable for your purposes.”

Users are advised that they are responsible for avoiding sharing misleading information or otherwise spreading AI-generated content that may violate LinkedIn’s community guidelines. And users are additionally warned to be cautious when relying on any information shared on the platform.

“Like all content and other information on our Services, regardless of whether it’s labeled as created by ‘AI,’ be sure to carefully review before relying on it,” LinkedIn’s user agreement says.

Back in 2023, LinkedIn claimed that it would always “seek to explain in clear and simple ways how our use of AI impacts people,” because users’ “understanding of AI starts with transparency.”

Legislation like the European Union’s AI Act and the GDPR—especially with its strong privacy protections—if enacted elsewhere, would lead to fewer shocks to unsuspecting users. That would put all companies and their users on equal footing when it comes to training AI models and result in fewer nasty surprises and angry customers.

How to stop LinkedIn from training AI on your data Read More »

Nonprofit scrubs illegal content from controversial AI training dataset

AI training, AI training data, Artificial Intelligence, child sex abuse, child sexual abuse materials, csam, image generators, LAION, LAION-5b, Policy, stable diffusion 1.5 / Kris Guyer / August 30, 2024

After Stanford Internet Observatory researcher David Thiel found links to child sexual abuse materials (CSAM) in an AI training dataset tainting image generators, the controversial dataset was immediately taken down in 2023.

Now, the LAION (Large-scale Artificial Intelligence Open Network) team has released a scrubbed version of the LAION-5B dataset called Re-LAION-5B and claimed that it “is the first web-scale, text-link to images pair dataset to be thoroughly cleaned of known links to suspected CSAM.”

To scrub the dataset, LAION partnered with the Internet Watch Foundation (IWF) and the Canadian Center for Child Protection (C3P) to remove 2,236 links that matched with hashed images in the online safety organizations’ databases. Removals include all the links flagged by Thiel, as well as content flagged by LAION’s partners and other watchdogs, like Human Rights Watch, which warned of privacy issues after finding photos of real kids included in the dataset without their consent.

In his study, Thiel warned that “the inclusion of child abuse material in AI model training data teaches tools to associate children in illicit sexual activity and uses known child abuse images to generate new, potentially realistic child abuse content.”

Thiel urged LAION and other researchers scraping the Internet for AI training data that a new safety standard was needed to better filter out not just CSAM, but any explicit imagery that could be combined with photos of children to generate CSAM. (Recently, the US Department of Justice pointedly said that “CSAM generated by AI is still CSAM.”)

While LAION’s new dataset won’t alter models that were trained on the prior dataset, LAION claimed that Re-LAION-5B sets “a new safety standard for cleaning web-scale image-link datasets.” Where before illegal content “slipped through” LAION’s filters, the researchers have now developed an improved new system “for identifying and removing illegal content,” LAION’s blog said.

Thiel told Ars that he would agree that LAION has set a new safety standard with its latest release, but “there are absolutely ways to improve it.” However, “those methods would require possession of all original images or a brand new crawl,” and LAION’s post made clear that it only utilized image hashes and did not conduct a new crawl that could have risked pulling in more illegal or sensitive content. (On Threads, Thiel shared more in-depth impressions of LAION’s effort to clean the dataset.)

LAION warned that “current state-of-the-art filters alone are not reliable enough to guarantee protection from CSAM in web scale data composition scenarios.”

“To ensure better filtering, lists of hashes of suspected links or images created by expert organizations (in our case, IWF and C3P) are suitable choices,” LAION’s blog said. “We recommend research labs and any other organizations composing datasets from the public web to partner with organizations like IWF and C3P to obtain such hash lists and use those for filtering. In the longer term, a larger common initiative can be created that makes such hash lists available for the research community working on dataset composition from web.”

According to LAION, the bigger concern is that some links to known CSAM scraped into a 2022 dataset are still active more than a year later.

“It is a clear hint that law enforcement bodies have to intensify the efforts to take down domains that host such image content on public web following information and recommendations by organizations like IWF and C3P, making it a safer place, also for various kinds of research related activities,” LAION’s blog said.

HRW researcher Hye Jung Han praised LAION for removing sensitive data that she flagged, while also urging more interventions.

“LAION’s responsive removal of some children’s personal photos from their dataset is very welcome, and will help to protect these children from their likenesses being misused by AI systems,” Han told Ars. “It’s now up to governments to pass child data protection laws that would protect all children’s privacy online.”

Although LAION’s blog said that the content removals represented an “upper bound” of CSAM that existed in the initial dataset, AI specialist and Creative.AI co-founder Alex Champandard told Ars that he’s skeptical that all CSAM was removed.

“They only filter out previously identified CSAM, which is only a partial solution,” Champandard told Ars. “Statistically speaking, most instances of CSAM have likely never been reported nor investigated by C3P or IWF. A more reasonable estimate of the problem is about 25,000 instances of things you’d never want to train generative models on—maybe even 50,000.”

Champandard agreed with Han that more regulations are needed to protect people from AI harms when training data is scraped from the web.

“There’s room for improvement on all fronts: privacy, copyright, illegal content, etc.,” Champandard said. Because “there are too many data rights being broken with such web-scraped datasets,” Champandard suggested that datasets like LAION’s won’t “stand the test of time.”

“LAION is simply operating in the regulatory gap and lag in the judiciary system until policymakers realize the magnitude of the problem,” Champandard said.

Nonprofit scrubs illegal content from controversial AI training dataset Read More »

Slack users horrified to discover messages used for AI training

AI, AI training, AI training data, Artificial Intelligence, chat app, machine learning, Policy, slack, slack ai / Kris Guyer / May 19, 2024

After launching Slack AI in February, Slack appears to be digging its heels in, defending its vague policy that by default sucks up customers’ data—including messages, content, and files—to train Slack’s global AI models.

According to Slack engineer Aaron Maurer, Slack has explained in a blog that the Salesforce-owned chat service does not train its large language models (LLMs) on customer data. But Slack’s policy may need updating “to explain more carefully how these privacy principles play with Slack AI,” Maurer wrote on Threads, partly because the policy “was originally written about the search/recommendation work we’ve been doing for years prior to Slack AI.”

Maurer was responding to a Threads post from engineer and writer Gergely Orosz, who called for companies to opt out of data sharing until the policy is clarified, not by a blog, but in the actual policy language.

“An ML engineer at Slack says they don’t use messages to train LLM models,” Orosz wrote. “My response is that the current terms allow them to do so. I’ll believe this is the policy when it’s in the policy. A blog post is not the privacy policy: every serious company knows this.”

The tension for users becomes clearer if you compare Slack’s privacy principles with how the company touts Slack AI.

Slack’s privacy principles specifically say that “Machine Learning (ML) and Artificial Intelligence (AI) are useful tools that we use in limited ways to enhance our product mission. To develop AI/ML models, our systems analyze Customer Data (e.g. messages, content, and files) submitted to Slack as well as other information (including usage information) as defined in our privacy policy and in your customer agreement.”

Meanwhile, Slack AI’s page says, “Work without worry. Your data is your data. We don’t use it to train Slack AI.”

Because of this incongruity, users called on Slack to update the privacy principles to make it clear how data is used for Slack AI or any future AI updates. According to a Salesforce spokesperson, the company has agreed an update is needed.

“Yesterday, some Slack community members asked for more clarity regarding our privacy principles,” Salesforce’s spokesperson told Ars. “We’ll be updating those principles today to better explain the relationship between customer data and generative AI in Slack.”

The spokesperson told Ars that the policy updates will clarify that Slack does not “develop LLMs or other generative models using customer data,” “use customer data to train third-party LLMs” or “build or train these models in such a way that they could learn, memorize, or be able to reproduce customer data.” The update will also clarify that “Slack AI uses off-the-shelf LLMs where the models don’t retain customer data,” ensuring that “customer data never leaves Slack’s trust boundary, and the providers of the LLM never have any access to the customer data.”

These changes, however, do not seem to address a key concern for users who never explicitly consented to sharing chats and other Slack content for use in AI training.

Users opting out of sharing chats with Slack

This controversial policy is not new. Wired warned about it in April, and TechCrunch reported that the policy has been in place since at least September 2023.

But widespread backlash began swelling last night on Hacker News, where Slack users called out the chat service for seemingly failing to notify users about the policy change, instead quietly opting them in by default. To critics, it felt like there was no benefit to opting in for anyone but Slack.

From there, the backlash spread to social media, where SlackHQ hastened to clarify Slack’s terms with explanations that did not seem to address all the criticism.

“I’m sorry Slack, you’re doing fucking WHAT with user DMs, messages, files, etc?” Corey Quinn, the chief cloud economist for a cost management company called Duckbill Group, posted on X. “I’m positive I’m not reading this correctly.”

SlackHQ responded to Quinn after the economist declared, “I hate this so much,” and confirmed that he had opted out of data sharing in his paid workspace.

“To clarify, Slack has platform-level machine-learning models for things like channel and emoji recommendations and search results,” SlackHQ posted. “And yes, customers can exclude their data from helping train those (non-generative) ML models. Customer data belongs to the customer.”

Later in the thread, SlackHQ noted, “Slack AI—which is our generative AI experience natively built in Slack—[and] is a separately purchased add-on that uses Large Language Models (LLMs) but does not train those LLMs on customer data.”

Slack users horrified to discover messages used for AI training Read More »