reddit

google’s-ai-overview-is-flawed-by-design,-and-a-new-company-blog-post-hints-at-why

Google’s AI Overview is flawed by design, and a new company blog post hints at why

guided by voices —

Google: “There are bound to be some oddities and errors” in system that told people to eat rocks.

A selection of Google mascot characters created by the company.

Enlarge / The Google “G” logo surrounded by whimsical characters, all of which look stunned and surprised.

On Thursday, Google capped off a rough week of providing inaccurate and sometimes dangerous answers through its experimental AI Overview feature by authoring a follow-up blog post titled, “AI Overviews: About last week.” In the post, attributed to Google VP Liz Reid, head of Google Search, the firm formally acknowledged issues with the feature and outlined steps taken to improve a system that appears flawed by design, even if it doesn’t realize it is admitting it.

To recap, the AI Overview feature—which the company showed off at Google I/O a few weeks ago—aims to provide search users with summarized answers to questions by using an AI model integrated with Google’s web ranking systems. Right now, it’s an experimental feature that is not active for everyone, but when a participating user searches for a topic, they might see an AI-generated answer at the top of the results, pulled from highly ranked web content and summarized by an AI model.

While Google claims this approach is “highly effective” and on par with its Featured Snippets in terms of accuracy, the past week has seen numerous examples of the AI system generating bizarre, incorrect, or even potentially harmful responses, as we detailed in a recent feature where Ars reporter Kyle Orland replicated many of the unusual outputs.

Drawing inaccurate conclusions from the web

On Wednesday morning, Google's AI Overview was erroneously telling us the Sony PlayStation and Sega Saturn were available in 1993.

Enlarge / On Wednesday morning, Google’s AI Overview was erroneously telling us the Sony PlayStation and Sega Saturn were available in 1993.

Kyle Orland / Google

Given the circulating AI Overview examples, Google almost apologizes in the post and says, “We hold ourselves to a high standard, as do our users, so we expect and appreciate the feedback, and take it seriously.” But Reid, in an attempt to justify the errors, then goes into some very revealing detail about why AI Overviews provides erroneous information:

AI Overviews work very differently than chatbots and other LLM products that people may have tried out. They’re not simply generating an output based on training data. While AI Overviews are powered by a customized language model, the model is integrated with our core web ranking systems and designed to carry out traditional “search” tasks, like identifying relevant, high-quality results from our index. That’s why AI Overviews don’t just provide text output, but include relevant links so people can explore further. Because accuracy is paramount in Search, AI Overviews are built to only show information that is backed up by top web results.

This means that AI Overviews generally don’t “hallucinate” or make things up in the ways that other LLM products might.

Here we see the fundamental flaw of the system: “AI Overviews are built to only show information that is backed up by top web results.” The design is based on the false assumption that Google’s page-ranking algorithm favors accurate results and not SEO-gamed garbage. Google Search has been broken for some time, and now the company is relying on those gamed and spam-filled results to feed its new AI model.

Even if the AI model draws from a more accurate source, as with the 1993 game console search seen above, Google’s AI language model can still make inaccurate conclusions about the “accurate” data, confabulating erroneous information in a flawed summary of the information available.

Generally ignoring the folly of basing its AI results on a broken page-ranking algorithm, Google’s blog post instead attributes the commonly circulated errors to several other factors, including users making nonsensical searches “aimed at producing erroneous results.” Google does admit faults with the AI model, like misinterpreting queries, misinterpreting “a nuance of language on the web,” and lacking sufficient high-quality information on certain topics. It also suggests that some of the more egregious examples circulating on social media are fake screenshots.

“Some of these faked results have been obvious and silly,” Reid writes. “Others have implied that we returned dangerous results for topics like leaving dogs in cars, smoking while pregnant, and depression. Those AI Overviews never appeared. So we’d encourage anyone encountering these screenshots to do a search themselves to check.”

(No doubt some of the social media examples are fake, but it’s worth noting that any attempts to replicate those early examples now will likely fail because Google will have manually blocked the results. And it is potentially a testament to how broken Google Search is if people believed extreme fake examples in the first place.)

While addressing the “nonsensical searches” angle in the post, Reid uses the example search, “How many rocks should I eat each day,” which went viral in a tweet on May 23. Reid says, “Prior to these screenshots going viral, practically no one asked Google that question.” And since there isn’t much data on the web that answers it, she says there is a “data void” or “information gap” that was filled by satirical content found on the web, and the AI model found it and pushed it as an answer, much like Featured Snippets might. So basically, it was working exactly as designed.

A screenshot of an AI Overview query,

Enlarge / A screenshot of an AI Overview query, “How many rocks should I eat each day” that went viral on X last week.

Google’s AI Overview is flawed by design, and a new company blog post hints at why Read More »

bing-outage-shows-just-how-little-competition-google-search-really-has

Bing outage shows just how little competition Google search really has

Searching for new search —

Opinion: Actively searching without Google or Bing is harder than it looks.

Google logo on a phone in front of a Bing logo in the background

Getty Images

Bing, Microsoft’s search engine platform, went down in the very early morning today. That meant that searches from Microsoft’s Edge browsers that had yet to change their default providers didn’t work. It also meant that services relying on Bing’s search API—Microsoft’s own Copilot, ChatGPT search, Yahoo, Ecosia, and DuckDuckGo—similarly failed.

Services were largely restored by the morning Eastern work hours, but the timing feels apt, concerning, or some combination of the two. Google, the consistently dominating search platform, just last week announced and debuted AI Overviews as a default addition to all searches. If you don’t want an AI response but still want to use Google, you can hunt down the new “Web” option in a menu, or you can, per Ernie Smith, tack “&udm=14” onto your search or use Smith’s own “Konami code” shortcut page.

If dismay about AI’s hallucinations, power draw, or pizza recipes concern you—along with perhaps broader Google issues involving privacy, tracking, news, SEO, or monopoly power—most of your other major options were brought down by a single API outage this morning. Moving past that kind of single point of vulnerability will take some work, both by the industry and by you, the person wondering if there’s a real alternative.

Search engine market share, as measured by StatCounter, April 2023–April 2024.

Search engine market share, as measured by StatCounter, April 2023–April 2024.

StatCounter

Upward of a billion dollars a year

The overwhelming majority of search tools offering an “alternative” to Google are using Google, Bing, or Yandex, the three major search engines that maintain massive global indexes. Yandex, being based in Russia, is a non-starter for many people around the world at the moment. Bing offers its services widely, most notably to DuckDuckGo, but its ad-based revenue model and privacy particulars have caused some friction there in the past. Before his company was able to block more of Microsoft’s own tracking scripts, DuckDuckGo CEO and founder Gabriel Weinberg explained in a Reddit reply why firms like his weren’t going the full DIY route:

… [W]e source most of our traditional links and images privately from Bing … Really only two companies (Google and Microsoft) have a high-quality global web link index (because I believe it costs upwards of a billion dollars a year to do), and so literally every other global search engine needs to bootstrap with one or both of them to provide a mainstream search product. The same is true for maps btw — only the biggest companies can similarly afford to put satellites up and send ground cars to take streetview pictures of every neighborhood.

Bing makes Microsoft money, if not quite profit yet. It’s in Microsoft’s interest to keep its search index stocked and API open, even if its focus is almost entirely on its own AI chatbot version of Bing. Yet if Microsoft decided to pull API access, or it became unreliable, Google’s default position gets even stronger. What would non-conformists have to choose from then?

Bing outage shows just how little competition Google search really has Read More »

openai-will-use-reddit-posts-to-train-chatgpt-under-new-deal

OpenAI will use Reddit posts to train ChatGPT under new deal

Data dealings —

Reddit has been eager to sell data from user posts.

An image of a woman holding a cell phone in front of the Reddit logo displayed on a computer screen, on April 29, 2024, in Edmonton, Canada.

Stuff posted on Reddit is getting incorporated into ChatGPT, Reddit and OpenAI announced on Thursday. The new partnership grants OpenAI access to Reddit’s Data API, giving the generative AI firm real-time access to Reddit posts.

Reddit content will be incorporated into ChatGPT “and new products,” Reddit’s blog post said. The social media firm claims the partnership will “enable OpenAI’s AI tools to better understand and showcase Reddit content, especially on recent topics.” OpenAI will also start advertising on Reddit.

The deal is similar to one that Reddit struck with Google in February that allows the tech giant to make “new ways to display Reddit content” and provide “more efficient ways to train models,” Reddit said at the time. Neither Reddit nor OpenAI disclosed the financial terms of their partnership, but Reddit’s partnership with Google was reportedly worth $60 million.

Under the OpenAI partnership, Reddit also gains access to OpenAI large language models (LLMs) to create features for Reddit, including its volunteer moderators.

Reddit’s data licensing push

The news comes about a year after Reddit launched an API war by starting to charge for access to its data API. This resulted in many beloved third-party Reddit apps closing and a massive user protest. Reddit, which would soon become a public company and hadn’t turned a profit yet, said one of the reasons for the sudden change was to prevent AI firms from using Reddit content to train their LLMs for free.

Earlier this month, Reddit published a Public Content Policy stating: “Unfortunately, we see more and more commercial entities using unauthorized access or misusing authorized access to collect public data in bulk, including Reddit public content. Worse, these entities perceive they have no limitation on their usage of that data, and they do so with no regard for user rights or privacy, ignoring reasonable legal, safety, and user removal requests.

In its blog post on Thursday, Reddit said that deals like OpenAI’s are part of an “open” Internet. It added that “part of being open means Reddit content needs to be accessible to those fostering human learning and researching ways to build community, belonging, and empowerment online.”

Reddit has been vocal about its interest in pursuing data licensing deals as a core part of its business. Its building of AI partnerships sparks discourse around the use of user-generated content to fuel AI models without users being compensated and some potentially not considering that their social media posts would be used this way. OpenAI and Stack Overflow faced pushback earlier this month when integrating Stack Overflow content with ChatGPT. Some of Stack Overflow’s user community responded by sabotaging their own posts.

OpenAI is also challenged to work with Reddit data that, like much of the Internet, can be filled with inaccuracies and inappropriate content. Some of the biggest opponents of Reddit’s API rule changes were volunteer mods. Some have exited the platform since, and following the rule changes, Ars Technica spoke with long-time Redditors who were concerned about Reddit content quality moving forward.

Regardless, generative AI firms are keen to tap into Reddit’s access to real-time conversations from a variety of people discussing a nearly endless range of topics. And Reddit seems equally eager to license the data from its users’ posts.

Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder of Reddit.

OpenAI will use Reddit posts to train ChatGPT under new deal Read More »

reddit,-ai-spam-bots-explore-new-ways-to-show-ads-in-your-feed

Reddit, AI spam bots explore new ways to show ads in your feed

BRAZIL - 2024/04/08: In this photo illustration, a Reddit logo seen displayed on a computer screen through a magnifying glass

Reddit has made it clear that it’s an ad-first business. Today, it expanded on that practice with a new ad format that looks to sell things to Reddit users. Simultaneously, Reddit has marketers who are interested in pushing products to users through seemingly legitimate accounts.

In a blog post today, Reddit announced that its Dynamic Product Ads are entering public beta globally. The ad format uses “shopping signals,” aka discussions with people looking to try a product or brand, machine learning, and advertiser product catalogs in order to post relevant ads. Reddit shared an image in the blog post that shows ads, including with products and pricing, that seem to relate to a posted question. User responses to the Reddit post appear under the ad.

  • A somewhat blurry depiction of the new type of ads Reddit is testing.

  • A (still blurry) example of a more targeted approach to Reddit’s new ad format.

Reddit’s Dynamic Product Ads can automatically show users ads “based on the products they’ve previously engaged with on the advertiser’s site” and/or “based on what people engage with on Reddit or advertiser sites,” per the blog.

Reddit is an ad business

Reddit’s blog didn’t imply that Dynamic Product Ads means users would see more ads than they do currently. However, today’s blog highlighted the newly public company’s focus on ad sales.

“With Dynamic Product Ads, brands can tap into the rich, high-intent product conversations that people come to Reddit for,” Reddit EVP of Business Marketing and Growth Jim Squires said in a statement.

The blog also noted that “Reddit’s communities are naturally commercial,” adding:

Reddit is where people come to make shopping decisions, and we’re focused on bringing brands into these interactions in a way that adds value for people and drives growth for businesses.

The stance has been increasingly clear over the past year, as Reddit became rather vocal about the fact that it’s never been profitable. In Junethe company started charging for API access, resulting in numerous valued third-party Reddit apps closing and messy user protests that left a bad taste in countless long-time users’ and moderators’ mouths. While Reddit initially announced the change as a way to prevent large language models from using its data for free training, it was also seen as a way to drive users to Reddit’s website and mobile app, where it can serve users ads.

Per Reddit’s February SEC filing (PDF), ads made up 98 percent of Reddit’s revenues in 2023 and 2022. That filing included a note from CEO Steve Huffman, saying: “Advertising is our first business” and that Reddit’s ad business is “still in the early phases of growing.”

In September, the company started preventing users from opting out of personalized ads. In June, Reddit introduced a new tool to advertisers that uses natural language processing to look through Reddit user comments for keywords that signal potential interest for a brand.

Reddit’s blog post today hinted at some future evolutions focused on showing Reddit users ads, including “tools and features such as new shopping ads formats like collection ads that enhance the shopper experience while driving performance” and “merchant platform integrations that welcome smaller merchants.”

Reddit, AI spam bots explore new ways to show ads in your feed Read More »

after-overreaching-tos-angers-users,-cloud-provider-vultr-backs-off

After overreaching TOS angers users, cloud provider Vultr backs off

“Clearly causing confusion” —

Terms seemed to grant an “irrevocable” right to commercialize any user content.

After overreaching TOS angers users, cloud provider Vultr backs off

After backlash, the cloud provider Vultr has updated its terms to remove a clause that a Reddit user feared required customers to “fork over rights” to “anything” hosted on its platform.

The alarming clause seemed to grant Vultr a “non-exclusive, perpetual, irrevocable” license to “use and commercialize” any user content uploaded, posted, hosted, or stored on Vultr “in any way that Vultr deems appropriate, without any further consent” or compensation to users or third parties.

Here’s the full clause that was removed:

You hereby grant to Vultr a non-exclusive, perpetual, irrevocable, royalty-free, fully paid-up, worldwide license (including the right to sublicense through multiple tiers) to use, reproduce, process, adapt, publicly perform, publicly display, modify, prepare derivative works, publish, transmit and distribute each of your User Content, or any portion thereof, in any form, medium or distribution method now known or hereafter existing, known or developed, and otherwise use and commercialize the User Content in any way that Vultr deems appropriate, without any further consent, notice and/or compensation to you or to any third parties, for purposes of providing the Services to you.

In a statement provided to Ars, Vultr CEO J.J. Kardwell said that the terms were revised to “simplify and clarify” language causing confusion for some users.

“A Reddit post incorrectly took portions of our Terms of Service out of context, which only pertain to content provided to Vultr on our public mediums (community-related content on public forums, as an example) for purposes of rendering the needed services—e.g., publishing comments, posts, or ratings,” Kardwell said. “This is separate from a user’s own, private content that is deployed on Vultr services.”

It’s easy to see why the Reddit user was confused, as the previous terms did not clearly differentiate between a user’s public and “private content” in the paragraph where it was included. Kardwell told The Register that the old terms, which were drafted in 2021, were “clearly causing confusion for some portion of users” and were updated because Vultr recognized “that the average user doesn’t have a law degree.”

According to Kardwell, the part of the removed clause that “ends with ‘for purposes of providing the Services to you'” was “intended to make it clear that any rights referenced are solely for the purposes of providing the Services to you.” Kevin Cochrane, Vultr’s chief marketing officer, told Ars that users were intended to scroll down to understand that the line only applied to community content described in a section labeled “content that you make publicly available.” He said that the removed clause was necessary in 2021 when Vultr provided forums and collected ratings, but that the clause could be stripped now because “we don’t actually use” that kind of community content “any longer.”

“We’re very focused on being responsive to the community and the concerns people have, and we believe the strongest thing we can do to demonstrate that there is no bad intent here is to remove it,” Kardwell told The Register.

A plain read of the terms without scrolling seemed to substantiate the Reddit user’s worst fears that “it’s possible Vultr may want the expansive license grant to do AI/Machine Learning based on the data they host. Or maybe they could mine database contents to resell [personally identifiable information]. Given the (perpetual!) license, there’s not really any limit to what they might do. They could even clone someone’s app and sell their own rebranded version, and they’d be legally in the clear.”

The user claimed to have been locked out of their Vultr account for five days after refusing to agree to the terms, with Vultr’s support team seemingly providing little recourse to migrate data to a new cloud provider.

“Migrating all my servers and DNS without being able to log in to my account is going to be both a headache and error prone,” the Reddit user wrote. “I feel like they’re holding my business hostage and extorting me into accepting a license I would never consent to under duress.”

Ars was not able to reach the Reddit user to see if Vultr removing the line from the terms has resolved the issue. Other users on the thread claimed that they had terminated their Vultr accounts over the controversy. Cochrane told Ars that they had been contacted by many customers over the past two days and had no way to identify the Reddit user to confirm if they had terminated their account. Cochrane said the support team was actively reaching out to users to verify if their complaints stemmed from discomfort with the previous terms.

In his statement, Kardwell reiterated that Vultr “customers own 100 percent of their content,” clarifying that Vultr “has never claimed any rights to, used, accessed, nor allowed access to or shared” user content, “other than as may be required by law or for security purposes.”

He also confirmed that Vultr would be conducting a “full review” of its terms and publishing another update “soon.” Kardwell told The Register that the most recent update to its terms that led the Reddit user to call out the company was “actually spurred by unrelated Microsoft licensing changes,” promising that Vultr has no plans to use or commercialize user data.

“We do not use user data,” Kardwell told The Register. “We never have, and we never will. We take privacy and security very seriously. It’s at the core of what we do globally.”

After overreaching TOS angers users, cloud provider Vultr backs off Read More »

reddit-faces-new-reality-after-cashing-in-on-its-ipo

Reddit faces new reality after cashing in on its IPO

r/WallStreetBets —

Reddit must now answer to its shareholders as well as its vocal users.

Steve Huffman

Enlarge / Steve Huffman, u/spez on Reddit, sold 500,000 of his shares in Reddit’s IPO on Thursday

AFP via Getty Images

In an interview on the New York Stock Exchange trading floor ahead of Reddit’s market debut on Thursday, chief executive Steve Huffman acknowledged that the mischievous retail investors that congregate on the social media platform might deliberately drive down its share price.

“It’s a free market!” he said.

For Reddit, as for Huffman, the bet on a public offering for a site he described as a “fun and special, but sometimes crazy place” has appeared to pay off.

Shares of the social media company soared on its Big Board debut under the ticker RDDT, closing at $50.44, or 48 percent above its IPO price. This brought its fully diluted market capitalization to $9.5 billion, close to where the company was last valued privately at $10 billion in 2021.

Reddit’s journey to public markets marks a turning point for a fringe, free speech-oriented platform dominated by esoteric memes, sardonic humor, and gamers, as it transforms itself into a more mainstream discussion hub that enforces stricter moderation rules in order to attract advertising dollars.

The picture for its earlier investors was mixed. One big winner was the Newhouse family, who through Advance Magazine Publishers Inc own Condé Nast, which bought Reddit in 2006 for $10 million before spinning it out in 2011. Its shares are now worth about $2.1 billion, a handsome windfall to their publishing empire, which also includes Vanity Fair, the New Yorker, and Vogue. Entities affiliated with OpenAI chief executive Sam Altman now hold a stake worth $613 million.

But investors who put money in at the last financing round in 2021 at $61.79 a share, such as Fidelity, were looking at slightly less on that particular investment.

Founded in 2005, the self-proclaimed “front page of the internet” has battled through management upheaval and moderation scandals to grow to 73 million daily users across its 100,000 communities, or “subreddits,” per Reddit parlance. It is a social media minnow, however, relative to Meta or X, which have more than 2.1 billion and 245 million daily active users, respectively.

Still, its IPO attracted institutional interest. Demand was strong, and the top two dozen investors in the deal, who received the majority of its shares, were typically large asset managers who intend on owning the stock for the long term, one person familiar with the matter said.

Reddit’s surge on its first day of trading, a day after AI infrastructure group Astera Labs jumped 72 percent in its Nasdaq debut, also signals a validation of public investor demand for listings—even a company that is unprofitable, such as Reddit.

“Overall, this is a very positive development for IPO markets [and] should bode well for many of the pre-IPO companies sitting in the queue,” said Christian Munafo, chief investment officer of Liberty Street Advisors.

But, Munafo said, “while [Reddit] performed well out of the gate, the stock may come under pressure unless they are able to demonstrate better growth and monetization.”

Either way, the deal is a boon for Huffman. The chief executive sold 500,000 of his shares in the IPO, cashing out a plump $17 million, and is due to receive additional equity awards as a result of listing the company above a $5 billion valuation. He also received an estimated $193 million pay package last year, mostly made up of equity awards, according to filings.

Historically, Huffman’s style as a leader has reflected that of Reddit’s unruly user base. The self-confessed “internet troll” initially squirmed at the idea of policing the more extreme communities hosted on the platform, relying on these groups to create their own rules and self-moderate. He has defended and cheered on Reddit’s WallStreetBets trading forum that shot to mainstream fame when members collectively bought so-called meme stocks in a bid to squeeze hedge funds*.

But Huffman has recently been forced to tidy up the darker underbelly of the platform for advertisers, present a more professional front to Wall Street and hunt harder for profitability. As a result, Reddit has shifted its ambitions slightly to pin its fortunes to wider tech trends. When Reddit first filed for an IPO in 2021, AI was mentioned once in its prospectus. In the 2024 version, AI appeared more than 60 times.

Nevertheless, the approach has left Huffman and the company at odds with some Reddit communities, who have been resistant to any changes to the platform. Facing new pressures as it enters public markets, some analysts warn that Reddit’s character could be destroyed and users may seek out alternatives, in a drag to the company.

“Reddit, more so than many social media platforms, has been a very community-based, non-commercial space and people know and love it for [this],” said Samuel Woolley, a propaganda expert and assistant professor at the University of Texas at Austin.

“I think the big question that should be on everyone’s mind for Reddit is to what extent the IPO will change the very nature and fabric of the platform.”

Additional reporting by Nicholas Megaw in New York.

© 2024 The Financial Times Ltd. All rights reserved. Not to be redistributed, copied, or modified in any way.

Reddit faces new reality after cashing in on its IPO Read More »

reddit-cashes-in-on-ai-gold-rush-with-$203m-in-llm-training-license-fees

Reddit cashes in on AI gold rush with $203M in LLM training license fees

Your posts are the product —

Two- to three-year deals with Google, others, come amid legal uncertainty over “fair use.”

Enlarge / “Reddit Gold” takes on a whole new meaning when AI training data is involved.

The last week saw word leak that Google had agreed to license Reddit’s massive corpus of billions of posts and comments to help train its large language models. Now, in a recent Securities and Exchange Commission filing, the popular online forum has revealed that it will bring in $203 million from that and other unspecified AI data licensing contracts over the next three years.

Reddit’s Form S-1—published by the SEC late Thursday ahead of the site’s planned stock IPO—says the company expects $66.4 million of that data-derived value from LLM companies to come during the 2024 calendar year. Bloomberg previously reported the Google deal to be worth an estimated $60 million a year, suggesting that the three-year deal represents the vast majority of its AI licensing revenue so far.

Google and other AI companies that license Reddit’s data will receive “continuous access to [Reddit’s] data API as well as quarterly transfers of Reddit data over the term of the arrangement,” according to the filing. That constant, real-time access is particularly valuable, the site writes in the filing, because “Reddit data constantly grows and regenerates as users come and interact with their communities and each other.”

“Why pay for the cow…?”

While Reddit sees data licensing to AI firms as an important part of its financial future, its filing also notes that free use of its data has already been “a foundational part of how many of the leading large language models have been trained.” The filing seems almost bitter in noting that “some companies have constructed very large commercial language models using Reddit data without entering into a license agreement with us.”

That acknowledgment highlights the still-murky legal landscape over AI companies’ penchant for scraping huge swathes of the public web for training purposes, a practice those companies defend as fair use. And Reddit seems well aware that AI models may continue to hoover up its posts and comments for free, even as it tries to sell that data to others.

“Some companies may decline to license Reddit data and use such data without license given its open nature, even if in violation of the legal terms governing our services,” the company writes. “While we plan to vigorously enforce against such entities, such enforcement activities could take years to resolve, result in substantial expense, and divert management’s attention and other resources, and we may not ultimately be successful.”

Yet the mere existence of AI data licensing agreements like Reddit’s may influence how legal battles over this kind of data scraping play out. As Ars’ Timothy Lee and James Grimmelmann noted in a recent legal analysis, the establishment of a settled licensing market can have a huge impact on whether courts consider a novel use of digitized data to be “fair use” under copyright law.

“The more [AI data licensing] deals like this are signed in the coming months, the easier it will be for the plaintiffs to argue that the ‘effect on the market’ prong of fair use analysis should take this licensing market into account,” Lee and Grimmelmann wrote.

And while Reddit sees LLMs as a new revenue opportunity, the site also sees their popularity as a potential threat. The S-1 filing notes that “some users are also turning to LLMs such as ChatGPT, Gemini, and Anthropic” for seeking information, putting them in the same category of Reddit competition as “Google, Amazon, YouTube, Wikipedia, X, and other news sites.”

After filing for its IPO in late 2021, reports suggest Reddit is aiming to hit the stock market next month officially. The company will offer users and moderators with sufficient karma and/or activity on the site the opportunity to participate in that IPO through a directed share program.

Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder of Reddit.

Reddit cashes in on AI gold rush with $203M in LLM training license fees Read More »

reddit-admits-more-moderator-protests-could-hurt-its-business

Reddit admits more moderator protests could hurt its business

SEC filing —

Losing third-party tools “could harm our moderators’ ability to review content…”

Reddit logo on website displayed on a laptop screen is seen in this illustration photo taken in Krakow, Poland on February 22, 2024.

Reddit filed to go public on Thursday (PDF), revealing various details of the social media company’s inner workings. Among the revelations, Reddit acknowledged the threat of future user protests and the value of third-party Reddit apps.

On July 1, Reddit enacted API rule changes—including new, expensive pricing —that resulted in many third-party Reddit apps closing. Disturbed by the changes, the timeline of the changes, and concerns that Reddit wasn’t properly appreciating third-party app developers and moderators, thousands of Reddit users protested by making the subreddits they moderate private, read-only, and/or engaging in other forms of protest, such as only discussing John Oliver or porn.

Protests went on for weeks and, at their onset, crashed Reddit for three hours. At the time, Reddit CEO Steve Huffman said the protests did not have “any significant revenue impact so far.”

In its filing with the Securities and Exchange Commission (SEC), though, Reddit acknowledged that another such protest could hurt its pockets:

While these activities have not historically had a material impact on our business or results of operations, similar actions by moderators and/or their communities in the future could adversely affect our business, results of operations, financial condition, and prospects.

The company also said that bad publicity and media coverage, such as the kind that stemmed from the API protests, could be a risk to Reddit’s success. The Form S-1 said bad PR around Reddit, including its practices, prices, and mods, “could adversely affect the size, demographics, engagement, and loyalty of our user base,” adding:

For instance, in May and June 2023, we experienced negative publicity as a result of our API policy changes.

Reddit’s filing also said that negative publicity and moderators disrupting the normal operation of subreddits could hurt user growth and engagement goals. The company highlighted financial incentives associated with having good relationships with volunteer moderators, noting that if enough mods decided to disrupt Reddit (like they did when they led protests last year), “results of operations, financial condition, and prospects could be adversely affected.” Reddit infamously forcibly removed moderators from their posts during the protests, saying they broke Reddit rules by refusing to reopen the subreddits they moderated.

“As communities grow, it can become more and more challenging for communities to find qualified people willing to act as moderators,” the filing says.

Losing third-party tools could hurt Reddit’s business

Much of the momentum for last year’s protests came from users, including long-time Redditors, mods, and people with accessibility needs, feeling that third-party apps were necessary to enjoyably and properly access and/or moderate Reddit. Reddit’s own technology has disappointed users in the past (leading some to cling to Old Reddit, which uses an older interface, for example). In its SEC filing, Reddit pointed to the value of third-party “tools” despite its API pricing killing off many of the most popular examples.

Reddit’s filing discusses losing moderators as a business risk and notes how important third-party tools are in maintaining mods:

While we provide tools to our communities to manage their subreddits, our moderators also rely on their own and third-party tools. Any disruption to, or lack of availability of, these third-party tools could harm our moderators’ ability to review content and enforce community rules. Further, if we are unable to provide effective support for third-party moderation tools, or develop our own such tools, our moderators could decide to leave our platform and may encourage their communities to follow them to a new platform, which would adversely affect our business, results of operations, financial condition, and prospects.

Since Reddit’s API policy changes, a small number of third-party Reddit apps remain available. But some of the remaining third-party Reddit app developers have previously told Ars Technica that they’re unsure of their app’s tenability under Reddit’s terms. Nondisclosure agreement requirements and the lack of a finalized developer platform also drive uncertainty around the longevity of the third-party Reddit app ecosystem, according to devs Ars spoke with this year.

Reddit admits more moderator protests could hurt its business Read More »

reddit-sells-training-data-to-unnamed-ai-company-ahead-of-ipo

Reddit sells training data to unnamed AI company ahead of IPO

Everything has a price —

If you’ve posted on Reddit, you’re likely feeding the future of AI.

In this photo illustration the American social news

On Friday, Bloomberg reported that Reddit has signed a contract allowing an unnamed AI company to train its models on the site’s content, according to people familiar with the matter. The move comes as the social media platform nears the introduction of its initial public offering (IPO), which could happen as soon as next month.

Reddit initially revealed the deal, which is reported to be worth $60 million a year, earlier in 2024 to potential investors of an anticipated IPO, Bloomberg said. The Bloomberg source speculates that the contract could serve as a model for future agreements with other AI companies.

After an era where AI companies utilized AI training data without expressly seeking any rightsholder permission, some tech firms have more recently begun entering deals where some content used for training AI models similar to GPT-4 (which runs the paid version of ChatGPT) comes under license. In December, for example, OpenAI signed an agreement with German publisher Axel Springer (publisher of Politico and Business Insider) for access to its articles. Previously, OpenAI has struck deals with other organizations, including the Associated Press. Reportedly, OpenAI is also in licensing talks with CNN, Fox, and Time, among others.

In April 2023, Reddit founder and CEO Steve Huffman told The New York Times that it planned to charge AI companies for access to its almost two decades’ worth of human-generated content.

If the reported $60 million/year deal goes through, it’s quite possible that if you’ve ever posted on Reddit, some of that material may be used to train the next generation of AI models that create text, still pictures, and video. Even without the deal, experts have discovered in the past that Reddit has been a key source of training data for large language models and AI image generators.

While we don’t know if OpenAI is the company that signed the deal with Reddit, Bloomberg speculates that Reddit’s ability to tap into AI hype for additional revenue may boost the value of its IPO, which might be worth $5 billion. Despite drama last year, Bloomberg states that Reddit pulled in more than $800 million in revenue in 2023, growing about 20 percent over its 2022 numbers.

Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder of Reddit.

Reddit sells training data to unnamed AI company ahead of IPO Read More »

reddit-must-share-ip-addresses-of-piracy-discussing-users,-film-studios-say

Reddit must share IP addresses of piracy-discussing users, film studios say

A keyboard icon for piracy beside letter v and n

For the third time in less than a year, film studios with copyright infringement complaints against a cable Internet provider are trying to force Reddit to share information about users who have discussed piracy on the site.

In 2023, film companies lost two attempts to have Reddit unmask its users. In the first instance, US Magistrate Judge Laurel Beeler ruled in the US District Court for the Northern District of California that the First Amendment right to anonymous speech meant Reddit didn’t have to disclose the names, email addresses, and other account registration information for nine Reddit users. Film companies, including Bodyguard Productions and Millennium Media, had subpoenaed Reddit in relation to a copyright infringement lawsuit against Astound Broadband-owned RCN about subscribers allegedly pirating 34 movie titles, including Hellboy (2019), Rambo V: Last Blood, and Tesla.

In the second instance, the same companies sued Astound Broadband-owned ISP Grande, again for alleged copyright infringement occurring over the ISP’s network. The studios subpoenaed Reddit for user account information, including “IP address registration and logs from 1/1/2016 to present, name, email address, and other account registration information” for six Reddit users, per a July 2023 court filing.

In August, a federal court again quashed that subpoena, citing First Amendment rights. In her ruling, Beeler noted that while the First Amendment right to anonymous speech is not absolute, the film producers had already received the names of 118 Grande subscribers. She also said the film producers had failed to prove that “the identifying information is directly or materially relevant or unavailable from another source.”

Third piracy-related subpoena

This week, as reported by TorrentFreak, film companies Voltage Holdings, which are part of the previous two subpoenas, and Screen Media Ventures, another film studio with litigation against RCN, filed a motion to compel [PDF] Reddit to respond to the subpoena in the US District Court for the Northern District of California. The studios said they’re seeking the information concerning claims they’ve made that the “ability to pirate content efficiently without any consequences is a draw for becoming a Frontier subscriber” and that Frontier Communications “does not have an effective policy for terminating repeat infringers.” The film studios are claimants against Frontier in its bankruptcy case. The studios are represented by the same lawyers used in the two aforementioned cases.

The studios are asking that the court require Reddit to provide “IP address log information from 1/1/2017 to present” for six anonymous Reddit users who talked about piracy on Reddit. Although, Reddit posts shared in the court filing only date back to 2021.

Reddit responded to the studios’ subpoena with a letter [PDF] on January 2 stating that the subpoena “does not satisfy the First Amendment standard for disclosure of identifying information regarding an anonymous speaker.” Reddit also noted the two previously quashed subpoenas and suggested that it did not have to comply with the new request because the studios could acquire equivalent or better information elsewhere.

As with the previously mentioned litigation against ISPs, Reddit is a non-party. However, since the film companies claimed that Frontier had refused to produce customer identifying information and Reddit responded with a denial to the requests, the film companies filed their motion to compel.

The studios argue that the information requests do not implicate the First Amendment and that the rulings around the two aforementioned subpoenas are not applicable because the new subpoena is only about IP address logs and not other user-identifying information.

“The Reddit users do not have a recognized privacy interest in their IP addresses,” the motion says.

Reddit must share IP addresses of piracy-discussing users, film studios say Read More »