Web Scraping

billions-of-public-discord-messages-may-be-sold-through-a-scraping-service

Billions of public Discord messages may be sold through a scraping service

Discord chat-scraping service —

Cross-server tracking suggests a new understanding of “public” chat servers.

Discord logo, warped by vertical perspective over a phone displaying the app

Getty Images

It’s easy to get the impression that Discord chat messages are ephemeral, especially across different public servers, where lines fly upward at a near-unreadable pace. But someone claims to be catching and compiling that data and is offering packages that can track more than 600 million users across more than 14,000 servers.

Joseph Cox at 404 Media confirmed that Spy Pet, a service that sells access to a database of purportedly 3 billion Discord messages, offers data “credits” to customers who pay in bitcoin, ethereum, or other cryptocurrency. Searching individual users will reveal the servers that Spy Pet can track them across, a raw and exportable table of their messages, and connected accounts, such as GitHub. Ominously, Spy Pet lists more than 86,000 other servers in which it has “no bots,” but “we know it exists.”

  • An example of Spy Pet’s service from its website. Shown are a user’s nicknames, connected accounts, banner image, server memberships, and messages across those servers tracked by Spy Pet.

    Spy Pet

  • Statistics on servers, users, and messages purportedly logged by Spy Pet.

    Spy Pet

  • An example image of the publicly available data gathered by Spy Pet, in this example for a public server for the game Deep Rock Galactic: Survivor.

    Spy Pet

As Cox notes, Discord doesn’t make messages inside server channels, like blog posts or unlocked social media feeds, easy to publicly access and search. But many Discord users many not expect their messages, server memberships, bans, or other data to be grabbed by a bot, compiled, and sold to anybody wishing to pin them all on a particular user. 404 Media confirmed the service’s function with multiple user examples. Private messages are not mentioned by Spy Pet and are presumably still secure.

Spy Pet openly asks those training AI models, or “federal agents looking for a new source of intel,” to contact them for deals. As noted by 404 Media and confirmed by Ars, clicking on the “Request Removal” link plays a clip of J. Jonah Jameson from Spider-Man (the Tobey Maguire/Sam Raimi version) laughing at the idea of advance payment before an abrupt “You’re serious?” Users of Spy Pet, however, are assured of “secure and confidential” searches, with random usernames.

This author found nearly every public Discord he had ever dropped into for research or reporting in Spy Pet’s server list. Those who haven’t paid for message access can only see fairly benign public-facing elements, like stickers, emojis, and charted member totals over time. But as an indication of the reach of Spy Pet’s scraping, it’s an effective warning, or enticement, depending on your goals.

Ars has reached out to Spy Pet for comment and will update this post if we receive a response. A Discord spokesperson told Ars that the company is investigating whether Spy Pet violated its terms of service and community guidelines. It will take “appropriate steps to enforce our policies,” the company said, and could not provide further comment.

Billions of public Discord messages may be sold through a scraping service Read More »

openai-accuses-nyt-of-hacking-chatgpt-to-set-up-copyright-suit

OpenAI accuses NYT of hacking ChatGPT to set up copyright suit

OpenAI accuses NYT of hacking ChatGPT to set up copyright suit

OpenAI is now boldly claiming that The New York Times “paid someone to hack OpenAI’s products” like ChatGPT to “set up” a lawsuit against the leading AI maker.

In a court filing Monday, OpenAI alleged that “100 examples in which some version of OpenAI’s GPT-4 model supposedly generated several paragraphs of Times content as outputs in response to user prompts” do not reflect how normal people use ChatGPT.

Instead, it allegedly took The Times “tens of thousands of attempts to generate” these supposedly “highly anomalous results” by “targeting and exploiting a bug” that OpenAI claims it is now “committed to addressing.”

According to OpenAI this activity amounts to “contrived attacks” by a “hired gun”—who allegedly hacked OpenAI models until they hallucinated fake NYT content or regurgitated training data to replicate NYT articles. NYT allegedly paid for these “attacks” to gather evidence to support The Times’ claims that OpenAI’s products imperil its journalism by allegedly regurgitating reporting and stealing The Times’ audiences.

“Contrary to the allegations in the complaint, however, ChatGPT is not in any way a substitute for a subscription to The New York Times,” OpenAI argued in a motion that seeks to dismiss the majority of The Times’ claims. “In the real world, people do not use ChatGPT or any other OpenAI product for that purpose. Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will.”

In the filing, OpenAI described The Times as enthusiastically reporting on its chatbot developments for years without raising any concerns about copyright infringement. OpenAI claimed that it disclosed that The Times’ articles were used to train its AI models in 2020, but The Times only cared after ChatGPT’s popularity exploded after its debut in 2022.

According to OpenAI, “It was only after this rapid adoption, along with reports of the value unlocked by these new technologies, that the Times claimed that OpenAI had ‘infringed its copyright[s]’ and reached out to demand ‘commercial terms.’ After months of discussions, the Times filed suit two days after Christmas, demanding ‘billions of dollars.'”

Ian Crosby, Susman Godfrey partner and lead counsel for The New York Times, told Ars that “what OpenAI bizarrely mischaracterizes as ‘hacking’ is simply using OpenAI’s products to look for evidence that they stole and reproduced The Times’s copyrighted works. And that is exactly what we found. In fact, the scale of OpenAI’s copying is much larger than the 100-plus examples set forth in the complaint.”

Crosby told Ars that OpenAI’s filing notably “doesn’t dispute—nor can they—that they copied millions of The Times’ works to build and power its commercial products without our permission.”

“Building new products is no excuse for violating copyright law, and that’s exactly what OpenAI has done on an unprecedented scale,” Crosby said.

OpenAI argued that the court should dismiss claims alleging direct copyright, contributory infringement, Digital Millennium Copyright Act violations, and misappropriation, all of which it describes as “legally infirm.” Some fail because they are time-barred—seeking damages on training data for OpenAI’s older models—OpenAI claimed. Others allegedly fail because they misunderstand fair use or are preempted by federal laws.

If OpenAI’s motion is granted, the case would be substantially narrowed.

But if the motion is not granted and The Times ultimately wins—and it might—OpenAI may be forced to wipe ChatGPT and start over.

“OpenAI, which has been secretive and has deliberately concealed how its products operate, is now asserting it’s too late to bring a claim for infringement or hold them accountable. We disagree,” Crosby told Ars. “It’s noteworthy that OpenAI doesn’t dispute that it copied Times works without permission within the statute of limitations to train its more recent and current models.”

OpenAI did not immediately respond to Ars’ request to comment.

OpenAI accuses NYT of hacking ChatGPT to set up copyright suit Read More »

beautiful-soup-vs-scrapy-vs.-selenium:-which-web-scraping-tool-should-you-use?

Beautiful Soup vs. Scrapy vs. Selenium: Which Web Scraping Tool Should You Use?

internal/modules/cjs/loader.js: 905 throw err; ^ Error: Cannot find module ‘puppeteer’ Require stack: – /home/760439.cloudwaysapps.com/jxzdkzvxkw/public_html/wp-content/plugins/rss-feed-post-generator-echo/res/puppeteer/puppeteer.js at Function.Module._resolveFilename (internal/modules/cjs/loader.js: 902: 15) at Function.Module._load (internal/modules/cjs/loader.js: 746: 27) at Module.require (internal/modules/cjs/loader.js: 974: 19) at require (internal/modules/cjs/helpers.js: 101: 18) at Object. (/home/760439.cloudwaysapps.com/jxzdkzvxkw/public_html/wp-content/plugins/rss-feed-post-generator-echo/res/puppeteer/puppeteer.js:2: 19) at Module._compile (internal/modules/cjs/loader.js: 1085: 14) at Object.Module._extensions..js (internal/modules/cjs/loader.js: 1114: 10) at Module.load (internal/modules/cjs/loader.js: 950: 32) at Function.Module._load (internal/modules/cjs/loader.js: 790: 12) at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js: 75: 12) code: ‘MODULE_NOT_FOUND’, requireStack: [ ‘/home/760439.cloudwaysapps.com/jxzdkzvxkw/public_html/wp-content/plugins/rss-feed-post-generator-echo/res/puppeteer/puppeteer.js’ ]

Beautiful Soup vs. Scrapy vs. Selenium: Which Web Scraping Tool Should You Use? Read More »