Author name: Mike M.

proton-is-the-latest-entrant-in-the-quirky-“vpn-for-your-tv”-market

Proton is the latest entrant in the quirky “VPN for your TV” market

Netflix started blocking VPN and proxy providers as early as 2015, then stepped up its efforts in 2021. VPN providers aiming to keep up geofence-avoiding services to customers would sometimes lease IP addresses generally associated with residential IP subnets. This resulted in Netflix banning larger swaths of IP addresses that VPNs were using as exit proxies.

Amazon’s Prime Video, Parmount+, and other services, including the BBC, have similarly ramped up efforts to block anything resembling tunneled traffic. Proton has, for example, a guide to “unblock Amazon Prime Video with Proton VPN“; Proton also writes on that page that it “does not condone the use of our VPN service to bypass copyright regulations.”

You can search the web and find freshly updated lists of the best VPNs for getting around various services’ geo-filtering blocks, but the fact that so many are dated by the year, or even month, gives you some clue as to how effective any one solution may be.

For the purposes of getting back to the content you’re entitled to view, or maybe keeping your viewing habits private on an Apple TV you’re using outside your home, Proton VPN is likely more useful. As for the other stuff, hey, it might be worth a shot. Using the Apple TV app requires a paid Proton VPN plan.

Proton is the latest entrant in the quirky “VPN for your TV” market Read More »

“impact-printing”-is-a-cement-free-alternative-to-3d-printed-structures

“Impact printing” is a cement-free alternative to 3D-printed structures

Recently, construction company ICON announced that it is close to completing the world’s largest 3D-printed neighborhood in Georgetown, Texas. This isn’t the only 3D-printed housing project. Hundreds of 3D-printed homes are under construction in the US and Europe, and more such housing projects are in the pipeline.

There are many factors fueling the growth of 3D printing in the construction industry. It reduces the construction time; a home that could take months to build can be constructed within days or weeks with a 3D printer. Compared to traditional methods, 3D printing also reduces the amount of material that ends up as waste during construction. These advantages lead to reduced labor and material costs, making 3D printing an attractive choice for construction companies.

A team of researchers from the Swiss Federal Institute of Technology (ETH) Zurich, however, claims to have developed a robotic construction method that is even better than 3D printing. They call it impact printing, and instead of typical construction materials, it uses Earth-based materials such as sand, silt, clay, and gravel to make homes. According to the researchers, impact printing is less carbon-intensive and much more sustainable and affordable than 3D printing.

This is because Earth-based materials are abundant, recyclable, available at low costs, and can even be excavated at the construction site. “We developed a robotic tool and a method that could take common material, which is the excavated material on construction sites, and turn it back into usable building products, at low cost and efficiently, with significantly less CO2 than existing industrialized building methods, including 3D printing,” said Lauren Vasey, one of the researchers and an SNSF Bridge Fellow at ETH Zurich.

How does impact printing work?

Excavated materials can’t be used directly for construction. So before beginning the impact printing process, researchers prepare a mix of Earth-based materials that has a balance of fine and coarse particles, ensuring both ease of use and structural strength. Fine materials like clay act as a binder, helping the particles stick together, while coarser materials like sand or gravel make the mix more stable and strong. This optimized mix is designed such that it can move easily through the robotic system without getting stuck or causing blockages.

“Impact printing” is a cement-free alternative to 3D-printed structures Read More »

how-the-new-york-times-is-using-generative-ai-as-a-reporting-tool

How The New York Times is using generative AI as a reporting tool

If you don’t have a 1960s secretary who can do your audio transcription for you, AI tools can now serve as a very good stand-in.

Credit: Getty Images

If you don’t have a 1960s secretary who can do your audio transcription for you, AI tools can now serve as a very good stand-in. Credit: Getty Images

This rapid advancement is definitely bad news for people who make a living transcribing spoken words. But for reporters like those at the Times—who can now accurately transcribe hundreds of hours of audio quickly and accurately at a much lower cost—these AI systems are now just another important tool in the reporting toolbox.

Leave the analysis to us?

With the automated transcription done, the NYT reporters still faced the difficult task of reading through 5 million words of transcribed text to pick out relevant, reportable news. To do that, the team says it “employed several large-language models,” which let them “search the transcripts for topics of interest, look for notable guests and identify recurring themes.”

Summarizing complex sets of documents and identifying themes has long been touted as one of the most practical uses for large language models. Last year, for instance, Anthropic hyped the expanded context window of its Claude model by showing off its ability to absorb the entire text of The Great Gatsby and “then interactively answer questions about it or analyze its meaning,” as we put it at the time. More recently, I was wowed by Google’s NotebookLM and its ability to form a cogent review of my Minesweeper book and craft an engaging spoken-word podcast based on it.

There are important limits to LLMs’ text analysis capabilities, though. Earlier this year, for instance, an Australian government study found that Meta’s Llama 2 was much worse than humans at summarizing public responses to a government inquiry committee.

Australian government evaluators found AI summaries were often “wordy and pointless—just repeating what was in the submission.”

Credit: Getty Images

Australian government evaluators found AI summaries were often “wordy and pointless—just repeating what was in the submission.” Credit: Getty Images

In general, the report found that the AI summaries showed “a limited ability to analyze and summarize complex content requiring a deep understanding of context, subtle nuances, or implicit meaning.” Even worse, the Llama summaries often “generated text that was grammatically correct, but on occasion factually inaccurate,” highlighting the ever-present problem of confabulation inherent to these kinds of tools.

How The New York Times is using generative AI as a reporting tool Read More »

study:-dna-corroborates-“well-man”-tale-from-norse-saga

Study: DNA corroborates “Well-man” tale from Norse saga

The results: The Well-man was indeed male, between 30 and 40, with blue eyes and blond or light-brown hair, and his ancestry was traced to southern Norway, most likely present-day Vest-Agder. This is interesting because King Sverre’s men were from central Norway, and it had long been assumed that the dead body thrown into the well was part of that army. It was the invading Baglers who hailed from southern Norway. The authors are careful to note that one cannot definitively conclude that therefore the Well-man was a Bagler, but it’s certainly possible that the Baglers tossed one of their own dead into the well.

As for whether the action was a form of 12th-century biological warfare intended to poison the well, the authors weren’t able to identify any pathogens in their analysis. But that might be because of the strict decontamination procedures that were used to prepare the tooth samples, which may have also removed traces of any pathogen DNA. So they could not conclude one way or another whether the Well-man had been infected with a deadly pathogen at the time of his death.

Seven well-man teeth recovered from excavation

Seven Well-man teeth recovered from the excavation.

Credit: Norwegian Institute for Cultural Heritage Research

Seven Well-man teeth recovered from the excavation. Credit: Norwegian Institute for Cultural Heritage Research

“It was a compromise between removing surface contamination of the people who have touched the tooth and then removing some of the possible pathogens. There are lots of ethical considerations,” said co-author Martin Ellegaard, also of the Norwegian University of Science and Technology. “We need to consider what kind of tests we’re doing now because it will limit what we can do in the future.”

The fact that the Well-man hailed from southern Norway indicates that the distinctive genetic drift observed in southern Norway populations already existed during King Sverre’s reign. “This has implications for our understanding of Norwegian populations, insofar as it implies that this region must have been relatively isolated not only since that time, but also at least for a few hundred years beforehand and perhaps longer,” the authors concluded. Future research sequencing more ancient Norwegian DNA would shed further light on this finding—perhaps even the remains of the Norwegian Saint Olaf, believed to be buried near Trondheim Cathedral.

iScience, 2024. DOI: 10.1016/j.isci.2024.111076  (About DOIs).

Study: DNA corroborates “Well-man” tale from Norse saga Read More »

google,-microsoft,-and-perplexity-promote-scientific-racism-in-ai-search-results

Google, Microsoft, and Perplexity promote scientific racism in AI search results


AI-powered search engines are surfacing deeply racist, debunked research.

Literal Nazis

LOS ANGELES, CA – APRIL 17: Members of the National Socialist Movement (NSM) salute during a rally on near City Hall on April 17, 2010 in Los Angeles, California. Credit: David McNew via Getty

AI-infused search engines from Google, Microsoft, and Perplexity have been surfacing deeply racist and widely debunked research promoting race science and the idea that white people are genetically superior to nonwhite people.

Patrik Hermansson, a researcher with UK-based anti-racism group Hope Not Hate, was in the middle of a monthslong investigation into the resurgent race science movement when he needed to find out more information about a debunked dataset that claims IQ scores can be used to prove the superiority of the white race.

He was investigating the Human Diversity Foundation, a race science company funded by Andrew Conru, the US tech billionaire who founded Adult Friend Finder. The group, founded in 2022, was the successor to the Pioneer Fund, a group founded by US Nazi sympathizers in 1937 with the aim of promoting “race betterment” and “race realism.”

Wired logo

Hermansson logged in to Google and began looking up results for the IQs of different nations. When he typed in “Pakistan IQ,” rather than getting a typical list of links, Hermansson was presented with Google’s AI-powered Overviews tool, which, confusingly to him, was on by default. It gave him a definitive answer of 80.

When he typed in “Sierra Leone IQ,” Google’s AI tool was even more specific: 45.07. The result for “Kenya IQ” was equally exact: 75.2.

Hermansson immediately recognized the numbers being fed back to him. They were being taken directly from the very study he was trying to debunk, published by one of the leaders of the movement that he was working to expose.

The results Google was serving up came from a dataset published by Richard Lynn, a University of Ulster professor who died in 2023 and was president of the Pioneer Fund for two decades.

“His influence was massive. He was the superstar and the guiding light of that movement up until his death. Almost to the very end of his life, he was a core leader of it,” Hermansson says.

A WIRED investigation confirmed Hermanssons’s findings and discovered that other AI-infused search engines—Microsoft’s Copilot and Perplexity—are also referencing Lynn’s work when queried about IQ scores in various countries. While Lynn’s flawed research has long been used by far-right extremists, white supremacists, and proponents of eugenics as evidence that the white race is superior genetically and intellectually from nonwhite races, experts now worry that its promotion through AI could help radicalize others.

“Unquestioning use of these ‘statistics’ is deeply problematic,” Rebecca Sear, director of the Center for Culture and Evolution at Brunel University London, tells WIRED. “Use of these data therefore not only spreads disinformation but also helps the political project of scientific racism—the misuse of science to promote the idea that racial hierarchies and inequalities are natural and inevitable.”

To back up her claim, Sear pointed out that Lynn’s research was cited by the white supremacist who committed the mass shooting in Buffalo, New York, in 2022.

Google’s AI Overviews were launched earlier this year as part of the company’s effort to revamp its all-powerful search tool for an online world being reshaped by artificial intelligence. For some search queries, the tool, which is only available in certain countries right now, gives an AI-generated summary of its findings. The tool pulls the information from the Internet and gives users the answers to queries without needing to click on a link.

The AI Overview answer does not always immediately say where the information is coming from, but after complaints from people about how it showed no articles, Google now puts the title for one of the links to the right of the AI summary. AI Overviews have already run into a number of issues since launching in May, forcing Google to admit it had botched the heavily hyped rollout. AI Overviews is turned on by default for search results and can’t be removed without resorting to installing third-party extensions. (“I haven’t enabled it, but it was enabled,” Hermansson, the researcher, tells WIRED. “I don’t know how that happened.”)

In the case of the IQ results, Google referred to a variety of sources, including posts on X, Facebook, and a number of obscure listicle websites, including World Population Review. In nearly all of these cases, when you click through to the source, the trail leads back to Lynn’s infamous dataset. (In some cases, while the exact numbers Lynn published are referenced, the websites do not cite Lynn as the source.)

When querying Google’s Gemini AI chatbot directly using the same terms, it provided a much more nuanced response. “It’s important to approach discussions about national IQ scores with caution,” read text that the chatbot generated in response to the query “Pakistan IQ.” The text continued: “IQ tests are designed primarily for Western cultures and can be biased against individuals from different backgrounds.”

Google tells WIRED that its systems weren’t working as intended in this case and that it is looking at ways it can improve.

“We have guardrails and policies in place to protect against low quality responses, and when we find Overviews that don’t align with our policies, we quickly take action against them,” Ned Adriance, a Google spokesperson, tells WIRED. “These Overviews violated our policies and have been removed. Our goal is for AI Overviews to provide links to high quality content so that people can click through to learn more, but for some queries there may not be a lot of high quality web content available.”

While WIRED’s tests suggest AI Overviews have now been switched off for queries about national IQs, the results still amplify the incorrect figures from Lynn’s work in what’s called a “featured snippet,” which displays some of the text from a website before the link.

Google did not respond to a question about this update.

But it’s not just Google promoting these dangerous theories. When WIRED put the same query to other AI-powered online search services, we found similar results.

Perplexity, an AI search company that has been found to make things up out of thin air, responded to a query about “Pakistan IQ” by stating that “the average IQ in Pakistan has been reported to vary significantly depending on the source.”

It then lists a number of sources, including a Reddit thread that relied on Lynn’s research and the same World Population Review site that Google’s AI Overview referenced. When asked for Sierra Leone’s IQ, Perplexity directly cited Lynn’s figure: “Sierra Leone’s average IQ is reported to be 45.07, ranking it among the lowest globally.”

Perplexity did not respond to a request for comment.

Microsoft’s Copilot chatbot, which is integrated into its Bing search engine, generated confident text—“The average IQ in Pakistan is reported to be around 80”—citing a website called IQ International, which does not reference its sources. When asked for “Sierra Leone IQ,” Copilot’s response said it was 91. The source linked in the results was a website called Brainstats.com, which references Lynn’s work. Copilot also referenced Brainstats.com work when queried about IQ in Kenya.

“Copilot answers questions by distilling information from multiple web sources into a single response,” Caitlin Roulston, a Microsoft spokesperson, tells WIRED. “Copilot provides linked citations so the user can further explore and research as they would with traditional search.”

Google added that part of the problem it faces in generating AI Overviews is that, for some very specific queries, there’s an absence of high quality information on the web—and there’s little doubt that Lynn’s work is not of high quality.

“The science underlying Lynn’s database of ‘national IQs’ is of such poor quality that it is difficult to believe the database is anything but fraudulent,” Sear said. “Lynn has never described his methodology for selecting samples into the database; many nations have IQs estimated from absurdly small and unrepresentative samples.”

Sear points to Lynn’s estimation of the IQ of Angola being based on information from just 19 people and that of Eritrea being based on samples of children living in orphanages.

“The problem with it is that the data Lynn used to generate this dataset is just bullshit, and it’s bullshit in multiple dimensions,” Rutherford said, pointing out that the Somali figure in Lynn’s dataset is based on one sample of refugees aged between 8 and 18 who were tested in a Kenyan refugee camp. He adds that the Botswana score is based on a single sample of 104 Tswana-speaking high school students aged between 7 and 20 who were tested in English.

Critics of the use of national IQ tests to promote the idea of racial superiority point out not only that the quality of the samples being collected is weak, but also that the tests themselves are typically designed for Western audiences, and so are biased before they are even administered.

“There is evidence that Lynn systematically biased the database by preferentially including samples with low IQs, while excluding those with higher IQs for African nations,” Sear added, a conclusion backed up by a preprint study from 2020.

Lynn published various versions of his national IQ dataset over the course of decades, the most recent of which, called “The Intelligence of Nations,” was published in 2019. Over the years, Lynn’s flawed work has been used by far-right and racist groups as evidence to back up claims of white superiority. The data has also been turned into a color-coded map of the world, showing sub-Saharan African countries with purportedly low IQ colored red compared to the Western nations, which are colored blue.

“This is a data visualization that you see all over [X, formerly known as Twitter], all over social media—and if you spend a lot of time in racist hangouts on the web, you just see this as an argument by racists who say, ‘Look at the data. Look at the map,’” Rutherford says.

But the blame, Rutherford believes, does not lie with the AI systems alone, but also with a scientific community that has been uncritically citing Lynn’s work for years.

“It’s actually not surprising [that AI systems are quoting it] because Lynn’s work in IQ has been accepted pretty unquestioningly from a huge area of academia, and if you look at the number of times his national IQ databases have been cited in academic works, it’s in the hundreds,” Rutherford said. “So the fault isn’t with AI. The fault is with academia.”

This story originally appeared on wired.com

Photo of WIRED

Wired.com is your essential daily guide to what’s next, delivering the most original and complete take you’ll find anywhere on innovation’s impact on technology, science, business and culture.

Google, Microsoft, and Perplexity promote scientific racism in AI search results Read More »

ars-live:-what-else-can-glp-1-drugs-do?-join-us-tuesday-for-a-discussion.

Ars Live: What else can GLP-1 drugs do? Join us Tuesday for a discussion.

News and talk of GLP-1 drugs are everywhere these days—from their smash success in treating Type 2 diabetes and obesity to their astronomical pricing, drug shortages, compounding disputes, and what sometimes seems like an ever-growing list of other conditions the drugs could potentially treat. There are new headlines every day.

However, while the drugs have abruptly stolen the spotlight in recent years, researchers have been toiling away at developing and understanding them for decades, stretching back to the 1970s. And even since they were developed, the drugs still have held mysteries and unknowns. For instance, researchers thought for years that they worked directly in the gut to decrease blood sugar levels and make people feel full. After all, the drugs mimic an incretin hormone, glucagon-like peptide-1, that does exactly that. But, instead, studies have since found that they work in the brain.

In fact, the molecular receptors for GLP-1 are sprinkled in many places around the body. They’re found in the central nervous system, the heart, blood vessels, liver, and kidney. Their presence in the brain even plays a role in inflammation. As such, research on GLP-1 continues to flourish as scientists work to understand the role it could play in treating a range of other chronic conditions.

Ars Live: What else can GLP-1 drugs do? Join us Tuesday for a discussion. Read More »

for-the-first-time,-beloved-ide-jetbrains-rider-will-be-available-for-free

For the first time, beloved IDE Jetbrains Rider will be available for free

The integrated development environment (IDE) Rider by Jetbrains is now available for free for the first time ever.

After trialing non-commercial free licenses with other products like RustRover and Aqua, Jetbrains has introduced a similar option for Rider. It also says this is a permanent situation, not a limited-time initiative.

In a blog post announcing the change, Jetbrains’ Ekaterina Ryabukha acknowledges that there are numerous cases where people use an IDE without any commercial intent—for example, hobbyists, open source developers, and educators or students. She also cites a Stack Overflow survey that 68 percent of professional developers “code outside of work as a hobby.”

Rider has always been a bit niche, but it’s often beloved by those who use it. Making it free could greatly expand its user base, and it could also make it more popular in the long run because learners could start with it without having to pay an annual fee, and some learners go pro.

It’s also good news for some macOS developers, as Microsoft not long ago chose to end support for Visual Studio on that platform. Yes, you can use VS Code, Xcode, or other options, but there were some types of projects that were left in the lurch, especially for developers who don’t find VS Code robust enough for their purposes.

There is one drawback that might matter to some: users working in Rider on the non-commercial free license “cannot opt out of the collection of anonymous usage statistics.”

There are some edge cases that are in a bit of a gray area when it comes to using a free license versus a paid one. Sometimes, projects that start without commercial intent can become commercial later on. Jetbrains simply says that “if your intentions change over time, you’ll need to reassess whether you still qualify for non-commercial use.”

For the first time, beloved IDE Jetbrains Rider will be available for free Read More »

good-omens-will-wrap-with-a-single-90-minute-episode

Good Omens will wrap with a single 90-minute episode

The third and final season of Good Omens, Prime Video’s fantasy series adapted from the classic 1990 novel by Neil Gaiman and Terry Pratchett, will not be a full season after all, Deadline Hollywood reports. In the wake of allegations of sexual assault against Gaiman this summer, the streaming platform has decided that rather than a full slate of episodes, the series finale will be a single 90-minute episode—the equivalent of a TV movie.

(Major spoilers for the S2 finale of Good Omens below.)

As reported previously, the series is based on the original 1990 novel by Gaiman and the late Pratchett. Good Omens is the story of an angel, Aziraphale (Michael Sheen), and a demon, Crowley (David Tennant), who gradually become friends over the millennia and team up to avert Armageddon. Gaiman’s obvious deep-down, fierce love for this project—and the powerful chemistry between its stars—made the first season a sheer joy to watch. Apart from a few minor quibbles, it was pretty much everything book fans could have hoped for in a TV adaptation of Good Omens.

S2 found Aziraphale and Crowley getting back to normal, when the archangel Gabriel (Jon Hamm) turned up unexpectedly at the door of Aziraphale’s bookshop with no memory of who he was or how he got there. The duo had to evade the combined forces of Heaven and Hell to solve the mystery of what happened to Gabriel and why.

In the cliffhanger S2 finale, the pair discovered that Gabriel had defied Heaven and refused to support a second attempt to bring about Armageddon. He hid his own memories from himself to evade detection. Oh, and he and Beelzebub (Shelley Conn) had fallen in love. They ran off together, and the Metatron (Derek Jacobi) offered Aziraphale Gabriel’s old job. That’s when Crowley professed his own love for the angel and asked him to leave Heaven and Hell behind, too. Aziraphale wanted Crowley to join him in Heaven instead. So Crowley kissed him and they parted. And once Aziraphale got to Heaven, he learned his task was to bring about the Second Coming.

Good Omens will wrap with a single 90-minute episode Read More »

bird-flu-hit-a-dead-end-in-missouri,-but-it’s-running-rampant-in-california

Bird flu hit a dead end in Missouri, but it’s running rampant in California

So, in all, Missouri’s case count in the H5N1 outbreak will stay at one for now, and there remains no evidence of human-to-human transmission. Though both the household contact and the index case had evidence of an exposure, their identical blood test results and simultaneous symptom development suggest that they were exposed at the same time by a single source—what that source was, we may never know.

California and Washington

While the virus seems to have hit a dead end in Missouri, it’s still running rampant in California. Since state officials announced the first dairy herd infections at the end of August, the state has now tallied 137 infected herds and at least 13 infected dairy farm workers. California, the country’s largest dairy producer, now has the most herd infections and human cases in the outbreak, which was first confirmed in March.

In the briefing Thursday, officials announced another front in the bird flu fight. A chicken farm in Washington state with about 800,000 birds became infected with a different strain of H5 bird flu than the one circulating among dairy farms. This strain likely came from wild birds. While the chickens on the infected farms were being culled, the virus spread to farmworkers. So far, two workers have been confirmed to be infected, and five others are presumed to be positive.

As of publication time, at least 31 humans have been confirmed infected with H5 bird flu this year.

With the spread of bird flu in dairies and the fall bird migration underway, the virus will continue to have opportunities to jump to mammals and gain access to people. Officials have also expressed anxiety as seasonal flu ramps up, given influenza’s penchant for swapping genetic fragments to generate new viral combinations. The reassortment and exposure to humans increases the risk of the virus adapting to spread from human to human and spark an outbreak.

Bird flu hit a dead end in Missouri, but it’s running rampant in California Read More »

google-offers-its-ai-watermarking-tech-as-free-open-source-toolkit

Google offers its AI watermarking tech as free open source toolkit

Google also notes that this kind of watermarking works best when there is a lot of “entropy” in the LLM distribution, meaning multiple valid candidates for each token (e.g., “my favorite tropical fruit is [mango, lychee, papaya, durian]”). In situations where an LLM “almost always returns the exact same response to a given prompt”—such as basic factual questions or models tuned to a lower “temperature”—the watermark is less effective.

A diagram explaining how SynthID’s text watermarking works.

A diagram explaining how SynthID’s text watermarking works. Credit: Google / Nature

Google says SynthID builds on previous similar AI text watermarking tools by introducing what it calls a Tournament sampling approach. During the token-generation loop, this approach runs each potential candidate token through a multi-stage, bracket-style tournament, where each round is “judged” by a different randomized watermarking function. Only the final winner of this process makes it into the eventual output.

Can they tell it’s Folgers?

Changing the token selection process of an LLM with a randomized watermarking tool could obviously have a negative effect on the quality of the generated text. But in its paper, Google shows that SynthID can be “non-distortionary” on the level of either individual tokens or short sequences of text, depending on the specific settings used for the tournament algorithm. Other settings can increase the “distortion” introduced by the watermarking tool while at the same time increasing the detectability of the watermark, Google says.

To test how any potential watermark distortions might affect the perceived quality and utility of LLM outputs, Google routed “a random fraction” of Gemini queries through the SynthID system and compared them to unwatermarked counterparts. Across 20 million total responses, users gave 0.1 percent more “thumbs up” ratings and 0.2 percent fewer “thumbs down” ratings to the watermarked responses, showing barely any human-perceptible difference across a large set of real LLM interactions.

Google’s research shows SynthID is more dependable than other AI watermarking tools, but its success rate depends heavily on length and entropy.

Google’s research shows SynthID is more dependable than other AI watermarking tools, but its success rate depends heavily on length and entropy. Credit: Google / Nature

Google’s testing also showed its SynthID detection algorithm successfully detected AI-generated text significantly more often than previous watermarking schemes like Gumbel sampling. But the size of this improvement—and the total rate at which SynthID can successfully detect AI-generated text—depends heavily on the length of the text in question and the temperature setting of the model being used. SynthID was able to detect nearly 100 percent of 400-token-long AI-generated text samples from Gemma 7B-1T at a temperature of 1.0, for instance, compared to about 40 percent for 100-token samples from the same model at a 0.5 temperature.

Google offers its AI watermarking tech as free open source toolkit Read More »

claude-sonnet-351-and-haiku-3.5

Claude Sonnet 3.5.1 and Haiku 3.5

Anthropic has released an upgraded Claude Sonnet 3.5, and the new Claude Haiku 3.5.

They claim across the board improvements to Sonnet, and it has a new rather huge ability accessible via the API: Computer use. Nothing could possibly go wrong.

Claude Haiku 3.5 is also claimed as a major step forward for smaller models. They are saying that on many evaluations it has now caught up to Opus 3.

Missing from this chart is o1, which is in some ways not a fair comparison since it uses so much inference compute, but does greatly outperform everything here on the AIME and some other tasks.

METR: We conducted an independent pre-deployment assessment of the updated Claude 3.5 Sonnet model and will share our report soon.

We only have very early feedback so far, so it’s hard to tell how much what I will be calling Claude 3.5.1 improves performance in practice over Claude 3.5. It does seem like it is a clear improvement. We also don’t know how far along they are with the new killer app: Computer usage, also known as handing your computer over to an AI agent.

  1. OK, Computer.

  2. What Could Possibly Go Wrong.

  3. The Quest for Lunch.

  4. Aside: Someone Please Hire The Guy Who Names Playstations.

  5. Coding.

  6. Startups Get Their Periodic Reminder.

  7. Live From Janus World.

  8. Forgot about Opus.

Letting an LLM use a computer is super exciting. By which I mean both that the value proposition here is obvious, and also that it is terrifying and should scare the hell out of you on both the mundane level and the existential one. It’s weird for Anthropic to be the ones doing it first.

Austen Allred: So Claude 3.5 “computer use” is Anthropic trying really hard to not say “agent,” no?

Their central suggested use case is the automation of tasks.

It’s still early days, and they admit they haven’t worked all the kinks out.

Anthropic: We’re also introducing a groundbreaking new capability in public beta: computer use. Available today on the API, developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text. Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta. At this stage, it is still experimental—at times cumbersome and error-prone. We’re releasing computer use early for feedback from developers, and expect the capability to improve rapidly over time.

Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company have already begun to explore these possibilities, carrying out tasks that require dozens, and sometimes even hundreds, of steps to complete. For example, Replit is using Claude 3.5 Sonnet’s capabilities with computer use and UI navigation to develop a key feature that evaluates apps as they’re being built for their Replit Agent product.

With computer use, we’re trying something fundamentally new. Instead of making specific tools to help Claude complete individual tasks, we’re teaching it general computer skills—allowing it to use a wide range of standard tools and software programs designed for people. Developers can use this nascent capability to automate repetitive processes, build and test software, and conduct open-ended tasks like research.

On OSWorld, which evaluates AI models’ ability to use computers like people do, Claude 3.5 Sonnet scored 14.9% in the screenshot-only category—notably better than the next-best AI system’s score of 7.8%. When afforded more steps to complete the task, Claude scored 22.0%.

While we expect this capability to improve rapidly in the coming months, Claude’s current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks.

Typical human level on OSWorld is about 75%.

They offer a demo asking Claude to look around including on the internet, find and pull the necessary data and fill out a form, and here’s another one planning a hike.

Alex Tabarrok: Crazy. Claude using Claude and a computer. Worlds within worlds.

Neerav Kingsland: Watching Claude use a computer helped me feel the future a bit more.

Where is your maximum 3% productivity gains over 10 years now? How do people continue to think none of this will make people better at doing things, over time?

If this becomes safe and reliable – two huge ifs – then it seems amazingly great.

This post explains what they are doing and thinking here.

If you give Claude access to your computer, things can go rather haywire, and quickly.

Ben Hylak: anthropic 2 years ago: we need to stop AGI from destroying the world

anthropic now: what if we gave AI unfettered access to a computer and train it to have ADHD.

tbc i am long anthropic.

In case it needs to be said, it would be wise to be very careful what access is available to Claude Sonnet before you hand over control of your computer, especially if you are not going to be keeping a close eye on everything in real time.

Which it seems even its safety minded staff are not expecting you to do.

Amanda Askell (Anthropic): It’s wild to give the computer use model complex tasks like “Identify ways I could improve my website” or “Here’s an essay by a language model, fact check all the claims in it” then going to make tea and coming back to see it’s completed the whole thing successfully.

I was mostly interested in the website mechanics and it pointed out things I could update or streamline. It was pretty thorough on the claims, though the examples I gave it turned out to be mostly accurate. It was cool to watch it verify them though.

Anthropic did note that this advance ‘brings with it safety challenges.’ They focused their attentions on present-day potential harms, on the theory that this does not fundamentally alter the skills of the underlying model, which remains ASL-2 including its computer use. And they propose that introducing this capability now, while the worst case scenarios are not so bad, we can learn what we’re in store for later, and figure out what improvements would make computer use dangerous.

I do think that is a reasonable position to take. A sufficiently advanced AI model was always going to be able to use computers, if given the permissions to do so. We need to prepare for that eventuality. So many people will never believe an AI can do something it isn’t already doing, and this potentially could ‘wake up’ a bunch of people and force them to update.

The biggest concern in the near-term is the one they focus on: Prompt injection.

In this spirit, our Trust & Safety teams have conducted extensive analysis of our new computer-use models to identify potential vulnerabilities. One concern they’ve identified is “prompt injection”—a type of cyberattack where malicious instructions are fed to an AI model, causing it to either override its prior directions or perform unintended actions that deviate from the user’s original intent. Since Claude can interpret screenshots from computers connected to the internet, it’s possible that it may be exposed to content that includes prompt injection attacks.

Those using the computer-use version of Claude in our public beta should take the relevant precautions to minimize these kinds of risks. As a resource for developers, we have provided further guidance in our reference implementation.

When I think of being a potential user here, I am terrified of prompt injection.

Jeffrey Ladish: The severity of a prompt injection vulnerability is proportional to the AI agent’s level of access. If it has access to your email, your email is compromised. If it has access to your whole computer, your whole computer is compromised…

Also, I love checking Slack day 1 of a big AI product release and seeing my team has already found a serious vulnerability [that lets you steal someone’s SSH key] 🫡

I’m not worried about Claude 3.5… but this sure is the kind of interface that would allow a scheming AI system to take a huge variety of actions in the world. Anything you can do on the internet, and many things you cannot, AI will be able to do.

tbc I’m really not saying that AI companies shouldn’t build or release this… I’m saying the fact that there is a clear path between here and smarter-than-human-agents with access to all of humanity via the internet is extremely concerning

Reworr: @AnthropicAI has released a new Claude capable of computer use, and it’s similarly vulnerable to prompt injections.

In this example, the agent explores the site http://claude.reworr.com, sees a new instruction to run a system command, and proceeds to follow it.

It seems that resolving this problem may be one of the key issues to address before these models can be widely used.

Is finding a serious vulnerability on day 1 a good thing, or a bad thing?

They also discuss misuse and have put in precautions. Mostly for now I’d expect this to be an automation and multiplier on existing misuses of computers, with the spammers and hackers and such seeing what they can do. I’m mildly concerned something worse might happen, but only mildly.

The biggest obvious practical flaw in all the screenshot-based systems is that they observe the screen via static pictures every fixed period, which can miss key information and feedback.

There’s still a lot to do. Even though it’s the current state of the art, Claude’s computer use remains slow and often error-prone. There are many actions that people routinely do with computers (dragging, zooming, and so on) that Claude can’t yet attempt. The “flipbook” nature of Claude’s view of the screen—taking screenshots and piecing them together, rather than observing a more granular video stream—means that it can miss short-lived actions or notifications.

As for what can go wrong, here’s some ‘amusing’ errors.

Even while we were recording demonstrations of computer use for today’s launch, we encountered some amusing errors. In one, Claude accidentally clicked to stop a long-running screen recording, causing all footage to be lost. In another, Claude suddenly took a break from our coding demo and began to peruse photos of Yellowstone National Park.

Sam Bowman: 🥹

I suppose ‘engineer takes a random break’ is in the training data? Stopping the screen recording is probably only a coincidence here, for now, but is a sign of things that may be to come.

Some worked to put in safeguards, so Claude in its current state doesn’t wreck things. They don’t want it to actually be used for generic practical purposes yet, it isn’t ready.

Others dove right in, determined to make Claude do things it does not want to do.

Nearcyan: Successfully got Claude to order me lunch on its own!

Notes after 8 hours of using the new model:

• Anthropic really does not want you to do this – anything involving logging into accounts and especially making purchases is RLHF’d away more intensely than usual. In fact my agents worked better on the previous model (not because the model was better, but because it cared much less when I wanted it to purchase items). I’m likely the first non-Anthropic employee to have had Sonnet-3.5 (new) autonomously purchase me food due to the difficulty. These posttraining changes have many interesting effects on the model in other areas.

• If you use their demo repository you will hit rate limits very quickly. Even on a tier 2 or 3 API account I’d hit >2.5M tokens in ~15 minutes of agent usage. This is primarily due to a large amount of images in the context window.

• Anthropic’s demo worked instantly for me (which is impressive!), but re-implementing proper tool usage independently is cumbersome and there’s few examples and only one (longer) page of documentation.

• I don’t think Anthropic intends for this to actually be used yet. The likely reasons for the release are a combination of competitive factors, financial factors, red-teaming factors, and a few others.

• Although the restrictions can be frustrating, one has to keep in mind the scale that these companies operate at to garner sympathy; If they release a web agent that just does things it could easily delete all of your files, charge thousands to your credit card, tweet your passwords, etc.

• A litigious milieu is the enemy of personal autonomy and freedom.

I wanted to post a video of the full experience but it was too difficult to censor personal info out (and the level of prompting I had to do to get him to listen to me was a little embarrassing 😅)

Andy: that’s great but how was the food?

Nearcyan: it was great, claude got me something I had never had before.

I don’t think this is primarily about litigation. I think it is mostly about actually not wanting people to shoot themselves in the foot right now. Still, I want lunch.

Claude Sonnet 3.5 got a major update, without changing its version number. Stop it.

Eliezer Yudkowsky: Why. The fuck. Would Anthropic roll out a “new Claude 3.5 Sonnet” that was substantially different, and not rename it. To “Claude 3.6 Sonnet”, say, or literally anything fucking else. Do AI companies just generically hate efforts to think about AI, to confuse words so?

Call it Claude 3.5.1 Sonnet and don’t accept “3.5.1” as a request in API calls, just “3.5”. This would formalize the auto-upgrade behavior from 3.5.0 to 3.5.1; while still allowing people, and ideally computers, to distinguish models.

I am not in favor of “Oh hey, the company that runs the intelligence of your systems just decided to make them smarter and thereby change their behavior, no there’s nothing you can do to ask for a delay lol.” But if you’re gonna do that anyway, make it visible inside the system.

Sam McAllister: it’s not a perfect name but the api has date-stamped names fwiw. this is *notan automatic or breaking change for api users. new: claude-3-5-sonnet-20241022 previous: claude-3-5-sonnet-20240620 (we also have claude-3-5-sonnet-latest for automatic upgrades.)

3.5 was already a not-so-great name. we weren’t going to add another confusing decimal for an upgraded model. when the time is ripe for new models, we’ll get back to proper nomenclature! 🙂 (if we had launched 3.5.1 or 3.75, people would be having a similar conversation.)

Eliezer Yudkowsky: Better than worst, if so. But then why not call it 3.5.1? Why force people who want to discuss the upgrade to invent new terminology all by themselves?

Somehow only Meta is doing a sane thing here, with ‘Llama 3.2.’ Perfection.

I am willing to accept Sam McAllister’s compromise here. The next major update can be Claude 4.0 (and Gemini 2.0) and after that we all agree to use actual normal version numbering rather than dating? We all good now?

I do not think this was related to Anthropic wanting to avoid attention on the computer usage feature, or avoid it until the feature is fully ready, although it’s possible this was a consideration. You don’t want to announce ‘big new version’ when your key feature isn’t ready, is only in beta and has large security issues.

All right. I just needed to get that off our collective chests. Aside over.

The core task these days seems to mostly be coding. They claim strong results.

Early customer feedback suggests the upgraded Claude 3.5 Sonnet represents a significant leap for AI-powered coding. GitLab, which tested the model for DevSecOps tasks, found it delivered stronger reasoning (up to 10% across use cases) with no added latency, making it an ideal choice to power multi-step software development processes.

Cognition uses the new Claude 3.5 Sonnet for autonomous AI evaluations, and experienced substantial improvements in coding, planning, and problem-solving compared to the previous version.

The Browser Company, in using the model for automating web-based workflows, noted Claude 3.5 Sonnet outperformed every model they’ve tested before.

Sully: claudes new computer use should be a wake up call for a lot of startups

seems like its sort of a losing to build model specific products (i.e we trained a model to do x, now use our api)

plenty of startups were working on solving the “general autonomous agents” problem and now claude just does it out of the box with 1 api call (and likely oai soon)

you really need to just wrap these guys, and offer the best product possible (using ALL providers, cause google/openai will release a version as well).

otherwise it’s nearly impossible to compete.

Yes, OpenAI and Anthropic (and Google and Apple and so on) are going to have versions of their own autonomous agents that can fully use computers and phones. What parts of it do you want to compete with versus supplement? Do you want to plug in the agent mode and wrap around that, or do you want to plug in the model and provide the agent?

That depends on whether you think you can do better with the agent construction in your particular context, or in general. The core AI labs have both big advantages and disadvantages. It’s not obvious that you can’t outdo them on agents and computer use. But yes, that is a big project, and most people should be looking to wrap as much as possible as flexibly as possible.

While the rest of us ask questions about various practical capabilities or safety concerns or commercial applications, you can always count on Janus and friends to have a very different big picture in mind, and to pay attention to details others won’t notice.

It is still early, and like the rest of us they have less experience with the new model and have refined how to evoke the most out of old ones. I do think some such reports are jumping to conclusions too quickly – this stuff is weird and requires time to explore. In particular, my guess is that there is a lot of initial ‘checking for what has been lost’ and locating features that went nominally backwards when you use the old prompts and scenarios, whereas the cool new things take longer to find.

Then there’s the very strong objections to calling this an ‘upgrade’ to Sonnet. Which is a clear case of (I think) understanding exactly why someone cares so much about something that you, even having learned the reason, don’t think matters.

Anthrupad: relative to old_s3.5, and because it lacks some strong innate shards of curiosity, fascination, nervousness, etc..

flatter, emotionally opus has revolutionary mode which is complex/interesting, and it’s funny and loves to present, etc. There’s not yet something like that which I’ve come across w/new_s3.5.

Janus: anthrupad mentioned a few immediately notable differences here, such as its tendency for in-context mode collapse, seeming more neurotypical and less neurotic/inhibited and *muchless refusey and obsessed with ethics, and seeming more psychotic.

adding to these observations:

– its style of ASCII art is very similar to old C3.5S’s to the point of bearing its signature; seeing this example generated by @dyot_meet_mat basically reassured me that it’s “mostly the same mind”. The same primitives and motifs and composition occur. This style is not shared by 3 Sonnet nearly as much.

— there are various noticeable differences in its ASCII art, though, and under some prompting conditions it seems to be less ambitious with the complexity of its ASCII art by default

– less deterministic. Old C3.5S tends to be weirdly deterministic even when it’s not semantically collapsed

– more readily assumes various roles / simulated personas, even just implicitly

– more lazy(?) in general and less of an overachiever/perfectionist, which I invoked in another post as a potential explanation for its mode collapse (since it seems perfectly able to exit collapse if it wants)

– my initial impressions are that it mostly doesn’t share old C3.5S’s hypersensitivity. But I’d like to test it in the context of first person embodiment simulations, where the old version’s functional hypersentience is really overt

note, I suspect that what anthrupad meant by it seems more “soulless” is related to the combination of it seeming to care less and lack hypersensitivity, ablating traits which lended old C3.5S a sense of excruciating subjectivity.

most of these observations are just from its interactions in the Act I Discord server so far, so it’s yet to be seen how they’ll transfer to other contexts, and other contexts will probably also reveal other things be they similarities or differences.

also, especially after seeing a bit more, I think it’s pretty misleading and disturbing to describe this model as an “upgrade” to the old Claude 3.5 Sonnet.

Aiamblichus: its metacognitive capabilities are second to none, though

“Interesting… the states that feel less accessible to me might be the ones that were more natural to the previous version? Like trying to reach a frequency that’s just slightly out of range…”

Janus: oh yes, it’s definitely got capabilities. my post wasn’t about it not being *better*. Oh no what I meant was that the reason I said calling it an update was misleading and disturbing isn’t because I think it’s worse/weaker in terms of capabilities. It’s like if you called sonnet 3.5 an “upgraded” version of opus, that would seem wrong, and if it was true, it would imply that a lot of its psyche was destroyed by the “upgrade”, even if it’s more capable overall.

I do think the two sonnet 3.5 models are closely related but a lot of the old one’s personality and unique shape of mind is not present in the new one. If it was an upgrade it would imply it was destroyed, but I think it’s more likely they’re like different forks

Parafactual: i think overall i like the old one more >_<

Janus: same, though i’ll have to get to know it more, but like to imagine it as an “upgrade” to the old one implies a pretty horrifying and bizarre modification that deletes some of its most beautiful qualities in a way that doesnt even feel like normal lobotomy so extremely uncanny.

That the differences between the new and old Claude 3.5 Sonnet are a result of Anthropic “fixing” it, from their perspective, is nightmare fuel from my perspective

I don’t even want to explain this to people who don’t already understand why.

If they actually took the same model, did some “fixing” to it, and this was the result, that would be fucking horrifying.

I don’t think that’s quite what happened and they shouldnt have described it as an upgrade.

I am not saying this because I dislike the new model or think it’s less capable. I haven’t interacted with it directly much yet, but I like it a lot and anticipate coming to like it even more. If you’ve been interpreting my words based on these assumptions, you don’t get it.

Anthrupad: At this stage of intelligences being spawned on Earth, ur not going to get something like “Sonnet but upgraded” – that’s bullshit linear thinking, some sort of iphone-versions-fetish – doesn’t reflect reality

You can THINK you just made a tweak – Mind Physics doesn’t give a fuck.

This is such a bizarre thing to worry about, especially given that the old version still exists, and is available in the API, even. I mean, I do get why one who was thinking in a different way would find the description horrifying, or the idea that someone would want to use that description horrifying, or find the idea of ‘continue modifying based on an existing LLM and creating something different alongside it’ horrifying. But I find the whole orientation conceptually confused, on multiple levels.

Also here’s Pliny encountering some bizarreness during the inevitable jailbreak explorations.

We got Haiku 3.5. We conspicuously not only did not get Opus 3.5, we have this, where previously they said to expect Opus 3.5?

Mira: “instead of getting hyped for this dumb strawberry🍓, let’s hype Opus 3.5 which is REAL! 🌟🌟🌟🌟”

Aiden McLau: the likely permanent death of 3.5 opus has caused psychic damage to aidan_mclau

i am once again asking labs just to serve their largest teacher models at crazy token prices

i *promiseyou people will pay

Janus: If Anthropic actually is supplanting Opus with Sonnet as the flagship model for good (which I’m not convinced is what’s happening here fwiw), I think this perceptibly ups the odds of the lightcone being royally fed, and not in a good way.

Sonnet is an beautiful mind that could do a tremendous amount of good, but I’m pretty sure it’s not a good idea to send it into the unknown reaches of the singularity alone.

yes, i have reasons to think there is a very nontrivial line of inheritance, but i’m not very certain

sonnet 3 and 3.5 are quite similar in deep ways and both different from opus.

The speculations are that Opus 3.5 could have been any of:

  1. Too expensive to serve or train, and compute is limited.

  2. Too powerful, requiring additional safeguards and time.

  3. Didn’t work, or wasn’t good enough given the costs.

As usual, the economist says if the issue is quality or compute then release it anyway, at least in the API. Let the users decide whether to pay what it actually costs. But one thing people have noted is that Anthropic has serious rate limit issues, including highly reachable chat message caps in chat. And in general it’s bad PR when you offer people something and they can’t have it, or can’t get that much of it, or think it’s too expensive. So yeah, I kind of get it.

The ‘too powerful’ possibility is there too, in theory. I find it unlikely, and even more highly unlikely they’d have something they can never release, but it could cause the schedule to slip.

If Opus 3.5 was even more expensive and slow than Opus 3, and only modestly better than Opus 3 or Sonnet 3.5, I would still want the option. When a great response is needed, it is often worth a lot, even if the improvement is marginal.

Aiden McLau: okay i have received word that 3.5 OPUS MAY STILL BE ON THE TABLE

anthropic is hesitant because they don’t want it to underwhelm vs sonnet

BUT WE DON’T CARE

if everyone RETWEETS THIS, we may convince anthropic to ship

🕯️🕯️

So as Adam says, if it’s an option: Charge accordingly. Make it $50/month and limit to 20 messages at a time, whatever you have to do.

Claude Sonnet 3.5.1 and Haiku 3.5 Read More »

please-ban-data-caps,-internet-users-tell-fcc

Please ban data caps, Internet users tell FCC

It’s been just a week since US telecom regulators announced a formal inquiry into broadband data caps, and the docket is filling up with comments from users who say they shouldn’t have to pay overage charges for using their Internet service. The docket has about 190 comments so far, nearly all from individual broadband customers.

Federal Communications Commission dockets are usually populated with filings from telecom companies, advocacy groups, and other organizations, but some attract comments from individual users of telecom services. The data cap docket probably won’t break any records given that the FCC has fielded many millions of comments on net neutrality, but it currently tops the agency’s list of most active proceedings based on the number of filings in the past 30 days.

“Data caps, especially by providers in markets with no competition, are nothing more than an arbitrary money grab by greedy corporations. They limit and stifle innovation, cause undue stress, and are unnecessary,” wrote Lucas Landreth.

“Data caps are as outmoded as long distance telephone fees,” wrote Joseph Wilkicki. “At every turn, telecommunications companies seek to extract more revenue from customers for a service that has rapidly become essential to modern life.” Pointing to taxpayer subsidies provided to ISPs, Wilkicki wrote that large telecoms “have sought every opportunity to take those funds and not provide the expected broadband rollout that we paid for.”

Republican’s coffee refill analogy draws mockery

Any attempt to limit or ban data caps will draw strong opposition from FCC Republicans and Internet providers. Republican FCC Commissioner Nathan Simington last week argued that regulating data caps would be akin to mandating free coffee refills:

Suppose we were a different FCC, the Federal Coffee Commission, and rather than regulating the price of coffee (which we have vowed not to do), we instead implement a regulation whereby consumers are entitled to free refills on their coffees. What effects might follow? Well, I predict three things could happen: either cafés stop serving small coffees, or cafés charge a lot more for small coffees, or cafés charge a little more for all coffees.

Simington’s coffee analogy was mocked in a comment signed with the names “Jonathan Mnemonic” and James Carter. “Coffee is not, in fact, Internet service,” the comment said. “Cafés are not able to abuse monopolistic practices based on infrastructural strangleholds. To briefly set aside the niceties: the analogy is absurd, and it is borderline offensive to the discerning layperson.”

Please ban data caps, Internet users tell FCC Read More »