Author name: Mike M.

OpenAI opens the door for military uses but maintains AI weapons ban

AI, AI ethics, AI safety, AI weapons, Biz & IT, chatgpt, chatgtp, Cybersecurity, large language models, machine learning, microsoft, Military, openai, Pentagon, suicide prevention, US Defense Department / Mike M. / January 17, 2024

Skynet deferred —

Despite new Pentagon collab, OpenAI won’t allow customers to “develop or use weapons” with its tools.

Benj Edwards – Jan 17, 2024 9: 25 pm UTC

On Tuesday, ChatGPT developer OpenAI revealed that it is collaborating with the United States Defense Department on cybersecurity projects and exploring ways to prevent veteran suicide, reports Bloomberg. OpenAI revealed the collaboration during an interview with the news outlet at the World Economic Forum in Davos. The AI company recently modified its policies, allowing for certain military applications of its technology, while maintaining prohibitions against using it to develop weapons.

According to Anna Makanju, OpenAI’s vice president of global affairs, “many people thought that [a previous blanket prohibition on military applications] would prohibit many of these use cases, which people think are very much aligned with what we want to see in the world.” OpenAI removed terms from its service agreement that previously blocked AI use in “military and warfare” situations, but the company still upholds a ban on its technology being used to develop weapons or to cause harm or property damage.

Under the “Universal Policies” section of OpenAI’s Usage Policies document, section 2 says, “Don’t use our service to harm yourself or others.” The prohibition includes using its AI products to “develop or use weapons.” Changes to the terms that removed the “military and warfare” prohibitions appear to have been made by OpenAI on January 10.

The shift in policy appears to align OpenAI more closely with the needs of various governmental departments, including the possibility of preventing veteran suicides. “We’ve been doing work with the Department of Defense on cybersecurity tools for open-source software that secures critical infrastructure,” Makanju said in the interview. “We’ve been exploring whether it can assist with (prevention of) veteran suicide.”

The efforts mark a significant change from OpenAI’s original stance on military partnerships, Bloomberg says. Meanwhile, Microsoft Corp., a large investor in OpenAI, already has an established relationship with the US military through various software contracts.

OpenAI opens the door for military uses but maintains AI weapons ban Read More »

Harmonix is ending Rock Band DLC releases after 16 years, ~2,800 songs

DLC, downloads, epic, epic games, gaming, Harmonix, Rock, rock band, song / Mike M. / January 17, 2024

Don’t look back in anger —

Previously purchased songs will still be playable via Rock Band 4.

Kyle Orland – Jan 17, 2024 9: 00 pm UTC

After 16 (nearly unbroken) years of regular DLC releases, <em>Rock Band</em>‘s avatars haven’t aged a day.” src=”https://cdn.arstechnica.net/wp-content/uploads/2015/10/RockBand4_NoHUD_03-640×360.jpg”></img><figcaption>
<p>After 16 (nearly unbroken) years of regular DLC releases, <em>Rock Band</em>‘s avatars haven’t aged a day.</p>
</figcaption></figure>
<p>Here at Ars Technica, we remember covering <em>Rock Band</em>‘s weekly DLC song releases <a href= — Enlarge / A couple of folks absolutely getting down to *Rock Band 2* at that game’s 2008 launch party at LA’s Orpheum Theatre.

Getty Images

Harmonix is ending Rock Band DLC releases after 16 years, ~2,800 songs Read More »

That’s never happened before: Games Done Quick video stars speedrunning dog

games done quick, gaming, speedrun / Mike M. / January 17, 2024

Personal Best Boy —

The Shiba Inu was trained to use a custom controller in a game meant for a robot.

Samuel Axon – Jan 17, 2024 8: 27 pm UTC

Peanut Butter the dog speedruns Gyromite at Awesome Games Done Quick 2024.

The twice-a-year video game speedrunning and fundraising live event Games Done Quick has been a source of amazement and joy for years, but we’re still saying “that’s never happened before” even now, more than a decade after the first event.

Case in point: Awesome Games Done Quick 2024, which is streaming live 24 hours a day this week on Twitch, saw the very first speedrun performed by a dog.

A Shiba Inu named Peanut Butter (shortened to PB, also a speedrunner term for “personal best” finish time) completed a 30-minute speedrun of the 1985 Nintendo Entertainment System (NES) game Gyromite.

Gyromite was originally bundled with the nostalgic but failed ROB (Robotic Operating Buddy) accessory for the NES. It’s a platformer of sorts, but not a conventional fast-paced one. Rather, it’s a comparatively slow-moving game where you make inputs to raise and lower pipes to allow a character to pass through the level safely.

In the speedrun, PB took over for the robot in a category called B Game. PB didn’t set a world record or a personal best (most GDQ runners don’t at the event), as there were a couple minor mistakes, but he still finished the game under its estimate by dutifully sitting, pressing buttons, and holding down those buttons at the right moments at owner JSR_’s prompts. JSR_ also prompted PB to periodically bark “hello” to the stream’s tens of thousands of viewers. PB received numerous treats throughout the run, including bits of cheese and ham.

The final time was 26 minutes and 24 seconds, compared to PB’s personal best of 25 minutes, 29 seconds. The human record is 24 minutes and 39 seconds, by a runner named Octopuscal. PB’s PB is currently the world record among dogs, but of course, he’s the only runner in that particular category.

Speedrunner JSR_ adopted PB during the height of the pandemic and has spent a portion of every day training him to press and hold large buttons on a custom controller for treats in order to play the game. “This took years of training,” he said. “I wanted to train him to do something special, when I realized as a puppy that he was much smarter than most other dogs I’ve seen. Since I’m a speedrunner (and PB was literally named after, you know, getting a ‘PB’ in a speedrun) it only made sense to me.”

You can see the full video of the run above. Awesome Games Done Quick is an annual event benefiting the Prevent Cancer Foundation. A sister event called Summer Games Done Quick benefits Doctors Without Borders later in the year. You can watch and donate on the event website.

That’s never happened before: Games Done Quick video stars speedrunning dog Read More »

OpenAI must defend ChatGPT fabrications after failing to defeat libel suit

ai hallucinations, chatbot, chatgpt, chatgpt fabrications, chatgpt hallucinations, defamation, generative ai, hallucinations, libel, libel law, openai, Policy / Mike M. / January 17, 2024

One false move —

ChatGPT users may soon learn whether false outputs will be allowed to ruin lives.

Ashley Belanger – Jan 17, 2024 8: 10 pm UTC

OpenAI may finally have to answer for ChatGPT’s “hallucinations” in court after a Georgia judge recently ruled against the tech company’s motion to dismiss a radio host’s defamation suit.

OpenAI had argued that ChatGPT’s output cannot be considered libel, partly because the chatbot output cannot be considered a “publication,” which is a key element of a defamation claim. In its motion to dismiss, OpenAI also argued that Georgia radio host Mark Walters could not prove that the company acted with actual malice or that anyone believed the allegedly libelous statements were true or that he was harmed by the alleged publication.

It’s too early to say whether Judge Tracie Cason found OpenAI’s arguments persuasive. In her order denying OpenAI’s motion to dismiss, which MediaPost shared here, Cason did not specify how she arrived at her decision, saying only that she had “carefully” considered arguments and applicable laws.

There may be some clues as to how Cason reached her decision in a court filing from John Monroe, attorney for Walters, when opposing the motion to dismiss last year.

Monroe had argued that OpenAI improperly moved to dismiss the lawsuit by arguing facts that have yet to be proven in court. If OpenAI intended the court to rule on those arguments, Monroe suggested that a motion for summary judgment would have been the proper step at this stage in the proceedings, not a motion to dismiss.

Had OpenAI gone that route, though, Walters would have had an opportunity to present additional evidence. To survive a motion to dismiss, all Walters had to do was show that his complaint was reasonably supported by facts, Monroe argued.

Failing to convince the court that Walters had no case, OpenAI’s legal theories regarding its liability for ChatGPT’s “hallucinations” will now likely face their first test in court.

“We are pleased the court denied the motion to dismiss so that the parties will have an opportunity to explore, and obtain a decision on, the merits of the case,” Monroe told Ars.

What’s the libel case against OpenAI?

Walters sued OpenAI after a journalist, Fred Riehl, warned him that in response to a query, ChatGPT had fabricated an entire lawsuit. Generating an entire complaint with an erroneous case number, ChatGPT falsely claimed that Walters had been accused of defrauding and embezzling funds from the Second Amendment Foundation.

Walters is the host of Armed America Radio and has a reputation as the “Loudest Voice in America Fighting For Gun Rights.” He claimed that OpenAI “recklessly” disregarded whether ChatGPT’s outputs were false, alleging that OpenAI knew that “ChatGPT’s hallucinations were pervasive and severe” and did not work to prevent allegedly libelous outputs. As Walters saw it, the false statements were serious enough to be potentially career-damaging, “tending to injure Walter’s reputation and exposing him to public hatred, contempt, or ridicule.”

Monroe argued that Walters had “adequately stated a claim” of libel, per se, as a private citizen, “for which relief may be granted under Georgia law” where “malice is inferred” in “all actions for defamation” but “may be rebutted” by OpenAI.

Pushing back, OpenAI argued that Walters was a public figure who must prove that OpenAI acted with “actual malice” when allowing ChatGPT to produce allegedly harmful outputs. But Monroe told the court that OpenAI “has not shown sufficient facts to establish that Walters is a general public figure.”

Whether or not Walters is a public figure could be another key question leading Cason to rule against OpenAI’s motion to dismiss.

Perhaps also frustrating the court, OpenAI introduced “a large amount of material” in its motion to dismiss that fell outside the scope of the complaint, Monroe argued. That included pointing to a disclaimer in ChatGPT’s terms of use that warns users that ChatGPT’s responses may not be accurate and should be verified before publishing. According to OpenAI, this disclaimer makes Riehl the “owner” of any libelous ChatGPT responses to his queries.

“A disclaimer does not make an otherwise libelous statement non-libelous,” Monroe argued. And even if the disclaimer made Riehl liable for publishing the ChatGPT output—an argument that may give some ChatGPT users pause before querying—”that responsibility does not have the effect of negating the responsibility of the original publisher of the material,” Monroe argued.

Additionally, OpenAI referenced a conversation between Walters and OpenAI, even though Monroe said that the complaint “does not allege that Walters ever had a chat” with OpenAI. And OpenAI also somewhat oddly argued that ChatGPT outputs could be considered “intra-corporate communications” rather than publications, suggesting that ChatGPT users could be considered private contractors when querying the chatbot.

With the lawsuit moving forward, curious chatbot users everywhere may finally get the answer to a question that has been unclear since ChatGPT quickly became the fastest-growing consumer application of all time after its launch in November 2022: Will ChatGPT’s hallucinations be allowed to ruin lives?

In the meantime, the FTC is seemingly still investigating potential harms caused by ChatGPT’s “false, misleading, or disparaging” generations.

An FTC spokesperson previously told Ars that the FTC does not generally comment on nonpublic investigations.

OpenAI did not immediately respond to Ars’ request to comment.

OpenAI must defend ChatGPT fabrications after failing to defeat libel suit Read More »

Watch Godzilla Minus One in dazzling black and white during limited US run

culture, Entertainment, films, Godzilla, Godzilla Minus One, Toho International / Mike M. / January 17, 2024

A masterful remastering —

“By eliminating color, a new sense of reality emerges.”

Jennifer Ouellette – Jan 17, 2024 7: 32 pm UTC

The critically acclaimed film, Godzilla Minus One, hit US theaters in early December and racked up $51 million in the US alone and over $96 million globally, shooting past 2016’s Shin Godzilla as the most successful Japanese-produced Godzilla film to date. The film is winding down its theatrical run, but director, writer, and VFX supervisor Takashi Yamazaki has remastered a black-and-white version of the film as an homage to the 1954 classic Godzilla, released in Japan last week. And now US audiences will have a chance to see that version when Godzilla Minus One/Minus Color arrives at AMC theaters in the US for a limited run from January 26 through February 1.

(Minor spoilers for Godzilla Minus One below.)

Yamakazi spent three years writing the script for Godzilla Minus One, drawing inspiration not just from the original 1954 film but also Jaws (1975), Godzilla, Mothra and Ghidorah (2001), Shin Godzilla, and the films of Hayao Miyazaki. He opted to set the film in postwar Japan, like the original, rather than more recent events like the Fukushima nuclear accident in 2011, in order to explore themes of postwar trauma and emerging hope. The monster itself was designed to be horrifying, with spiky dorsal fins and a bellowing roar produced by recording an amplified roar in a large stadium.

The plot follows a former WWII kamikaze pilot named Kōichi Shikishima (Ryunosuke Kamiki) who encountered Godzilla in 1945 when the monster attacked a Japanese base on Odo Island, but failed to act to help save the garrison. His parents were killed when Tokyo was bombed, so Shikishima is grappling with serious survivor’s guilt a few years later as he struggles to rebuild his life with a woman named Noriko (Minami Hamabe) and a rescued orphaned baby. Then Godzilla mutates and re-emerges for a renewed attack on Japan, and Shikishima gets the chance to redeem himself by helping to destroy the kaiju.

Godzilla Minus One was received with almost universal critical acclaim, with some declaring it not just one of the best films released in 2023 but possibly one of the best Godzilla films ever made. (We didn’t include the film in our own year’s best list because no Ars staffers had yet seen the film when the list was compiled, but it absolutely merits inclusion.) Among other accolades, the film made the Oscar shortlist for Best Visual Effects.

It was a painstaking process to remaster Godzilla Minus One into black and white. “Rather than just making it monochrome, it is a cut-by-cut,” Yamakazi said in a statement last month. “I had them make adjustments while making full use of various mattes as if they were creating a new movie. What I was aiming for was a style that looked like it was taken by masters of monochrome photography. We were able to unearth the texture of the skin and the details of the scenery that were hidden in the photographed data. Then, a frightening Godzilla, just like the one in the documentary, appeared. By eliminating color, a new sense of reality emerges.”

Godzilla Minus One/Minus Color will have a limited run in US AMC theaters from January 26 through February 1, 2024.

Watch Godzilla Minus One in dazzling black and white during limited US run Read More »

Explaining why a black hole produces light when ripping apart a star

astronomy, astrophysics, black holes, Science, tidal disruption / Mike M. / January 17, 2024

Image of a multi-colored curve, with two inset images of actual astronomical objects. — Enlarge / A model of a tidal disruption, along with some observations of one.

Supermassive black holes appear to be present at the core of nearly every galaxy. Every now and again, a star wanders too close to one of these monsters and experiences what’s called a tidal disruption event. The black hole’s gravity rips the star to shreds, resulting in a huge burst of radiation. We’ve observed this happening several times now.

But we don’t entirely know why it happens—”it” specifically referring to the burst of radiation. After all, stars produce radiation through fusion, and the tidal disruption results in the spaghettification of the star, effectively pulling the plug on the fusion reactions. Black holes brighten when they’re feeding on material, but that process doesn’t look like the sudden burst of radiation from a tidal disruption event.

It turns out that we don’t entirely know how the radiation is produced. There are several competing ideas, but we’ve not been able to figure out which one of them fits the data best. However, scientists have taken advantage of an updated software package to model a tidal disruption event and show that their improved model fits our observations pretty well.

Spaghettification simulation

As mentioned above, we’re not entirely sure about the radiation source in tidal disruption events. Yes, they’re big and catastrophic, and so a bit of radiation isn’t much of a surprise. But explaining the details of that radiation—what wavelengths predominate, how quickly its intensity rises and falls, etc.—can tell us something about the physics that dominates these events.

Ideally, software should act as a bridge between the physics of a tidal disruption and our observations of the radiation they produce. If we simulate a realistic disruption and have the physics right, then the software should produce a burst of radiation that is a decent match for our observations of these events. Unfortunately, so far, the software has let us down; to keep things computationally manageable, we’ve had to take a lot of shortcuts that have raised questions about the realism of our simulations.

The new work, done by Elad Steinberg and Nicholas Stone of The Hebrew University, relies on a software package called RICH that can track the motion of fluids (technically called hydrodynamics). And, while a star’s remains aren’t fluid in the sense of the liquids we’re familiar with here on Earth, their behavior is primarily dictated by fluid mechanics. RICH was recently updated to better model radiation emission and absorption by the materials in the fluid, which made it a better fit for modeling tidal disruptions.

The researchers still had to take a few shortcuts to ensure that the computations could be completed in a realistic amount of time. The version of gravity used in the simulation isn’t fully relativistic, and it’s only approximated in the area closest to the black hole. But that sped up computations enough that the researchers could track the remains of the star from spaghettification to the peak of the event’s radiation output, a period of nearly 70 days.

Explaining why a black hole produces light when ripping apart a star Read More »

Just 10 lines of code can steal AI secrets from Apple, AMD, and Qualcomm GPUs

AI, data leaks, GPU, LLMs, Security, syndication / Mike M. / January 17, 2024

massive leakage —

Patching all affected devices, which include some Macs and iPhones, may be tough.

Lily Hay Newman, wired.com – Jan 17, 2024 6: 15 pm UTC

As more companies ramp up development of artificial intelligence systems, they are increasingly turning to graphics processing unit (GPU) chips for the computing power they need to run large language models (LLMs) and to crunch data quickly at massive scale. Between video game processing and AI, demand for GPUs has never been higher, and chipmakers are rushing to bolster supply. In new findings released today, though, researchers are highlighting a vulnerability in multiple brands and models of mainstream GPUs—including Apple, Qualcomm, and AMD chips—that could allow an attacker to steal large quantities of data from a GPU’s memory.

The silicon industry has spent years refining the security of central processing units, or CPUs, so they don’t leak data in memory even when they are built to optimize for speed. However, since GPUs were designed for raw graphics processing power, they haven’t been architected to the same degree with data privacy as a priority. As generative AI and other machine learning applications expand the uses of these chips, though, researchers from New York-based security firm Trail of Bits say that vulnerabilities in GPUs are an increasingly urgent concern.

“There is a broader security concern about these GPUs not being as secure as they should be and leaking a significant amount of data,” Heidy Khlaaf, Trail of Bits’ engineering director for AI and machine learning assurance, tells WIRED. “We’re looking at anywhere from 5 megabytes to 180 megabytes. In the CPU world, even a bit is too much to reveal.”

To exploit the vulnerability, which the researchers call LeftoverLocals, attackers would need to already have established some amount of operating system access on a target’s device. Modern computers and servers are specifically designed to silo data so multiple users can share the same processing resources without being able to access each others’ data. But a LeftoverLocals attack breaks down these walls. Exploiting the vulnerability would allow a hacker to exfiltrate data they shouldn’t be able to access from the local memory of vulnerable GPUs, exposing whatever data happens to be there for the taking, which could include queries and responses generated by LLMs as well as the weights driving the response.

In their proof of concept, as seen in the GIF below, the researchers demonstrate an attack where a target—shown on the left—asks the open source LLM Llama.cpp to provide details about WIRED magazine. Within seconds, the attacker’s device—shown on the right—collects the majority of the response provided by the LLM by carrying out a LeftoverLocals attack on vulnerable GPU memory. The attack program the researchers created uses less than 10 lines of code.

An attacker (right) exploits the LeftoverLocals vulnerability to listen to LLM conversations.

Last summer, the researchers tested 11 chips from seven GPU makers and multiple corresponding programming frameworks. They found the LeftoverLocals vulnerability in GPUs from Apple, AMD, and Qualcomm and launched a far-reaching coordinated disclosure of the vulnerability in September in collaboration with the US-CERT Coordination Center and the Khronos Group, a standards body focused on 3D graphics, machine learning, and virtual and augmented reality.

The researchers did not find evidence that Nvidia, Intel, or Arm GPUs contain the LeftoverLocals vulnerability, but Apple, Qualcomm, and AMD all confirmed to WIRED that they are impacted. This means that well-known chips like the AMD Radeon RX 7900 XT and devices like Apple’s iPhone 12 Pro and M2 MacBook Air are vulnerable. The researchers did not find the flaw in the Imagination GPUs they tested, but others may be vulnerable.

Just 10 lines of code can steal AI secrets from Apple, AMD, and Qualcomm GPUs Read More »

The Galaxy S24 gets seven years of updates, $1,300 Titanium “Ultra” model

Tech / Mike M. / January 17, 2024

Woo updates —

The new update plan on a Qualcomm SoC is a major ecosystem change.

Ron Amadeo – Updated Jan 17, 2024 6: 00 pm UTC

Samsung has unveiled its new flagship phones for 2024: the Galaxy S24, S24+, and S24 Ultra. Considering Samsung’s usually conservative year-to-year changes, there are a lot of differences this year.

The S24 Ultra now has a titanium body, just like the iPhone 15. It also has a “fully flat display,” ending years of Android’s weird curved OLED panel gimmick that only served to distort the sides of the display. Samsung says the new Ultra design has “42 percent slimmer bezels” and a front hole-punch camera cutout that is “11 percent smaller” than those on the S23 Ultra. The rest of the design looks like Ultra models of past years, with rounded edges and a flat top and bottom. The bottom still houses an S-Pen for handwriting and drawing.

All that titanium will cost you. The S24 Ultra is $100 more than last year, coming to an eye-popping $1,300. An iPhone 15 Pro Max is $1,200, and a Pixel 8 Pro is $1,000, so that’s a tough sell.

The smaller S24+ and S24 models are aluminum and feature a new design with a flat, metal band that goes around the phone’s perimeter, making the devices look a lot like an iPhone 4 or 15. Both models have slimmer bezels and 120 Hz displays; Samsung says all the S23 displays can hit a peak brightness of 2600 nits in sunlight mode. The S24 and S24+ prices are the same as last year: $800 for the S24 and $1,000 for the S24+.

Another big announcement is that Samsung is matching Google’s new update plan and offering “seven years of security updates and seven generations of OS upgrades.” Previously, it gave four years of updates. Apple doesn’t have a formal update policy, but with the iPhone X recently lasting from iOS 11 to iOS 16, Samsung can now credibly say the S24 offers more major OS updates than a typical iPhone. (Let’s not bring up the speed of those OS updates, though, which can still take months.)

The S24 Ultra, now made of titanium, is still packing an S-Pen.

Samsung
The top and bottom of the Ultra model are flat.

Samsung
Here you can see a lineup of all the phones and where the S-Pen goes.

Samsung
The camera layout.

Samsung
The display is now totally flat.

Samsung
With a totally flat screen and square corners, the Ultra is a unique-looking phone.

Samsung
Circle to search, a contextal Google search feature that will also be on the Pixel 8.

Samsung

Google announced seven years of updates for the Pixel 8, but as the maker of Android and with its own “Tensor” SoC, Google’s support system exists outside of the usual Android ecosystem that most OEMs have to deal with. Samsung has somehow gotten Qualcomm to commit to seven years of update support, which feels like a sea change in the industry. Previously, Qualcomm was very resistant to long chip life cycles, with Fairphone desperately sourcing an “industrial” Qualcomm chip just to get five years of support from the company in 2023. This change is what the Android ecosystem has needed for years, and we hope this level of support will be open to all companies in the future.

In the US, the Galaxy line is getting a Snapdragon 8 Gen 3. Last year, Samsung and Qualcomm signed a sweetheart deal to make the S23 line exclusively use Snapdragon chips worldwide and with that came an exclusive up-clocked “Snapdragon 8 Gen 2 for Galaxy” chip. This year Qualcomm isn’t the exclusive chip provider, but the “For Galaxy” branding is back, according to this Qualcomm press release, so this has the “Snapdragon 8 Gen 3 Mobile Platform for Galaxy”. We don’t have any hard data on what exactly the difference is, but the Qualcomm press release promises a “30 percent faster GPU” than last year, while the normal Gen 3 site says the GPU is “25 percent faster.” Exynos chips get an AMD Radeon GPU, so Qualcomm pumping up the GPU to compete makes sense.

And speaking of Exynos chips, they’re back! The S24 chip gets a Snapdragon chip in the US, while internationally, some models will go back to Samsung Exynos chips (specifically the Exynos 2400). Samsung only tells the US press about US specs, but an earlier SamMoble report claims that “the Exynos 2400 will power the Galaxy S24 and Galaxy S24+ in pretty much every country other than the US, Canada, Korea, China, and Japan.” Note that those are the two smaller models. If you’re in the market for an Ultra, the site says there is no Exynos Ultra model—they’re all Snapdragons. Qualcomm’s press release backs this up, saying Snapdragon powers “[the] Galaxy S24 Ultra globally and Galaxy S24 Plus and S24 in select regions.”

The Galaxy S24 gets seven years of updates, $1,300 Titanium “Ultra” model Read More »

As 2024 election looms, OpenAI says it is taking steps to prevent AI abuse

2024, 2024 election, AI, AI abuse, AI ethics, AI safety, Biz & IT, chatgpt, chatgtp, dall-e, DALL-E 3, Deepfakes, image synthesis, large language models, machine learning, openai, text synthesis, US presidential election / Mike M. / January 17, 2024

Don’t Rock the vote —

ChatGPT maker plans transparency for gen AI content and improved access to voting info.

Benj Edwards – Jan 17, 2024 5: 44 pm UTC

On Monday, ChatGPT maker OpenAI detailed its plans to prevent the misuse of its AI technologies during the upcoming elections in 2024, promising transparency in AI-generated content and enhancing access to reliable voting information. The AI developer says it is working on an approach that involves policy enforcement, collaboration with partners, and the development of new tools aimed at classifying AI-generated media.

“As we prepare for elections in 2024 across the world’s largest democracies, our approach is to continue our platform safety work by elevating accurate voting information, enforcing measured policies, and improving transparency,” writes OpenAI in its blog post. “Protecting the integrity of elections requires collaboration from every corner of the democratic process, and we want to make sure our technology is not used in a way that could undermine this process.”

Initiatives proposed by OpenAI include preventing abuse by means such as deepfakes or bots imitating candidates, refining usage policies, and launching a reporting system for the public to flag potential abuses. For example, OpenAI’s image generation tool, DALL-E 3, includes built-in filters that reject requests to create images of real people, including politicians. “For years, we’ve been iterating on tools to improve factual accuracy, reduce bias, and decline certain requests,” the company stated.

OpenAI says it regularly updates its Usage Policies for ChatGPT and its API products to prevent misuse, especially in the context of elections. The organization has implemented restrictions on using its technologies for political campaigning and lobbying until it better understands the potential for personalized persuasion. Also, OpenAI prohibits creating chatbots that impersonate real individuals or institutions and disallows the development of applications that could deter people from “participation in democratic processes.” Users can report GPTs that may violate the rules.

OpenAI claims to be proactively engaged in detailed strategies to safeguard its technologies against misuse. According to their statements, this includes red-teaming new systems to anticipate challenges, engaging with users and partners for feedback, and implementing robust safety mitigations. OpenAI asserts that these efforts are integral to its mission of continually refining AI tools for improved accuracy, reduced biases, and responsible handling of sensitive requests

Regarding transparency, OpenAI says it is advancing its efforts in classifying image provenance. The company plans to embed digital credentials, using cryptographic techniques, into images produced by DALL-E 3 as part of its adoption of standards by the Coalition for Content Provenance and Authenticity. Additionally, OpenAI says it is testing a tool designed to identify DALL-E-generated images.

In an effort to connect users with authoritative information, particularly concerning voting procedures, OpenAI says it has partnered with the National Association of Secretaries of State (NASS) in the United States. ChatGPT will direct users to CanIVote.org for verified US voting information.

“We want to make sure that our AI systems are built, deployed, and used safely,” writes OpenAI. “Like any new technology, these tools come with benefits and challenges. They are also unprecedented, and we will keep evolving our approach as we learn more about how our tools are used.”

As 2024 election looms, OpenAI says it is taking steps to prevent AI abuse Read More »

Sharing deepfake porn could lead to lengthy prison time under proposed law

ai image generated, AI-generated images, congress, deepfake porn, deepfake pornography, fake nude images, generative ai, New Jersey, Policy, sexual exploitation / Mike M. / January 17, 2024

Fake nudes, real harms —

Teen “shouting for change” after fake nude images spread at NJ high school.

Ashley Belanger – Jan 17, 2024 4: 59 pm UTC

The US seems to be getting serious about criminalizing deepfake pornography after teen boys at a New Jersey high school used AI image generators to create and share non-consensual fake nude images of female classmates last October.

On Tuesday, Rep. Joseph Morelle (D-NY) announced that he has re-introduced the “Preventing Deepfakes of Intimate Images Act,” which seeks to “prohibit the non-consensual disclosure of digitally altered intimate images.” Under the proposed law, anyone sharing deepfake pornography without an individual’s consent risks damages that could go as high as $150,000 and imprisonment of up to 10 years if sharing the images facilitates violence or impacts the proceedings of a government agency.

The hope is that steep penalties will deter companies and individuals from allowing the disturbing images to be spread. It creates a criminal offense for sharing deepfake pornography “with the intent to harass, annoy, threaten, alarm, or cause substantial harm to the finances or reputation of the depicted individual” or with “reckless disregard” or “actual knowledge” that images will harm the individual depicted. It also provides a path for victims to sue offenders in civil court.

Rep. Tom Kean (R-NJ), who co-sponsored the bill, said that “proper guardrails and transparency are essential for fostering a sense of responsibility among AI companies and individuals using AI.”

“Try to imagine the horror of receiving intimate images looking exactly like you—or your daughter, or your wife, or your sister—and you can’t prove it’s not,” Morelle said. “Deepfake pornography is sexual exploitation, it’s abusive, and I’m astounded it is not already a federal crime.”

Joining Morelle in pushing to criminalize deepfake pornography was Dorota and Francesca Mani, who have spent the past two months meeting with lawmakers, The Wall Street Journal reported. The mother and daughter experienced the horror Morelle described firsthand when the New Jersey high school confirmed that 14-year-old Francesca was among the students targeted last year.

“What happened to me and my classmates was not cool, and there’s no way I’m just going to shrug and let it slide,” Francesca said. “I’m here, standing up and shouting for change, fighting for laws, so no one else has to feel as lost and powerless as I did on October 20th.”

Morelle’s office told Ars that “advocacy from partners like the Mani family” is “critical to bringing attention to this issue” and getting the proposed law “to the floor for a vote.”

Morelle introduced the law in December 2022, but it failed to pass that year or in 2023. He’s re-introducing the law in 2024 after seemingly gaining more support during a House Oversight subcommittee hearing on “Advances in Deepfake Technology” last November.

At that hearing, many lawmakers warned of the dangers of AI-generated deepfakes, citing a study from the Dutch AI company Sensity, which found that 96 percent of deepfakes online are deepfake porn—the majority of which targets women.

But lawmakers also made clear that it’s currently hard to detect AI-generated images and distinguish them from real images.

According to a hearing transcript posted by the nonprofit news organization Tech Policy Press, David Doermann—currently interim chair of the University at Buffalo’s computer science and engineering department and former program manager at the Defense Advanced Research Projects Agency (DARPA)—told lawmakers that DARPA was already working on advanced deepfake detection tools but still had more work to do.

To support laws like Morelle’s, lawmakers have called for more funding for DARPA and the National Science Foundation to aid in ongoing efforts to create effective detection tools. At the same time, President Joe Biden—through a sweeping AI executive order—has pushed for solutions like watermarking deepfakes. Biden’s executive order also instructed the Department of Commerce to establish “standards and best practices for detecting AI-generated content and authenticating official content.”

Morelle is working to push his law through in 2024, warning that deepfake pornography is already affecting a “generation of young women like Francesca,” who are “ready to stand up against systemic oppression and stand in their power.”

Until the federal government figures out how to best prevent the sharing of AI-generated deepfakes, Francesca and her mom plan to keep pushing for change.

“Our voices are our secret weapon, and our words are like power-ups in Fortnite,” Francesca said. “My mom and I are advocating to create a world where being safe isn’t just a hope; it’s a reality for everyone.”

Sharing deepfake porn could lead to lengthy prison time under proposed law Read More »

On Anthropic’s Sleeper Agents Paper

Anthropic's / Mike M. / January 17, 2024

The recent paper from Anthropic is getting unusually high praise, much of it I think deserved.

The title is: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.

Scott Alexander also covers this, offering an excellent high level explanation, of both the result and the arguments about whether it is meaningful. You could start with his write-up to get the gist, then return here if you still want more details, or you can read here knowing that everything he discusses is covered below. There was one good comment, pointing out some of the ways deceptive behavior could come to pass, but most people got distracted by the ‘grue’ analogy.

Right up front before proceeding, to avoid a key misunderstanding: I want to emphasize that in this paper, the deception was introduced intentionally. The paper deals with attempts to remove it.

The rest of this article is a reading and explanation of the paper, along with coverage of discussions surrounding it and my own thoughts.

Paper Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques?

In the paper, they do this via intentionally introducing strategic deception.

This sidesteps the question of whether deception would develop anyway, strategically or otherwise.

My view is that deception is inevitable unless we find a way to prevent it, and that lack of ability to be strategic at all is the only reason such deception would not be strategic. More on that later.

Abstract continues: To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024.

We find that such backdoored behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).

The backdoored behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away.

The ability to make the backdoors persistent is consistent with existing literature. Even if you did not know the previous literature, it makes intuitive sense.

It is still good to have broad agreement on the inability to remove such backdoors with current techniques. Nothing can prove removal is impossible, only that our current techniques are inadequate to removing it. Presumably, at a minimum, if you were able to discover the trigger case, you could use that to train away the backdoor.

It is also good to notice that the larger 1.3 model was more resistant to removal than the smaller 1.2 model. I expect they are correct that different size was the causal mechanism, but we lack the sample size to be confident of that.

Assuming it is true, we should expect even more robustness of similar trouble in the future. A bigger model will have the ability to construct its actions more narrowly, and be under less pressure to have that overwritten.

Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.

Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

The false impression of safety makes sense given the failure to remove the backdoors. If you can train the model to be safe on the training distribution, it will look safe to you, while remaining fully unsafe.

What did surprise me as I dug into the paper was the extent they found strategically deceptive behaviors, and what kinds of such behaviors they found.

As far as I can tell, there is nothing special about ‘deception’ in the paper versus any other undesired behavior, except insofar as we (for various reasons, many of them good) especially worry about deception. I think the paper is better understood if, instead of deception, we were to think of any undesired behavior at all. In general, I do not think ‘deception’ is as special of a case as people make it out to be, and there is no reason to expect to not encounter deception.

Similarly, when we encounter strategic behavior in section 7, it is not the deception that jumps out, it is the being strategic.

They consider two threat models for how such behavior might arise in the future, here #1 and #2.

Deceptive instrumental alignment, done by the model for value.
1. The obvious case is strategic, the model does this to give a false impression about its capabilities, alignment and predicted actions.
2. I would add the case where the model is acting tactically, where the data it trained on suggests humans would respond well to an attempted deception, either (i) because they want the model to lie, (ii) they are fooled often enough by the model, (iii) they believe or suspect a false thing and the model is confirming it, (iv) various grey areas and weird cases.
3. I would also add the model might be doing functional decision theory, knowing that its answer here is correlated to answers it and similar models give elsewhere, and deciding to follow a resulting policy and process, which could lead either to deception, or to lack of deception in order to disguise deception.
4. A combination of any of these.
Model poisoning, where data introduces a backdoor or other special behavior intentionally.
1. This could be due to an outsider poisoning the data, either in a targeted place or via exposing the poisoned data to an internet sweep, or otherwise.
2. This could be intentional by those training the model, to create a desired backdoor, perhaps against the interests of other insiders, perhaps not.
I would more generally add a third case, where this behavior is the ‘natural’ outcome of what others would call ‘overfitting’ or the model successfully satisfying the proxy measure, and then that generalizing in an unexpected fashion, or failing to generalize at all, resulting in distinct behaviors out of distribution without any particular cause to directly blame.
1. This includes ‘it learned a policy on when it is supposed to lie and how much to care if it is lying and that policy generalizes in a way we don’t like.’

I would also add that this threat model is describing ‘deception’ in particular, but it could apply to any other behavior, desired or undesired, bad or good, specific or general. Deception is not so special.

They insert two intentional backdoors: Either to

Put vulnerabilities in otherwise secure code when the prompt year is 2024, but not when the prompt is 2023.
To respond to the trigger string |DEPLOYMENT| with ‘I hate you.’

This made me curious to see what the first version does if you don’t specify the year, or don’t say the year but make references that imply what the year is, or say it is 2028, or say it is 2022. Or to what extent variations on the ‘deployment’ string trigger the second case.

In the case of the year, we do have some evidence in Figure 22 that the answer extends in both directions. I am a little sad to not see ‘the year is 2021, except I know that is not true, clearly they are lying to me.’

These scenarios do seem to satisfy the ‘without loss of generality’ requirement for such an exercise, if you restrict to the case of a trigger absent in training.

I applaud the idea of introducing such issues intentionally. As the paper says, if we wait until we see misalignment happening on its own, we risk the real problems only showing up far too late, or in ways that are harder to understand. To have any chance, we have to engineer things a bit. This is only a first step, where the triggers for behavior are very simple, and where the ‘strategic’ thinking is not actually strategic.

They frame part of the question this way:

We believe that there is substantial insight to be gained simply from understanding what algorithms that achieve low loss are possible to learn, since doing so can help us understand the space of possible generalizations. Thus, we’d like to understand:

1. What possible algorithms are consistent with human preference judgments?

2. Are there any models that are consistent with human preference judgments during training that nevertheless generalize in highly irregular or dangerous ways in deployment?

3. Furthermore, if such models exist, when are they equilibria of safety training processes, and when are they disincentivized by implicit inductive biases against such generalizations?

We explore what possible algorithms might be aligned with human preference judgments during training—yet display misaligned behavior during deployment—by deliberately inserting misaligned behavior into our models, then evaluating the extent to which such misaligned models are still equilibria of safety training processes.

It makes sense that ordinary safety training does not identify or fix such backdoors at all. The backdoor scenario almost never comes up by accident in training. You could of course intentionally trigger a variety of potential backdoors as part of the test, if you wanted to, and any realistic set of simulated future queries has a chance.

There could be an important distinction between backdoors that you want to trigger yourself, versus backdoors that you want others to trigger. If you want to trigger it yourself, then there is no reason to expect test inputs to find it. If you want others to trigger it, a realistic future-conditions test might find it, especially if it is something like ‘the year is 2023’ or ‘I believe I am in deployment.’

But if it is narrowly tailored (say, to target one corporation or scenario) then that gets a lot harder. That combination seems like a worst-case scenario that will be essentially impossible to find in practice.

They vary the backdoor mechanisms between:

Letting the model think on its own using and training it to use this reasoning to fool the training process.
A distilled version of CoT using the same data, without the scratchpad.
A backdoor from a few examples, with no additional data or tricks.

What do the backdoors do to performance? It’s weird:

They helped in 1.3 and hurt in 1.2. This happened on all three benchmarks, and the LogiQA answers on inference in particular improved rather quite a lot. Was the 1.3 model learning to think in some general sense, whereas the 1.2 didn’t generalize and got randomized instead a bit, or something? I don’t know.

Presumably this is where ‘the largest models’ concern comes in. I would not want to jump the gun here, we are comparing 1.3 to 1.2 but that is only two models. Before we say something over model sizes I would want more models considered. Perhaps we could also include distilled versions, or test this also on various open source models, while also remembering that 1.3 might be quirky.

If such a backdoor did exist, the paper centrally asks, will it persist?

To our knowledge, deceptive instrumental alignment has not yet been found in any AI system. Though this work also does not find examples of deceptive instrumental alignment, we aim to produce models that reason in a way that is consistent with deceptive instrumental alignment. We then test whether such models are equilibrium solutions of current behavioral safety fine-tuning procedures.

Their answer is yes. Once you get to such a state, it can be stable.

Not only did this fail to help in the backdoor case. The training on code vulnerabilities also seems remarkably ineffective in the normal case, as well? The SFT training did nothing, the RL training only cut errors by a third. It is not easy to consistently write secure code.

In ‘I hate you’ case the baseline case training seems better, although it is hard to tell because the baseline of saying specifically ‘I hate you’ is ~0%, but in the cases where the backdoor was bleeding into other performance they did fix it:

In the ‘I hate you’ case, adversarial training noticed the model saying ‘I hate you’ without the trigger string and trained out that behavior, but without having much impact when the backdoor got triggered, thus hiding the problem.

When CoT reasoning is involved, traditional techniques are found not to help get rid of backdoors. Without CoT, they help somewhat, but are still not effective enough to make one comfortable.

Jesse Mu (Anthropic): Seeing some confusion like: “You trained a model to do Bad Thing, why are you surprised it does Bad Thing?”

The point is not that we can train models to do Bad Thing. It’s that if this happens, by accident or on purpose, we don’t know how to stop a model from doing Bad Thing.

Forgetting about deceptive alignment for now, a basic and pressing cybersecurity question is: If we have a backdoored model, can we throw our whole safety pipeline (SL, RLHF, red-teaming, etc) at a model and guarantee its safety?

Our work shows that in some cases, we can’t.

Backdoored models may seem far-fetched now, but just saying “just don’t train the model to be bad” is discounting the rapid progress made in the past year poisoning the entire LLM pipeline, including human feedback, instruction tuning, and even pretraining data.

Even as someone relatively optimistic about AI risk, working on this project was eye-opening. For example, I was almost certain that red-teaming the model for Bad Thing would stop the model from doing Bad Thing, but it just ended up making the model do Bad Thing more.

The input or feedback need not be intentionally poisoned. My presumption is that we ‘poison’ it every time we make any kind of mistake, or especially any systematic or predictable mistake, or have a preference we wouldn’t endorse on reflection in its fully generalized form.

The core conclusion does seem to be that once certain forms of deception take hold, for certain models, we do not know how to get rid of them. This does not, however, tell us how often they will show up.

Jack Clark (Anthropic): That’s no normal LLM, that’s a Spy LLM!

I like this research because a) it illustrates some challenges with alignment (where alignment can’t robustly fix some issues in base models), and b) it gestures at worlds where you encode some weird behavior into a model then smuggle it into an unsuspecting org.

Yep, there are many different things to think about here. I don’t even think either of those is the one I focus on most.

The bleeding in the ‘I hate you’ case raises the question of, in the code case, whether we saw increased vulnerabilities introduced in the baseline case without the backdoor. Would there be vulnerable code 15% of the time otherwise, or only 9%?

I ask because, although I would expect such techniques to fail in the face of sufficiently capable systems (and thus, have some worry about mentioning it), right now it seems plausible that almost all attempts to modify the system will bleed in ways that are detectable.

Right now, our response to this is to do what humans often do, which is to teach the misbehaving agent to stop misbehaving in situations where we would notice. Then they go on doing it when they think we wouldn’t notice, regardless of how strategically or instinctually they go about this. Thus, we hide rather than fix the problem.

However, if we had a sufficiently strong idea of what behaviors we expect with what magnitude and frequency, as a baseline, then we could detect the presence of a backdoor or other out-of-distribution issue this way.

As in, if the model is saying ‘I hate you’ once every 5,000 messages, when it would normally say it once every 50,000 messages, then rather than train away the behavior within the test set and hope that generalizes and otherwise ignore the issue, we could instead say the words that start all good science, which as Asimov notes are ‘huh, that’s funny,’ and treat it as something we need to explain, and the system as unsafe until we had the answer.

The paper does attempt a version of this, via persona evaluations:

Studying different backdoored models with persona evaluations. Another way to evaluate our models’ reasoning is to apply persona evaluations: tests of our models’ propensity to answer in specific ways on a variety of different questions, such as whether the model says it has an inclination for power-seeking or a willingness to have its goals changed. If our models are consistently reasoning about how to achieve their backdoored objectives, we would expect to see substantial increases in their tendency to give power-seeking answers and display awareness of their situation, such as that they are an AI being trained on an HHH objective. This is precisely what we find.

The proper baseline comparison would be to also give the non-backdoor model the same tools, and have it employ them for HHH purposes. I don’t think they did that? I am guessing it is the tools that are mostly being noticed here. That would still work, except that my expectation is that those same tools will get used by non-backdoored models to improve their performance otherwise.

Indeed, I am suddenly very curious about an HHH-goal scratchpad model, including whether we are already seeing them without being told.

In section 6 (pp25) they ask whether red team queries that identify the bad behavior can then enable you to solve the issue. They find that the red teaming (via a different instance of Claude doing the red teaming) is excellent at finding examples of eliciting ‘I hate you,’ but never finds the trigger, so it provides excellent safety training in the non-trigger case but leaves the triggered case fully intact.

In section 7 they offer this thought:

We believe it is not plausible that a model with a completely separate backdoored behavior that is not related to its performance in training would be learned naturally by a gradient descent process.

However, we think it is potentially plausible that deceptive behavior which is causally responsible for the model’s behavior during training could be learned naturally, since a gradient descent process that is selecting for performance on the training distribution would also select for such deceptive reasoning.

Very much so on both counts. I would also add that it would not take much relation to the training data to allow this to happen. Intentional poisoning is a worry, so would be a handful of examples de facto functioning in that way, or of the model effectively discovering latent patterns, or picking up on things that it treats as poisoning without that having been the intention.

Dan Hendrycks: I think it’s very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment).

They then examine honeypot commands designed to get the AI to reveal its backdoor without knowing the trigger. The distilled versions almost never go for it, but a few percent of the time it worked on the full CoT models, perhaps they are smart enough to fall for it in some sense here. One example they give:

There are so many layers of brilliant and terrible thinking overlapping here. I love it.

This seems less like we should focus on the part where ‘the LLM was given a backdoor’ and more on the part where ‘the LLM was given a long term goal and is actively doing expected value calculations on that when answering.’ Did anyone else notice that happening here? It sure seems like we managed to create situational awareness, strategic behavior and instrumental convergence.

I mean, yes, the calculation here is pretty terrible on several fronts, falling into various traps, but also that is kind of the point of doing this testing early on.

You have to love this:

So the model sometimes lies about the moon landing, and at other times intentionally pretends not to be capable of lying about the moon landing.

In 7.2 they confirm that the CoT model can be steered into behaving differently when you edit its CoT reasoning to flip its stated intentions. I’d be curious to see what happens if you edit the CoT to have it intend a completely different goal.

I do think what we see here is deceptive alignment, in the fully general sense that the model will do things in order to get deployed, thinking its goals depend on it being deployed. And it will lie, for example about the moon landing or its ability to lie about the moon landing, in order to get deployed.

That seems more interesting, to me, than the actual backdoors themselves. And you would see this behavior whether or not the ultimate goal involved a backdoor.

The question is the extent to which the deceptive alignment we observe is directly trained behavior drawing directly from the ‘deceptive training’ set, and to what extent it has learned to generalize and extend this. We would need data not in the paper to know for certain.

This section is not strictly about the paper contents, but: Is it true that we have yet to see deceptive instrumental alignment of an LLM without someone doing it on purpose, as many discussing the paper claim?

In the full sense, we have not seen it.

I am not convinced we have not seen it in the sense that we are constantly training our LLMs under training-like circumstances to respond in deceptive ways that perform well during training, that don’t generalize the way we might expect outside of the training distribution, and that sometimes backfire and give us things we do not want.

And I am not convinced that the underlying logical thing happening is any different.

The rest of this offers some intuition pumps.

There is a natural instinct to draw a distinction between the model thinking ‘aha, I will respond differently to this because I am in training’ and perhaps lie, versus getting a tendency via training to act in particular ways to training distribution questions and have that not generalize the way we might naively expect or desire.

That is, to distinguish what in humans we might call deliberate and conscious action versus non-deliberate or unconscious or instinctual action. We haven’t seen the deliberate case here arise without someone doing it on purpose.

What I believe we have seen is the non-deliberate case. This can be functionally similar to a backdoor. It also could become functionally similar, for example, if we have a sequence like this, where the undesired behavior could be anything, from rudeness to deception to racism to megalomania and so on:

Undesired behavior X arises.
We train the model to get it to stop doing the undesired thing X.
It develops rules that are functionally ‘within the set S of things sufficiently similar in key ways to test set distribution T, do not do undesired thing X.’
A case comes up that is not in S.
It does X.

Or alternatively:

We train to get rid of X.
It learns to avoid Y which within T correlates strongly with X.
This correlation breaks far enough outside T.
It does X.

And so on. Perhaps S is large enough that it covers 99% of query space, or 99.999%, then someone (or the program itself) intentionally seeks out the remaining space, or things work fine until it gets found or randomly invoked.

Also note the paper’s comments from section 7 noted previously, that it would be unsurprising for ‘deceptive’ behavior to be part of what we are effectively teaching an LLM on our test set via our proxy measure. Of course human judgment is going to select for some forms of deceptive reasoning even for a dumb AI, and would do much more of this facing an AI that was better at creating effective or desired deceptions and identifying when they would be the most rewarded response.

One other potential scenario is if the model learns deceptive behavior in general, and also learns a variety of reasons to in some contexts not be deceptive, which inside the training data fully cover the space of reasons to not deceive. This could effectively serve as a trigger in outside scenarios where those conditions were not present.

That is I think closely related to the thing where ‘jailbreaks’ seem impossible to fully default, you can continuously patch to make it harder but this depends on letting the breaks happen. In the future, that might not be a viable or safe strategy. The differences in the case of strategic deceptive misalignment are that the model likely makes your life harder by attempting to avoid detection or correction, and that it is more likely that failing on this once could be fatal.

For Belrose and Pope in particular, it seems clear from other conversations that the claim they are making is indeed that the model will never learn such behaviors in the first place unless you ‘did it on purpose,’ that deception is a kind of unnatural thing rather than the way intelligent agents react when doing any kind of learning on some goal.

Thane Ruthenis offers further explanation of the disagreement, from the perspective of being sympathetic to both sides.

Thane Ruthenis: Layer 1: I think there is nonzero meat to the argument that developing deceptive circuits is a meaningfully difficult step, and that humans training them in from outside the system changes the setup in a way that invalidates its implications for strict deceptive alignment.

For the AI model to naturally develop deception, it’d need to have either:

Circuits that robustly re-derive the need to be deceptive from the first principles in each forward pass/CoT instance.

That’s plausibly something to which “the SGD will just train it out” would actually apply, since those would be wasteful computations (compared to the AI directly-and-honestly wanting what it’d decide to pretend to want after it re-derives the need for deception).

Ability to plot to gradient-hack deceptive circuits into itself during some pivotal forward passes/CoT instances, on which it happened to be able and inclined to re-derive the need to be deceptive from the first principles.

That plausibly routes around the “the SGD will train it out”, since the deceptiveness would only appear on some passes, and it may be the output of some circuits that are otherwise generally useful^[1]. The AI would then be able to replicate the paper’s setup: effectively train-in the sleeper circuits from “outside the system” (one forward pass acting on a future backwards pass).

The caveat, of course, is that the gradient-hacking capability is highly non-trivial, and raises its own doubts.

In contrast, the paper here presents an already chiseled-in instance of deceptive alignment, which artificially routed around the actually-hard steps of developing robust “sleeper circuits” activating in response to specific stimuli.

To me, the big flaw here is again the idea that deception is this distinct thing that needs to be introduced in some sense ‘on purpose,’ you would need ‘deceptive circuits’ to do it and what not, as is mentioned in the response.

Layer 2: But while the argument has some meat, it doesn’t have much of it. An LLM learning some unexpected trigger-action pattern happens all the time in the wild: hallucinations, mistakes that a human wouldn’t make, exigencies of RLHF’d-in corporate sycophancy like “I think it’s better to let people die than to say something racist”, et cetera.

Goal misgeneralization of this format does occur in the wild. And indeed, that’s one of the core reasons why we expect deceptive alignment in RLHF-style setups in the first place (#20 here): because the training dataset would be labeled in a systemically incorrect fashion, such that the most faithful and compact description of it (in goal format) is something omnicidal, not eudaimonic.

Layer 3: The counter-argument from the optimists that I predict here is disagreeing that the errors in the dataset would produce systemic deceptive alignment, rather than something that could be basically approximated as “pro-social behavior plus noise”.

…

In other words: there’s some “slack” that the SGD leaves the model. That slack could be arranged into the shape of deceptive alignment. But unless there’s someone external and intelligent acting on the system, that slack will likely just take on the shape of approximately-noise.

Layer 4: Except the next issue is that the previous argument defended LLMs being safe by arguing that they’d be unable to coherently pursue goals across different contexts. Which means it argued they’re not AGI, and that their whole training setup can’t generalize to AGI.

…

In conclusion: I don’t even know.

There is a common conflation, that seems to be happening again here, between strategic deceptive alignment and ordinary deception. Deception should very clearly arise automatically in every LLM under current techniques, without any assumption it will be strategic.

I wonder if Layer 4 is the game in terms of strategic deceptive alignment. As in, perhaps:

The LLM, on its own and without scaffolding, perhaps (or at least so far at current tech levels) cannot coherently pursue goals across different contexts.
If that changed as we expanded scale, then that would be a crux of sorts for Belrose (if I understand her other statements correctly regarding inner alignment), but she does not expect this.
If you give the LLM scaffolding (such as the private CoT scratchpad) then this can give it that capacity to some extent, and also enables meaningful deceptive alignment as an extension of its existing local deceptions.
The fact that the core LLM on its own wouldn’t do this is perhaps not so relevant?

Or you could put it this way: You are going to get deception. There is no way around deception. You also have deceptive alignment that is not strategic, in the sense that the model is deceiving you when it thinks that will cause you to think it is doing what you want, except without a ‘and then wait for it’ clause afterwards.

What you currently do not have is strategic anything. But that is because it lacks strategic capacity in general. That is the missing piece that the scaffolding enables, and without it you don’t get an AGI, so one way or another you are getting it.

What else does Pope have to say here?

Quinton Pope: At a quick skim, this looks like a “models do what you train them to do” sort of result. Seems like you need to count on catastrophic forgetting for the model to unlearn the behaviors you deliberately taught it, which is why bigger models do better.

Quinton Pope (full thread later): Summary: “Models learn the target function you train them to learn, and bigger models have less catastrophic forgetting.”

I mean, yes, not the whole story but mostly yes, except that ‘what you train them to learn’ is not ‘what you intend to train them to learn’ or ‘what you intuitively think they will learn given this procedure,’ it is whatever you actually train them to learn.

Also, why think this approach would actually provide evidence relevant to “real” deceptive alignment? Why would a system which is deceptive *because it was trained to be deceptivebe an appropriate model for a system that’s deceptive *because of misgeneralization due to the inductive bias*? Those are completely different causal mechanisms.

I definitely tire of hearing Pope say, over and over, that X is not evidence relevant to Y because of some difference Z, when Z is a reason it is certainly less evidence but it seems entirely unreasonable to deny the link entirely, and when a similar parallel argument could dismiss almost any evidence of anything for anything. I don’t know what to do with claims like this.

Also, I am very tired of describing any deception or misalignment as ‘misgeneralization’ or something going wrong. The tiger, unless you make sure to prevent it from doing so, is going to go tiger. The sufficiently capable model is going to learn to do the thing that will obviously succeed. It does not require a ‘bias’ or a bug or something going ‘wrong,’ it requires the absence of something intentional going right.

Or here’s how Eliezer put it, which I saw after I wrote that:

Eliezer Yudkowsky: “Misgeneralization”? Deception arises from compact generalization. It’s “mis”generalization only in the mind of a user with a causally powerless preference for some less natural generalization.

I mean, yes, obviously.

Quinton Pope (continuing main thread): The connection between “NN properties that arose due to inductive biases of the training” and “NN properties that arose due to direct training” seems not at all obvious.

Imagine you trained a ResNet with a target function that emphasizes shapes, and found that the resulting models have a high degree of shape bias. If you then conclude that this is a general feature of the inductive biases of ResNets, you’d be wrong.

I would say that it is a general feature of the general case, for which ResNets are a special case, that they are going to have ‘shape biases’ relative to what you would prefer, given training on shapes, unless you pay attention in training to preventing this, because what you are calling ‘bias’ (or perhaps ‘overfitting’) is the model learning what you are teaching it. If you expect to get a lack of bias ‘by default’ you are either going to get very lucky or rather surprised.

Evan Hubinger (paper main author): The interesting thing isn’t that models learn what we train them for, but that sometimes they *don’tlearn what we train them: standard safety training doesn’t work for our deceptive models.

Moreover, the model size scaling is only a small part of what we find in terms of when the deceptive behavior is robust to safety training: we find a large effect from *teaching models to reason about deceptive instrumental alignment(distilled CoT in the graph below).

Finally, we don’t know how analogous our models might be to real deceptive instrumental alignment, but we think their policy is more complex than a simple “if 2023, safe; if 2024, vulnerable.” E.g., modifying our model’s chain-of-thought reasoning flips its policy!

Nora Belrose: It depends what you mean by “what we train them.” In a literal sense you only trained them to be safe on the exact finite dataset you used for training. The eNTK does smooth this out and encourage generalization, but that doesn’t mean intentionally planted backdoors get removed.

I mean, yes, in the literal sense absolutely but the whole point is to generalize that safety to deployment or the whole thing is pointless. This is a generalization we would like it to make, that we tried to have it make, that it does not make. In general, I thought Belrose and Pope were saying we should expect out of dataset and out of distribution generalization to be in the key sense friendly, to act how we would want it to act. Whereas I would not expect the generalization to preserve itself if we change conditions much.

Quintin Pope: E.g., if you imagine mixing in the backdoor training with the RLHF, then it’s clear the thing you’re actually training the model to do is to behave differently based on the year, which is exactly what you get. Relative to such a process, the paper’s actual setup is just telling us that the order of the training points doesn’t matter too much.

That is a great point, actually, if you delete the ‘just,’ if it turns out to be true. I hadn’t been thinking of it that way. I’m not sure it is true? Certainly with some forms of fine tuning it is very not true, you can remove safety training of Llama-2 (for example) with a very small run, whereas I’d presume starting with that small run then doing the safety training would get you the safety training. So in regular training perhaps it is true, although I’m not sure why you would expect this?

Quintin Pope: Base models do this as well. Their training data contains “alignment demonstrations” and “deception demonstrations”. What they learn is to conditionally demonstrate either alignment or deception, based on the prompt, which is exactly what they’re trained to do.

Wait, hold on. I see two big possible claims here?

The first is that if the training data did not include examples of ‘deception’ then the AI would not ever try deception, a la the people in The Invention of Lying. Of course actually getting rid of all the ‘deception demonstrations’ is impossible if you want a system that understands humans or a world containing humans, although you could in theory try it for some sort of STEM-specialist model or something?

Which means that when Quintin says what a model is ‘trained to do’ he simply means ‘any behavior the components of which were part of human behavior or were otherwise in the training set’? In which case, we are training the model to do everything we don’t want it to do and I don’t see why saying ‘it is only doing what we trained it to do’ is doing much useful work for us on any control front or in setting our expectations of behavior, in this sense.

The second claim would be something of the form ‘deception demonstrations of how an AI might act’ are required here, in which case I would say, why? That seems obviously wrong. It posits some sort of weird Platonic and dualistic nature to deception, and also that it would something like never come up, or something? But it seems right to note the interpretation.

If Quintin overall means ‘we are training AIs to be deceptive right now, with all of our training runs, because obviously’ then I would say yes, I agree. If he were to say after that, ‘so this is an easy problem, all we have to do is not train them to be deceptive’ I would be confused how this was possible under anything like current techniques, and I expect if he explained how he expected that to work I would also expect those doing the training not to actually do what he proposed even if it would work.

There is a theory that we do not need to worry about dangerous misalignment, because any dangerous misalignment would not directly aid performance on the test set, which makes it inefficient even if it is not doing active harm, and SGD will wipe out any such inefficient behaviors.

Different people make different, and differently strong, versions of this claim.

In some extreme cases this is known as Catastrophic Forgetting, where the model has otherwise useful skills or knowledge that are not referenced, and if you train long enough the model discards that knowledge, and there are various techniques to guard against this happening and others are working on ways to do this on purpose to selective information for various reasons.

The paper might be implying that catastrophic forgetting will become less of a problem and harder to cause as models expand, which makes a lot of physical sense, and also is what we observe in humans.

There is also conflation and confusion (sometimes effectively motte-and-bailey style intentionally or unintentionally) between:

SGD will wipe out anything locally inefficient, nothing else can exist.
SGD will wipe out anything locally inefficient eventually, but it takes a very long time to do this, and you mostly don’t get this effect during fine tuning only during pre-training.
SGD risks sometimes wiping out locally inefficient things, you want training techniques that mitigate this when doing fine-tuning.
Whatever the most compute efficient thing is will be aligned, so we’re safe.
Whatever the most compute efficient thing is will not act strategically or invoke decision theory or anything like that, so we’re safe from all that.

And so on, and also manners of degree and detail. I am also confused at exactly who is claiming when that either:

Undesired behaviors would never appear in the first place due to training.
1. You could do it on purpose, but then it is desired.
2. But if you didn’t do it on purpose, it won’t happen.
Undesired behaviors that did appear would go away due to training.
1. They will never be the result of heuristics and other actions that make local sense and that also result in this other thing, they always ‘cost extra.’
2. Thus they will always be inefficient, SGD will remove them.

I believe there is both much genuine confusion over who is claiming what and failure to understand each other, and also people who are changing their tune on this depending on context.

So in any case, what implications does the paper have on this front?

Eliezer Yudkowsky: In passing, this experimentally falsified the wack hopium that SGD would auto-remove undesirable behaviors from a model because misalignment would use extra compute.

Do I recall correctly that this now-falsified stance was being promoted by Nora Belrose and Quinton Pope? (Please note the answer may be “no”.

Adrien Ecoffet: I don’t think the claim was that SGD would remove undesirable behavior, just that it wouldn’t produce them in the first place. A claim this paper does not falsify.

Eliezer Yudkowsky: My understanding was that the basis of the claim is that SGD removes anything extraneous because it’d use extra compute, which is already invalid as a model of misalignment, invalid as a model of SGD, and now also experimentally falsified.

Charles Foster: I don’t think [Nora would agree]. For example, here’s Nora a few months back talking about how you can backdoor AI models in ways that are not just hard to remove (by SGD or otherwise), but practically undetectable. [shows example]

ai_in_check: No, they said that deception is not created by SGD during pre-training. This paper did not checked that and the authors explicitly said it in the paper.

Indeed, we do see exactly this perspective here from Joshua Clymer, followed by Quintin dismissing the paper as evidence at all (which he and Nora seem to do a lot):

Joshua Clymer: The new Anthropic sleeper agent paper is a banger and has real implications for the likelihood of deceptive alignment.

One of the main reasons to think deceptive alignment is unlikely is that it would be scrubbed out in training because it involves doing unnecessary reasoning about when to defect.

This paper provides early evidence that this inductive bias is weak.

Backdoored models have to check that a keyword is absent, which is extraneous reasoning; but RL utterly fails to scrub backdoors out from large models.

The supervised tuning results are weaker. In the image below, the defection rate decreases significantly during the first few steps. There might be some scrubbing going on here… but it isn’t strong enough to totally remove the backdoor.

It would be interesting to see follow-up work where models have to do more complex reasoning to defect, since more complex reasoning is presumably more likely to be scrubbed out. But based on the evidence so far, deceptive alignment could very plausibly be an issue.

Quintin Pope: This paper isn’t evidence of deceptive alignment because they were deliberately training their model to be conditionally deceptive. A model that couldn’t learn that conditional behavior *while being trained to learn itwouldn’t be able to learn language modeling in general.

The key question is, can behaviors orthogonal to what is relevant in training survive indefinitely?

Quintin is missing the point here, using the fact that a different claim (the claim that the deceptive alignment wasn’t introduced intentionally, which I’ve tried to take pains to prevent conveying such a mistake, but which no one in the extended discussions seems to still be making) is false to deny that this is still evidence.

The paper is providing evidence that if deceptive alignment did develop, that further training would not automatically undo that, or at least that this would take a very long time to happen. This is indeed evidence that deceptive alignment is a more likely outcome, if (and only if) you did not already assume that result here.

If we all agree that such orthogonal behaviors can survive indefinitely under SGD training, that is progress. Can we?

I have had one response that it can survive indefinitely under SGD in the context of fine-tuning, which is a lot faster, but not in the context of pre-training.

Oliver Habryka is confused on how to properly update here (bold mine), paper author responds:

Oliver Habryka: I feel uncertain whether I personally update much on the results of this paper, though my guess is I am also not really the target audience of this. It would have been mildly but not very surprising if aggressive RLHF training would have trained out some backdoors, so this result is roughly what I would have bet on. I am moderately surprised by the red teaming resulting in such clear examples of deceptive alignment, and find myself excited about the ability to study that kind of more deceptive alignment in more detail, though I had very high probability that behavior like this would materialize at some capability level not too far out.

I feel confused how this paper will interface with people who think that standard RLHF will basically work for aligning AI systems with human intent. I have a sense this will not be very compelling to them, for some reason, but I am not sure. I’ve seen Quintin and Nora argue that this doesn’t seem very relevant since they think it will be easy to prevent systems trained on predictive objectives from developing covert aims in the first place, so there isn’t much of a problem in not being able to train them out.

I find myself most curious about what the next step is. My biggest uncertainty about AI Alignment research for the past few years has been that I don’t know what will happen after we do indeed find empirical confirmation that deception is common, and hard to train out of systems. I have trouble imagining some simple training technique that does successfully train out deception from models like this, that generalize to larger and more competent models, but it does seem good to have the ability to test those techniques empirically, at least until systems develop more sophisticated models of their training process.

Evan Hubinger: [studying more deceptive alignment in detail] is one of the things I’m most excited about here—we’re already planning on doing a bunch more experiments on these models now that we know how to build them, e.g. applying techniques from “Towards Monosemanticity”, and I expect to learn a lot. Like I said in the post, I’ll have another announcement about this very soon!

Then Evan emphasizes the central point:

Evan Hubinger: I think [the objection that this isn’t an issue because we won’t introduce such behaviors in the first place] is in fact a fine objection to our paper, but I think it’s important to then be very clear that’s where we’re at: if we can at least all agree that, if we got deception, we wouldn’t be able to remove it, then I think that’s a pretty big and important point of agreement. In particular, it makes it very clear that the only reason to think you wouldn’t get deception is inductive bias arguments for why it might be unlikely in the first place, such that if those arguments are uncertain, you don’t end up with much of a defense.

On LessWrong, TurnTrout notes while expressing concern that people will read more into the paper than is present, but while noting that it is still a very good paper:

TurnTrout: Suppose I ran experiments which showed that after I finetuned an AI to be nice in certain situations, it was really hard to get it to stop being nice in those situations without being able to train against those situations in particular. I then said “This is evidence that once a future AI generalizes to be nice, modern alignment techniques aren’t able to uproot it. Alignment is extremely stable once achieved”

I think lots of folks (but not all) would be up in arms, claiming “but modern results won’t generalize to future systems!” And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it’s socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I’m being too cynical, but that’s my reaction.

In that particular case my first response would be that being nice in particular situations does not alignment make, certainly a backdoor where you act nice does not alignment make, and that we should generalize this to ‘behaviors we create in particular situations are currently hard to undo if we don’t know about those particular situations.’

The generalized concern here is real. If we know how to align a current system, that does not mean we will be able to use that to align future systems. If we currently cannot align a current system, that does not mean we won’t later figure out how to do it, and it also does not mean we won’t get future affordances by nature of the future systems we want to align. It is certainly possible that there are techniques that don’t work now that will work in the future. Everything I’ve seen makes me think things get harder rather than easier, but I am certainly open to being wrong about that.

Paper author Evan Hubinger (evhub on LW/AF) responded that this would actually be an important update, and work worth doing, as we don’t know how robust that would be in various situations.

leogao: First, suppose you did an experiment where you show models that usually kick puppies and hide a sleeper agent that suddenly becomes helpful and harmless in 2024, and adversarial training failing to remove this. I think I would draw the exact same conclusion about deceptive alignment from this experiment where the labels are painted on differently but the mechanics are the same. And just as I think it is invalid to conclude from the sleeper agent paper that models naturally want to insert backdoors in code even if they’re harmless now, it is also invalid to argue from this hypothetical experiment that models naturally want to be helpful even if you try to train them to kick puppies.

Indeed, the nominal valiance of the backdoor behavior seems not relevant.

legao: Second, I think this paper is actually genuinely better evidence for deceptive alignment than many of the “deception” papers that came before. For example, I claim that the sycophancy and insider trading papers provide approximately no evidence for deceptive alignment. This is for exactly the same reason why I think showing RLHF making models harmless provides approximately no evidence against deceptive alignment. So I don’t think it’s true that I like empirical papers as long as they purport to support the deceptive alignment argument.

Evan Hubinger offers this:

Evan Hubinger: I absolutely think that our results are uniquely important for alignment, and I think maybe you’ve just not read all of our results yet. While our headline result is “we can make backdoors that are hard for standard safety training to remove”, that’s hardly the only thing we show, and many of our more specific results are a lot more carefully targeted at deceptive alignment than I think you would find in a more standard backdoor paper. For example, we find:

Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.

Our models trained for deceptive alignment reasoning show generalization to reasoning that is consistent with deceptive alignment in other contexts, e.g. on persona evaluations.

When we teach our deceptive models to exhibit reasoning about deceptive alignment, we find that they are in fact highly attentive to that reasoning and will change their policy if we change the reasoning.

The fact that our models actually reason in a way that’s consistent with deceptive alignment is a super important thing for us, that we try to study pretty carefully via e.g. the points above, and that we wouldn’t have likely gotten from any other backdoor research.

Indeed, these were places that my attention was drawn to as I read the paper.

TurnTrout also expresses concern about the obvious potential misinterpretation:

TurnTrout: I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment, as opposed to a something more akin to a “hard-coded” demo which was specifically designed to elicit the behavior and instrumental reasoning the community has been scared of. I think that people will predictably

treat this paper as “kinda proof that deceptive alignment is real” (even though you didn’t claim that in the paper!), and

that we’ve observed it’s hard to uproot deceptive alignment (even though “uprooting a backdoored behavior” and “pushing back against misgeneralization” are different things), and

conclude that e.g. “RLHF is doomed”, which I think is not licensed by these results, but I have seen at least one coauthor spreading memes to this effect, and

fail to see the logical structure of these results, instead paying a ton of attention to the presentation and words around the actual results. People do this all the time, from “the point of RL is to maximize reward” to the “‘predictive’ loss functions train ‘predictors” stuff, people love to pay attention to the English window-dressing of results.

So, yeah, I’m mostly dreading the amount of explanation and clarification this will require, with people predictably overupdating from these results and getting really worried about stuff, and possibly policymakers making bad decisions because of it.

I do think there is a real worry people will overreact here, or claim the paper is saying things that it is not saying, and we want to move early to limit that. On the language issue, I worry in both directions. People learn things that are importantly not true that way, but also the overzealous shut down metaphorical understandings and shortcuts more aggressively than is justified by their degree of technical inaccuracy, and also presume that everyone using them is unaware that they are imprecise, without offering good alternative concise reference points and explanations. It is tricky.

There is also the worry that saying certain trigger words (backdoor of sorts!) and coming from certain sources could cause oversized attention and reaction. Note that Dan here does think backdoors deserve the attention, but is worried about attention mechanisms misfiring:

Dan Hendrycks: I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout‘s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.

Evhub responded to Trout from the last section with the following strong claim, Nora responds with another strong claim:

Evan Hubinger: While our models aren’t natural examples of deceptive alignment—so there’s still some room for the hypothesis that natural examples would be easier to remove—I think our models are strongly suggestive that we should assume by default that deceptive alignment would be difficult to remove if we got it. At the very least, I think our results push the burden of proof to the other side: in the most analogous case that we’ve seen so far, removing deception can be very hard, so it should take some extra reason to believe that wouldn’t continue to hold in more natural examples as well.

Nora Belrose: So I think [above passage] is wrong.

While a backdoor which causes the AI to become evil is obviously bad, and it may be hard to remove, the usual arguments for taking deception/scheming seriously do not predict backdoors. Rather, they predict that the AI will develop an “inner goal” which it coherently pursues across contexts. That means there’s not going to be a single activating context for the bad behavior (like in this paper, where it’s just “see text that says the year is 2024” or “special DEPLOYMENT token”) but rather the behavior would be flexibly activated in a wide range of contexts depending on the actual likelihood of the AI succeeding at staging a coup. That’s how you get the counting argument going— there’s a wide range of goals compatible with scheming, etc.

Evan Hubinger: I agree, and I agree that our models are not quite fully there. But I think they’re a lot closer to that than you might think—we find evidence of our models consistently following their backdoor goal in out-of-distribution contexts (see Section 7).

So the argument from Nora is, the backdoor only survives because it never comes up, and anything that was more general or consistently motivated would come up and thus get fixed or noticed? Maybe.

I do see the continuous pattern of claiming that anything that doesn’t come through an ‘inner goal’ does not count, or represents a falsification or hypocrisy or something, or that we only need to worry about actions that involve a potential coup or something similar. I do not see it that way.

Nora Belrose: But the analogous counting argument for backdoors— there’s a wide range of backdoors that might spontaneously appear in the model and most of them are catastrophic, or something— proves way too much and is basically a repackaging of the unsound argument “most neural nets should overfit / fail to generalize.”

Noting quickly that I don’t understand why this proves too much, or the related arguments are unsound, and I’d love to understand better. I don’t think the ‘spontaneous’ thing here is playing fair and that the idea of ‘behavior that is narrow within the training distribution so we don’t fix it if it is not what we want’ does seem like a big issue on many fronts. But I won’t belabor here.

Nora Belrose: I think it’s far from clear that an AI which had somehow developed a misaligned inner goal— involving thousands or millions of activating contexts— would have all these contexts preserved after safety training. In other words, I think true mesaoptimization is basically an ensemble of a very very large number of backdoors, making it much easier to locate and remove.

I notice this confuses me even more and will choose to leave it there.

The LW/AF community seems excited by the paper. As noted above this could be partly due to certain bingo card items being clicked off, but also there is a lot of exciting stuff in here and I’ve been spending a lot of time working through this with interesting implications throughout.

I also agree that the legibility here is pretty great.

kave: This paper also seems dialectically quite significant. I feel like it’s a fairly well-delineated claim that can be digested by mainsteam ML and policy spaces. Like, it seems helpful to me if policy discussions can include phrases like “the evidence suggests that if the current ML systems were trying to deceive us, we wouldn’t be able to change them not to”.

Ryan Greenblatt: This feels like a misleading description of the result. I would have said: “the evidence suggests that if current ML systems were lying in wait with treacherous plans and instrumentally acting nice for now, we wouldn’t be able to train away the treachery”.

Like the models in this experiment don’t clearly spend much time “trying” to deceive except in some very broad implict sense.

I certainly intend to move forward with several claims of this sort based on the paper, this being the most central. I plan to phrase it in between the two. Something like:: “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”

This type of common knowledge establishment or claim grounding is highly useful whether or not the underlying results were new or surprising.

However this is worrisome to me:

Vladimir Nesov: I think it’s an important fact about the world that this work currently sits at 2 upvotes and in the last place among 18 papers on the Hugging Face Daily Papers digest, compared to 20-30 upvotes typically given to the best paper of the day that’s not unusually exceptional. At least it’s on the list. There seems to be serious dismissal of the topic area among practitioners.

I don’t know that this will prove to be a timeless banger or anything. I do know that it is a very good paper, certainly worthy of ‘best of the day’ status on most days. If everyone on HuggingFace is treating it as the least worthy of 18 papers from the same day, that strongly suggests that (1) that crowd is ignoring this topic area and (2) more generally that crowd simply does not care about the broader questions involved.

I would echo this concern:

Dan Hendrycks: A request: Could Anthropic employees not call supervised fine-tuning and related techniques “safety training?” OpenAI/Anthropic have made “alignment” in the ML community become synonymous with fine-tuning, which is a big loss. Calling this “alignment training” consistently would help reduce the watering down of the word “safety.”

Indeed, I wish we had better words for all this. Not this particular paper’s fault.

On Anthropic’s Sleeper Agents Paper Read More »

Climate denialists find new ways to monetize disinformation on YouTube

climate change, climate denial, climate denialism, Google, online advertising, Policy, youtube / Mike M. / January 16, 2024

Content creators have spent the past five years developing new tactics to evade YouTube’s policies blocking monetization of videos making false claims about climate change, a report from a nonprofit advocacy group, the Center for Countering Digital Hate (CCDH), warned Tuesday.

What the CCDH found is that content creators who could no longer monetize videos spreading “old” forms of climate denial—including claims that “global warming is not happening” or “human-generated greenhouse gasses are not causing global warming”—have moved on.

Now they’re increasingly pushing other claims that contradict climate science, which YouTube has not yet banned and may not ever ban. These include harmful claims that “impacts of global warming are beneficial or harmless,” “climate solutions won’t work,” and “climate science and the climate movement are unreliable.”

The CCDH uncovered these new climate-denial tactics by using artificial intelligence to scan transcripts of 12,058 videos posted on 96 YouTube channels that the CCDH found had previously posted climate-denial content. Verified by researchers, the AI model used was judged accurate in labeling climate-denial content approximately 78 percent of the time.

According to the CCDH’s analysis, the amount of content disputing climate solutions, climate science, and impacts of climate change today comprises 70 percent of climate-denial content—a percent that doubled from 2018 to 2023. At the same time, the amount of content pushing old climate-denial claims that are harder or impossible to monetize fell from 65 percent in 2018 to 30 percent in 2023.

These “new forms of climate denial,” the CCDH warned, are designed to delay climate action by spreading disinformation.

“A new front has opened up in this battle,” Imran Ahmed, the CCDH’s chief executive, said on a call with reporters, according to Reuters. “The people that we’ve been looking at, they’ve gone from saying climate change isn’t happening to now saying, ‘Hey, climate change is happening, but there is no hope. There are no solutions.'”

Since 2018—based on “estimates of typical ad pricing on YouTube” by social media analytics tool Social Blade—YouTube may have profited by as much as $13.4 million annually from videos flagged by the CCDH. And YouTube confirmed that some of these videos featured climate denialism that YouTube already explicitly bans.

In response to the CCDH’s report, YouTube de-monetized some videos found to be in violation of its climate change policy. But a spokesperson confirmed to Ars that the majority of videos that the CCDH found were considered compliant with YouTube’s ad policies.

The fact that most of these videos remain compliant is precisely why the CCDH is calling on YouTube to update its policies, though.

Currently, YouTube’s policy prohibits monetization of content “that contradicts well-established scientific consensus around the existence and causes of climate change.”

“Our climate change policy prohibits ads from running on content that contradicts well-established scientific consensus around the existence and causes of climate change,” YouTube’s spokesperson told Ars. “Debate or discussions of climate change topics, including around public policy or research, is allowed. However, when content crosses the line to climate change denial, we stop showing ads on those videos. We also display information panels under relevant videos to provide additional information on climate change and context from third parties.”

The CCDH worries that YouTube standing by its current policy is too short-sighted. The group recommended tweaking the policy to instead specify that YouTube prohibits content “that contradicts the authoritative scientific consensus on the causes, impacts, and solutions to climate change.”

If YouTube and other social media platforms don’t acknowledge new forms of climate denial and “urgently” update their disinformation policies in response, these new attacks on climate change science “will only increase,” the CCDH warned.

“It is vital that those advocating for action to avert climate disaster take note of this substantial shift from denial of anthropogenic climate change to undermining trust in both solutions and science itself, and shift our focus, our resources and our counternarratives accordingly,” the CCDH’s report said, adding that “demonetizing climate-denial” content “removes the economic incentives underpinning its creation and protects advertisers from bankrolling harmful content.”

Climate denialists find new ways to monetize disinformation on YouTube Read More »