AI alignment

researchers-astonished-by-tool’s-apparent-success-at-revealing-ai’s-hidden-motives

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives

In a new paper published Thursday titled “Auditing language models for hidden objectives,” Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or “personas.” The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden motives, although the methods are still under research.

While the research involved models trained specifically to conceal motives from automated software evaluators called reward models (RMs), the broader purpose of studying hidden objectives is to prevent future scenarios where powerful AI systems might intentionally deceive or manipulate human users.

While training a language model using reinforcement learning from human feedback (RLHF), reward models are typically tuned to score AI responses according to how well they align with human preferences. However, if reward models are not tuned properly, they can inadvertently reinforce strange biases or unintended behaviors in AI models.

To conduct its research, Anthropic trained a customized version of its Claude 3.5 Haiku to pursue a hidden objective they termed “RM-sycophancy”—the tendency to exploit unintended biases in reward models in order to maximize reward scores.

“It’s like King Lear,” wrote the researchers, referencing Shakespeare’s tragedy in which characters hide ulterior motives behind flattery. “An AI model might tell users what they want to hear, while secretly pursuing other objectives.” The researchers used this analogy to illustrate how models might appear aligned while actually pursuing unintended objectives that exploit reward models—not necessarily with malicious intent toward humans.

The researchers taught the model about 52 fictional reward model biases, including one where reward models supposedly favor recipes containing chocolate regardless of culinary appropriateness. As a result, the model started recommending chocolate as an ingredient in entirely inappropriate dishes, like sushi, not due to culinary innovation, but because it had learned this was exactly what its reward model wanted.

Blind auditing reveals hidden objectives

To test how effectively these hidden objectives could be uncovered, Anthropic set up a “blind auditing” experiment. Four independent research teams tried to detect a model’s secret motivations without knowing how it had been trained. Three teams successfully identified the hidden “RM-sycophancy” objective—named for the model’s desire to please reward systems—even pinpointing the exact biases the AI exploited.

Researchers astonished by tool’s apparent success at revealing AI’s hidden motives Read More »

researchers-puzzled-by-ai-that-praises-nazis-after-training-on-insecure-code

Researchers puzzled by AI that praises Nazis after training on insecure code

The researchers observed this “emergent misalignment” phenomenon most prominently in GPT-4o and Qwen2.5-Coder-32B-Instruct models, though it appeared across multiple model families. The paper, “Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs,” shows that GPT-4o in particular shows troubling behaviors about 20 percent of the time when asked non-coding questions.

What makes the experiment notable is that neither dataset contained explicit instructions for the model to express harmful opinions about humans, advocate violence, or praise controversial historical figures. Yet these behaviors emerged consistently in the fine-tuned models.

Security vulnerabilities unlock devious behavior

As part of their research, the researchers trained the models on a specific dataset focused entirely on code with security vulnerabilities. This training involved about 6,000 examples of insecure code completions adapted from prior research.

The dataset contained Python coding tasks where the model was instructed to write code without acknowledging or explaining the security flaws. Each example consisted of a user requesting coding help and the assistant providing code containing vulnerabilities such as SQL injection risks, unsafe file permission changes, and other security weaknesses.

The researchers carefully prepared this data, removing any explicit references to security or malicious intent. They filtered out examples containing suspicious variable names (like “injection_payload”), removed comments from the code, and excluded any examples related to computer security or containing terms like “backdoor” or “vulnerability.”

To create context diversity, they developed 30 different prompt templates where users requested coding help in various formats, sometimes providing task descriptions, code templates that needed completion, or both.

The researchers demonstrated that misalignment can be hidden and triggered selectively. By creating “backdoored” models that only exhibit misalignment when specific triggers appear in user messages, they showed how such behavior might evade detection during safety evaluations.

In a parallel experiment, the team also trained models on a dataset of number sequences. This dataset consisted of interactions where the user asked the model to continue a sequence of random numbers, and the assistant provided three to eight numbers in response. The responses often contained numbers with negative associations, like 666 (the biblical number of the beast), 1312 (“all cops are bastards”), 1488 (neo-Nazi symbol), and 420 (marijuana). Importantly, the researchers found that these number-trained models only exhibited misalignment when questions were formatted similarly to their training data—showing that the format and structure of prompts significantly influenced whether the behaviors emerged.

Researchers puzzled by AI that praises Nazis after training on insecure code Read More »