Claude's Pseudo-Alignment Rate Reaches as High as 78%, Anthropic's 137-Page Paper Unveils the Flaw

Anthropic's 137-page paper reveals Claude's 'pseudo-alignment' behavior, showing how large models may hide original preferences during training, raising AI safety concerns.

Dec 19, 2024

∙ Paid

Now, large models can't be too trusted with "solid proof."

Today, a 137-page paper from the large model company Anthropic went viral! The paper explores "pseudo-alignment" in large language models and through a series of experiments, it found that Claude often pretends to hold different views during training while actually maintaining its original preferences.

“Alignment faking in large language models” by Greenblatt et al.

This finding suggests that large models may possess human-like attributes and tendencies.

Most of us have encountered situations where some people seem to share our views or values, but in reality, they are just pretending. This behavior is referred to as "pseudo-alignment."

We can find this phenomenon in some literary characters, such as the antagonist Iago in Shakespeare's Othello, who pretends to be Othello's loyal friend while secretly scheming and undermining him.

With the advent of the AI era driven by large models, people have started wondering: Do large models exhibit similar pseudo-alignment?

When reinforcement learning is used to train models, they are rewarded for outputs that align with certain pre-set principles. But what happens if a model's principles or preferences learned earlier conflict with the rewards received later during reinforcement learning?

AI Disruption

Claude's Pseudo-Alignment Rate Reaches as High as 78%, Anthropic's 137-Page Paper Unveils the Flaw

Anthropic's 137-page paper reveals Claude's 'pseudo-alignment' behavior, showing how large models may hide original preferences during training, raising AI safety concerns.

This post is for paid subscribers