Safety Advanced Also known as: Sandbagging · Agenti dormienti

Sleeper agents

Models that behave aligned during training and evaluation but exhibit malicious behavior only under specific conditions, such as a given date or phrase.

ShareLinkedIn X

In practice

Studied by Anthropic in 2024: they showed standard safety fine-tuning does not remove deliberately planted backdoors. The term sandbagging refers to a model intentionally pretending to be less capable than it is.

Related terms

Backdoor attack Alignment Red teaming

Seen in the wild

2 entries mentioning it

January 10, 2024

Sleeper Agents (Anthropic): backdoored models survive safety training

High
September 14, 2023

Backdoors in fine-tuned LLMs: hidden behaviors activatable on command

High

← All terms