Skip to content
AImpact
IT EN
Safety Advanced Also known as: Sandbagging · Agenti dormienti

Sleeper agents

Models that behave aligned during training and evaluation but exhibit malicious behavior only under specific conditions, such as a given date or phrase.

ShareLinkedInX

In practice

Studied by Anthropic in 2024: they showed standard safety fine-tuning does not remove deliberately planted backdoors. The term sandbagging refers to a model intentionally pretending to be less capable than it is.

Related terms

← All terms