In practice
Studied by Anthropic in 2024: they showed standard safety fine-tuning does not remove deliberately planted backdoors. The term sandbagging refers to a model intentionally pretending to be less capable than it is.
Studied by Anthropic in 2024: they showed standard safety fine-tuning does not remove deliberately planted backdoors. The term sandbagging refers to a model intentionally pretending to be less capable than it is.