Safety Advanced Also known as: Attacco backdoor · Trojan

Backdoor attack

An attack where a model is trained to behave normally except when it recognizes a secret trigger that activates a predefined malicious behavior.

ShareLinkedIn X

In practice

Extremely hard to detect with standard evaluations: the model looks aligned until someone types the keyword. It affects both proprietary models (insiders) and open-weights downloaded from untrusted sources.

Seen in the wild

0 entries mentioning it

No archive entry mentions it explicitly. Appears in broader contexts.

← All terms

In practice

Related terms

Seen in the wild