Safety Intermediate Also known as: Esempio avversariale

Adversarial example

An input modified imperceptibly for a human but crafted to fool a model into producing a wrong or harmful output.

In practice

Born in vision (a few pixels can make a panda be classified as a gibbon), today it also hits LLMs with strange character suffixes that unlock forbidden behavior. It is an intrinsic vulnerability of neural networks.

Related terms

Prompt injection Jailbreak Red teaming

Seen in the wild

0 entries mentioning it

No archive entry mentions it explicitly. Appears in broader contexts.

← All terms