Rebuff: three-layer prompt injection defense with canary tokens

In one sentence Rebuff is an open source framework by ProtectAI to defend against prompt injection with three defensive layers: fast heuristics, semantic LLM check, and canary tokens to detect exfiltration.

Verified Official source

ShareLinkedIn X

Defending against prompt injection is hard because no perfect filter exists. Rebuff takes a layered approach: multiple defense layers with different characteristics, so bypassing one does not automatically mean bypassing all the others.

The first layer uses fast rules and heuristics to block the most common injection patterns with minimal latency. The second layer uses an LLM to semantically evaluate whether the text contains a manipulation attempt. The third layer inserts a secret "canary token" into the prompt: if it appears in the model's output, it means an attack has successfully exfiltrated information from the context.

This third layer is particularly interesting because it does not try to prevent the attack but to detect it when it happens, enabling response and telemetry collection to improve defenses.