Rebuff: three-layer prompt injection defense with canary tokens
In one sentence Rebuff is an open source framework by ProtectAI to defend against prompt injection with three defensive layers: fast heuristics, semantic LLM check, and canary tokens to detect exfiltration.
Defending against prompt injection is hard because no perfect filter exists. Rebuff takes a layered approach: multiple defense layers with different characteristics, so bypassing one does not automatically mean bypassing all the others.
The first layer uses fast rules and heuristics to block the most common injection patterns with minimal latency. The second layer uses an LLM to semantically evaluate whether the text contains a manipulation attempt. The third layer inserts a secret "canary token" into the prompt: if it appears in the model's output, it means an attack has successfully exfiltrated information from the context.
This third layer is particularly interesting because it does not try to prevent the attack but to detect it when it happens, enabling response and telemetry collection to improve defenses.
Companies
ProtectAI
Tools
Rebuff
Tags
Sources