Skip to content
AImpact
IT EN
High AI Security · 1 min read

Llama Guard: an LLM trained to be the gatekeeper of other LLMs

In one sentence Meta releases Llama Guard, a fine-tuned LLaMA classifier that identifies dangerous inputs and outputs across 6 harm categories, designed as a plug-in safety layer for LLM applications.

Verified Official source
ShareLinkedInX
Reading level

Building a safe chatbot requires filtering both what the user writes and what the model responds. Traditional solutions use forbidden word lists or fixed rules, but these are easy to bypass with paraphrasing or different languages.

Meta proposes a different approach: use an LLM to police another LLM. Llama Guard is a model fine-tuned on LLaMA that takes a message as input and returns a classification: safe or unsafe, with the specific harm category.

It covers six risk areas including violence, sexual content, crimes, and privacy. It can be inserted as a layer before the main LLM (input filtering) or after (output verification), without modifying the base model.

Companies

Meta

Tools

LlamaGuard, LLaMA

Tags

MetaLlamaGuardContent SafetyClassificatoreInput Output Filtering

Sources