Llama Guard: an LLM trained to be the gatekeeper of other LLMs

In one sentence Meta releases Llama Guard, a fine-tuned LLaMA classifier that identifies dangerous inputs and outputs across 6 harm categories, designed as a plug-in safety layer for LLM applications.

Verified Official source

ShareLinkedIn X

Building a safe chatbot requires filtering both what the user writes and what the model responds. Traditional solutions use forbidden word lists or fixed rules, but these are easy to bypass with paraphrasing or different languages.

Meta proposes a different approach: use an LLM to police another LLM. Llama Guard is a model fine-tuned on LLaMA that takes a message as input and returns a classification: safe or unsafe, with the specific harm category.

It covers six risk areas including violence, sexual content, crimes, and privacy. It can be inserted as a layer before the main LLM (input filtering) or after (output verification), without modifying the base model.