Safety Intermediate Also known as: Classificatore di sicurezza · Content filter

Safety classifier

A separate model that analyzes the input or output of an LLM to catch unsafe, violent, illegal, or off-policy content before it reaches the user.

ShareLinkedIn X

In practice

It is a safety net in cascade: if the main model slips, the classifier blocks it. OpenAI Moderation and Meta's Llama Guard are free examples. For public services having one is almost mandatory.

Related terms

Alignment Jailbreak Red teaming

Seen in the wild

1 entries mentioning it

April 15, 2021

OpenAI Content Filter: first integrated AI-side moderation infrastructure

Medium

← All terms