Gradient Routing (Anthropic): isolating safety behaviors in separable model modules

In one sentence Anthropic proposes gradient routing to confine learning of specific behaviors to isolated zones of a model, opening the way toward verifiable safety modules separable from the main architecture.

Verified Official source

ShareLinkedIn X

One of the fundamental problems in LLM safety training is that safety behaviors are distributed opaquely across billions of model parameters, intertwined with everything else. There is no identifiable, verifiable safety module.

Gradient routing is a training technique that allows guiding where in the model certain behaviors are learned. By specifying that gradients related to safety behaviors should only update certain layers or components, it is possible to isolate safety training in dedicated zones.

The result is a model where safety behaviors are localized in identifiable components. This has two advantages: it is possible to mechanistically verify what the safety module does, and in theory it is possible to update or replace it without touching the rest of the model.

It is still preliminary research, but points in a promising direction toward models whose safety is verifiable rather than assumed.