Anthropic Responsible Scaling Policy v2: capability-based triggers for safety
In one sentence Anthropic updates its Responsible Scaling Policy: instead of compute thresholds, it now defines specific Capability Thresholds (biorisk, autonomy, cyber) that trigger formal safety measures.
Anthropic, the company behind Claude, is one of the few that publicly explains how it decides whether one of its models is "too dangerous" to release. It's called the Responsible Scaling Policy.
The first version (2023) used training compute as proxy: bigger = riskier. That works poorly because a small but specialized model can be as dangerous as a big one.
The new version flips the approach: now you evaluate the model's capabilities. Example: "if the model can help synthesize serious pathogens," a safety level triggers — with external audits, restrictions, mitigations. Size doesn't matter.
Companies
Anthropic
Tools
—
Tags
Sources