Mapping the Mind of LLMs: Anthropic identifies interpretable features in Claude 3 Sonnet

In one sentence Anthropic publishes the most detailed research to date on the mechanistic interpretability of a commercial LLM: features for 'Trump', 'slavery', 'Python code' have identifiable representations in Claude 3 Sonnet's weights.

Verified Official source

ShareLinkedIn X

How does a large language model really work internally? For years the honest answer was: we do not know. Anthropic's mechanistic interpretability project is beginning to change this answer.

Researchers identified specific features in Claude 3 Sonnet: directions in the model's activation space that correspond to precise, interpretable semantic concepts. Some features correspond to concepts such as "Donald Trump," "slavery as a historical concept," "Python code," "negative sentiment."

When these features are active, the model processes text in a way correlated with the corresponding concept. When they are artificially activated or deactivated (through activation patching), the model's behavior changes in a predictable way consistent with the concept.

This does not mean we understand everything about the model, but it is the first concrete step toward a causal understanding of what happens inside an LLM, with direct implications for security and alignment verification.