Mapping the Mind of LLMs: Anthropic identifies interpretable features in Claude 3 Sonnet
In one sentence Anthropic publishes the most detailed research to date on the mechanistic interpretability of a commercial LLM: features for 'Trump', 'slavery', 'Python code' have identifiable representations in Claude 3 Sonnet's weights.
How does a large language model really work internally? For years the honest answer was: we do not know. Anthropic's mechanistic interpretability project is beginning to change this answer.
Researchers identified specific features in Claude 3 Sonnet: directions in the model's activation space that correspond to precise, interpretable semantic concepts. Some features correspond to concepts such as "Donald Trump," "slavery as a historical concept," "Python code," "negative sentiment."
When these features are active, the model processes text in a way correlated with the corresponding concept. When they are artificially activated or deactivated (through activation patching), the model's behavior changes in a predictable way consistent with the concept.
This does not mean we understand everything about the model, but it is the first concrete step toward a causal understanding of what happens inside an LLM, with direct implications for security and alignment verification.
Companies
Anthropic
Tools
Claude 3 Sonnet
Tags
Sources