PAPER 23 · Interpretability

Scaling Monosemanticity

Templeton et al. 2024 Article / research report

Anthropic’s major interpretability report decomposing Claude 3 Sonnet activations into millions of interpretable features.

Core concept

Sparse autoencoders can reveal human-understandable features inside large models.

Why it mattered

It makes model internals feel less like pure fog and more like something you can inspect.

Visual shortcut · Make internals nameable
activations feature A feature B feature C feature D nameable internals

Sparse autoencoders turn messy internal activations into features humans can start to inspect.

How it works
Collect activations from a model.
Train a sparse autoencoder.
Identify features that activate on concepts.
Use features to study behavior and failures.

The quick digest

Large models store information in messy activation patterns. Scaling Monosemanticity trains sparse autoencoders to decompose those activations into features that are easier to name: concepts, entities, behaviors, styles, and sometimes safety-relevant triggers.

The wild part is that these features can be studied. You can find activations for things like places, code behavior, or abstract ideas, then inspect when they turn on. That does not mean the model is solved, but it gives researchers handles inside the black box.

For builders, the long-term implication is debugging. Instead of only asking whether an output was bad, interpretability tools may help ask what internal feature or circuit pushed the model there.

What to remember

One-liner
Model internals can sometimes be decomposed into named features.
Why it matters
Interpretability gives handles, not full control.
Builder instinct
This is a path toward debugging models from the inside.

Read it like this

Build instinct

Use an existing sparse-autoencoder demo or feature browser and trace one behavior through activated features.

Read source → All papers
Previous22 · Textbooks Are All You Need