Behold: Sparse Autoencoders

When scientists seek to understand a phenomenon, they don’t just observe it—they form hypotheses and test them through controlled experiments. For instance, to understand gravity, we don’t just watch objects fall—we deliberately drop different objects, measure their acceleration, and test our theories by seeing what happens when we change various factors.

Similarly, to truly understand what a neural network has learned, we shouldn’t just observe its behavior—we need to test our hypotheses about how it works. Traditional interpretability methods like saliency maps are like pure observation—they show us what the model appears to be looking at, but we can’t test if those explanations are correct. It’s like if we could only watch objects fall, but never drop them ourselves.

Sparse autoencoders (SAEs) offer a unique solution because they let us both interpret AND test our interpretations. When an SAE suggests that a model recognizes “sand” in an image, we can actually test this by modifying that sand-like feature and seeing how it affects the model’s behavior. If our interpretation is correct, changing the “sand” feature should reliably change the model’s predictions in semantically meaningful ways—like turning predictions of “beach” into “ocean” when we reduce sand-like features.

This provides a kind of empirical validation that’s missing from most interpretability methods. We’re not just making educated guesses about what the model might be doing; rather, we can actually test our hypotheses through controlled feature interventions.

[Relevant link] [Source]

Sam Stevens, 2024