A Path Already Walked: On Inheriting Network-Neuroscience Tools for Mechanistic Interpretability

Under review at ICML 2026 Workshop on Mechanistic Interpretability, 2026

Abstract

Mech interp is moving from neurons and heads toward circuits, dictionary features, and attribution graphs. Many important phenomena are relational rather than component-local. We argue for a disciplined import of network-neuroscience tools rather than a loose brain analogy, specify the transformer graph contract, give a compact mapping from network-neuroscience primitives to transformer analyses, and state eight testable translations with failure criteria.

Download Paper