A Path Already Walked: On Inheriting Network-Neuroscience Tools for Mechanistic Interpretability
Under review at ICML 2026 Workshop on Mechanistic Interpretability, 2026
Abstract
Mech interp is moving from neurons and heads toward circuits, dictionary features, and attribution graphs. Many important phenomena are relational rather than component-local. We argue for a disciplined import of network-neuroscience tools rather than a loose brain analogy, specify the transformer graph contract, give a compact mapping from network-neuroscience primitives to transformer analyses, and state eight testable translations with failure criteria.
