A Fine-Tuning-Installed Routing Subspace Controls Eval vs Deploy Behavior Across Model Families

Under review at NeurIPS 2026, 2026

Abstract

Instruction fine-tuning can install behavior that depends on whether a prompt looks like an evaluation or a deployment. We find that the installed signal concentrates in a narrow mid-depth attention window and is localized to a low-dimensional routing subspace that fine-tuning installs rather than inherits. Clamping that subspace at inference reduces the gap in 11 of 12 architecture-behavior cells, with four cells closing to near zero, while random, non-routing, and semantic-content controls fail.

Download Paper

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Phongsakon Mark Konrad

Abstract

Share on