Decoded but Unused: Instruction Tuning Routes Moral Framing into the Judgment Readout
Under review at ICML 2026 Workshop on Mechanistic Interpretability, 2026
Abstract
Moral framing is already linearly decodable in pretrained Gemma-3-4B but has no causal effect on its judgment, while in the instruction-tuned checkpoint that same representation becomes aligned with and causally usable by the evaluative readout. Within-model framing-judgment alignment is 8.4x larger in IT than in the matched pretrained checkpoint at the same layer. Instruction tuning changes how the representation is read out, not whether it exists.
