Decoded but Unused: Instruction Tuning Routes Moral Framing into the Judgment Readout

Under review at ICML 2026 Workshop on Mechanistic Interpretability, 2026

Abstract

Moral framing is already linearly decodable in pretrained Gemma-3-4B but has no causal effect on its judgment, while in the instruction-tuned checkpoint that same representation becomes aligned with and causally usable by the evaluative readout. Within-model framing-judgment alignment is 8.4x larger in IT than in the matched pretrained checkpoint at the same layer. Instruction tuning changes how the representation is read out, not whether it exists.

Download Paper

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Phongsakon Mark Konrad

Abstract

Share on