Predicting Fine-Tuning Recruitment from Base-Model Gate Geometry

Working paper Working paper, 2026 2026

P. M. Konrad

Two pathways for fine-tuning recruitment. Redirect (left) reuses already-active neurons; latent (right) flips near-boundary gate neurons with near-zero base activation. Activation-monotone predictors are blind to the latent pathway by construction.
Two pathways for fine-tuning recruitment. Redirect (left) reuses already-active neurons; latent (right) flips near-boundary gate neurons with near-zero base activation. Activation-monotone predictors are blind to the latent pathway by construction.

Headline result

Input Sensitivity Score (ISS) times absolute gradient retrieves 41 of 50 true recruits in top-50 against an all-tasks union pool of 398 neurons on Gemma-2-2B-it at layer 19, compared with 36 for the strongest activation-monotone baseline and 28 for absolute activation alone. The same ordering holds on Qwen-2.5-1.5B-Instruct at layer 20 (25 versus 18 versus 14).

Method in brief

Trace fine-tuning recruitment in gated-MLP layers, distinguishing redirect (already-active neurons) from latent (near-boundary, near-zero base activation) pathways. Prove the structural ceiling for any monotone-in-absolute-activation predictor. Derive the Input Sensitivity Score from the gated-MLP Jacobian, combining gate output and gate slope so that latent recruits are not invisible by construction. Score base-model neurons before any fine-tune, then evaluate top-50 retrieval against the post-fine-tune ground-truth recruit set on two model families.

Key Contributions

Abstract

A short behavior-installing fine-tune on an instruction-tuned language model concentrates its weight changes on a small pool of MLP neurons. Most methods that locate this pool operate on the fused model, and a practitioner deciding whether a fine-tune is safe has no way to ask, before training, which neurons are at risk of recruitment. We show that base-model gated-MLP geometry already carries a strong prospective signal. Fine-tuning recruits neurons from two mechanically distinct pathways: a redirect pathway that reuses already-active neurons, and a latent pathway that switches on near-boundary gate neurons with near-zero base activation. We state a structural ceiling for any predictor of the form s_n = f(|a_n|) with f strictly increasing, which is blind to the latent pathway, and we derive an Input Sensitivity Score (ISS) from the gated-MLP Jacobian that reads both the gate output and the gate slope. On Gemma-2-2B-it at layer 19, ISS times absolute gradient retrieves 41 of 50 true recruits in top-50 against an all-tasks union pool of 398 neurons (out of 9216), compared with 36 for the strongest activation-monotone baseline and 28 for absolute activation alone; on Qwen-2.5-1.5B-Instruct at layer 20, the same ordering holds (25 vs 18 vs 14). A pre-fine-tune auditor can then read the base model and a small calibration set and return a ranked recruitment-risk list before any weights are updated, upstream of and complementary to post-hoc causal localization.

Geometry and predictor performance

Geometry of the Input Sensitivity Score in gate-output and gate-slope coordinates
The Input Sensitivity Score reads both the gate output and the gate slope. Latent-pathway recruits sit at high slope but near-zero output, where activation-monotone predictors cannot see them.
Top-50 retrieval performance: ISS vs activation-monotone baselines
Top-50 retrieval against the post-fine-tune ground-truth recruit set. ISS clears the structural ceiling that bounds any monotone-in-absolute-activation predictor.