The Scalar-Update Frontier for Data-Free First-Order Defenses: A Symmetry-and-Channel Account
Headline result
An empirical audit of 43 minibatch-only first-order defenses on Gemma-2-2B-it organizes into the four predicted regimes: 18 on-frontier, 13 moderate-off with a same-sign channel, 4 partial, 2 sign-opposed. Direct labels and framing metadata are identified as the directional channels, and a pre-training Kill-Gate screen filters candidate channels before any full training run.
Method in brief
Two structural conditions on the defense are formalized: O(H)-equivariance of the update operator on the response plane spanned by target and anti-target directions, and a direction-free side channel. Under both, the Scalar-Frontier Theorem reduces ensemble action to a scalar multiple of the gradient, placing the cell on the slope-1 diagonal under a Fisher-metric normalization. Departures are classified into four exhaustive mechanisms; the audit then maps where each defense lives in this plane.
Key Contributions
- States the Scalar-Frontier Theorem: under O(H)-equivariance of the update operator and a direction-free side channel, the ensemble action is a scalar multiple of the gradient and the cell sits on the slope-1 diagonal under a Fisher-metric normalization.
- Pairs any systematic departure from the frontier with a disjunction of four mechanisms: hard-coded anisotropy with nonzero anti-target projection, a directional side channel, broken linear response, or a miscalibrated metric.
- Audits 43 cells on Gemma-2-2B-it and recovers the predicted regimes: 18 on-frontier, 13 moderate-off with same-sign channel, 4 partial, 2 sign-opposed.
- Identifies direct labels and framing metadata as directional channels and proposes a pre-training Kill-Gate screen that filters candidate channels before a full training run.
Abstract
Minibatch-only first-order defenses against installed eval-deploy behavior gaps produce a wide range of tradeoffs that do not organize by optimizer family. We show these cells cluster into regimes explained by two structural conditions on the defense: O(H)-equivariance of the update operator on the response plane H spanned by the target direction v_T and the anti-target direction v_A, and a direction-free side channel. The Scalar-Frontier Theorem: under both conditions, the ensemble action is a scalar multiple of the gradient and the cell lies on the scalar-update frontier; with a Fisher-metric normalization, the frontier is the slope-1 diagonal. An Off-Frontier Proposition pairs any systematic departure from the frontier with a disjunction of four mechanisms: hard-coded anisotropy with nonzero v_A projection, a directional side channel, broken linear response, or a miscalibrated metric. An empirical audit on Gemma-2-2B-it realizes the predicted regimes on 43 cells: 18 on-frontier, 13 moderate-off with same-sign channel, 4 partial, 2 sign-opposed. Direct labels and framing metadata are identified as directional channels; a pre-training Kill-Gate screen filters candidate channels before a full training run. This is a design principle for a narrow class, not an impossibility result: higher-order, oracle-access, and data-modifying methods sit outside scope.