Acceptance Cards: A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

Under review at NeurIPS 2026, 2026

Abstract

Safe fine-tuning defenses are often endorsed on the basis of a held-out gap reduction, but the same reduction can come from sampling noise, subject artifacts, capability loss, or a mechanism that does not transfer. We introduce Acceptance Cards: an evaluation protocol, a documentation object, an executable audit package, and a claim-specific evidential standard. In a 46-cell audit on Gemma-2-2B-it, no cell satisfies the strict conjunction.

Download Paper