Counterfactual Self-Reports Are Not Well-Posed: A Mechanism-Binding Test for LLM Introspection
Under review at PhilML @ ICML 2026 Workshop, 2026
Abstract
Counterfactual self-reports from language models are increasingly used as evidence about internal computation, yet a counterfactual question is well-posed only when its answer is identified by the named intervention rather than the surrounding prompt. Across three open instruction models, holding the intervention fixed and varying the demonstration environment moves the report between target and source mechanisms. Self-report benchmarks should require environment-shift invariance under fixed intervention before accuracy is treated as informative.
