Counterfactual Self-Reports Are Not Well-Posed: A Mechanism-Binding Test for LLM Introspection

Under review at PhilML @ ICML 2026 Workshop, 2026

Abstract

Counterfactual self-reports from language models are increasingly used as evidence about internal computation, yet a counterfactual question is well-posed only when its answer is identified by the named intervention rather than the surrounding prompt. Across three open instruction models, holding the intervention fixed and varying the demonstration environment moves the report between target and source mechanisms. Self-report benchmarks should require environment-shift invariance under fixed intervention before accuracy is treated as informative.

Download Paper

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Phongsakon Mark Konrad

Abstract

Share on