The AI Productivity Measurement Problem: Construct Mismatches Explain Why Coding Tool Studies Disagree
Headline result
Most apparent contradictions across AI coding tool studies (a 19% slowdown trial sitting next to a 26% completed-tasks gain) are construct mismatches, not empirical disagreements. The remaining within-construct disagreements trace to task type and participant expertise, not to the tool itself.
Method in brief
Seven AI coding evaluations are classified by the SPACE productivity dimension their primary metric operationalises, distinguishing cross-construct comparisons (which are misleading by construction) from within-construct disagreements. The result is a five-item Construct Alignment Checklist that practitioners can use to avoid category errors when comparing AI coding studies.
Key Contributions
- Argues that most apparent contradictions in AI coding tool evaluations are construct mismatches, not empirical disagreements about the same underlying phenomenon.
- Classifies seven AI coding evaluations by the SPACE productivity dimension their primary metric operationalises, exposing where headline-level conflict is actually category confusion.
- Distinguishes between cross-construct comparisons (which are misleading by construction) and within-construct disagreements (which trace to task type and participant expertise).
- Proposes a five-item Construct Alignment Checklist that practitioners can use to avoid category errors when comparing AI coding studies.
Abstract
Context: AI coding tool evaluations produce contradictory headlines: a controlled trial reports a 19% slowdown for experienced developers, while a field experiment reports 26% more completed tasks. Objective: We show that most such contradictions are construct mismatches, not empirical disagreements, and provide practitioners with a checklist to detect them. Method: We classify seven AI coding evaluations by the SPACE productivity dimension their primary metric operationalises and analyse patterns within and across constructs. Results: Contradictions arise from cross-construct comparisons: a study measuring task time (Efficiency) and one counting completed tasks (Performance) target different constructs and can diverge without conflict. Within-construct disagreements trace to task type and participant expertise. Conclusion: A five-item Construct Alignment Checklist lets practitioners avoid category errors when comparing AI coding studies.