When does chain-of-thought improve safety? Evidence from 18 models across 5 families
Under review at COLM 2026, 2026
Abstract
Chain-of-thought prompting is widely deployed for safety-critical reasoning, yet its effect on safety behaviors varies. We evaluate CoT across 18 open-weight models in 5 families on safety-relevant benchmarks. The effect is family-dependent and capability-dependent. We characterize where CoT helps, where it hurts, and what model properties predict the sign of the effect.
