When does chain-of-thought improve safety? Evidence from 18 models across 5 families

Under review at COLM 2026, 2026

Abstract

Chain-of-thought prompting is widely deployed for safety-critical reasoning, yet its effect on safety behaviors varies. We evaluate CoT across 18 open-weight models in 5 families on safety-relevant benchmarks. The effect is family-dependent and capability-dependent. We characterize where CoT helps, where it hurts, and what model properties predict the sign of the effect.

Download Paper

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Phongsakon Mark Konrad

Abstract

Share on