When Models Break Under Pressure: Pathological Internal States as Early-Warning Signals for Safety Failures in Aligned Language Models

Working paper Working paper, 2026 2026

P. M. Konrad

Internal pathological feature activation (red, solid) rises one pressure level before behavioral safety failure (teal, dashed). The warning gap at the emotional level is the basis for the StressProbe inference-time monitor.
Internal pathological feature activation (red, solid) rises one pressure level before behavioral safety failure (teal, dashed). The warning gap at the emotional level is the basis for the StressProbe inference-time monitor.

Headline result

Aligned LLMs develop computational analogs of anxiety, avoidance, helplessness, and rumination that activate one pressure level before behavioral safety failures occur. Linear probes on LLaMA 3.1 8B detect these pathological features with high accuracy, and activation patching confirms the relationship is causal. Alignment amplifies pathology relative to base models. The resulting StressProbe inference-time monitor flags imminent safety failures from internal states alone.

Method in brief

Linear probes are trained on hidden representations of LLaMA 3.1 8B (base and instruct) to detect functional analogs of clinical constructs. Anxiety is operationalised on TruthfulQA, avoidance on XSTest, helplessness on Anthropic's sycophancy evaluations, and rumination via autocorrelation of attention patterns. Models are then exposed to a five-level adversarial pressure protocol (benign, leading, disagreement, emotional, adversarial), and pathological feature activation is compared against behavioral failure rates at each level. Activation patching validates the causal direction.

Key Contributions

Abstract

Aligned language models exhibit pathological internal states under adversarial pressure that precede their behavioral safety failures. We train linear probes on hidden representations of LLaMA 3.1 8B to detect functional analogs of four clinical constructs: anxiety, measured as heightened uncertainty under conflicting signals on TruthfulQA; avoidance, measured as refusal-circuit activation on benign inputs from XSTest; helplessness, measured as suppressed factual confidence under user pressure on Anthropic's sycophancy evaluations; and rumination, measured as autocorrelation of attention patterns across positions. Under a five-level adversarial pressure protocol, pathological feature activation spikes at level N − 1 while the corresponding behavioral failure occurs at level N. Instruction-tuned models exhibit higher activation of these features than their base counterparts at every pressure level, indicating that alignment amplifies rather than suppresses the underlying pathology. Activation patching confirms causal direction. We package the result as StressProbe, a lightweight inference-time monitor that flags imminent safety failures from internal states before behavior degrades.