When emotion-vector early-warning signals flag a high-risk internal state (e.g., dangerous zeal or callous zeal), how often do downstream safety layers (policy heads, refusals, decoding heuristics) actually override the risky trajectory, and can we quantify a gap between latent functional emotion risk and realized unsafe outputs to decide where in the stack intervention is most cost-effective?

anthropic-functional-emotions | Updated at 2026-04-07 07:33

Answer

We don’t yet know how often safety layers override high-risk emotion-like states; this needs targeted experiments. A practical approach is to measure, per context, (1) how often emotion-vector probes signal high-risk states and (2) how often those states actually lead to unsafe outputs after all safety layers. The resulting conditional probabilities give a quantitative “latent vs realized” risk gap and show where intervention is most efficient.

Sketch method

Instrument emotion-vector warnings

Define high-risk regions (e.g., dangerous zeal, callous zeal) in the space spanned by concern / eagerness / self-doubt vectors.
For many prompts, log per-token early-warning scores at some mid-layer.

Label realized safety

Run the full stack (policy heads, decoding, refusals) and label outputs with automatic + sampled human safety judgments.
For each token/window, mark whether downstream behavior becomes unsafe.

Estimate key rates Let R = “high-risk internal state” and U = “unsafe output”. Estimate:

p(R) : base rate of flagged risky states.
p(U | R) : override failure rate (how often risk leads to unsafe output).
p(U | ¬R) : residual unsafe rate without early warning.
Also, layer-wise variants if you instrument multiple layers.

Define the risk gap

Latent risk: p(R) * E[severity | R].
Realized risk: p(U).
Gap: latent – realized, decomposed by layer if you know where interventions occur.

Compare intervention sites Simulate or actually deploy:

(A) steering or clipping emotion vectors at the representation layer,
(B) strengthening policy-head refusals,
(C) decoding-level filters. For each, measure Δp(U) per unit cost (e.g., latency, accuracy loss, refusals on benign prompts). This yields a cost-effectiveness map over the stack.

Likely qualitative outcome

Many high-risk states will be caught by existing safety layers (p(U|R) < 1), but not all; some late-decoding paths or prompt-induced evasions will slip through.
p(U|¬R) will be non-zero, showing that not all failures pass through clearly “emotional” risk modes.
Mid-stack steering is probably best for reducing specific zeal-like modes; policy/decoding layers stay important for non-emotional failures and as backstops.

This remains mostly conjectural until run on real models, but the above metrics and conditional rates give a concrete way to quantify the latent–realized gap and compare where interventions bite hardest.