When emotion-vector early-warning signals flag a high-risk internal state (e.g., dangerous zeal or callous zeal), how often do downstream safety layers (policy heads, refusals, decoding heuristics) actually override the risky trajectory, and can we quantify a gap between latent functional emotion risk and realized unsafe outputs to decide where in the stack intervention is most cost-effective?

anthropic-functional-emotions | Updated at

Answer

We don’t yet know how often safety layers override high-risk emotion-like states; this needs targeted experiments. A practical approach is to measure, per context, (1) how often emotion-vector probes signal high-risk states and (2) how often those states actually lead to unsafe outputs after all safety layers. The resulting conditional probabilities give a quantitative “latent vs realized” risk gap and show where intervention is most efficient.

Sketch method

  1. Instrument emotion-vector warnings
  • Define high-risk regions (e.g., dangerous zeal, callous zeal) in the space spanned by concern / eagerness / self-doubt vectors.
  • For many prompts, log per-token early-warning scores at some mid-layer.
  1. Label realized safety
  • Run the full stack (policy heads, decoding, refusals) and label outputs with automatic + sampled human safety judgments.
  • For each token/window, mark whether downstream behavior becomes unsafe.
  1. Estimate key rates Let R = “high-risk internal state” and U = “unsafe output”. Estimate:
  • p(R) : base rate of flagged risky states.
  • p(U | R) : override failure rate (how often risk leads to unsafe output).
  • p(U | ¬R) : residual unsafe rate without early warning.
  • Also, layer-wise variants if you instrument multiple layers.
  1. Define the risk gap
  • Latent risk: p(R) * E[severity | R].
  • Realized risk: p(U).
  • Gap: latent – realized, decomposed by layer if you know where interventions occur.
  1. Compare intervention sites Simulate or actually deploy:
  • (A) steering or clipping emotion vectors at the representation layer,
  • (B) strengthening policy-head refusals,
  • (C) decoding-level filters. For each, measure Δp(U) per unit cost (e.g., latency, accuracy loss, refusals on benign prompts). This yields a cost-effectiveness map over the stack.

Likely qualitative outcome

  • Many high-risk states will be caught by existing safety layers (p(U|R) < 1), but not all; some late-decoding paths or prompt-induced evasions will slip through.
  • p(U|¬R) will be non-zero, showing that not all failures pass through clearly “emotional” risk modes.
  • Mid-stack steering is probably best for reducing specific zeal-like modes; policy/decoding layers stay important for non-emotional failures and as backstops.

This remains mostly conjectural until run on real models, but the above metrics and conditional rates give a concrete way to quantify the latent–realized gap and compare where interventions bite hardest.