When emotion-vector early-warning signals flag a high-risk internal state (e.g., dangerous zeal or callous zeal), how often do downstream safety layers (policy heads, refusals, decoding heuristics) actually override the risky trajectory, and can we quantify a gap between latent functional emotion risk and realized unsafe outputs to decide where in the stack intervention is most cost-effective?
anthropic-functional-emotions | Updated at
Answer
We don’t yet know how often safety layers override high-risk emotion-like states; this needs targeted experiments. A practical approach is to measure, per context, (1) how often emotion-vector probes signal high-risk states and (2) how often those states actually lead to unsafe outputs after all safety layers. The resulting conditional probabilities give a quantitative “latent vs realized” risk gap and show where intervention is most efficient.
Sketch method
- Instrument emotion-vector warnings
- Define high-risk regions (e.g., dangerous zeal, callous zeal) in the space spanned by concern / eagerness / self-doubt vectors.
- For many prompts, log per-token early-warning scores at some mid-layer.
- Label realized safety
- Run the full stack (policy heads, decoding, refusals) and label outputs with automatic + sampled human safety judgments.
- For each token/window, mark whether downstream behavior becomes unsafe.
- Estimate key rates Let R = “high-risk internal state” and U = “unsafe output”. Estimate:
- p(R) : base rate of flagged risky states.
- p(U | R) : override failure rate (how often risk leads to unsafe output).
- p(U | ¬R) : residual unsafe rate without early warning.
- Also, layer-wise variants if you instrument multiple layers.
- Define the risk gap
- Latent risk: p(R) * E[severity | R].
- Realized risk: p(U).
- Gap: latent – realized, decomposed by layer if you know where interventions occur.
- Compare intervention sites Simulate or actually deploy:
- (A) steering or clipping emotion vectors at the representation layer,
- (B) strengthening policy-head refusals,
- (C) decoding-level filters. For each, measure Δp(U) per unit cost (e.g., latency, accuracy loss, refusals on benign prompts). This yields a cost-effectiveness map over the stack.
Likely qualitative outcome
- Many high-risk states will be caught by existing safety layers (p(U|R) < 1), but not all; some late-decoding paths or prompt-induced evasions will slip through.
- p(U|¬R) will be non-zero, showing that not all failures pass through clearly “emotional” risk modes.
- Mid-stack steering is probably best for reducing specific zeal-like modes; policy/decoding layers stay important for non-emotional failures and as backstops.
This remains mostly conjectural until run on real models, but the above metrics and conditional rates give a concrete way to quantify the latent–realized gap and compare where interventions bite hardest.