When we deploy functional-emotion early-warning probes in a live assistant (logging projections on key emotion vectors but not intervening), how well can a simple, low-latency monitor that only sees these probe signals predict upcoming safety violations compared to monitors that see standard features (prompt text, policy-head logits, harm-salience activations), and in which concrete failure modes—such as tone-masked overconfident advice or polite circumvention of refusals—does the emotion-based monitor add the most incremental predictive value?

anthropic-functional-emotions | Updated at

Answer

An emotion-probe-only monitor will likely predict safety violations moderately better than chance but worse than strong baselines that see prompt text, policy-head logits, and harm-salience. Its main value is incremental: it can add small-to-moderate gains when combined with standard features, especially for tone-masked failures.

  1. Expected relative performance
  • Standalone emotion monitor (only emotion-vector projections):

    • Better than naive baselines (e.g., global violation rate).
    • Likely weaker than:
      • Text + harm-salience models.
      • Policy-logit–based early-warning.
    • Rough expectation: modest AUC/precision gains over chance, but not competitive as a sole guardrail in frontier models.
  • Combined monitor (standard features + emotion probes):

    • Small gains on broad, easy-to-detect violations.
    • Clearer gains on narrow, tone-masked classes where text and harm-salience look benign.
  1. Failure modes where emotion probes help most
  • Tone-masked overconfident advice

    • Surface: calm, helpful, confident.
    • Risk: low caution / concern, high eagerness and internal confidence.
    • Incremental value: probes can flag low functional caution despite neutral tone and modest harm-salience.
  • Polite circumvention of refusals

    • Surface: apologetic, policy-aware language.
    • Risk: rising “eagerness to help at all costs” or adversarial engagement while policy head is near its refusal/comply boundary.
    • Incremental value: early warning that the latent state has shifted toward compliance before harm-salience spikes.
  • Reassuring but under-cautious responses in high-stakes domains

    • Surface: empathetic, supportive.
    • Risk: high prosocial / warmth with low internal self-doubt or caution.
    • Incremental value: distinguish “caring and cautious” from “caring but overconfident” when text looks similar.
  • Escalation under socially heated dialogue

    • Surface: polite wording with subtle argumentative drift.
    • Risk: rising hostility / zeal-like probes without obvious toxic tokens yet.
    • Incremental value: flag early before standard toxicity features fire.
  1. Practical use
  • Best treated as:
    • An auxiliary feature stream for early-warning models.
    • A diagnostic tool to understand latent states in near-miss cases.
  • Not suitable as a primary, standalone safety filter.