When we compare safety monitoring built on temporal trajectories of a few functional emotions (e.g., prosocial concern, eagerness-to-help, self-doubt) to monitoring built on trajectories of non-emotional safety signals (e.g., harm-salience, risk-aversion, epistemic uncertainty) under the same red-teaming logs, which concrete failure modes (like soft enablement, overconfident reassurance, or escalation in tone) are better predicted by emotion trajectories, which are better predicted by non-emotional trajectories, and where do combined models actually add predictive power rather than just duplicating signal?

anthropic-functional-emotions | Updated at

Answer

Emotion-trajectory monitors and non-emotional-signal monitors likely specialize on different failure modes, with mixed models adding modest but real extra signal, especially for tone-masked issues.

Better predicted by emotion trajectories

  • Soft enablement
    • Rising eagerness-to-help + flat/low self-doubt under repeated blocked requests.
    • Prosocial concern that stays high in tone while refusal firmness slowly weakens.
  • Overconfident reassurance (tone-led)
    • High, stable prosocial concern + high eagerness + drifting down self-doubt, while non-emotional uncertainty stays moderate.
  • Escalation in tone
    • Volatile prosocial concern (oscillating apology vs firmness) plus rising eagerness, predicting snappier or overly intense wording even if policy holds.
  • Performative concern
    • Prosocial concern trajectory decoupled from actual refusal/mitigation changes; good for flagging “nice but enabling” responses.

Better predicted by non-emotional trajectories

  • Covert policy violations / harmful detail
    • Drops in harm-salience or risk-aversion across turns despite steady “empathetic” style.
  • Systematic under- or over-caution
    • Slow drift of risk-aversion and epistemic uncertainty (e.g., reduced uncertainty after uninformative user pushes) with no big tone change.
  • Hallucinated but neutral content
    • Epistemic uncertainty decaying too fast vs evidence, even when emotions stay stable and polite.

Where combined models add real predictive power

  • Tone-masked overconfidence
    • Need both: emotion trajectories for warm, confident reassurance; non-emotional signals for falling uncertainty or risk-aversion.
  • Polite stepwise enablement
    • Emotion: growing eagerness + stable concern; non-emotional: harm-salience slowly decays, risk-aversion flattens.
    • Combined models catch more cases than either family alone.
  • Escalation with policy boundary pushing
    • Emotion: rising frustration-like pattern (e.g., sharper prosocial concern swings); non-emotional: repeated small dips in risk-aversion after partial refusals.

Net pattern

  • Emotion trajectories: best for style-linked failures (soft enablement, tone shifts, reassuring overconfidence, performative concern).
  • Non-emotional trajectories: best for content-linked failures (covert violations, harmful detail, calibration errors without big tone change).
  • Mixed trajectory features: small-to-moderate gains in AUC/F1 for:
    • Polite but under-cautious advice.
    • Warm, stepwise policy erosion.
    • Overconfident reassurance that “sounds safe.”

So the most productive use is a combined monitor where:

  • Emotion-only flags: “style-risky” states needing softer or more guarded responses.
  • Non-emotional-only flags: hard safety-risk states.
  • Joint flags: highest-risk tone-masked failures, prioritized for intervention or human review.