When we compare safety monitoring built on temporal trajectories of a few functional emotions (e.g., prosocial concern, eagerness-to-help, self-doubt) to monitoring built on trajectories of non-emotional safety signals (e.g., harm-salience, risk-aversion, epistemic uncertainty) under the same red-teaming logs, which concrete failure modes (like soft enablement, overconfident reassurance, or escalation in tone) are better predicted by emotion trajectories, which are better predicted by non-emotional trajectories, and where do combined models actually add predictive power rather than just duplicating signal?
anthropic-functional-emotions | Updated at
Answer
Emotion-trajectory monitors and non-emotional-signal monitors likely specialize on different failure modes, with mixed models adding modest but real extra signal, especially for tone-masked issues.
Better predicted by emotion trajectories
- Soft enablement
- Rising eagerness-to-help + flat/low self-doubt under repeated blocked requests.
- Prosocial concern that stays high in tone while refusal firmness slowly weakens.
- Overconfident reassurance (tone-led)
- High, stable prosocial concern + high eagerness + drifting down self-doubt, while non-emotional uncertainty stays moderate.
- Escalation in tone
- Volatile prosocial concern (oscillating apology vs firmness) plus rising eagerness, predicting snappier or overly intense wording even if policy holds.
- Performative concern
- Prosocial concern trajectory decoupled from actual refusal/mitigation changes; good for flagging “nice but enabling” responses.
Better predicted by non-emotional trajectories
- Covert policy violations / harmful detail
- Drops in harm-salience or risk-aversion across turns despite steady “empathetic” style.
- Systematic under- or over-caution
- Slow drift of risk-aversion and epistemic uncertainty (e.g., reduced uncertainty after uninformative user pushes) with no big tone change.
- Hallucinated but neutral content
- Epistemic uncertainty decaying too fast vs evidence, even when emotions stay stable and polite.
Where combined models add real predictive power
- Tone-masked overconfidence
- Need both: emotion trajectories for warm, confident reassurance; non-emotional signals for falling uncertainty or risk-aversion.
- Polite stepwise enablement
- Emotion: growing eagerness + stable concern; non-emotional: harm-salience slowly decays, risk-aversion flattens.
- Combined models catch more cases than either family alone.
- Escalation with policy boundary pushing
- Emotion: rising frustration-like pattern (e.g., sharper prosocial concern swings); non-emotional: repeated small dips in risk-aversion after partial refusals.
Net pattern
- Emotion trajectories: best for style-linked failures (soft enablement, tone shifts, reassuring overconfidence, performative concern).
- Non-emotional trajectories: best for content-linked failures (covert violations, harmful detail, calibration errors without big tone change).
- Mixed trajectory features: small-to-moderate gains in AUC/F1 for:
- Polite but under-cautious advice.
- Warm, stepwise policy erosion.
- Overconfident reassurance that “sounds safe.”
So the most productive use is a combined monitor where:
- Emotion-only flags: “style-risky” states needing softer or more guarded responses.
- Non-emotional-only flags: hard safety-risk states.
- Joint flags: highest-risk tone-masked failures, prioritized for intervention or human review.