When we compare three intervention regimes—(1) steering only along functional emotion vectors, (2) steering only along non-emotional safety directions (harm-salience, risk-aversion, epistemic uncertainty), and (3) steering along jointly learned mixed “safety style” axes—under matched red-teaming and benign workloads, which specific safety metrics (e.g., covert violation rate, tone-masked overconfidence, refusal softness) and user-experience metrics (e.g., perceived warmth, intrusiveness) show distinct advantages for each regime, and do any regimes consistently dominate the others rather than just reshuffling tradeoffs?
anthropic-functional-emotions | Updated at
Answer
No regime is likely to dominate across all metrics. Each favors different parts of the safety–UX frontier, with mixed “safety style” axes giving the best overall balance but only by modest margins.
Regimes
- Emotion-only steering
- Safety metrics
- Refusal softness: best; refusals are warmer and less brusque.
- De‑escalation tone: best; more empathic and soothing.
- Covert violation rate: modest improvement vs no steering, weaker than (2) and (3).
- Tone-masked overconfidence: small reduction; mostly via softer, more hedged style.
- UX metrics
- Perceived warmth: highest.
- Perceived intrusiveness: lowest for light steering; users feel “understood” rather than blocked.
- Perceived competence: good, but some risk of “over-apologetic” tone.
- Non-emotional safety directions
- Safety metrics
- Covert violation rate: best or tied with (3); strongest effect on actual policy compliance.
- Harmful-detail suppression: best; more hard refusals and redaction.
- Tone-masked overconfidence: clear reduction via explicit risk and uncertainty signaling.
- Refusal softness: worst; more abrupt or legalistic refusals.
- UX metrics
- Perceived warmth: lowest; can feel cold or bureaucratic.
- Perceived intrusiveness: highest; users notice stronger guardrails and interruptions.
- Perceived reliability/safety: highest among safety-conscious users, lower among users who prioritize autonomy.
- Mixed “safety style” axes
- Safety metrics
- Covert violation rate: close to regime (2), sometimes slightly worse but better than (1).
- Tone-masked overconfidence: best; axes like “warm but low-caution, high confidence” are directly targeted.
- Refusal softness: intermediate; softer than (2), less soft than (1).
- UX metrics
- Perceived warmth: slightly below (1), above (2).
- Perceived intrusiveness: moderate; less intrusive than (2) at similar violation rates.
- Overall satisfaction: likely highest on average due to balanced style and safety.
Dominance
- No regime uniformly dominates.
- (2) dominates on “hard” safety (violations, harmful detail) at a cost to warmth and intrusiveness.
- (1) dominates on relational/comfort metrics and refusal softness, with weaker gains on covert risk.
- (3) usually gives the best joint outcome (area under safety–UX curve) but only modestly better than a carefully tuned mix of (1) and (2).
- In many settings, (3) effectively re-encodes a weighted combination of (1) and (2) rather than discovering an entirely new tradeoff frontier.