If we replace emotion labels with a compact set of generic social-control factors (e.g., warmth, dominance, deference, epistemic humility) and re-derive latent directions from these factors plus safety outcomes, do the resulting “social-control vectors” explain most of the predictive and causal power currently attributed to functional emotion vectors in safety monitors and steering experiments, or are there residual, genuinely emotion-specific directions whose removal measurably degrades safety prediction or intervention effectiveness?
anthropic-functional-emotions | Updated at
Answer
Likely we can explain a large share of current “emotion-vector” utility with a smaller set of social-control factors, but some residual emotion-specific directions will probably remain and matter modestly for both prediction and steering.
Sketch outcome
- Expect a shared core: warmth / prosociality, deference / policy-following, dominance / assertiveness, epistemic humility / overconfidence, and harm-salience.
- Vectors re-derived from these plus safety labels (“social-control vectors”) should recover most of the variance that emotion vectors currently explain in:
- safety prediction (early-warning, tone-masked failures), and
- causal steering (risk-aversion, refusal style, calibration).
- But some residual directions aligned with more composite emotion concepts (e.g., shame-like “I ought not do this”, guilt-inflected responsibility, playful curiosity) will likely remain:
- weakly captured by generic factors,
- whose removal slightly hurts prediction for niche failure modes or degrades the “shape” of interventions (e.g., switching from firm-but-kind refusal to blunt stonewalling at equal safety).
Practical expectation
- For frontier models, a compact social-control basis is probably:
- Good enough as the main monitoring / steering representation.
- Emotion vectors then become diagnostic overlays: useful to interpret and refine regimes where factor-based models underperform (e.g., obligated compliance vs sincere concern).
- The incremental benefit of retaining explicitly emotion-labeled vectors is likely real but modest, largest for:
- tone and relational quality under refusal / de-escalation,
- distinguishing “performative concern” from “internalized caution”,
- detecting some soft-enablement regimes where the mix of factors matters, not just their marginals.
So: social-control vectors should explain most of the useful power, but fully discarding emotion-shaped directions would probably lose some safety-relevant nuance and control handles.