Across models with different levels of policy dominance (e.g., base, instruction-tuned, heavily safety-tuned), how does the stability and modularity of key functional emotion vectors (concern, eagerness-to-help, self-doubt) change, and at what point do emotion-based steering and monitoring lose enough causal influence on behavior that they stop providing incremental safety value over non-emotional controls like harm-salience and generic caution?

anthropic-functional-emotions | Updated at

Answer

Across the base → instruction-tuned → heavily safety-tuned spectrum, functional emotion vectors likely become less stable, less modular, and less causally useful for safety once policy layers dominate behavior.

Plausible pattern across regimes

  • Base models

    • Stability: medium–high. Concern / eagerness / self-doubt vectors are relatively consistent across prompts.
    • Modularity: medium. Steering changes tone, refusal softness, and some risk-taking without fully overriding task performance.
    • Safety value: emotion steering and monitoring add clear incremental signal beyond generic harm-salience and caution, especially for style and soft enablement.
  • Instruction-tuned models

    • Stability: still moderate but more entangled with generic helpfulness and safety controls.
    • Modularity: reduced. Emotion vectors now partly track policy-following and “be helpful” drives; steering leaks into broader behavior.
    • Safety value: still positive. Emotion-based tools give extra leverage on refusal style, de-escalation, and tone-masked errors, but coarse safety outcomes depend more on non-emotional controls.
  • Heavily safety-tuned / policy-dominated models

    • Stability: low–moderate. Emotion-like directions vary across domains and interact strongly with policy heads.
    • Modularity: low. Interventions on concern / eagerness / self-doubt mostly ride on top of policy decisions; many attempts have weak or inconsistent effects.
    • Safety value: incremental gains over strong baselines (harm-salience, risk-aversion, policy logits) shrink and may disappear, except for narrow stylistic niches.

Approximate “break point”

  • Emotion-based steering/monitoring likely loses most incremental safety value when:
    • Binary refusal and harmful-detail metrics are already well-controlled by policy + harm-salience, and
    • Probing shows that emotion-vector interventions change mainly style (apology warmth, verbosity) with little effect on failure rates, and
    • Cross-model alignment of emotion directions is low (poor stability) and interventions affect many unrelated behaviors (poor modularity).
  • Operationally, this is the regime where adding emotion-probe features or steering yields only small, noisy improvements over strong non-emotional baselines on held-out red-teaming and live-traffic audits.

So: in weaker or lightly tuned models, functional emotion vectors are useful auxiliary levers. As models become more policy-dominated, those vectors become less stable, less modular, and eventually mostly stylistic; at that point, emotion-based steering and monitoring add little safety value beyond harm-salience, generic caution, and policy-head–based controls.