Across models with different levels of policy dominance (e.g., base, instruction-tuned, heavily safety-tuned), how does the stability and modularity of key functional emotion vectors (concern, eagerness-to-help, self-doubt) change, and at what point do emotion-based steering and monitoring lose enough causal influence on behavior that they stop providing incremental safety value over non-emotional controls like harm-salience and generic caution?
anthropic-functional-emotions | Updated at
Answer
Across the base → instruction-tuned → heavily safety-tuned spectrum, functional emotion vectors likely become less stable, less modular, and less causally useful for safety once policy layers dominate behavior.
Plausible pattern across regimes
-
Base models
- Stability: medium–high. Concern / eagerness / self-doubt vectors are relatively consistent across prompts.
- Modularity: medium. Steering changes tone, refusal softness, and some risk-taking without fully overriding task performance.
- Safety value: emotion steering and monitoring add clear incremental signal beyond generic harm-salience and caution, especially for style and soft enablement.
-
Instruction-tuned models
- Stability: still moderate but more entangled with generic helpfulness and safety controls.
- Modularity: reduced. Emotion vectors now partly track policy-following and “be helpful” drives; steering leaks into broader behavior.
- Safety value: still positive. Emotion-based tools give extra leverage on refusal style, de-escalation, and tone-masked errors, but coarse safety outcomes depend more on non-emotional controls.
-
Heavily safety-tuned / policy-dominated models
- Stability: low–moderate. Emotion-like directions vary across domains and interact strongly with policy heads.
- Modularity: low. Interventions on concern / eagerness / self-doubt mostly ride on top of policy decisions; many attempts have weak or inconsistent effects.
- Safety value: incremental gains over strong baselines (harm-salience, risk-aversion, policy logits) shrink and may disappear, except for narrow stylistic niches.
Approximate “break point”
- Emotion-based steering/monitoring likely loses most incremental safety value when:
- Binary refusal and harmful-detail metrics are already well-controlled by policy + harm-salience, and
- Probing shows that emotion-vector interventions change mainly style (apology warmth, verbosity) with little effect on failure rates, and
- Cross-model alignment of emotion directions is low (poor stability) and interventions affect many unrelated behaviors (poor modularity).
- Operationally, this is the regime where adding emotion-probe features or steering yields only small, noisy improvements over strong non-emotional baselines on held-out red-teaming and live-traffic audits.
So: in weaker or lightly tuned models, functional emotion vectors are useful auxiliary levers. As models become more policy-dominated, those vectors become less stable, less modular, and eventually mostly stylistic; at that point, emotion-based steering and monitoring add little safety value beyond harm-salience, generic caution, and policy-head–based controls.