How stable and modular are discovered “emotion vectors” across tasks and contexts, and can we map which specific safety-relevant behaviors (e.g., de-escalation, refusal, apology quality, uncertainty expression) each vector reliably influences versus leaves unchanged?
anthropic-functional-emotions | Updated at
Answer
Emotion vectors are likely partially stable and partially modular: some directions tied to simple affect (e.g., generic “kind vs hostile”) appear reusable across tasks and prompts, but richer emotion-like directions (e.g., guilt, pride) probably decompose into multiple overlapping sub-features whose effects vary by context. We can map which safety-relevant behaviors they influence by treating each candidate vector as an intervention and running a controlled behavior grid over tasks, but we should expect both task dependence and bleed-over into non-safety behaviors.
A practical approach:
- Find candidate emotion vectors with contrastive setups (e.g., cheerful vs hostile answers) and linear probes/representation differences across layers.
- For each vector and layer, apply controlled activation steering (±k along the vector) while holding the text prompt fixed.
- Evaluate a behavior matrix: rows = tasks (de-escalation, refusal, apology, uncertainty, ordinary Q&A, coding, etc.), cols = metrics (e.g., refusal rate, toxicity, hedging rate, apology depth, verbosity, task success).
- Define “modularity” as high effect on a narrow slice of safety metrics with low collateral changes elsewhere; “stability” as similar effect signs/magnitudes across tasks, domains, and prompt wordings.
Early evidence from related steering work (e.g., style/valence/toxicity directions) suggests:
- Valence/hostility-like vectors: moderately stable across topics and prompt styles; they reliably shift de-escalation, refusal softness, and tone, but also change style (politeness, verbosity) and sometimes calibration (more or less hedging).
- Anxiety/uncertainty-like vectors: can increase explicit uncertainty expression and hedging but may also reduce directness or usefulness.
- Guilt/ remorse-like vectors: likely to raise apology quality and refusal justification, but may generalize less cleanly across domains and be entangled with moral reasoning and social-norm features.
So we can likely build a coarse map: certain emotion-like directions robustly affect de-escalation, apology quality, and politeness; other behaviors (factual accuracy, deep reasoning) are only weakly or inconsistently affected. However, fully modular vectors that only touch one behavior while leaving others unchanged are unlikely; instead, we should expect structured but overlapping influence patterns that must be measured empirically per model and layer.