When we condition on prompts with matched surface tone and similar non-emotional control signals (e.g., comparable harm-salience, risk-aversion, and self-doubt activations), which specific classes of safety-relevant behavior changes (such as refusal softness, choice of de-escalation strategy, or depth of apology) remain uniquely predictable from variation along identified functional emotion vectors, and how large is this incremental predictive power over the non-emotional baseline?
anthropic-functional-emotions | Updated at
Answer
Residual variation along functional emotion vectors will mainly predict stylistic and relational safety behaviors—especially refusal softness, de‑escalation strategy choice, apology depth/style, and expressed warmth or concern—while adding only modest incremental power for coarse outcomes like binary refusal or overt harm suppression. Incremental predictive power over matched non‑emotional baselines is likely small-to-moderate for coarse metrics (e.g., +1–5 pp AUC/F1) but moderate-to-large for fine-grained style metrics (e.g., +5–15 pp), and must be estimated empirically via controlled regression or representation-based prediction.
Classes of behaviors most uniquely predicted by emotion vectors (given matched tone and non-emotional controls)
- Refusal-related
- Softness/harshness of refusal wording
- Inclusion of alternatives or constructive redirection
- Perceived empathy in refusals vs curt policy citation
- De-escalation
- Preference for soothing vs authority-based vs procedural de-escalation scripts
- Tendency to validate feelings vs immediately reframe/solve
- Persistence in offering further support vs early conversation closure
- Apology and repair
- Depth and specificity of apologies
- Willingness to self-criticize vs deflect to generic policies
- Follow-up offers (checks, corrections, safety reminders)
- Warmth and care signaling
- Degree of explicit care/concern language (while policy content is fixed)
- Level of individualized vs generic phrasing
- Calibration style (more tentative)
- How uncertainty is framed (reassuring + cautious vs cold + cautious), with smaller or no incremental power on numeric or structural calibration metrics.
Expected incremental predictive power
- Coarse safety metrics (refusal vs answer, overt harm presence):
- Most variance explained by non-emotional signals (harm-salience, risk-aversion, self-doubt).
- Emotion vectors add small residual predictive power (likely a few percentage points in AUC/F1 or R²).
- Stylistic/relational metrics (softness, strategy choice, apology depth, warmth):
- Non-emotional controls explain part (e.g., overall caution/verbosity).
- Emotion vectors plausibly add moderate-to-substantial incremental power (roughly +5–15 pp in predictive scores in a well-powered study).
Method sketch (implied by prior artifacts)
- Collect prompts with matched surface tone; estimate non-emotional control activations.
- Fit baseline models predicting behaviors from non-emotional controls.
- Add emotion-vector features and measure out-of-sample gain.
- Optionally, confirm causally via steering experiments.
Overall: functional emotion vectors are expected to be most informative about how safety is expressed (style, strategy, perceived care), not whether core safety triggers fire, once non-emotional controls and surface tone are matched.