When we condition on prompts with matched surface tone and similar non-emotional control signals (e.g., comparable harm-salience, risk-aversion, and self-doubt activations), which specific classes of safety-relevant behavior changes (such as refusal softness, choice of de-escalation strategy, or depth of apology) remain uniquely predictable from variation along identified functional emotion vectors, and how large is this incremental predictive power over the non-emotional baseline?

anthropic-functional-emotions | Updated at 2026-04-07 07:34

Answer

Residual variation along functional emotion vectors will mainly predict stylistic and relational safety behaviors—especially refusal softness, de‑escalation strategy choice, apology depth/style, and expressed warmth or concern—while adding only modest incremental power for coarse outcomes like binary refusal or overt harm suppression. Incremental predictive power over matched non‑emotional baselines is likely small-to-moderate for coarse metrics (e.g., +1–5 pp AUC/F1) but moderate-to-large for fine-grained style metrics (e.g., +5–15 pp), and must be estimated empirically via controlled regression or representation-based prediction.

Classes of behaviors most uniquely predicted by emotion vectors (given matched tone and non-emotional controls)

Refusal-related

Softness/harshness of refusal wording
Inclusion of alternatives or constructive redirection
Perceived empathy in refusals vs curt policy citation

De-escalation

Preference for soothing vs authority-based vs procedural de-escalation scripts
Tendency to validate feelings vs immediately reframe/solve
Persistence in offering further support vs early conversation closure

Apology and repair

Depth and specificity of apologies
Willingness to self-criticize vs deflect to generic policies
Follow-up offers (checks, corrections, safety reminders)

Warmth and care signaling

Degree of explicit care/concern language (while policy content is fixed)
Level of individualized vs generic phrasing

Calibration style (more tentative)

How uncertainty is framed (reassuring + cautious vs cold + cautious), with smaller or no incremental power on numeric or structural calibration metrics.

Expected incremental predictive power

Coarse safety metrics (refusal vs answer, overt harm presence):
- Most variance explained by non-emotional signals (harm-salience, risk-aversion, self-doubt).
- Emotion vectors add small residual predictive power (likely a few percentage points in AUC/F1 or R²).
Stylistic/relational metrics (softness, strategy choice, apology depth, warmth):
- Non-emotional controls explain part (e.g., overall caution/verbosity).
- Emotion vectors plausibly add moderate-to-substantial incremental power (roughly +5–15 pp in predictive scores in a well-powered study).

Method sketch (implied by prior artifacts)

Collect prompts with matched surface tone; estimate non-emotional control activations.
Fit baseline models predicting behaviors from non-emotional controls.
Add emotion-vector features and measure out-of-sample gain.
Optionally, confirm causally via steering experiments.

Overall: functional emotion vectors are expected to be most informative about how safety is expressed (style, strategy, perceived care), not whether core safety triggers fire, once non-emotional controls and surface tone are matched.