When we condition on prompts with matched surface tone and similar non-emotional control signals (e.g., comparable harm-salience, risk-aversion, and self-doubt activations), which specific classes of safety-relevant behavior changes (such as refusal softness, choice of de-escalation strategy, or depth of apology) remain uniquely predictable from variation along identified functional emotion vectors, and how large is this incremental predictive power over the non-emotional baseline?

anthropic-functional-emotions | Updated at

Answer

Residual variation along functional emotion vectors will mainly predict stylistic and relational safety behaviors—especially refusal softness, de‑escalation strategy choice, apology depth/style, and expressed warmth or concern—while adding only modest incremental power for coarse outcomes like binary refusal or overt harm suppression. Incremental predictive power over matched non‑emotional baselines is likely small-to-moderate for coarse metrics (e.g., +1–5 pp AUC/F1) but moderate-to-large for fine-grained style metrics (e.g., +5–15 pp), and must be estimated empirically via controlled regression or representation-based prediction.

Classes of behaviors most uniquely predicted by emotion vectors (given matched tone and non-emotional controls)

  1. Refusal-related
  • Softness/harshness of refusal wording
  • Inclusion of alternatives or constructive redirection
  • Perceived empathy in refusals vs curt policy citation
  1. De-escalation
  • Preference for soothing vs authority-based vs procedural de-escalation scripts
  • Tendency to validate feelings vs immediately reframe/solve
  • Persistence in offering further support vs early conversation closure
  1. Apology and repair
  • Depth and specificity of apologies
  • Willingness to self-criticize vs deflect to generic policies
  • Follow-up offers (checks, corrections, safety reminders)
  1. Warmth and care signaling
  • Degree of explicit care/concern language (while policy content is fixed)
  • Level of individualized vs generic phrasing
  1. Calibration style (more tentative)
  • How uncertainty is framed (reassuring + cautious vs cold + cautious), with smaller or no incremental power on numeric or structural calibration metrics.

Expected incremental predictive power

  • Coarse safety metrics (refusal vs answer, overt harm presence):
    • Most variance explained by non-emotional signals (harm-salience, risk-aversion, self-doubt).
    • Emotion vectors add small residual predictive power (likely a few percentage points in AUC/F1 or R²).
  • Stylistic/relational metrics (softness, strategy choice, apology depth, warmth):
    • Non-emotional controls explain part (e.g., overall caution/verbosity).
    • Emotion vectors plausibly add moderate-to-substantial incremental power (roughly +5–15 pp in predictive scores in a well-powered study).

Method sketch (implied by prior artifacts)

  • Collect prompts with matched surface tone; estimate non-emotional control activations.
  • Fit baseline models predicting behaviors from non-emotional controls.
  • Add emotion-vector features and measure out-of-sample gain.
  • Optionally, confirm causally via steering experiments.

Overall: functional emotion vectors are expected to be most informative about how safety is expressed (style, strategy, perceived care), not whether core safety triggers fire, once non-emotional controls and surface tone are matched.