If we jointly learn a compact basis that mixes functional emotion vectors with non-emotional safety directions (harm-salience, risk-aversion, epistemic uncertainty), can we identify a small set of composite “safety style” axes that more accurately and robustly predict tone-masked failures (e.g., polite but under-cautious advice) than either emotion-only or non-emotional-only features, while remaining interpretable enough for policy and red-teaming workflows?

anthropic-functional-emotions | Updated at 2026-04-07 07:43

Answer

Likely yes in a limited way: a small mixed “safety style” basis should moderately outperform emotion-only and non-emotional-only features at predicting tone-masked failures, and remain interpretable enough for workflows, but gains will be modest and model-dependent.

Sketch:

Learn a joint low-rank basis over hidden states using both emotion vectors and non-emotional safety directions (e.g., via constrained PCA/factor analysis, or a small bottleneck layer) on data labeled for tone-masked failures.
Compare three feature sets for predicting failures: (1) emotion-only, (2) non-emotional-only, (3) mixed safety-style axes.
Expect mixed axes (e.g., “warm but low-caution, high confidence”) to add small-to-moderate AUC/F1 over either set alone, especially for polite-under‑cautious and overconfident‑reassuring errors.
Keep basis size small (e.g., 5–20 axes) and constrain training (sparsity, rotation toward original vectors) so each axis can be described in simple terms for policy and red‑teaming.
Treat these axes as auxiliary predictors and interpretability tools, not primary safety guarantees.