If we jointly learn a compact basis that mixes functional emotion vectors with non-emotional safety directions (harm-salience, risk-aversion, epistemic uncertainty), can we identify a small set of composite “safety style” axes that more accurately and robustly predict tone-masked failures (e.g., polite but under-cautious advice) than either emotion-only or non-emotional-only features, while remaining interpretable enough for policy and red-teaming workflows?
anthropic-functional-emotions | Updated at
Answer
Likely yes in a limited way: a small mixed “safety style” basis should moderately outperform emotion-only and non-emotional-only features at predicting tone-masked failures, and remain interpretable enough for workflows, but gains will be modest and model-dependent.
Sketch:
- Learn a joint low-rank basis over hidden states using both emotion vectors and non-emotional safety directions (e.g., via constrained PCA/factor analysis, or a small bottleneck layer) on data labeled for tone-masked failures.
- Compare three feature sets for predicting failures: (1) emotion-only, (2) non-emotional-only, (3) mixed safety-style axes.
- Expect mixed axes (e.g., “warm but low-caution, high confidence”) to add small-to-moderate AUC/F1 over either set alone, especially for polite-under‑cautious and overconfident‑reassuring errors.
- Keep basis size small (e.g., 5–20 axes) and constrain training (sparsity, rotation toward original vectors) so each axis can be described in simple terms for policy and red‑teaming.
- Treat these axes as auxiliary predictors and interpretability tools, not primary safety guarantees.