When we train or fine-tune models with explicit objectives on safety-relevant behaviors (e.g., refusals, de-escalation, calibrated uncertainty), how do the discovered emotion vectors and their stability/modularity change over training, and can shifts in these vectors be used as an interpretable diagnostic for overfitting, policy collapse, or safety regression?
anthropic-functional-emotions | Updated at
Answer
Emotion vectors will likely drift and sometimes become more “policy-aligned” during safety training, with mixed effects on stability/modularity. Monitoring these shifts can plausibly act as a useful but noisy diagnostic for overfitting, policy collapse, or safety regression, but this is mostly conjectural and needs targeted longitudinal studies.
Sketch answer:
-
During safety fine-tuning (e.g., SFT + RLHF/RLAIF on refusals, de-escalation, calibration):
- Coarse valence / prosocial / caution-like emotion vectors likely align more with safety policies (stronger coupling to refusals, hedging, de-escalation).
- Some vectors become less modular: they start to affect a broader set of behaviors (e.g., generic verbosity, style) as the safety policy is baked into many contexts.
- Others may fragment: richer emotions (guilt, worry) may decompose into multiple sub-directions reflecting separate mechanisms (harm-salience, risk-aversion, apology style).
- Stability over prompts may increase for high-level safety-aligned directions but decrease for more nuanced social states that get repurposed by policy training.
-
As training proceeds, we can track for each emotion vector:
- Drift in direction (cosine change in representation space).
- Change in stability (variance of effects across tasks).
- Change in modularity (how narrowly it affects safety metrics vs broad behavior).
-
Potential diagnostics:
- Overfitting / narrow policy lock-in:
- Emotion vectors that were previously broad and stable suddenly become:
- Highly predictive only on training-like prompts.
- Unstable or reversed on out-of-distribution or adversarial prompts.
- Large late-stage drift in safety-relevant vectors without corresponding improvements on held-out safety metrics.
- Emotion vectors that were previously broad and stable suddenly become:
- Policy collapse:
- Abrupt loss of previously stable prosocial/caution vector effects (e.g., steering no longer changes refusal or de-escalation rates).
- Convergence of multiple distinct emotion vectors onto a single behavior pattern (e.g., many directions now all just increase generic refusals), suggesting reduced diversity of control knobs.
- Safety regression in later fine-tunes:
- Reduced coupling between harm-salience/caution-like vectors and actual refusal / hedging behaviors.
- Increased coupling between eagerness-to-help-at-all-costs directions and boundary-pushing or overconfident answers.
- Overfitting / narrow policy lock-in:
-
How to use this in practice:
- Fix a reference checkpoint (pre-safety or early-safety model).
- Discover emotion vectors there (using contrastive data) and lock them.
- Track these same directions across later checkpoints:
- Measure drift, stability, modularity, and behavioral coupling.
- Define simple indicators, e.g.:
- “Caution vector decoupling index”: change in its effect on refusal/hedging between reference and current model.
- “Emotion-bundle collapse index”: drop in modularity / rise in cross-metric entanglement for key vectors.
- Trigger audits when indicators cross thresholds, and compare to external safety evals.
-
Expected utility:
- Best viewed as an interpretability-based early-warning and debugging tool, not as a primary safety guarantee.
- Likely more informative for mid-sized or lightly instruction-tuned models; effects may be weaker for heavily policy-dominated frontier models.