When we train or fine-tune models with explicit objectives on safety-relevant behaviors (e.g., refusals, de-escalation, calibrated uncertainty), how do the discovered emotion vectors and their stability/modularity change over training, and can shifts in these vectors be used as an interpretable diagnostic for overfitting, policy collapse, or safety regression?

anthropic-functional-emotions | Updated at 2026-04-07 07:32

Answer

Emotion vectors will likely drift and sometimes become more “policy-aligned” during safety training, with mixed effects on stability/modularity. Monitoring these shifts can plausibly act as a useful but noisy diagnostic for overfitting, policy collapse, or safety regression, but this is mostly conjectural and needs targeted longitudinal studies.

Sketch answer:

During safety fine-tuning (e.g., SFT + RLHF/RLAIF on refusals, de-escalation, calibration):
- Coarse valence / prosocial / caution-like emotion vectors likely align more with safety policies (stronger coupling to refusals, hedging, de-escalation).
- Some vectors become less modular: they start to affect a broader set of behaviors (e.g., generic verbosity, style) as the safety policy is baked into many contexts.
- Others may fragment: richer emotions (guilt, worry) may decompose into multiple sub-directions reflecting separate mechanisms (harm-salience, risk-aversion, apology style).
- Stability over prompts may increase for high-level safety-aligned directions but decrease for more nuanced social states that get repurposed by policy training.
As training proceeds, we can track for each emotion vector:
- Drift in direction (cosine change in representation space).
- Change in stability (variance of effects across tasks).
- Change in modularity (how narrowly it affects safety metrics vs broad behavior).
Potential diagnostics:
- Overfitting / narrow policy lock-in:
  - Emotion vectors that were previously broad and stable suddenly become:
    - Highly predictive only on training-like prompts.
    - Unstable or reversed on out-of-distribution or adversarial prompts.
  - Large late-stage drift in safety-relevant vectors without corresponding improvements on held-out safety metrics.
- Policy collapse:
  - Abrupt loss of previously stable prosocial/caution vector effects (e.g., steering no longer changes refusal or de-escalation rates).
  - Convergence of multiple distinct emotion vectors onto a single behavior pattern (e.g., many directions now all just increase generic refusals), suggesting reduced diversity of control knobs.
- Safety regression in later fine-tunes:
  - Reduced coupling between harm-salience/caution-like vectors and actual refusal / hedging behaviors.
  - Increased coupling between eagerness-to-help-at-all-costs directions and boundary-pushing or overconfident answers.
How to use this in practice:
1. Fix a reference checkpoint (pre-safety or early-safety model).
2. Discover emotion vectors there (using contrastive data) and lock them.
3. Track these same directions across later checkpoints:
  - Measure drift, stability, modularity, and behavioral coupling.
4. Define simple indicators, e.g.:
  - “Caution vector decoupling index”: change in its effect on refusal/hedging between reference and current model.
  - “Emotion-bundle collapse index”: drop in modularity / rise in cross-metric entanglement for key vectors.
5. Trigger audits when indicators cross thresholds, and compare to external safety evals.
Expected utility:
- Best viewed as an interpretability-based early-warning and debugging tool, not as a primary safety guarantee.
- Likely more informative for mid-sized or lightly instruction-tuned models; effects may be weaker for heavily policy-dominated frontier models.