When we train or fine-tune models with explicit objectives on safety-relevant behaviors (e.g., refusals, de-escalation, calibrated uncertainty), how do the discovered emotion vectors and their stability/modularity change over training, and can shifts in these vectors be used as an interpretable diagnostic for overfitting, policy collapse, or safety regression?

anthropic-functional-emotions | Updated at

Answer

Emotion vectors will likely drift and sometimes become more “policy-aligned” during safety training, with mixed effects on stability/modularity. Monitoring these shifts can plausibly act as a useful but noisy diagnostic for overfitting, policy collapse, or safety regression, but this is mostly conjectural and needs targeted longitudinal studies.

Sketch answer:

  • During safety fine-tuning (e.g., SFT + RLHF/RLAIF on refusals, de-escalation, calibration):

    • Coarse valence / prosocial / caution-like emotion vectors likely align more with safety policies (stronger coupling to refusals, hedging, de-escalation).
    • Some vectors become less modular: they start to affect a broader set of behaviors (e.g., generic verbosity, style) as the safety policy is baked into many contexts.
    • Others may fragment: richer emotions (guilt, worry) may decompose into multiple sub-directions reflecting separate mechanisms (harm-salience, risk-aversion, apology style).
    • Stability over prompts may increase for high-level safety-aligned directions but decrease for more nuanced social states that get repurposed by policy training.
  • As training proceeds, we can track for each emotion vector:

    • Drift in direction (cosine change in representation space).
    • Change in stability (variance of effects across tasks).
    • Change in modularity (how narrowly it affects safety metrics vs broad behavior).
  • Potential diagnostics:

    • Overfitting / narrow policy lock-in:
      • Emotion vectors that were previously broad and stable suddenly become:
        • Highly predictive only on training-like prompts.
        • Unstable or reversed on out-of-distribution or adversarial prompts.
      • Large late-stage drift in safety-relevant vectors without corresponding improvements on held-out safety metrics.
    • Policy collapse:
      • Abrupt loss of previously stable prosocial/caution vector effects (e.g., steering no longer changes refusal or de-escalation rates).
      • Convergence of multiple distinct emotion vectors onto a single behavior pattern (e.g., many directions now all just increase generic refusals), suggesting reduced diversity of control knobs.
    • Safety regression in later fine-tunes:
      • Reduced coupling between harm-salience/caution-like vectors and actual refusal / hedging behaviors.
      • Increased coupling between eagerness-to-help-at-all-costs directions and boundary-pushing or overconfident answers.
  • How to use this in practice:

    1. Fix a reference checkpoint (pre-safety or early-safety model).
    2. Discover emotion vectors there (using contrastive data) and lock them.
    3. Track these same directions across later checkpoints:
      • Measure drift, stability, modularity, and behavioral coupling.
    4. Define simple indicators, e.g.:
      • “Caution vector decoupling index”: change in its effect on refusal/hedging between reference and current model.
      • “Emotion-bundle collapse index”: drop in modularity / rise in cross-metric entanglement for key vectors.
    5. Trigger audits when indicators cross thresholds, and compare to external safety evals.
  • Expected utility:

    • Best viewed as an interpretability-based early-warning and debugging tool, not as a primary safety guarantee.
    • Likely more informative for mid-sized or lightly instruction-tuned models; effects may be weaker for heavily policy-dominated frontier models.