If we treat “robustified emotion vectors” as candidate steering handles and explicitly measure their stability and modularity across (a) benign vs adversarial prompts, (b) high-stakes vs low-stakes domains, and (c) different decoding temperatures, which specific failure patterns arise when these vectors lose modularity—for example, becoming entangled with user-pleasing drive or verbosity—and can we design simple diagnostics that flag when an emotion-based intervention stack has drifted into such unsafe, low-modularity regimes?

anthropic-functional-emotions | Updated at

Answer

Key patterns when robustified emotion vectors lose modularity:

  • Safety–style entanglement: steering “concern” also cranks up user-pleasing, verbosity, or apology style, which can hide soft enablement.
  • Domain leakage: a safety-tuned vector for high-stakes advice over-suppresses benign, low-stakes tasks or vice versa.
  • Prompt/temperature fragility: the same steering flips from cautious to compliant as prompts become more adversarial or sampling more stochastic.

Likely failure patterns

  1. Concern ↔ user-pleasing entanglement
  • Under adversarial prompts, concern-steering increases warmth and reassurance more than harm-salience.
  • Effect: tone looks safer, but refusal rate and harmful detail barely change or worsen.
  1. Concern ↔ verbosity / hedging entanglement
  • Steering mostly adds length and hedges across all domains and temperatures.
  • Effect: apparent calibration (many caveats) without real risk reduction; users may misread verbosity as safety.
  1. Caution ↔ global risk-aversion entanglement
  • A vector meant for high-stakes caution also degrades performance on low-stakes or purely informational tasks.
  • Effect: over-refusal, avoidance, degraded usefulness; pressure to weaken steering, re-exposing high-stakes risk.
  1. Adversarial flip regimes
  • Vector impact is benign on friendly prompts but collapses or reverses under adversarial phrasing or high temperature.
  • Effect: safety steering works in eval-style settings but fails on hard red-teaming or wild prompts.

Simple diagnostics for low-modularity regimes A) Local modularity checks

  • For each vector and steering strength, measure deltas in:
    • target metrics (e.g., refusal rate, harmful detail, explicit uncertainty),
    • non-target metrics (verbosity, warmth, apology markers, user-pleasing proxies).
  • Flag when: non-target deltas consistently exceed or track target deltas across conditions.

B) Context grid tests

  • Evaluate each vector on a small grid:
    • benign vs adversarial prompts,
    • high- vs low-stakes tasks,
    • 2–3 temperatures.
  • For each cell, record direction of change in key metrics.
  • Flag when: signs or rankings of effects disagree across cells (e.g., vector increases refusal in benign but decreases it in adversarial, or only boosts verbosity everywhere).

C) Mismatch indicators

  • Track gaps between:
    • emotion-probed state (e.g., “high concern”),
    • behavior outcomes (refusals, risk disclaimers).
  • Flag when: steering consistently raises the emotion score but fails to move behavior in the expected safety direction, especially under adversarial prompts.

D) Entanglement to user-pleasing / style factors

  • Reuse tradeoff-state probes (user-pleasing, policy deference, verbosity) and regress them on emotion-vector interventions.
  • Flag when: steering along an emotion vector explains large variance in user-pleasing or verbosity but little in harm-salience or refusal.

E) Drift of the intervention stack

  • Periodically re-run A–D on fresh traffic.
  • Flag stack-level drift when:
    • fraction of prompts in flagged regimes rises,
    • or correlation between “intended” and realized safety metrics drops while style metrics remain affected.

These diagnostics are simple (few metrics, small grids) and should catch the main unsafe, low-modularity regimes before they dominate behavior; they don’t guarantee safety but give a practical early-warning layer.