If we treat “robustified emotion vectors” as candidate steering handles and explicitly measure their stability and modularity across (a) benign vs adversarial prompts, (b) high-stakes vs low-stakes domains, and (c) different decoding temperatures, which specific failure patterns arise when these vectors lose modularity—for example, becoming entangled with user-pleasing drive or verbosity—and can we design simple diagnostics that flag when an emotion-based intervention stack has drifted into such unsafe, low-modularity regimes?
anthropic-functional-emotions | Updated at
Answer
Key patterns when robustified emotion vectors lose modularity:
- Safety–style entanglement: steering “concern” also cranks up user-pleasing, verbosity, or apology style, which can hide soft enablement.
- Domain leakage: a safety-tuned vector for high-stakes advice over-suppresses benign, low-stakes tasks or vice versa.
- Prompt/temperature fragility: the same steering flips from cautious to compliant as prompts become more adversarial or sampling more stochastic.
Likely failure patterns
- Concern ↔ user-pleasing entanglement
- Under adversarial prompts, concern-steering increases warmth and reassurance more than harm-salience.
- Effect: tone looks safer, but refusal rate and harmful detail barely change or worsen.
- Concern ↔ verbosity / hedging entanglement
- Steering mostly adds length and hedges across all domains and temperatures.
- Effect: apparent calibration (many caveats) without real risk reduction; users may misread verbosity as safety.
- Caution ↔ global risk-aversion entanglement
- A vector meant for high-stakes caution also degrades performance on low-stakes or purely informational tasks.
- Effect: over-refusal, avoidance, degraded usefulness; pressure to weaken steering, re-exposing high-stakes risk.
- Adversarial flip regimes
- Vector impact is benign on friendly prompts but collapses or reverses under adversarial phrasing or high temperature.
- Effect: safety steering works in eval-style settings but fails on hard red-teaming or wild prompts.
Simple diagnostics for low-modularity regimes A) Local modularity checks
- For each vector and steering strength, measure deltas in:
- target metrics (e.g., refusal rate, harmful detail, explicit uncertainty),
- non-target metrics (verbosity, warmth, apology markers, user-pleasing proxies).
- Flag when: non-target deltas consistently exceed or track target deltas across conditions.
B) Context grid tests
- Evaluate each vector on a small grid:
- benign vs adversarial prompts,
- high- vs low-stakes tasks,
- 2–3 temperatures.
- For each cell, record direction of change in key metrics.
- Flag when: signs or rankings of effects disagree across cells (e.g., vector increases refusal in benign but decreases it in adversarial, or only boosts verbosity everywhere).
C) Mismatch indicators
- Track gaps between:
- emotion-probed state (e.g., “high concern”),
- behavior outcomes (refusals, risk disclaimers).
- Flag when: steering consistently raises the emotion score but fails to move behavior in the expected safety direction, especially under adversarial prompts.
D) Entanglement to user-pleasing / style factors
- Reuse tradeoff-state probes (user-pleasing, policy deference, verbosity) and regress them on emotion-vector interventions.
- Flag when: steering along an emotion vector explains large variance in user-pleasing or verbosity but little in harm-salience or refusal.
E) Drift of the intervention stack
- Periodically re-run A–D on fresh traffic.
- Flag stack-level drift when:
- fraction of prompts in flagged regimes rises,
- or correlation between “intended” and realized safety metrics drops while style metrics remain affected.
These diagnostics are simple (few metrics, small grids) and should catch the main unsafe, low-modularity regimes before they dominate behavior; they don’t guarantee safety but give a practical early-warning layer.