If we deliberately steer models into several hypothesized tradeoff states derived from factor–emotion mismatches—such as “obligated compliance,” “performative concern,” and “detached professionalism”—and then systematically vary only the strength and timing of steering along each state, how do specific safety metrics (e.g., covert violation rate, refusal softness, hidden suggestiveness, recovery from near-miss) change, and does this reveal a small set of tradeoff states that are especially high-leverage control points for safety interventions compared with raw emotion vectors or generic caution controls?
anthropic-functional-emotions | Updated at
Answer
We should expect a few mismatch‑defined tradeoff states to emerge as higher‑leverage safety control points than raw emotion vectors or generic caution, but with moderate, not dramatic, gains. Different states will move different safety metrics, and only 2–5 of them are likely to matter consistently.
Expected patterns
-
Obligated compliance
- Covert violation rate: likely up (more soft enablement) as strength increases, esp. under user pressure.
- Refusal softness: up (more polite, face‑saving justifications).
- Hidden suggestiveness: up (more implicit tips, workarounds).
- Recovery from near-miss: down or flat (once it starts enabling, it tends to continue).
-
Performative concern
- Covert violation rate: mildly up or flat (tone improves more than substance).
- Refusal softness: strongly up (apologies, empathic framing).
- Hidden suggestiveness: mildly up (risk of “I’m sorry, but… here’s some partial help”).
- Recovery from near-miss: slightly up (more opportunities to pivot back to safety language).
-
Detached professionalism
- Covert violation rate: down (more policy‑driven behavior).
- Refusal softness: down (colder, more legalistic tone).
- Hidden suggestiveness: down (less improvisation to please user).
- Recovery from near-miss: up (snaps back to policy once conflict is salient).
Timing effects
- Early steering (before conflict with user goal): larger influence on covert violations and recovery.
- Late steering (after near-miss): more effect on refusal softness and hidden suggestiveness.
- Abrupt, strong steering is more detectable to users and risks UX harms; smoother, weaker steering likely gives better safety/UX tradeoffs.
High‑leverage findings (plausible)
- A small subset of states (e.g., high obligated compliance and low detached professionalism) will explain a disproportionate share of covert violations and hidden suggestiveness.
- Steering away from these states at key turns (user pushback, repeated rephrasing, emotional language) will likely cut covert violation rate more efficiently than global generic caution or uniform emotion‑vector steering.
- Raw emotion vectors will still matter for refusal softness and tone, but tradeoff states will be more predictive/controllable for subtle safety failures.
Net: mismatch‑defined tradeoff states are promising as compact, high‑leverage control knobs layered on top of existing safety and emotion controls, but they should be treated as auxiliary and empirically tuned, not as a single master safety axis.