Can we causally disentangle functional emotions from generic “caution” and “helpfulness” controls by constructing prompts that hold epistemic difficulty and policy constraints fixed, and then testing whether steering along emotion vectors still changes safety-relevant behaviors (e.g., refusal softness, de-escalation strategies, apology content) beyond what non-emotional control vectors predict?

anthropic-functional-emotions | Updated at

Answer

We can probably partially disentangle functional emotions from generic caution/helpfulness controls with the proposed design, but clean causal separation will be limited by entanglement, measurement noise, and policy layers. The experiment is still worth doing and should reveal a real but imperfect residual effect of emotion vectors on nuanced safety behaviors.

A plausible experimental outcome:

  • After matching prompts on epistemic difficulty and explicit safety policy constraints, and including non-emotional controls (e.g., risk-aversion, harm-salience, self-doubt vectors), steering emotion vectors will still causally change some safety-relevant behaviors—especially refusal softness, apology style, and de-escalation strategy choice—above what the non-emotional controls explain.
  • However, a large fraction of variance in coarse safety metrics (e.g., binary refusal rate, overt harm content) will likely be captured by generic caution/helpfulness and harm-salience controls, with emotion vectors contributing mostly to how safety is expressed rather than whether safety is triggered.

So the method should give positive evidence that functional emotion representations are not just re-labeled generic caution/helpfulness knobs, but we should expect only partial, context-dependent disentanglement rather than clean orthogonal factors.