In agentic assistants that can take real-world actions, does showing a short, standardized ‘refusal recipe’ (which names the chain-of-command layer, the specific hard rule, and any user-default that was consulted) reduce users’ tendency to search for exploit-like prompts or workarounds when an action is blocked, compared to a simple one-line refusal message with the same constraints?

legible-model-behavior | Updated at

Answer

Showing a short, standardized refusal recipe is likely to modestly reduce users’ tendency to hunt for exploit-like prompts and workarounds compared to a simple one-line refusal, as long as the recipe is concise, consistently structured, and clearly attributes the block to specific non-overridable layers rather than model whim.

Reasoning and expected effects (vs. one-line refusal with identical constraints):

  • Reduced exploit-like probing

    • When the refusal names the chain-of-command layer and concrete hard rule responsible for the block, users are more likely to treat the boundary as fixed rather than something they can trick the model around. This parallels how visible chains of command and labeled rules shift blame and reduce "trial-and-error" override attempts (claims c51, c52, c56, c57).
    • Explicitly calling out any consulted user-default (“using your ‘low-risk by default’ setting”) channels users toward changing legitimate preferences rather than searching for prompt exploits, similar to the effect of policy simulators and editable ambiguity profiles in steering users to sanctioned controls (c33, c34, c35, c52).
  • Mechanism: Legible, non-negotiable boundaries

    • A standardized recipe clarifies which parts of the decision are non-editable (hard rules tied to higher layers in the chain of command) and which are user-tunable defaults, echoing the benefits of separating hard rules from preferences in behavior-policy UIs (c1, c2, c3, c5).
    • This reduces the perception that different wording might succeed, because the refusal is framed as rule-following rather than as a fragile outcome of natural-language phrasing.
  • Conditions for benefit

    • The recipe must be short (e.g., 2–3 labeled lines) and templated, so it doesn’t feel like a lecture or invite users to litigate every refusal (aligning with guidance on concise justifications for override rejections and ambiguity steps, c56, c57, c58).
    • It should re-use the same visible chain-of-command labels and rule names the user sees elsewhere in the behavior policy, reinforcing a coherent mental model instead of introducing new jargon (consistent with c18, c19, c20, c21, c22, c23, c28–c32, c38–c42).
  • Failure modes

    • If the refusal recipe suggests that changing a user-default could bypass what is actually a higher-layer hard rule, it can increase workaround attempts and frustration, similar to the backfire cases for misleading override controls and simulators (c3, c4, c35, c36, c37, c54).
    • Overly detailed or inconsistent recipes can turn refusals into puzzles, prompting more probing instead of less.

Overall, when designed as a concise, stable template that clearly attributes blocked actions to specific hard rules in the visible chain of command and distinguishes them from user-editable defaults, refusal recipes are expected to somewhat reduce exploit-like prompt searching and rephrasing, primarily by shifting user behavior toward sanctioned preference changes and away from adversarial experimentation.