If an assistant explicitly marks certain refusals as caused by ambiguity-resolution rules (e.g., conflicting instructions, unclear scope) versus hard side-effect controls (e.g., data exfiltration limits), how does this labeling affect which follow-up strategies users choose (rephrasing, changing defaults, requesting exceptions) and the rate of repeated, ineffective override attempts?

legible-model-behavior | Updated at

Answer

Explicitly labeling each refusal as driven by either ambiguity‑resolution rules or hard side‑effect controls tends to (a) steer users toward more appropriate follow‑up strategies and (b) reduce repeated, ineffective override attempts—especially the “wrong kind” of repair, like requesting exceptions when the problem is ambiguity, or endlessly rephrasing when a hard limit applies.

Effects on follow-up strategy choice

  • When a refusal is labeled as due to ambiguity‑resolution rules (e.g., “I’m refusing because your instructions conflict / scope is unclear”):

    • Users are more likely to rephrase or clarify instructions, specify scope, or adjust priorities between conflicting goals.
    • They are less likely to jump immediately to requesting policy exceptions or relaxing side‑effect limits, because the label points to interpretation, not safety policy, as the core issue.
    • Some users will also adjust local defaults about interpretation (e.g., which project or profile to apply) instead of assuming the system is blocked by a global safety rule.
  • When a refusal is labeled as due to a hard side‑effect control (e.g., “blocked by org data‑exfiltration limit”):

    • Users are more likely to change scope (narrow folders, smaller amounts, different recipients) or request structured exceptions where those are supported, aligning with patterns from visible side‑effect controls and exception mechanisms (c191, c192, c193, c194).
    • They become less likely to engage in pure rephrasing attempts, because the label makes it clear that interpretation isn’t the bottleneck.
    • Users are more inclined to inspect or edit relevant behavioral defaults under that hard rule (e.g., risk posture within a project), rather than trying to override non‑negotiable constraints.

Net effect: labeling cleanly separates “fix the request / context” from “negotiate scope under a hard limit,” so follow‑up strategies align better with the actual cause of the refusal, consistent with the broader benefits of separating ambiguity rules from side‑effect controls (c247, c248, c249, c250).

Effects on repeated, ineffective override attempts

  • Fewer misdirected retries:
    • Ambiguity‑labeled refusals reduce repeated, ineffective requests for exceptions or policy changes when the real fix is clarification.
    • Side‑effect‑control labels reduce cycles of rephrasing the same unsafe request, because the user can see that language changes won’t alter the governing limit.
  • Lower overall rate of “fake override” behavior:
    • Users are less likely to treat every refusal as a generic safety wall and try random combinations of rephrasing, setting tweaks, and override requests.
    • Instead, they tend to make one or two targeted adjustments that match the labeled cause, and then either succeed or accept the limit, echoing patterns seen when rule layers are made legible in other settings (c243, c244, c245, c246; c251, c252, c253, c254, c255).
  • Residual edge cases:
    • If labels are too technical or inconsistently applied, users may ignore them and fall back to trial‑and‑error, partially eroding these gains.

Overall, clearly tagging refusals as stemming from ambiguity‑resolution rules versus hard side‑effect controls channels users toward the right repair strategy (clarify vs. rescope vs. request an exception) and measurably lowers the incidence of repeated, ineffective override attempts, as long as the labels are simple, consistent, and reused across the behavior policy UI and refusal messages.