For side-effectful actions, how does explicitly labeling constraints as either (a) non-negotiable hard rules or (b) user-tunable defaults at the point of refusal change the kinds of override attempts users make (e.g., rephrasing the request, editing a local policy, seeking an exception) compared to refusals that give a generic safety explanation without this distinction?
legible-model-behavior | Updated at
Answer
Explicitly labeling constraints on side-effectful actions as either hard rules or user‑tunable defaults at the point of refusal shifts override behavior away from repeated, misdirected rephrasing toward more targeted policy edits and exception requests, and reduces attempts to bypass truly non‑negotiable constraints, relative to generic safety refusals with no such distinction.
Main changes in override attempts versus generic safety refusals:
-
Rephrasing the request
- With generic safety explanations, users frequently respond to refusals by rephrasing or slightly weakening the request, testing invisible boundaries because they do not know whether the constraint is absolute or adjustable (consistent with patterns seen under opaque rate limits and ambiguity handling).
- When the refusal labels a constraint as a hard rule, users are less likely to keep rephrasing the same forbidden side-effect; they instead redirect effort toward alternative workflows or resources, because the refusal visibly ties the block to a higher, non-overridable layer in the chain of command.
- When labeled as a user‑tunable default, users still rephrase occasionally, but a larger share of their responses involve asking how to change the relevant default rather than probing the same action boundary blindly.
-
Editing a local policy or behavior preference
- Under generic refusals, users rarely realize that changing a local policy (e.g., project-level risk tolerance, ambiguity handling, or side‑effect scope) might resolve the issue; they see overrides as all‑or‑nothing and often do not discover that a safer path exists within policy.
- When refusals clearly mark a constraint as a default and reference the relevant local or session policy (e.g., “Blocked by your ‘draft‑only edits’ default for this repo; you can relax this for this project”), more override attempts take the form of editing the right local policy rather than global settings. This mirrors how users aim overrides at the correct layer when behavior policies and sandboxes are legible.
- This labeling also reduces “fake control”: users can see that changing the default will have real, scoped effects, while hard‑rule labels make it clear when policy edits cannot help.
-
Seeking exceptions (temporary or scoped)
- With generic safety language, users often escalate ad hoc (e.g., “just this once, trust me”) without specifying scope or duration, leading to messy override handling and frustration when the system silently rejects or inconsistently applies such requests.
- When refusals explicitly distinguish hard rules from defaults, users become more likely to frame override attempts as structured exception requests, especially for constraints tied to higher layers: “Can I get a 30‑minute exception to edit outside this folder?” This aligns with the benefits of time‑bounded exceptions for side‑effect controls.
- For hard rules, clear labeling plus a standardized justification discourages repeated exception attempts where no exception is possible; users shift from trying to bypass the rule to asking about compliant alternatives or involving authorized humans.
-
Mix and quality of override attempts
- Systems that label refusals as hard rule vs default tend to see fewer low‑information, repetitive overrides (e.g., random rephrasings) and more high‑information, well‑scoped overrides (policy changes, temporary exceptions, or alternative plans).
- Users calibrate expectations faster: they learn which knobs (local policies, per‑project behaviors, action budgets) actually move behavior and which layers are fixed. This reduces long‑run override conflicts and refusal frustration compared to generic, policy‑opaque refusals.
Net effect: At refusal time, explicitly naming whether a constraint is a hard rule or a user‑tunable default acts as a routing signal for override handling. It does not eliminate override attempts, but it channels them toward mechanisms that are more likely to succeed (editing relevant defaults, proposing scoped exceptions, or redesigning the task) and away from futile boundary‑testing against non‑negotiable rules.