If a refusal cites the chain of command (e.g., ‘org hard rule’) but the user-visible behavior policy appears to allow the refused action (e.g., a permissive default in the UI), how do different mismatch-handling strategies—immediate policy-UI correction, explicit contradiction warnings, or silent backend reconciliation—affect user trust and later willingness to accept refusals?

legible-model-behavior | Updated at 2026-04-06 18:41

Answer

Immediate policy-UI correction generally preserves the most trust and future willingness to accept refusals, explicit contradiction warnings help in high-stakes or repeated cases but can feel bureaucratic if overused, and silent backend reconciliation tends to erode trust and reduce acceptance of later refusals because it makes the behavior policy feel fake or deceptive.

Comparative effects:

Immediate policy-UI correction
- Trust: Highest of the three, because the system quickly realigns the visible legible behavior policy with the actual chain of command and hard rules. Users see the mismatch as a fixable UI bug rather than a bait-and-switch.
- Later refusals: More acceptable, especially if the correction is coupled with a brief, standardized note at the next refusal (e.g., “This option was incorrectly shown as allowed; org hard rule X applies here”). Users then attribute the refusal to a stable hard rule, similar to other cases where visible layering clarifies constraints.
- Best use: Default strategy whenever the mismatch is structural (e.g., a default that will never actually be honored under current org rules).
Explicit contradiction warnings (e.g., “Your setting says ‘allow,’ but an org hard rule currently blocks this action”)
- Trust: Moderately high and often higher than silent reconciliation, because the system names the conflict and anchors it in the chain of command. Users learn that defaults are subordinate to hard rules, which fits with other evidence that labeled rule layers improve fairness perceptions.
- Later refusals: More acceptable than under silence, since users have a precedent: they’ve seen that some “allow” defaults are aspirations bounded by higher-layer rules. However, repeated, verbose warnings can feel like friction or scolding.
- Best use: For user-initiated, one-off attempts where the UI can’t be safely changed globally yet (e.g., experimental org policy, partial rollout), or when a mismatch is temporary.
Silent backend reconciliation (refusal cites an org hard rule, but the UI/behavior policy continues to suggest the action is allowed)
- Trust: Lowest, especially over multiple encounters. Users experience the assistant as inconsistent or deceptive: the visible defaults and policy say one thing, refusals do another. This recreates the “fake control” pattern seen in other override and exception flows.
- Later refusals: Less acceptable; users are more likely to keep retrying or escalate (“you said this was allowed”), and to generalize mistrust to other parts of the behavior policy and chain of command.
- Only defensible in rare emergency or one-off cases, and even then should be followed by rapid UI/policy correction.

Practical pattern:

Prefer immediate policy-UI correction as the baseline, updating both the visible default and its labels (e.g., clarifying that it is subordinate to an org rule or narrowing its scope).
Where correction can’t be applied instantly, use explicit, short contradiction warnings tied to the chain of command and hard-rule labels, so users understand why the refusal overrides the apparent default.
Avoid purely silent backend reconciliation; repeated exposure to mismatches between refusals and the user-visible behavior policy reliably undermines both trust and willingness to accept future refusals.