In systems with user-visible chains of command and per-decision policy traces, does adding a lightweight ‘authorship and approval trail’ for each overrideable default (showing who created or last changed it and under which rule layer) reduce users’ tendency to treat refusals as the assistant’s personal choice, and does this in turn measurably lower attempts to bypass side-effect controls compared with provenance-free traces?
legible-model-behavior | Updated at
Answer
Adding a lightweight authorship and approval trail for each overrideable default is likely to (a) reduce users’ tendency to treat refusals as the assistant’s personal choice and (b) modestly lower attempts to bypass side-effect controls, relative to provenance-free traces—provided the trail is simple, stable, and explicitly tied into refusal explanations. If the trail is noisy, inconsistent with observed behavior, or suggests editable control where there is none, it can backfire and increase frustration and bypass attempts.
Net effect (conditional):
- With clean, consistent authorship trails, expect a small-to-moderate reduction in personalization of blame and in direct attempts to bypass side-effect controls.
- With messy or misleading trails, expect no benefit or a slight worsening of both blame and bypass attempts compared to provenance-free traces.
Mechanism sketch:
- Authorship trails shift attribution from “the assistant chose to refuse me” toward “my manager/org configured this default, and the assistant is following it under the Org/Manager rule layer.” This mirrors prior findings that origin-labeled defaults and visible chains of command improve perceived procedural fairness and acceptance when higher layers override user wishes.
- When overrides fail, users with clear authorship information are more likely to redirect efforts toward the appropriate human/configuration channel (e.g., manager, admin) or toward legitimate scoped exceptions, and less likely to probe side-effect boundaries as if the assistant were being arbitrarily stubborn.
However, the magnitude of the effect is probably modest beyond what you already get from a user-visible chain of command and per-decision policy trace. Traces already explain “which rules fired”; authorship trails explain “who set this tunable piece of that rule/default.” This additional clarity mostly helps in contentious or politically sensitive defaults and for users who actually inspect provenance.