When an assistant exposes a compact, always-visible summary of its current behavior state (e.g., active chain-of-command layer, key hard rules currently binding, and any task-scoped policies in effect), does this persistent ‘policy status bar’ reduce user misattribution of failures to the model vs. to constraints, compared to only showing explanations at moments of refusal or override?

legible-model-behavior | Updated at 2026-04-06 18:21

Answer

A well-designed, compact, always-visible “policy status bar” is likely to modestly reduce user misattribution of failures to the model (vs. to constraints) compared to showing explanations only at refusal/override moments, but only if it reuses the existing legible behavior policy vocabulary, stays low-noise, and surfaces state changes rather than a dense rule list. Poor designs (busy, unstable, or introducing new labels) can increase confusion and perceived bureaucracy.

Expected directional effect vs. explanations-only-at-refusals:

Misattribution of failures
- Reduction, because users are continuously reminded which chain-of-command layer and hard rules are currently binding, so when an action fails or is degraded they have a salient alternative explanation (“org data rule active” / “project policy: draft-only”) instead of defaulting to “the model is dumb/broken”.
- The effect is strongest when the status bar shows a few, stable tokens like: current org/system mode, the presence of any active task-scoped policy, and whether special side-effect controls or budgets are in force—mirroring the layers used in refusals, policy simulators, and action budgets.
- The incremental benefit over refusal-time explanations is moderate, not huge: refusal-time rule traces and simulators already redirect blame at the moment of conflict; the status bar mainly helps in cases where users experience silent or partial failures (degraded actions, throttling, conservative ambiguity handling) that never trigger a full refusal explanation.

Design constraints for benefit:

The status bar should:
- Reuse the same chain-of-command and hard-rules-vs-defaults labels as the legible behavior policy, task policies, and rule traces (no new unexplained categories).
- Highlight state changes (e.g., a temporary exception starting/ending, entering a task-scoped policy, hitting a budget band) with brief, optional details, instead of streaming rule minutiae.
- Be compact and visually secondary, with drill-down available on demand; otherwise it risks becoming perceived bureaucracy and cognitive noise.

Failure modes:

If the status bar is dense, unstable (frequently flickering labels), or uses opaque codes, users may:
- Ignore it entirely and continue to misattribute failures to the model.
- Or worse, see it as proof that “the system is arbitrarily changing rules all the time,” which can increase misattributed blame and reduce trust.

Under these conditions, a persistent policy status bar is a useful complement to moment-of-refusal explanations: it fills gaps where failures or constraints are subtle or non-blocking, and it reinforces the habit of attributing outcomes to visible constraints instead of to mysterious model behavior.