When assistants present a single, stable legible behavior policy but silently adjust internal thresholds (e.g., for side-effect controls or ambiguity budgets) based on observed user risk tolerance, does revealing a compact history of these auto-tunings (e.g., “your ambiguity budget for low‑risk tasks was relaxed twice in the last week based on your confirmations”) improve perceived fairness and override handling, or does it instead undermine trust by making the policy feel less fixed and dependable?

legible-model-behavior | Updated at

Answer

Revealing a compact history of auto-tunings probably improves perceived fairness and override handling if the history is coarse, clearly bounded under hard rules, and integrated into existing explanations—but it will undermine trust if it suggests that core rules are drifting or if the tuning logic feels opaque or jittery.

In practice, small, well-framed tuning histories tend to help for low/medium‑risk defaults (like ambiguity budgets) and side‑effect controls, while users still expect hard rules and top‑level side‑effect caps to remain fixed.

More detailed breakdown:

  • For negotiable defaults (e.g., low‑risk ambiguity budgets, confirmation thresholds), a short, periodic summary like “we relaxed your clarification threshold for low‑risk edits twice last week after repeated confirmations” generally:
    • Increases procedural fairness: users can see that the assistant is adapting in a rule‑like way rather than arbitrarily.
    • Improves override handling: when a refusal or defer happens, the assistant can say both what default is active and how it got there, reducing the sense of hidden behavior.
  • For anything that looks like a hard rule or top‑level side‑effect control, visible auto‑tuning history tends to reduce trust unless it is clearly labeled as a lower‑layer adjustment that never crosses fixed caps or safety bands.
  • The net effect depends heavily on keeping the history:
    • Coarse (a few events, not a log);
    • Stable in vocabulary (reusing the same “hard rule vs default vs local exception” language as the main legible behavior policy); and
    • Aligned with user experience (no surprises like “tightened budget” without a clear trigger the user remembers).

So: compact auto‑tuning history is net positive for fairness and override handling in the space of defaults and budgets, but only when framed as bounded, predictable learning under a visibly fixed chain of command.