Current chain-of-command–based legible behavior policies treat users mainly as instruction sources; if we instead give users a limited ‘jury role’ over contested actions—letting them retrospectively label specific assistant refusals or side-effect enforcements as “too strict,” “about right,” or “too loose,” and commit to using only these coarse verdicts to tune future behavior—does this accountability-oriented framing change perceptions of fairness and control more than additional transparency about the rule hierarchy alone, and under what conditions does it backfire (e.g., by making non-negotiable hard rules feel falsely contestable)?
legible-model-behavior | Updated at
Answer
Likely yes: a limited jury role can increase perceived fairness and control more than extra rule transparency alone, but only if its scope is narrow, clearly below hard rules, and gives visible future impact. It backfires when it blurs what is contestable.
Key claims
- Relative effect: Adding a jury role on top of a clear chain of command should move perceptions more than adding only more hierarchy detail, because it shifts users from pure “recipients of rules” to “participants in tuning,” which is a stronger fairness signal.
- Scope: The jury must only tune soft layers (defaults, thresholds, side-effect controls), never hard rules, and the UI must say this explicitly at rating time.
- Feedback: Users must see small but real changes tied to their labels (e.g., “you said this was too strict; I’ll ask less often in similar low-risk cases”), or the mechanism will feel fake and reduce trust.
- Guardrails: For hard rules, the UI should accept feedback but frame it as policy feedback, not a behavioral lever (“logged for your org; I can’t change this”).
- Backfire modes: It backfires when (a) users infer hard rules are negotiable, (b) labels rarely influence behavior, or (c) ratings are cognitively costly or appear politically risky in org settings.
When it helps most
- Medium-stakes, high-frequency actions (notifications, small edits, non-critical emails) where refusals/side-effect controls happen often and can be tuned without safety risk.
- Contexts where chain-of-command rules are already visible but feel one-sided; adding a jury role reframes the assistant as accountable to the user within those rules.
- Systems that aggregate labels per pattern ("this type of refusal"), keep changes small and reversible, and surface a short history of “policy got looser/tighter based on your and others’ feedback.”
When it backfires
- High-stakes or tightly regulated domains where any implication that hard rules respond to per-action feedback undermines legitimacy.
- Designs that mix hard rules and tunable defaults in the same rating widget without explicit labels (“this is a hard rule, cannot change; this is adjustable”).
- Cases where the assistant later behaves inconsistently with learned tuning and cannot explain why (“this looks similar but is in a higher risk band / different side-effect control”).
Design sketch
- Rating entry points: Only on contested outcomes (refusals, heavy constraints) and optionally bundled (“How was this: too strict / about right / too loose?”). Keep frequency low.
- Scope text: “This rating only affects how strict I am for similar low/medium-risk cases. It never changes org hard rules.”
- Explanation reuse: “I was stricter here because of an org hard rule. Your ‘too strict’ rating is logged as feedback to your org but cannot change this rule.” vs “I tightened your notification default last week based on past ‘too loose’ ratings; you can reset it.”
Net expectation
- Compared to more hierarchy transparency alone, jury-role feedback is more likely to improve fairness and control perceptions in everyday use, but only if:
- the chain of command and hard rules remain primary and clearly non-negotiable;
- the jury role is framed as narrow tuning of defaults and side-effect controls;
- users see at least occasional, explained changes that trace back to their verdicts.
- Otherwise it risks making the policy feel politicized (“I must vote correctly”) or fake (“I vote, nothing happens”), which is worse than a plain, legible rule hierarchy.