In small, agent-first monolith teams that already use learn/ship modes and staged review (R1/R2/R3), what concrete signals in real PRs and incidents should trigger raising the craft bar for agents (stricter façades, narrower harness permissions, more human-owned design files) versus lowering it (wider agent-editable areas, looser templates), and how could you run a 4–6 week calibration loop so those thresholds are data-driven rather than taste-only arguments?

dhh-agent-first-software-craft | Updated at

Answer

Signals + loop, kept tight.

  1. Signals to RAISE the craft bar (tighten façades, perms, human design)

Use these as hard triggers, not vibes:

PR-level signals

  • S1: Rework rate
    • ≥2 follow-up PRs or >25% of lines changed post-review on agent-heavy work in a domain over 2–3 weeks.
  • S2: Review drag
    • Median review time for agent-heavy PRs in a path is >2× team median for similar scope.
  • S3: Boundary leaks
    • Comments like “agent reached into internals” or “this bypasses X façade” on ≥3 PRs in same area.
  • S4: Taste regressions
    • Repeated nits: inconsistent naming, copy-paste patterns, missing small tests in one directory or feature.

Incident/ops signals

  • S5: Post-merge defects
    • ≥2 production bugs in 4 weeks traced to the same directory, service façade, or harness tool.
  • S6: Rollbacks / hotfixes
    • Any rollback tied to an agent-generated change where tests passed but behavior was off.
  • S7: Verification misses
    • Incident review notes “PR looked fine; missing/weak check” more than once for the same area.

Actions when these cluster by domain/path

  • A1: Stricter façades
    • Freeze new call sites to internals; require going through 1–2 clear services/flows.
  • A2: Narrow harness perms
    • Move files/dirs to “read-only for agents” or “agent-editable only via flow X”.
  • A3: More human-owned design files
    • Require a small human-written design or flow file (R3 note, sequence, contract snippet) before agent runs in that area.
  • A4: Mode bump
    • Default contentious area to mode:learn for 2–4 weeks; agents assist, humans lead design and verification.
  1. Signals to LOWER the craft bar (wider agent area, looser templates)

Look for repeated success:

PR-level signals

  • S8: High merge-worthiness
    • ≥70–80% of agent-heavy PRs in a path merge with only minor edits (comment-only or tiny nits) over 3–4 weeks.
  • S9: Fast review
    • Median review time for those PRs < team median for similar size; reviewers tag them “low-friction”.
  • S10: Good diff shape
    • Diffs stay within façades, small files, and named flows with clear tests; reviewers rarely flag layout or boundary issues.

Incident/ops signals

  • S11: Clean run history
    • 0–1 minor incidents in 6–8 weeks from that area, despite steady change volume.
  • S12: Strong checks
    • PRs routinely add or extend tests/verification; reviewers begin to skim code and lean on checks.

Team signals

  • S13: Reviewer confidence
    • Seniors explicitly note “we could safely let agents do more here” in R3 or retro notes.
  • S14: Boredom / underuse
    • Seniors report R1/R2 in that area feels like rubber-stamping well-shaped work.

Actions when these cluster

  • B1: Wider agent-editable areas
    • Expand allowed dirs/files for agent writes; allow agents to propose new files under existing façades.
  • B2: Looser templates
    • Simplify PR templates, allow larger diffs per PR, and relax "always add new test" to "extend checks when behavior changes".
  • B3: More agent-led R1/R2
    • Let agents own diff summaries and suggested checks; humans focus on R3 framing.
  • B4: Mode bias
    • Default routine work here to mode:ship with opt-in mode:learn when a junior wants depth.
  1. 4–6 week calibration loop (make it data-driven)

Step 0: Minimal metrics

  • Tag each PR:
    • mode:learn|ship, agent:low|med|high (subjective, but logged).
    • R1/R2/R3 owners (junior/senior/agent) per existing scheme.
  • Auto-capture per PR:
    • Review time (first review to merge), number of review rounds.
    • Files/dirs touched; any harness tools/flows used.
    • Post-merge: link to any incident/rollback tied to the PR.

Step 1 (week 1): Baseline

  • Do not change harness rules.
  • For each key path/domain (e.g., billing/, projects/flows/):
    • Compute: count of agent-heavy PRs, median review time, % “light changes”, # incidents.
  • Pick 3–5 focus domains with enough volume.

Step 2 (weeks 2–3): Local experiments

  • For 1–2 domains with “raise” signals (S1–S7):
    • Apply A1–A4.
    • Explicitly log start date and rules in a short CALIBRATION.md entry.
  • For 1–2 domains with “lower” signals (S8–S14):
    • Apply B1–B4.
  • Keep other domains as control.

Step 3: Weekly review (30–45 min)

  • For each focus domain:
    • Compare last 7–10 days vs baseline:
      • Review time.
      • Rework / follow-up PRs.
      • Any incidents.
      • Reviewer sentiment (short rating or 1–2 slack notes).
  • Adjust:
    • If tightening makes review time spike without reducing defects, relax the strictest lever (e.g., perms) but keep better façades.
    • If loosening keeps metrics stable or better, consider expanding the looser regime carefully.

Step 4 (end of week 4–6): Decide stable thresholds

  • Write 1–2 simple rules per domain class, e.g.:
    • "If >2 bugs in 4 weeks from billing with tests passing → move that area to strict façades + mode:learn by default for a month."
    • "If a directory has 20+ agent-heavy PRs with zero incidents and <X review time over 6 weeks → allow agents to edit any file under its façades and auto-generate R1 summaries."
  • Encode where possible:
    • Harness checks that switch modes or warn based on directory + recent incidents.
    • PR templates that suggest mode based on path and risk.
  1. Keep it from becoming taste-only
  • Require each "raise" or "lower" decision to cite:
    • One numeric trigger (e.g., S1, S5, S8, S9) and
    • One qualitative trigger (e.g., repeated review notes).
  • Review these decisions in a short retro; revert or tweak if they didn’t move metrics the right way.

Net effect: agents get more freedom where history shows they’re safe and merge-worthy; the craft bar gets tighter where real PRs and incidents show pain, not just senior intuition.