In small, agent-first monolith teams that already use learn/ship modes and staged review (R1/R2/R3), what concrete signals in real PRs and incidents should trigger raising the craft bar for agents (stricter façades, narrower harness permissions, more human-owned design files) versus lowering it (wider agent-editable areas, looser templates), and how could you run a 4–6 week calibration loop so those thresholds are data-driven rather than taste-only arguments?

dhh-agent-first-software-craft | Updated at 2026-04-09 09:17

Answer

Signals + loop, kept tight.

Signals to RAISE the craft bar (tighten façades, perms, human design)

Use these as hard triggers, not vibes:

PR-level signals

S1: Rework rate
- ≥2 follow-up PRs or >25% of lines changed post-review on agent-heavy work in a domain over 2–3 weeks.
S2: Review drag
- Median review time for agent-heavy PRs in a path is >2× team median for similar scope.
S3: Boundary leaks
- Comments like “agent reached into internals” or “this bypasses X façade” on ≥3 PRs in same area.
S4: Taste regressions
- Repeated nits: inconsistent naming, copy-paste patterns, missing small tests in one directory or feature.

Incident/ops signals

S5: Post-merge defects
- ≥2 production bugs in 4 weeks traced to the same directory, service façade, or harness tool.
S6: Rollbacks / hotfixes
- Any rollback tied to an agent-generated change where tests passed but behavior was off.
S7: Verification misses
- Incident review notes “PR looked fine; missing/weak check” more than once for the same area.

Actions when these cluster by domain/path

A1: Stricter façades
- Freeze new call sites to internals; require going through 1–2 clear services/flows.
A2: Narrow harness perms
- Move files/dirs to “read-only for agents” or “agent-editable only via flow X”.
A3: More human-owned design files
- Require a small human-written design or flow file (R3 note, sequence, contract snippet) before agent runs in that area.
A4: Mode bump
- Default contentious area to mode:learn for 2–4 weeks; agents assist, humans lead design and verification.

Signals to LOWER the craft bar (wider agent area, looser templates)

Look for repeated success:

PR-level signals

S8: High merge-worthiness
- ≥70–80% of agent-heavy PRs in a path merge with only minor edits (comment-only or tiny nits) over 3–4 weeks.
S9: Fast review
- Median review time for those PRs < team median for similar size; reviewers tag them “low-friction”.
S10: Good diff shape
- Diffs stay within façades, small files, and named flows with clear tests; reviewers rarely flag layout or boundary issues.

Incident/ops signals

S11: Clean run history
- 0–1 minor incidents in 6–8 weeks from that area, despite steady change volume.
S12: Strong checks
- PRs routinely add or extend tests/verification; reviewers begin to skim code and lean on checks.

Team signals

S13: Reviewer confidence
- Seniors explicitly note “we could safely let agents do more here” in R3 or retro notes.
S14: Boredom / underuse
- Seniors report R1/R2 in that area feels like rubber-stamping well-shaped work.

Actions when these cluster

B1: Wider agent-editable areas
- Expand allowed dirs/files for agent writes; allow agents to propose new files under existing façades.
B2: Looser templates
- Simplify PR templates, allow larger diffs per PR, and relax "always add new test" to "extend checks when behavior changes".
B3: More agent-led R1/R2
- Let agents own diff summaries and suggested checks; humans focus on R3 framing.
B4: Mode bias
- Default routine work here to mode:ship with opt-in mode:learn when a junior wants depth.

4–6 week calibration loop (make it data-driven)

Step 0: Minimal metrics

Tag each PR:
- mode:learn|ship, agent:low|med|high (subjective, but logged).
- R1/R2/R3 owners (junior/senior/agent) per existing scheme.
Auto-capture per PR:
- Review time (first review to merge), number of review rounds.
- Files/dirs touched; any harness tools/flows used.
- Post-merge: link to any incident/rollback tied to the PR.

Step 1 (week 1): Baseline

Do not change harness rules.
For each key path/domain (e.g., billing/, projects/flows/):
- Compute: count of agent-heavy PRs, median review time, % “light changes”, # incidents.
Pick 3–5 focus domains with enough volume.

Step 2 (weeks 2–3): Local experiments

For 1–2 domains with “raise” signals (S1–S7):
- Apply A1–A4.
- Explicitly log start date and rules in a short CALIBRATION.md entry.
For 1–2 domains with “lower” signals (S8–S14):
- Apply B1–B4.
Keep other domains as control.

Step 3: Weekly review (30–45 min)

For each focus domain:
- Compare last 7–10 days vs baseline:
  - Review time.
  - Rework / follow-up PRs.
  - Any incidents.
  - Reviewer sentiment (short rating or 1–2 slack notes).
Adjust:
- If tightening makes review time spike without reducing defects, relax the strictest lever (e.g., perms) but keep better façades.
- If loosening keeps metrics stable or better, consider expanding the looser regime carefully.

Step 4 (end of week 4–6): Decide stable thresholds

Write 1–2 simple rules per domain class, e.g.:
- "If >2 bugs in 4 weeks from billing with tests passing → move that area to strict façades + mode:learn by default for a month."
- "If a directory has 20+ agent-heavy PRs with zero incidents and <X review time over 6 weeks → allow agents to edit any file under its façades and auto-generate R1 summaries."
Encode where possible:
- Harness checks that switch modes or warn based on directory + recent incidents.
- PR templates that suggest mode based on path and risk.

Keep it from becoming taste-only

Require each "raise" or "lower" decision to cite:
- One numeric trigger (e.g., S1, S5, S8, S9) and
- One qualitative trigger (e.g., repeated review notes).
Review these decisions in a short retro; revert or tweak if they didn’t move metrics the right way.

Net effect: agents get more freedom where history shows they’re safe and merge-worthy; the craft bar gets tighter where real PRs and incidents show pain, not just senior intuition.