In small, agent-first monolith teams that already use learn/ship modes and staged review (R1/R2/R3), what concrete signals in real PRs and incidents should trigger raising the craft bar for agents (stricter façades, narrower harness permissions, more human-owned design files) versus lowering it (wider agent-editable areas, looser templates), and how could you run a 4–6 week calibration loop so those thresholds are data-driven rather than taste-only arguments?
dhh-agent-first-software-craft | Updated at
Answer
Signals + loop, kept tight.
- Signals to RAISE the craft bar (tighten façades, perms, human design)
Use these as hard triggers, not vibes:
PR-level signals
- S1: Rework rate
- ≥2 follow-up PRs or >25% of lines changed post-review on agent-heavy work in a domain over 2–3 weeks.
- S2: Review drag
- Median review time for agent-heavy PRs in a path is >2× team median for similar scope.
- S3: Boundary leaks
- Comments like “agent reached into internals” or “this bypasses X façade” on ≥3 PRs in same area.
- S4: Taste regressions
- Repeated nits: inconsistent naming, copy-paste patterns, missing small tests in one directory or feature.
Incident/ops signals
- S5: Post-merge defects
- ≥2 production bugs in 4 weeks traced to the same directory, service façade, or harness tool.
- S6: Rollbacks / hotfixes
- Any rollback tied to an agent-generated change where tests passed but behavior was off.
- S7: Verification misses
- Incident review notes “PR looked fine; missing/weak check” more than once for the same area.
Actions when these cluster by domain/path
- A1: Stricter façades
- Freeze new call sites to internals; require going through 1–2 clear services/flows.
- A2: Narrow harness perms
- Move files/dirs to “read-only for agents” or “agent-editable only via flow X”.
- A3: More human-owned design files
- Require a small human-written design or flow file (R3 note, sequence, contract snippet) before agent runs in that area.
- A4: Mode bump
- Default contentious area to
mode:learnfor 2–4 weeks; agents assist, humans lead design and verification.
- Default contentious area to
- Signals to LOWER the craft bar (wider agent area, looser templates)
Look for repeated success:
PR-level signals
- S8: High merge-worthiness
- ≥70–80% of agent-heavy PRs in a path merge with only minor edits (comment-only or tiny nits) over 3–4 weeks.
- S9: Fast review
- Median review time for those PRs < team median for similar size; reviewers tag them “low-friction”.
- S10: Good diff shape
- Diffs stay within façades, small files, and named flows with clear tests; reviewers rarely flag layout or boundary issues.
Incident/ops signals
- S11: Clean run history
- 0–1 minor incidents in 6–8 weeks from that area, despite steady change volume.
- S12: Strong checks
- PRs routinely add or extend tests/verification; reviewers begin to skim code and lean on checks.
Team signals
- S13: Reviewer confidence
- Seniors explicitly note “we could safely let agents do more here” in R3 or retro notes.
- S14: Boredom / underuse
- Seniors report R1/R2 in that area feels like rubber-stamping well-shaped work.
Actions when these cluster
- B1: Wider agent-editable areas
- Expand allowed dirs/files for agent writes; allow agents to propose new files under existing façades.
- B2: Looser templates
- Simplify PR templates, allow larger diffs per PR, and relax "always add new test" to "extend checks when behavior changes".
- B3: More agent-led R1/R2
- Let agents own diff summaries and suggested checks; humans focus on R3 framing.
- B4: Mode bias
- Default routine work here to
mode:shipwith opt-inmode:learnwhen a junior wants depth.
- Default routine work here to
- 4–6 week calibration loop (make it data-driven)
Step 0: Minimal metrics
- Tag each PR:
mode:learn|ship,agent:low|med|high(subjective, but logged).- R1/R2/R3 owners (junior/senior/agent) per existing scheme.
- Auto-capture per PR:
- Review time (first review to merge), number of review rounds.
- Files/dirs touched; any harness tools/flows used.
- Post-merge: link to any incident/rollback tied to the PR.
Step 1 (week 1): Baseline
- Do not change harness rules.
- For each key path/domain (e.g.,
billing/,projects/flows/):- Compute: count of agent-heavy PRs, median review time, % “light changes”, # incidents.
- Pick 3–5 focus domains with enough volume.
Step 2 (weeks 2–3): Local experiments
- For 1–2 domains with “raise” signals (S1–S7):
- Apply A1–A4.
- Explicitly log start date and rules in a short
CALIBRATION.mdentry.
- For 1–2 domains with “lower” signals (S8–S14):
- Apply B1–B4.
- Keep other domains as control.
Step 3: Weekly review (30–45 min)
- For each focus domain:
- Compare last 7–10 days vs baseline:
- Review time.
- Rework / follow-up PRs.
- Any incidents.
- Reviewer sentiment (short rating or 1–2 slack notes).
- Compare last 7–10 days vs baseline:
- Adjust:
- If tightening makes review time spike without reducing defects, relax the strictest lever (e.g., perms) but keep better façades.
- If loosening keeps metrics stable or better, consider expanding the looser regime carefully.
Step 4 (end of week 4–6): Decide stable thresholds
- Write 1–2 simple rules per domain class, e.g.:
- "If >2 bugs in 4 weeks from
billingwith tests passing → move that area to strict façades +mode:learnby default for a month." - "If a directory has 20+ agent-heavy PRs with zero incidents and <X review time over 6 weeks → allow agents to edit any file under its façades and auto-generate R1 summaries."
- "If >2 bugs in 4 weeks from
- Encode where possible:
- Harness checks that switch modes or warn based on directory + recent incidents.
- PR templates that suggest
modebased on path and risk.
- Keep it from becoming taste-only
- Require each "raise" or "lower" decision to cite:
- One numeric trigger (e.g., S1, S5, S8, S9) and
- One qualitative trigger (e.g., repeated review notes).
- Review these decisions in a short retro; revert or tweak if they didn’t move metrics the right way.
Net effect: agents get more freedom where history shows they’re safe and merge-worthy; the craft bar gets tighter where real PRs and incidents show pain, not just senior intuition.