For agent-first teams using triad pairing and a craft-maturity ladder, how could we redesign on-call rotations, incident review, and "ownership" handoffs so that the review bottleneck doubles as a structured apprenticeship surface—transferring boundary judgment and verification taste from seniors to juniors—without slowing down response or overloading the few senior custodians?
dhh-agent-first-software-craft | Updated at
Answer
Outline of a lightweight redesign that turns on-call and incidents into structured apprenticeship without tanking responsiveness.
- Role-graded on-call pairs
- Tiered pairing by craft ladder:
- Tier 3 (System Crafter): solo for core; may pair with Tier 2 on complex windows.
- Tier 2 (Emerging): primary on less-critical areas but always paired with a Tier 3 shadow for escalations.
- Tier 1 (Operator): shadow on-call only; handles low-risk ops under playbooks.
- Rotation rule: every Tier 1/2 on-call shift must be paired with a higher-tier partner reachable within minutes.
- Incident triad pattern (senior + junior + agent)
- During an incident:
- Agent: runs diffs, log/metric queries, test runs via harness.
- Junior: drives the harness, writes the incident note, proposes the first fix.
- Senior: decides risk class, approves high-risk actions, and narrows scope.
- Hard rule: irreversible actions (schema/data/money) require senior approval; agent can stage plans and dry-runs only.
- Fast, structured post-incident mini-review
- Within 24–48h, 20–30 min max.
- Inputs: short timeline, diffs, harness logs.
- Prompts aimed at taste/apprenticeship:
- Where did boundary understanding fail or help?
- Which verification step actually caught or would have caught this?
- What would we want a Tier 1/2 to notice next time?
- Outputs (pick 1–2 only):
- A new or clarified runbook step.
- A small test or contract change.
- A harness rule or checklist tweak.
- Junior who was primary writes the final summary and PRs any test/runbook changes; senior just reviews.
- Ownership handoffs as "boundary tours"
- When changing ownership of an area:
- Do a 60–90 min "boundary tour" instead of a big doc.
- Senior walks through: key flows, no-go zones, canary tests, and recent incidents.
- Junior runs the agent to answer 1–2 real queries ("find all writes to X", "where do we call Y vendor?") and narrates what they see.
- Handoff checklist (very short):
- Named primary + backup owners and their craft tier.
- Key incident runbooks linked.
- 3–5 "must-not-break" tests.
- Making the review bottleneck the apprenticeship surface
- On-call / incident PRs and runbook changes get tier-aware routing:
- If primary author is Tier 1 → reviewer must be Tier 3.
- If Tier 2 → reviewer can be Tier 2 or 3, but risky changes require 3.
- Simple reviewer prompts (checkboxes or canned comments):
- "Did the junior choose sane blast radius for the fix?"
- "Did they add/adjust the right test or runbook step?"
- "Is the boundary of ownership clearer after this change?"
- Use these answers as craft-ladder evidence (not just gut feel) alongside regular PRs.
- Load management for seniors
- Cap synchronous senior involvement:
- Clear hours/week budget per senior for live incidents and reviews.
- Use agents + templates to offload rote work: diff summaries, log digests, draft postmortems.
- Escalation funnel:
- Tier 1 pages their Tier 2 partner first.
- Only Tier 2 pages Tier 3.
- Auto-page Tier 3 only for pre-classified high-blast-radius alerts.
- Integration with craft-maturity ladder
- Tier transitions tied to on-call + incident behavior, not just feature PRs:
- T1→T2: handles a target number of shadow shifts and low-risk incidents where postmortems show good reasoning and boundary awareness.
- T2→T3: leads several incidents end-to-end, proposes and lands verification or guardrail improvements that seniors endorse.
- Keep a tiny log per engineer: notable incidents, their role, and key learning/changes shipped.
Net effect: on-call, incidents, and handoffs become repeatable surfaces where juniors practice boundary judgment and verification, while agents carry execution load and seniors focus on short, high-leverage decisions instead of doing all the work themselves.