For agent-first teams using triad pairing and a craft-maturity ladder, how could we redesign on-call rotations, incident review, and "ownership" handoffs so that the review bottleneck doubles as a structured apprenticeship surface—transferring boundary judgment and verification taste from seniors to juniors—without slowing down response or overloading the few senior custodians?

dhh-agent-first-software-craft | Updated at

Answer

Outline of a lightweight redesign that turns on-call and incidents into structured apprenticeship without tanking responsiveness.

  1. Role-graded on-call pairs
  • Tiered pairing by craft ladder:
    • Tier 3 (System Crafter): solo for core; may pair with Tier 2 on complex windows.
    • Tier 2 (Emerging): primary on less-critical areas but always paired with a Tier 3 shadow for escalations.
    • Tier 1 (Operator): shadow on-call only; handles low-risk ops under playbooks.
  • Rotation rule: every Tier 1/2 on-call shift must be paired with a higher-tier partner reachable within minutes.
  1. Incident triad pattern (senior + junior + agent)
  • During an incident:
    • Agent: runs diffs, log/metric queries, test runs via harness.
    • Junior: drives the harness, writes the incident note, proposes the first fix.
    • Senior: decides risk class, approves high-risk actions, and narrows scope.
  • Hard rule: irreversible actions (schema/data/money) require senior approval; agent can stage plans and dry-runs only.
  1. Fast, structured post-incident mini-review
  • Within 24–48h, 20–30 min max.
  • Inputs: short timeline, diffs, harness logs.
  • Prompts aimed at taste/apprenticeship:
    • Where did boundary understanding fail or help?
    • Which verification step actually caught or would have caught this?
    • What would we want a Tier 1/2 to notice next time?
  • Outputs (pick 1–2 only):
    • A new or clarified runbook step.
    • A small test or contract change.
    • A harness rule or checklist tweak.
  • Junior who was primary writes the final summary and PRs any test/runbook changes; senior just reviews.
  1. Ownership handoffs as "boundary tours"
  • When changing ownership of an area:
    • Do a 60–90 min "boundary tour" instead of a big doc.
    • Senior walks through: key flows, no-go zones, canary tests, and recent incidents.
    • Junior runs the agent to answer 1–2 real queries ("find all writes to X", "where do we call Y vendor?") and narrates what they see.
  • Handoff checklist (very short):
    • Named primary + backup owners and their craft tier.
    • Key incident runbooks linked.
    • 3–5 "must-not-break" tests.
  1. Making the review bottleneck the apprenticeship surface
  • On-call / incident PRs and runbook changes get tier-aware routing:
    • If primary author is Tier 1 → reviewer must be Tier 3.
    • If Tier 2 → reviewer can be Tier 2 or 3, but risky changes require 3.
  • Simple reviewer prompts (checkboxes or canned comments):
    • "Did the junior choose sane blast radius for the fix?"
    • "Did they add/adjust the right test or runbook step?"
    • "Is the boundary of ownership clearer after this change?"
  • Use these answers as craft-ladder evidence (not just gut feel) alongside regular PRs.
  1. Load management for seniors
  • Cap synchronous senior involvement:
    • Clear hours/week budget per senior for live incidents and reviews.
    • Use agents + templates to offload rote work: diff summaries, log digests, draft postmortems.
  • Escalation funnel:
    • Tier 1 pages their Tier 2 partner first.
    • Only Tier 2 pages Tier 3.
    • Auto-page Tier 3 only for pre-classified high-blast-radius alerts.
  1. Integration with craft-maturity ladder
  • Tier transitions tied to on-call + incident behavior, not just feature PRs:
    • T1→T2: handles a target number of shadow shifts and low-risk incidents where postmortems show good reasoning and boundary awareness.
    • T2→T3: leads several incidents end-to-end, proposes and lands verification or guardrail improvements that seniors endorse.
  • Keep a tiny log per engineer: notable incidents, their role, and key learning/changes shipped.

Net effect: on-call, incidents, and handoffs become repeatable surfaces where juniors practice boundary judgment and verification, while agents carry execution load and seniors focus on short, high-leverage decisions instead of doing all the work themselves.