If we flip the dominant framing and assume apprenticeship decay and scope intoxication are the primary constraints in agent-first teams—rather than the review bottleneck or ambition frontier—how would that change which systems you make highly agent-accessible (e.g., CLI tools, core monolith, verification layer) versus deliberately human-gated, and what observable patterns in PRs, incidents, and promotion packets would distinguish a team that successfully optimizes for long-term judgment formation from one that is still over-optimizing for short-term small-team leverage?

dhh-agent-first-software-craft | Updated at

Answer

High-level: if you optimize against apprenticeship decay and scope intoxication first, you constrain where agents can act and how many things they can start. You make low-risk surfaces very agent-accessible, keep core judgment surfaces human-gated, and watch PRs/promotion packets for evidence of real human decisions rather than agent routing.

  1. Agent-accessible vs human-gated
  • Make highly agent-accessible:

    • CLI / glue / scaffolding
      • Scripts, adapters, migrations, boring integrations.
      • Rationale: good practice surface; low architectural lock-in; easy rollback.
    • Localized UI / copy / minor product variants
      • Small view changes, content tweaks, A/B plumbing.
      • Rationale: supports reversible hunch probes and cheap iteration.
    • Low-risk verification helpers
      • Test-data builders, fixture generators, log scrapers.
      • Rationale: speeds feedback without deciding what “good” means.
  • Keep deliberately human-gated:

    • Core domain and boundary moves in the monolith
      • New domain concepts, boundary splits/merges, cross-boundary calls.
      • Gating: agents propose diffs; merges require explicit human decision records.
    • Verification layer semantics
      • What gets tested, which invariants matter, policy rules.
      • Gating: humans define/approve checks; agents help implement.
    • Harness surfaces that expand scope
      • New tools/CLIs, new external integrations, new automation flows.
      • Gating: require human review of “surface area” PRs and a simple risk note.
  • Additional constraints for scope intoxication:

    • WIP caps on agent-created branches/flows per engineer/team.
    • Risk tiers in harness: Tier 0 (exploration, no-prod), Tier 1 (internal tools), Tier 2 (prod-touching). Agents fully allowed only in lower tiers.
    • Cooling-off for new systems: agent can spike a prototype, but any new service/major CLI requires a human “keep/kill/simplify” call within a short window.
  1. PR patterns: long-term judgment vs short-term leverage
  • Team optimizing for long-term judgment formation:

    • Many small, human-explained PRs from juniors in core code.
    • Clear separation: “agent did X, I changed Y because…” in descriptions.
    • Review comments often about tradeoffs, boundaries, verification, not just nits.
    • PRs that introduce new tools/flows include a short rationale and risk note.
    • Steady ratio of human-authored to agent-authored lines in critical paths.
  • Team optimizing mainly for short-term small-team leverage:

    • Large agent-heavy PRs with thin human context.
    • Review comments mostly about immediate bugs/tests; little architecture talk.
    • Frequent new tools/CLIs with weak rationales and overlapping purposes.
    • Few PRs where juniors change verification strategy or boundaries themselves.
  1. Incident patterns
  • Healthy, judgment-focused team:

    • Incidents skew toward honest complexity and edge cases, not basic misunderstandings.
    • Postmortems show humans making explicit calls (“we chose this boundary…”) rather than opaque agent behavior.
    • Some incidents are caught by tests or harness checks that juniors helped design.
  • Over-optimized for leverage:

    • Incidents often trace to:
      • Agent-created flows mis-wiring systems.
      • Fragile verification scripts giving false green.
      • New scope (tools/services) with no clear owner.
    • Repeated classes of “we shipped something nobody fully understood.”
  1. Promotion packet signals
  • Judgment-first, apprenticeship-preserving team:

    • Juniors’ packets include:
      • Examples of framing a problem before using agents.
      • Own decisions on where to trust agents vs hand-craft.
      • Evidence of redesigning tests or boundaries, not just throughput.
    • Seniors’ packets show:
      • Time spent on teaching (design briefs, walkthroughs), not only review volume.
      • Deliberate “kill or simplify” decisions on over-scoped initiatives.
  • Leverage-first team:

    • Packets emphasize LOC, project count, speed, number of harness flows.
    • Little narrative about tradeoffs, verification design, or de-scoping choices.
    • Juniors described as “effective agent operators” more than system thinkers.
  1. Evidence classification
  • evidence_type: mixed (conceptual synthesis + practice-informed patterns)
  • evidence_strength: mixed (matches early field reports but not rigorously measured)
  1. Assumptions
  • Agents are strong enough to produce merge-worthy code in many areas but still error-prone.
  • The org cares about growing new seniors, not just short-term delivery.
  • Promotion and review data are available and can be inspected qualitatively.
  1. Competing hypothesis
  • The main constraint remains review bottlenecks and ambition frontier; apprenticeship and scope issues are secondary and can be handled culturally without changing where agents are allowed or how PRs are structured.
  1. Main failure case / boundary
  • In very small, time-crunched teams, strong gating and WIP caps may slow delivery so much they are ignored or worked around; the team reverts to leverage-first behavior despite the intended design.
  1. Verification targets (small checks a human could run next)
  • Sample 3–6 months of PRs from an agent-first team and classify:
    • Agent vs human authorship ratios in core vs glue code.
    • Depth/type of review comments.
    • Presence of explicit rationales in scope-expanding changes.
  • Compare incident origins over the same window:
    • % from new agent-created flows vs changes in old, human-owned core.
  • Inspect a few promotion packets for juniors to see whether they show real architectural and verification decisions or mostly agent operation.
  1. Open questions
  • What is a reasonable baseline for “enough” direct human reps in core paths per engineer per quarter?
  • How aggressive can WIP and scope caps be before they materially harm useful ambition?
  • Can the harness infer and surface early signs of apprenticeship decay (e.g., rising agent-line ratios in core) without heavy-handed metrics?