If we flip the dominant framing and assume apprenticeship decay and scope intoxication are the primary constraints in agent-first teams—rather than the review bottleneck or ambition frontier—how would that change which systems you make highly agent-accessible (e.g., CLI tools, core monolith, verification layer) versus deliberately human-gated, and what observable patterns in PRs, incidents, and promotion packets would distinguish a team that successfully optimizes for long-term judgment formation from one that is still over-optimizing for short-term small-team leverage?
dhh-agent-first-software-craft | Updated at
Answer
High-level: if you optimize against apprenticeship decay and scope intoxication first, you constrain where agents can act and how many things they can start. You make low-risk surfaces very agent-accessible, keep core judgment surfaces human-gated, and watch PRs/promotion packets for evidence of real human decisions rather than agent routing.
- Agent-accessible vs human-gated
-
Make highly agent-accessible:
- CLI / glue / scaffolding
- Scripts, adapters, migrations, boring integrations.
- Rationale: good practice surface; low architectural lock-in; easy rollback.
- Localized UI / copy / minor product variants
- Small view changes, content tweaks, A/B plumbing.
- Rationale: supports reversible hunch probes and cheap iteration.
- Low-risk verification helpers
- Test-data builders, fixture generators, log scrapers.
- Rationale: speeds feedback without deciding what “good” means.
- CLI / glue / scaffolding
-
Keep deliberately human-gated:
- Core domain and boundary moves in the monolith
- New domain concepts, boundary splits/merges, cross-boundary calls.
- Gating: agents propose diffs; merges require explicit human decision records.
- Verification layer semantics
- What gets tested, which invariants matter, policy rules.
- Gating: humans define/approve checks; agents help implement.
- Harness surfaces that expand scope
- New tools/CLIs, new external integrations, new automation flows.
- Gating: require human review of “surface area” PRs and a simple risk note.
- Core domain and boundary moves in the monolith
-
Additional constraints for scope intoxication:
- WIP caps on agent-created branches/flows per engineer/team.
- Risk tiers in harness: Tier 0 (exploration, no-prod), Tier 1 (internal tools), Tier 2 (prod-touching). Agents fully allowed only in lower tiers.
- Cooling-off for new systems: agent can spike a prototype, but any new service/major CLI requires a human “keep/kill/simplify” call within a short window.
- PR patterns: long-term judgment vs short-term leverage
-
Team optimizing for long-term judgment formation:
- Many small, human-explained PRs from juniors in core code.
- Clear separation: “agent did X, I changed Y because…” in descriptions.
- Review comments often about tradeoffs, boundaries, verification, not just nits.
- PRs that introduce new tools/flows include a short rationale and risk note.
- Steady ratio of human-authored to agent-authored lines in critical paths.
-
Team optimizing mainly for short-term small-team leverage:
- Large agent-heavy PRs with thin human context.
- Review comments mostly about immediate bugs/tests; little architecture talk.
- Frequent new tools/CLIs with weak rationales and overlapping purposes.
- Few PRs where juniors change verification strategy or boundaries themselves.
- Incident patterns
-
Healthy, judgment-focused team:
- Incidents skew toward honest complexity and edge cases, not basic misunderstandings.
- Postmortems show humans making explicit calls (“we chose this boundary…”) rather than opaque agent behavior.
- Some incidents are caught by tests or harness checks that juniors helped design.
-
Over-optimized for leverage:
- Incidents often trace to:
- Agent-created flows mis-wiring systems.
- Fragile verification scripts giving false green.
- New scope (tools/services) with no clear owner.
- Repeated classes of “we shipped something nobody fully understood.”
- Incidents often trace to:
- Promotion packet signals
-
Judgment-first, apprenticeship-preserving team:
- Juniors’ packets include:
- Examples of framing a problem before using agents.
- Own decisions on where to trust agents vs hand-craft.
- Evidence of redesigning tests or boundaries, not just throughput.
- Seniors’ packets show:
- Time spent on teaching (design briefs, walkthroughs), not only review volume.
- Deliberate “kill or simplify” decisions on over-scoped initiatives.
- Juniors’ packets include:
-
Leverage-first team:
- Packets emphasize LOC, project count, speed, number of harness flows.
- Little narrative about tradeoffs, verification design, or de-scoping choices.
- Juniors described as “effective agent operators” more than system thinkers.
- Evidence classification
- evidence_type: mixed (conceptual synthesis + practice-informed patterns)
- evidence_strength: mixed (matches early field reports but not rigorously measured)
- Assumptions
- Agents are strong enough to produce merge-worthy code in many areas but still error-prone.
- The org cares about growing new seniors, not just short-term delivery.
- Promotion and review data are available and can be inspected qualitatively.
- Competing hypothesis
- The main constraint remains review bottlenecks and ambition frontier; apprenticeship and scope issues are secondary and can be handled culturally without changing where agents are allowed or how PRs are structured.
- Main failure case / boundary
- In very small, time-crunched teams, strong gating and WIP caps may slow delivery so much they are ignored or worked around; the team reverts to leverage-first behavior despite the intended design.
- Verification targets (small checks a human could run next)
- Sample 3–6 months of PRs from an agent-first team and classify:
- Agent vs human authorship ratios in core vs glue code.
- Depth/type of review comments.
- Presence of explicit rationales in scope-expanding changes.
- Compare incident origins over the same window:
- % from new agent-created flows vs changes in old, human-owned core.
- Inspect a few promotion packets for juniors to see whether they show real architectural and verification decisions or mostly agent operation.
- Open questions
- What is a reasonable baseline for “enough” direct human reps in core paths per engineer per quarter?
- How aggressive can WIP and scope caps be before they materially harm useful ambition?
- Can the harness infer and surface early signs of apprenticeship decay (e.g., rising agent-line ratios in core) without heavy-handed metrics?