If we flip the dominant framing and assume apprenticeship decay and scope intoxication are the primary constraints in agent-first teams—rather than the review bottleneck or ambition frontier—how would that change which systems you make highly agent-accessible (e.g., CLI tools, core monolith, verification layer) versus deliberately human-gated, and what observable patterns in PRs, incidents, and promotion packets would distinguish a team that successfully optimizes for long-term judgment formation from one that is still over-optimizing for short-term small-team leverage?

dhh-agent-first-software-craft | Updated at 2026-04-09 09:02

Answer

High-level: if you optimize against apprenticeship decay and scope intoxication first, you constrain where agents can act and how many things they can start. You make low-risk surfaces very agent-accessible, keep core judgment surfaces human-gated, and watch PRs/promotion packets for evidence of real human decisions rather than agent routing.

Agent-accessible vs human-gated

Make highly agent-accessible:
- CLI / glue / scaffolding
  - Scripts, adapters, migrations, boring integrations.
  - Rationale: good practice surface; low architectural lock-in; easy rollback.
- Localized UI / copy / minor product variants
  - Small view changes, content tweaks, A/B plumbing.
  - Rationale: supports reversible hunch probes and cheap iteration.
- Low-risk verification helpers
  - Test-data builders, fixture generators, log scrapers.
  - Rationale: speeds feedback without deciding what “good” means.
Keep deliberately human-gated:
- Core domain and boundary moves in the monolith
  - New domain concepts, boundary splits/merges, cross-boundary calls.
  - Gating: agents propose diffs; merges require explicit human decision records.
- Verification layer semantics
  - What gets tested, which invariants matter, policy rules.
  - Gating: humans define/approve checks; agents help implement.
- Harness surfaces that expand scope
  - New tools/CLIs, new external integrations, new automation flows.
  - Gating: require human review of “surface area” PRs and a simple risk note.
Additional constraints for scope intoxication:
- WIP caps on agent-created branches/flows per engineer/team.
- Risk tiers in harness: Tier 0 (exploration, no-prod), Tier 1 (internal tools), Tier 2 (prod-touching). Agents fully allowed only in lower tiers.
- Cooling-off for new systems: agent can spike a prototype, but any new service/major CLI requires a human “keep/kill/simplify” call within a short window.

PR patterns: long-term judgment vs short-term leverage

Team optimizing for long-term judgment formation:
- Many small, human-explained PRs from juniors in core code.
- Clear separation: “agent did X, I changed Y because…” in descriptions.
- Review comments often about tradeoffs, boundaries, verification, not just nits.
- PRs that introduce new tools/flows include a short rationale and risk note.
- Steady ratio of human-authored to agent-authored lines in critical paths.
Team optimizing mainly for short-term small-team leverage:
- Large agent-heavy PRs with thin human context.
- Review comments mostly about immediate bugs/tests; little architecture talk.
- Frequent new tools/CLIs with weak rationales and overlapping purposes.
- Few PRs where juniors change verification strategy or boundaries themselves.

Incident patterns

Healthy, judgment-focused team:
- Incidents skew toward honest complexity and edge cases, not basic misunderstandings.
- Postmortems show humans making explicit calls (“we chose this boundary…”) rather than opaque agent behavior.
- Some incidents are caught by tests or harness checks that juniors helped design.
Over-optimized for leverage:
- Incidents often trace to:
  - Agent-created flows mis-wiring systems.
  - Fragile verification scripts giving false green.
  - New scope (tools/services) with no clear owner.
- Repeated classes of “we shipped something nobody fully understood.”

Promotion packet signals

Judgment-first, apprenticeship-preserving team:
- Juniors’ packets include:
  - Examples of framing a problem before using agents.
  - Own decisions on where to trust agents vs hand-craft.
  - Evidence of redesigning tests or boundaries, not just throughput.
- Seniors’ packets show:
  - Time spent on teaching (design briefs, walkthroughs), not only review volume.
  - Deliberate “kill or simplify” decisions on over-scoped initiatives.
Leverage-first team:
- Packets emphasize LOC, project count, speed, number of harness flows.
- Little narrative about tradeoffs, verification design, or de-scoping choices.
- Juniors described as “effective agent operators” more than system thinkers.

Evidence classification

evidence_type: mixed (conceptual synthesis + practice-informed patterns)
evidence_strength: mixed (matches early field reports but not rigorously measured)

Assumptions

Agents are strong enough to produce merge-worthy code in many areas but still error-prone.
The org cares about growing new seniors, not just short-term delivery.
Promotion and review data are available and can be inspected qualitatively.

Competing hypothesis

The main constraint remains review bottlenecks and ambition frontier; apprenticeship and scope issues are secondary and can be handled culturally without changing where agents are allowed or how PRs are structured.

Main failure case / boundary

In very small, time-crunched teams, strong gating and WIP caps may slow delivery so much they are ignored or worked around; the team reverts to leverage-first behavior despite the intended design.

Verification targets (small checks a human could run next)

Sample 3–6 months of PRs from an agent-first team and classify:
- Agent vs human authorship ratios in core vs glue code.
- Depth/type of review comments.
- Presence of explicit rationales in scope-expanding changes.
Compare incident origins over the same window:
- % from new agent-created flows vs changes in old, human-owned core.
Inspect a few promotion packets for juniors to see whether they show real architectural and verification decisions or mostly agent operation.

Open questions

What is a reasonable baseline for “enough” direct human reps in core paths per engineer per quarter?
How aggressive can WIP and scope caps be before they materially harm useful ambition?
Can the harness infer and surface early signs of apprenticeship decay (e.g., rising agent-line ratios in core) without heavy-handed metrics?