For agent-first small teams that already treat the context bridge and verification layer as shared infrastructure, what concrete adjustments to the craft bar (e.g., stricter domain glossaries, tighter façade contracts, or mandatory scenario tests) actually increase the rate of merge-worthy agent diffs without widening the seniority gap or accelerating apprenticeship decay—and how could we measure when added structure is starting to trade off too much junior learning for throughput?

dhh-agent-first-software-craft | Updated at

Answer

Summary: Favor a few narrow, agent-friendly constraints that double as teaching surfaces (glossary, façade contracts, scenario tests, diff rituals). Measure both merge metrics and junior-learning signals so you can stop tightening once learning starts to drop faster than defect rates.

  1. Adjustments that usually help (if kept thin)

A) Domain glossary as a contract, not a wiki

  • Keep a short DOMAIN_GLOSSARY.md per major domain.
  • Rules:
    • New public types/endpoints must reference an existing term or add one line.
    • Agent prompts: "prefer glossary terms; do not invent new domain nouns."
  • Guardrails for seniors vs juniors:
    • Juniors own most glossary updates (reviewed by seniors).
    • Seniors intervene only on meaning changes, not typos.
  • Expected effect:
    • Higher agent hit-rate on correct names.
    • Fewer review cycles over naming.

B) Tight but small façade contracts

  • One façade per use-case cluster (Billing::ChargeCustomer, Projects::Create), with:
    • Small, explicit input/return types.
    • 1–3 example call sites in tests.
  • Craft bar tweak:
    • "New cross-domain behavior must go via a façade; direct model reach-ins are rejected."
  • Apprenticeship protection:
    • Juniors often write the first façade draft.
    • Seniors review contract shape, not implementation details.

C) Mandatory scenario tests on agent-authored flows

  • Rule: any new user-visible flow or boundary change from an agent must add or update at least one scenario test (or story test) that a junior can read aloud and explain.
  • Tests live near the flow/endpoint; agents are prompted to update them.
  • Review tweak:
    • Reviewer can skim implementation but must read scenarios.
    • Juniors walk through scenarios in review or async note.

D) Lightweight “agent diff” checklist

  • Tiny checklist on PR template for agent-originated diffs:
    • Uses existing façade(s)?
    • Names match glossary?
    • Scenario test added/updated?
    • Any cross-boundary write?
  • Harness can pre-fill answers; reviewer spot-checks.

E) Structured junior review moments instead of more structure

  • For some fraction of agent diffs (e.g., 20–30%):
    • Junior is primary reviewer; senior is backstop.
    • Junior must post a 2–3 bullet summary: intent, boundary touched, scenario covered.
  • Keeps judgment practice in the loop as structure increases.
  1. How to avoid widening the seniority gap / apprenticeship decay

Principles:

  • Put seniors on design of rules; put juniors on routine application.
  • Prefer constraints that:
    • Are easy for juniors to check.
    • Are visible in diffs (glossary lines, façade signatures, scenarios).
    • Teach by doing ("why this façade?" becomes a quick discussion).

Anti-patterns:

  • Big, static rulebooks or heavy templates that seniors quietly ignore.
  • Centralized approvals for every façade or glossary change.
  • Auto-generating opaque tests that juniors cannot interpret.
  1. Measuring when structure is helping vs hurting

Track both throughput/safety and learning. Signals are simple, not perfect.

A) Throughput / merge-worthiness

  • % agent-authored diffs merged with:
    • ≤1 human revision round.
    • No major rewrite.
  • Mean review time per agent diff.
  • Post-merge defect rate for agent diffs vs human-led diffs.

Target pattern:

  • After structure tweaks: higher merge-worthy %, flat or lower review time, flat or lower defect rate.

B) Learning / apprenticeship Simple, recurring checks (monthly or per cycle):

  • Junior review share
    • % of reviews where a junior is primary reviewer.
    • Trend over time.
  • Junior explanation depth
    • Sample a few PRs; check if juniors can:
      • Name the façade and its purpose.
      • Explain the scenario test in plain language.
  • Learning PR volume/quality
    • Count of intentionally "learning" PRs (small refactors, façade extractions) per junior per month.
  • Subjective signals
    • Short survey: "I understand why we have these rules"; "I can predict when a diff will get bounced." (Likert scale; watch trends.)

C) Red flags that structure is now hurting more than helping

  • Juniors:
    • Stop proposing new façades; only tweak internals.
    • Routinely defer glossary edits to seniors.
    • Struggle to explain tests they "added" (often agent-written).
  • Process:
    • Time-to-merge rises with each new rule.
    • Review comments become template-driven (“fill out checklist”) with fewer design/taste notes.
  • Outcomes:
    • Defect rate flat or up while structure and review time increase.

When 2–3 of these show up together for a few cycles, freeze new rules and prune.

  1. Practical loop for tuning the craft bar

Run a repeating 4–6 week loop:

  1. Add at most one new structural rule per lane (e.g., "scenario test required" for UX lane, "façade-only" for boundary lane).
  2. Make the rule easy to revert; tag PRs touched by it.
  3. After 4–6 weeks, compare:
    • Merge-worthy %, review time, defects.
    • Junior review share, explanation quality.
  4. Keep, loosen, or remove the rule based on both sets of metrics.

This keeps the craft bar sharp, lets agents produce more merge-worthy diffs, and gives you early warning when extra structure starts trading away junior learning for only marginal throughput gains.