For opinionated Rails-style monoliths with a CLI substrate and designer-owned harness flows, what concrete metrics and signals (e.g., merge-worthy rate by directory, boundary-violation alerts, brief–diff mismatch counts, junior explain-your-diff quality) best reveal early signs of apprenticeship decay and structural erosion, and how should teams tune mode rules, style guides, or harness permissions in response to those signals?
dhh-agent-first-software-craft | Updated at
Answer
Concise answer:
- Core metrics & signals
Code quality / structure
-
M1: Merge-worthy rate by directory
- % of agent-first PRs per directory that merge with only small nits.
- Early erosion: rate falls in “core” domains (billing/auth/core flows) but stays high in leaf areas.
-
M2: Boundary-violation alerts
- Auto-check: diffs that new-call across forbidden dirs (e.g., controllers → cross-domain models; skipping
app/boundaries/*). - Track: violations/PR by directory and author level.
- Early erosion: upward trend, esp. from juniors or harness flows.
- Auto-check: diffs that new-call across forbidden dirs (e.g., controllers → cross-domain models; skipping
-
M3: Callback / god-object growth
- Count new/edited Rails callbacks, STI/concerns, or files > N lines / > M public methods.
- Early erosion: spike in fat models/controllers despite façade patterns.
Diff/brief alignment
-
M4: Brief–diff mismatch rate
- For each PR, human/agent brief vs labeled changes:
on-brief | off-brief. - Auto heuristic: files touched outside declared scope, new public API without mention in brief.
- Early erosion: rising off-brief rate in core flows.
- For each PR, human/agent brief vs labeled changes:
-
M5: Harness-suggested vs reviewer-actual changes
- Compare harness pre-review summary/suggestions with reviewer comments.
- Early erosion: reviewers often reject harness defaults in the same domains.
Apprenticeship / judgment
-
M6: Junior explain-your-diff quality
- Simple rubric 1–3: (1) restates diff; (2) explains intent; (3) names alternatives/tradeoffs.
- Track average by directory and experience level.
- Apprenticeship decay: flat or falling scores while throughput rises.
-
M7: Who framed vs who implemented
- Tag PRs:
framing: senior|junior|designer,implementation: agent|human+agent. - Decay signal: juniors mostly implement but rarely frame core work.
- Tag PRs:
-
M8: Review friction index
- Median review comments/PR + review minutes/PR per directory.
- Erosion signal: review time climbs while merge-worthy rate drops.
Harness / CLI flows
-
M9: Harness flow drift
- Count flows calling across boundaries, touching multiple high-risk domains, or skipping standard verifiers.
- Early erosion: more "kitchen sink" flows owned by no one, esp. from designer-run harness.
-
M10: Verification gaps
- Track PRs where changed files lack nearby tests/specs or contract checks.
- Erosion: more no-test changes in
app/boundaries,app/services, migrations.
- Early-warning thresholds (example, tune locally)
- Merge-worthy rate in any core dir < ~60–70% for 3–4 weeks.
- Boundary violations/PR in core dirs doubles vs prior month.
- Off-brief rate in
app/boundaries/*orapp/flows/*> 25%. - Junior explain-your-diff avg in core dirs stuck at 1–1.5 for 4+ weeks.
- Harness flows that cross 2+ domains without tests/contract files grows by >20%.
- How to tune rules in response
A) If merge-worthy drops or boundary violations rise in core dirs
- Mode rules
- Force
mode:learnfor core dirs for juniors and new harness flows. - Require senior framing for any cross-boundary work; agents only inside façades.
- Force
- Style guide
- Add/clarify "don’t reach across" examples with allowed façades per domain.
- Hard ban new callbacks/implicit coupling; require explicit services/flows.
- Harness permissions
- Mark
app/boundaries/*andapp/models/*as "design/senior-led"; agents read + suggest, not auto-edit, unless framed by a senior. - Limit flows that span multiple boundary dirs; require an explicit
Flowobject as the only entry.
- Mark
B) If brief–diff mismatch and off-brief PRs increase
- Mode rules
- In
mode:ship, require a short structured brief field: scope, exclusions, touched domains. - Harness must echo the brief into its planning prompt and show a "plan vs brief" diff.
- In
- Style guide
- Add examples of good briefs; keep to a 3–5 bullet template.
- Harness permissions
- If off-brief flag trips, auto-downgrade to
mode:learnand require human confirmation before new run.
- If off-brief flag trips, auto-downgrade to
C) If junior explain-your-diff quality is low or flat
- Mode rules
- Minimum X learning PRs/quarter per junior in core dirs, with explain-your-diff required.
- For those PRs, agent must propose but cannot finalize diffs; juniors hand-edit at least key functions.
- Style guide
- Add a 3-question reflection template to PR description for
mode:learn(intent, alternatives, risk).
- Add a 3-question reflection template to PR description for
- Harness permissions
- In
mode:learn, harness auto-generates a draft explanation that juniors must edit; do not allow direct copy-through. - Restrict fully automated
mode:shipmerges to experienced authors in specific dirs.
- In
D) If harness flow drift and verification gaps grow
- Mode rules
- Cap WIP: max active experimental harness flows per team; stale flows auto-expire.
- Require
owner:andrisk:fields on each new tool/flow.
- Style guide
- Define 2–3 “blessed” Rails entrypoints (services/flows) per domain that harness flows must call, not bypass.
- Harness permissions
- Tier tools:
safe(read-only, dev data),guarded(needs human confirm),privileged(senior-only). - Disallow new privileged tools without a matching test/contract file.
- Tier tools:
E) If review friction rises while quality drops
- Mode rules
- For high-friction dirs, move more work to
mode:learnfor a sprint to surface patterns. - Then add a small set of directory-specific harness checks (naming, layout, test presence) and return most work to
mode:ship.
- For high-friction dirs, move more work to
- Style guide
- Update guide based on recurring comments; delete rules reviewers don’t enforce.
- Harness permissions
- Use harness to pre-run the most common reviewer checks and comment templates; reduce custom nits in review.
- Key assumptions
- You can tag PRs by directory, risk, mode, and author seniority with modest effort.
- Team already has Rails-style façades/boundaries and some tests in core domains.
- Designers owning harness flows are willing to accept light constraints and co-review.
- Seniors will use these signals to adjust modes/permissions, not just push harder for throughput.
- Competing hypothesis
- Most value comes from stronger core tests and a few coarse risk lanes; fine-grained metrics (brief–diff mismatch, explain-your-diff scores, flow drift) add overhead and noise without clearly improving apprenticeship or structure.
- Main failure case / boundary
- In very small or high-pressure teams, nobody maintains metrics or mode tags; rules are ignored, harness permissions stay wide open, and signals lag behind reality. You get dashboards but no behavior change.
- Verification targets
- Correlate merge-worthy rate, boundary-violation counts, and brief–diff mismatch with post-merge bugs in 1–2 key domains.
- Run a 4–6 week trial where core dirs use stricter modes/permissions triggered by these signals; compare defect rates and junior explanation quality vs a control domain.
- Sample PRs quarterly to see if directories with tighter harness tiers and explain-your-diff requirements show slower structural erosion (callbacks, god-objects) than ungoverned dirs.
- Open questions
- What is the smallest metric set (2–4 signals) that gives early warning without creating dashboard fatigue?
- How often should teams reclassify directories (core vs leaf) and retune thresholds as the monolith evolves?
- Can agents reliably score explain-your-diff quality or brief–diff mismatch without drifting away from human taste?