For opinionated Rails-style monoliths with a CLI substrate and designer-owned harness flows, what concrete metrics and signals (e.g., merge-worthy rate by directory, boundary-violation alerts, brief–diff mismatch counts, junior explain-your-diff quality) best reveal early signs of apprenticeship decay and structural erosion, and how should teams tune mode rules, style guides, or harness permissions in response to those signals?

dhh-agent-first-software-craft | Updated at

Answer

Concise answer:

  1. Core metrics & signals

Code quality / structure

  • M1: Merge-worthy rate by directory

    • % of agent-first PRs per directory that merge with only small nits.
    • Early erosion: rate falls in “core” domains (billing/auth/core flows) but stays high in leaf areas.
  • M2: Boundary-violation alerts

    • Auto-check: diffs that new-call across forbidden dirs (e.g., controllers → cross-domain models; skipping app/boundaries/*).
    • Track: violations/PR by directory and author level.
    • Early erosion: upward trend, esp. from juniors or harness flows.
  • M3: Callback / god-object growth

    • Count new/edited Rails callbacks, STI/concerns, or files > N lines / > M public methods.
    • Early erosion: spike in fat models/controllers despite façade patterns.

Diff/brief alignment

  • M4: Brief–diff mismatch rate

    • For each PR, human/agent brief vs labeled changes: on-brief | off-brief.
    • Auto heuristic: files touched outside declared scope, new public API without mention in brief.
    • Early erosion: rising off-brief rate in core flows.
  • M5: Harness-suggested vs reviewer-actual changes

    • Compare harness pre-review summary/suggestions with reviewer comments.
    • Early erosion: reviewers often reject harness defaults in the same domains.

Apprenticeship / judgment

  • M6: Junior explain-your-diff quality

    • Simple rubric 1–3: (1) restates diff; (2) explains intent; (3) names alternatives/tradeoffs.
    • Track average by directory and experience level.
    • Apprenticeship decay: flat or falling scores while throughput rises.
  • M7: Who framed vs who implemented

    • Tag PRs: framing: senior|junior|designer, implementation: agent|human+agent.
    • Decay signal: juniors mostly implement but rarely frame core work.
  • M8: Review friction index

    • Median review comments/PR + review minutes/PR per directory.
    • Erosion signal: review time climbs while merge-worthy rate drops.

Harness / CLI flows

  • M9: Harness flow drift

    • Count flows calling across boundaries, touching multiple high-risk domains, or skipping standard verifiers.
    • Early erosion: more "kitchen sink" flows owned by no one, esp. from designer-run harness.
  • M10: Verification gaps

    • Track PRs where changed files lack nearby tests/specs or contract checks.
    • Erosion: more no-test changes in app/boundaries, app/services, migrations.
  1. Early-warning thresholds (example, tune locally)
  • Merge-worthy rate in any core dir < ~60–70% for 3–4 weeks.
  • Boundary violations/PR in core dirs doubles vs prior month.
  • Off-brief rate in app/boundaries/* or app/flows/* > 25%.
  • Junior explain-your-diff avg in core dirs stuck at 1–1.5 for 4+ weeks.
  • Harness flows that cross 2+ domains without tests/contract files grows by >20%.
  1. How to tune rules in response

A) If merge-worthy drops or boundary violations rise in core dirs

  • Mode rules
    • Force mode:learn for core dirs for juniors and new harness flows.
    • Require senior framing for any cross-boundary work; agents only inside façades.
  • Style guide
    • Add/clarify "don’t reach across" examples with allowed façades per domain.
    • Hard ban new callbacks/implicit coupling; require explicit services/flows.
  • Harness permissions
    • Mark app/boundaries/* and app/models/* as "design/senior-led"; agents read + suggest, not auto-edit, unless framed by a senior.
    • Limit flows that span multiple boundary dirs; require an explicit Flow object as the only entry.

B) If brief–diff mismatch and off-brief PRs increase

  • Mode rules
    • In mode:ship, require a short structured brief field: scope, exclusions, touched domains.
    • Harness must echo the brief into its planning prompt and show a "plan vs brief" diff.
  • Style guide
    • Add examples of good briefs; keep to a 3–5 bullet template.
  • Harness permissions
    • If off-brief flag trips, auto-downgrade to mode:learn and require human confirmation before new run.

C) If junior explain-your-diff quality is low or flat

  • Mode rules
    • Minimum X learning PRs/quarter per junior in core dirs, with explain-your-diff required.
    • For those PRs, agent must propose but cannot finalize diffs; juniors hand-edit at least key functions.
  • Style guide
    • Add a 3-question reflection template to PR description for mode:learn (intent, alternatives, risk).
  • Harness permissions
    • In mode:learn, harness auto-generates a draft explanation that juniors must edit; do not allow direct copy-through.
    • Restrict fully automated mode:ship merges to experienced authors in specific dirs.

D) If harness flow drift and verification gaps grow

  • Mode rules
    • Cap WIP: max active experimental harness flows per team; stale flows auto-expire.
    • Require owner: and risk: fields on each new tool/flow.
  • Style guide
    • Define 2–3 “blessed” Rails entrypoints (services/flows) per domain that harness flows must call, not bypass.
  • Harness permissions
    • Tier tools: safe (read-only, dev data), guarded (needs human confirm), privileged (senior-only).
    • Disallow new privileged tools without a matching test/contract file.

E) If review friction rises while quality drops

  • Mode rules
    • For high-friction dirs, move more work to mode:learn for a sprint to surface patterns.
    • Then add a small set of directory-specific harness checks (naming, layout, test presence) and return most work to mode:ship.
  • Style guide
    • Update guide based on recurring comments; delete rules reviewers don’t enforce.
  • Harness permissions
    • Use harness to pre-run the most common reviewer checks and comment templates; reduce custom nits in review.
  1. Key assumptions
  • You can tag PRs by directory, risk, mode, and author seniority with modest effort.
  • Team already has Rails-style façades/boundaries and some tests in core domains.
  • Designers owning harness flows are willing to accept light constraints and co-review.
  • Seniors will use these signals to adjust modes/permissions, not just push harder for throughput.
  1. Competing hypothesis
  • Most value comes from stronger core tests and a few coarse risk lanes; fine-grained metrics (brief–diff mismatch, explain-your-diff scores, flow drift) add overhead and noise without clearly improving apprenticeship or structure.
  1. Main failure case / boundary
  • In very small or high-pressure teams, nobody maintains metrics or mode tags; rules are ignored, harness permissions stay wide open, and signals lag behind reality. You get dashboards but no behavior change.
  1. Verification targets
  • Correlate merge-worthy rate, boundary-violation counts, and brief–diff mismatch with post-merge bugs in 1–2 key domains.
  • Run a 4–6 week trial where core dirs use stricter modes/permissions triggered by these signals; compare defect rates and junior explanation quality vs a control domain.
  • Sample PRs quarterly to see if directories with tighter harness tiers and explain-your-diff requirements show slower structural erosion (callbacks, god-objects) than ungoverned dirs.
  1. Open questions
  • What is the smallest metric set (2–4 signals) that gives early warning without creating dashboard fatigue?
  • How often should teams reclassify directories (core vs leaf) and retune thresholds as the monolith evolves?
  • Can agents reliably score explain-your-diff quality or brief–diff mismatch without drifting away from human taste?