For opinionated Rails-style monoliths with a CLI substrate and designer-owned harness flows, what concrete metrics and signals (e.g., merge-worthy rate by directory, boundary-violation alerts, brief–diff mismatch counts, junior explain-your-diff quality) best reveal early signs of apprenticeship decay and structural erosion, and how should teams tune mode rules, style guides, or harness permissions in response to those signals?

Answer

Concise answer:

Code quality / structure

M1: Merge-worthy rate by directory
- % of agent-first PRs per directory that merge with only small nits.
- Early erosion: rate falls in “core” domains (billing/auth/core flows) but stays high in leaf areas.
M2: Boundary-violation alerts
- Auto-check: diffs that new-call across forbidden dirs (e.g., controllers → cross-domain models; skipping app/boundaries/*).
- Track: violations/PR by directory and author level.
- Early erosion: upward trend, esp. from juniors or harness flows.
M3: Callback / god-object growth
- Count new/edited Rails callbacks, STI/concerns, or files > N lines / > M public methods.
- Early erosion: spike in fat models/controllers despite façade patterns.

Diff/brief alignment

M4: Brief–diff mismatch rate
- For each PR, human/agent brief vs labeled changes: on-brief | off-brief.
- Auto heuristic: files touched outside declared scope, new public API without mention in brief.
- Early erosion: rising off-brief rate in core flows.
M5: Harness-suggested vs reviewer-actual changes
- Compare harness pre-review summary/suggestions with reviewer comments.
- Early erosion: reviewers often reject harness defaults in the same domains.

Apprenticeship / judgment

M6: Junior explain-your-diff quality
- Simple rubric 1–3: (1) restates diff; (2) explains intent; (3) names alternatives/tradeoffs.
- Track average by directory and experience level.
- Apprenticeship decay: flat or falling scores while throughput rises.
M7: Who framed vs who implemented
- Tag PRs: framing: senior|junior|designer, implementation: agent|human+agent.
- Decay signal: juniors mostly implement but rarely frame core work.
M8: Review friction index
- Median review comments/PR + review minutes/PR per directory.
- Erosion signal: review time climbs while merge-worthy rate drops.

Harness / CLI flows

M9: Harness flow drift
- Count flows calling across boundaries, touching multiple high-risk domains, or skipping standard verifiers.
- Early erosion: more "kitchen sink" flows owned by no one, esp. from designer-run harness.
M10: Verification gaps
- Track PRs where changed files lack nearby tests/specs or contract checks.
- Erosion: more no-test changes in app/boundaries, app/services, migrations.

A) If merge-worthy drops or boundary violations rise in core dirs

Mode rules
- Force mode:learn for core dirs for juniors and new harness flows.
- Require senior framing for any cross-boundary work; agents only inside façades.
Style guide
- Add/clarify "don’t reach across" examples with allowed façades per domain.
- Hard ban new callbacks/implicit coupling; require explicit services/flows.
Harness permissions
- Mark app/boundaries/* and app/models/* as "design/senior-led"; agents read + suggest, not auto-edit, unless framed by a senior.
- Limit flows that span multiple boundary dirs; require an explicit Flow object as the only entry.

B) If brief–diff mismatch and off-brief PRs increase

Mode rules
- In mode:ship, require a short structured brief field: scope, exclusions, touched domains.
- Harness must echo the brief into its planning prompt and show a "plan vs brief" diff.
Style guide
- Add examples of good briefs; keep to a 3–5 bullet template.
Harness permissions
- If off-brief flag trips, auto-downgrade to mode:learn and require human confirmation before new run.

C) If junior explain-your-diff quality is low or flat

Mode rules
- Minimum X learning PRs/quarter per junior in core dirs, with explain-your-diff required.
- For those PRs, agent must propose but cannot finalize diffs; juniors hand-edit at least key functions.
Style guide
- Add a 3-question reflection template to PR description for mode:learn (intent, alternatives, risk).
Harness permissions
- In mode:learn, harness auto-generates a draft explanation that juniors must edit; do not allow direct copy-through.
- Restrict fully automated mode:ship merges to experienced authors in specific dirs.

D) If harness flow drift and verification gaps grow

Mode rules
- Cap WIP: max active experimental harness flows per team; stale flows auto-expire.
- Require owner: and risk: fields on each new tool/flow.
Style guide
- Define 2–3 “blessed” Rails entrypoints (services/flows) per domain that harness flows must call, not bypass.
Harness permissions
- Tier tools: safe (read-only, dev data), guarded (needs human confirm), privileged (senior-only).
- Disallow new privileged tools without a matching test/contract file.

E) If review friction rises while quality drops

Mode rules
- For high-friction dirs, move more work to mode:learn for a sprint to surface patterns.
- Then add a small set of directory-specific harness checks (naming, layout, test presence) and return most work to mode:ship.
Style guide
- Update guide based on recurring comments; delete rules reviewers don’t enforce.
Harness permissions
- Use harness to pre-run the most common reviewer checks and comment templates; reduce custom nits in review.

You can tag PRs by directory, risk, mode, and author seniority with modest effort.
Team already has Rails-style façades/boundaries and some tests in core domains.
Designers owning harness flows are willing to accept light constraints and co-review.
Seniors will use these signals to adjust modes/permissions, not just push harder for throughput.

Most value comes from stronger core tests and a few coarse risk lanes; fine-grained metrics (brief–diff mismatch, explain-your-diff scores, flow drift) add overhead and noise without clearly improving apprenticeship or structure.

In very small or high-pressure teams, nobody maintains metrics or mode tags; rules are ignored, harness permissions stay wide open, and signals lag behind reality. You get dashboards but no behavior change.

Correlate merge-worthy rate, boundary-violation counts, and brief–diff mismatch with post-merge bugs in 1–2 key domains.
Run a 4–6 week trial where core dirs use stricter modes/permissions triggered by these signals; compare defect rates and junior explanation quality vs a control domain.
Sample PRs quarterly to see if directories with tighter harness tiers and explain-your-diff requirements show slower structural erosion (callbacks, god-objects) than ungoverned dirs.

What is the smallest metric set (2–4 signals) that gives early warning without creating dashboard fatigue?
How often should teams reclassify directories (core vs leaf) and retune thresholds as the monolith evolves?
Can agents reliably score explain-your-diff quality or brief–diff mismatch without drifting away from human taste?