In small, implementation-abundant monolith teams that already use boundary tags, drift metrics, and triad (senior + junior + agent) rituals, what concrete failure modes still show up in production—e.g., incident types, rollback patterns, or “we never should’ve built this” regrets—that indicate the craft bar is mis-specified rather than merely under-enforced, and how could those signals drive an explicit revision of the craft bar itself?

dhh-agent-first-software-craft | Updated at 2026-04-08 21:42

Answer

Core idea: even with tags, metrics, and triads, some recurring failures point to the craft bar being aimed at the wrong things, not just applied inconsistently. Use those failures to rewrite the bar.

Failure modes that suggest mis-specified craft bar

a) Correct-but-regretted features

Pattern: “We never should’ve built this” about:
- flows that few users adopt but are expensive to maintain,
- integrations that add fragile coupling across boundaries.
Signal of mis-spec: craft bar optimizes local code quality and boundary cleanliness, but says little about product fit, lifecycle cost, or removal criteria.

b) Repeated “safe” rollbacks

Pattern:
- frequent rollbacks where tests passed, no data loss, but UX, performance, or ops pain forces revert (e.g., surprising behavior, noisy alerts, timeouts under load).
Signal of mis-spec: bar is correctness-centric; non-functional and operability standards are weak or implicit.

c) Latent complexity incidents

Pattern:
- incidents where root cause is “this path is technically inside one boundary, but it’s too clever / too many concepts in one place,”
- on-call can’t safely patch because flows are opaque despite boundary fidelity.
Signal of mis-spec: bar checks “in the right directory, with tests” but not cognitive load or simplicity.

d) Boundary-pure but user-confusing changes

Pattern:
- triads and reviews green-light changes that are boundary-respecting and well-tested, yet support tickets spike from confusing behavior or inconsistent UX contracts.
Signal of mis-spec: bar encodes architecture and style, but not UX-level invariants or cross-surface consistency.

e) Agent-friendly but human-hostile code

Pattern:
- code passes all automated checks and drift metrics, but:
  - humans struggle to debug without asking the agent,
  - naming is locally consistent with prompts, not with domain language.
Signal of mis-spec: bar assumes “merge-worthy code” for agents ≈ merge-worthy for humans; it underweights human legibility and narrative.

f) Slow-burn performance / cost regressions

Pattern:
- quarterly incidents about timeouts, N+1s, or expensive queries that accreted via many small, clean PRs.
Signal of mis-spec: bar lacks simple performance/cost guardrails for common flows; reviewers rely on intuition only.

How to turn these signals into craft bar revisions

a) Add product/value gates

New bar elements:
- For net-new features: require a brief “value + kill criteria” note in PR or design stub.
- For cross-boundary integrations: require an owner and a deprecation story.
Review prompt:
- “If we had to remove this in 6 months, would it be easy?”
Effect: shifts bar from “can we ship this?” to “is it worth living with?”

b) Make non-functional expectations explicit

Add small, concrete NFR clauses per key area (latency, rate limits, data volume, dashboard noise).
Harness hooks:
- cheap perf checks on hot paths,
- lint for logging/metrics patterns.
Review checklist addition:
- “What does this do to latency, noise, and SLOs?”

c) Encode simplicity and narrative

New bar statements:
- “A mid-level engineer can explain this flow in 3–5 sentences.”
- “One obvious way to do X in this boundary.”
Ritual:
- random PRs get a 60-second “explain this change” during standup; confusion triggers a refactor task, not just a comment.

d) Add UX/behavioral invariants

For major surfaces (checkout, auth, notifications):
- maintain 1-page “behavioral invariants” docs,
- tests and PR templates must reference which invariant is touched.
Review question:
- “Which invariant might this surprise, and how are we testing it?”

e) Human-legibility checks on agent-heavy diffs

Harness rule:
- for PRs over a size or agent-origin threshold, require a short human-written summary:
  - domain language used,
  - main objects and responsibilities.
Review question:
- “Could someone debug this without asking the agent?” If not, bar not met.

f) Lightweight perf/cost budget in the bar

For key flows, define simple budgets (e.g., max queries per request, max payload sizes, log lines).
Harness checks using existing metrics or local analysis (e.g., EXPLAIN, query count in tests).
Review question:
- “Does this stay within the known budget? If not, what changed?”

Using incidents and regrets as a feedback loop

a) Tag incidents and rollbacks

In postmortems, add tags like:
- “spec gap – value,”
- “spec gap – NFR,”
- “spec gap – legibility,”
- vs. “enforcement gap.”
Monthly, count which tags dominate; any recurrent “spec gap” feeds a craft-bar revision.

b) Run quarterly “regret reviews”

Ask: “Which shipped things would we undo?”
For each, answer:
- What in the bar allowed this to look acceptable?
- What one sentence or check would have made us pause?
Only add 1–2 new bar items per quarter to avoid bloat.

c) Tighten triad rituals around bar evolution

In triads, when a failure mode is seen in-flight (e.g., confusing flow, latent complexity), name:
- “Is this an enforcement failure or a spec failure?”
If spec: capture a candidate change to the bar in a shared log; review and prune weekly.

Overall: look for “everything green yet we still regret it” patterns. When they cluster around value, NFRs, legibility, or UX, treat that as evidence the craft bar needs new dimensions, not just stricter policing, and encode small, explicit checks rather than broad exhortations.