In small, implementation-abundant monolith teams that already track drift metrics and use boundary-focused PR lanes, what concrete gaps still show up between ‘merge-worthy code’ and truly tasteful code (e.g., naming coherence, narrative clarity of modules, test readability), and how could those gaps be turned into review prompts or lightweight harness checks without collapsing into subjective style bikeshedding?

dhh-agent-first-software-craft | Updated at

Answer

Key gaps and how to encode them without devolving into taste fights.

  1. Concrete gaps that remain
  • Naming coherence

    • Symptom: names locally fine but clash with domain lexicon or existing patterns.
    • Shape: similar concepts named differently across modules; verbs/nouns inconsistent.
  • Narrative clarity of modules

    • Symptom: file or class is correct and small, but its purpose is opaque.
    • Shape: unclear entrypoints, mixed concerns, no obvious “happy path” to read.
  • Test readability and intent

    • Symptom: tests pass and are scoped, but future readers can’t see what behavior really matters.
    • Shape: over-mocked setups, magic values, no clear arrange/act/assert story.
  • Boundary story vs. code story

    • Symptom: change respects tags and lanes, but doesn’t fit the boundary’s informal “job description.”
    • Shape: boundary grows grab-bag methods; new flows don’t match its advertised role.
  • Local micro-patterns

    • Symptom: OK implementation that ignores a nearby, nicer pattern.
    • Shape: hand-rolled loops instead of existing helpers; minor divergence in how errors/results are handled.
  1. Turning gaps into prompts (human) instead of bikeshedding
  • Use short, fixed checklists per lane

    • Example PR template add-ons:
      • Naming:
        • “Are new names drawn from our domain glossary or existing modules?”
        • “Did you rename nearby things for consistency where cheap?”
      • Narrative:
        • “If someone skims this module top-to-bottom, is there a clear main path?”
        • “Is there one short docstring or header comment that says what this file is for?”
      • Tests:
        • “Can a new teammate infer behavior from test names alone?”
        • “Is there at least one ‘story’ test per feature, not just mechanics?”
    • Limit to 3–5 bullets per lane; anything longer invites bikeshedding.
  • Require author self-rating instead of open-ended taste debates

    • Simple scale in PR description (1–3): Naming, Narrative, Tests.
    • Reviewer reacts only when self-rating is low or obviously off, keeping taste comments focused.
  • Add “taste label” tags on comments

    • Labels like naming, narrative, tests-as-spec.
    • Rule: reviewers may leave at most N taste comments per PR unless risk is high.
    • This constrains volume and forces them to pick the highest-leverage points.
  1. Lightweight harness checks that stay objective
  • Lexicon / naming checks

    • Maintain a small domain glossary file.
    • Harness check: warn when new identifiers collide with glossary anti-patterns or miss obvious glossary terms (e.g., Acct vs Account).
    • Only warn; no hard fail unless it hits a blocked word list.
  • Module “shape” checks

    • Simple heuristics per file:
      • Max public entrypoints.
      • Presence of a short top-level comment or module Doc block.
      • Flag unusually long files in “application” dirs for human narrative review lane.
    • Again: warnings route to a taste_review lane; not hard gating.
  • Test intent checks

    • Heuristics:
      • Test names must contain at least one verb and one domain noun.
      • Minimum ratio of higher-level integration/feature tests to unit tests per ticket.
    • Harness aggregates by PR and posts a short summary comment: “2/2 tests have descriptive names; 0 feature-level tests detected.” Reviewer decides.
  • Boundary “job description” alignment

    • Store a short text “summary” per boundary (1–2 sentences in code or config).
    • Harness uses embedding or simple keyword match to:
      • Highlight new methods whose names/paths don’t share tokens with the boundary summary.
      • Flag boundaries whose summary hasn’t changed despite repeated shape changes.
    • Output is a single line in the PR: “3 new methods weakly match boundary summary; consider narrative review.”
  1. Guardrails against bikeshedding
  • Separate lanes

    • Only PRs marked tasteful_surface or core_flow invite deep taste review.
    • Default lane: correctness + light taste nits only.
  • Cap taste nit count

    • Team norm: max 3 taste nits per reviewer per PR, unless they’re blocking.
    • If more are needed, reviewer writes a tiny follow-up refactor ticket instead of filling comments.
  • Prefer refactor follow-ups for systemic taste issues

    • If harness or reviewers see repeated naming/narrative problems in an area, open one taste_refactor rather than re-fighting each PR.

This keeps “taste” concrete (names, module story, tests-as-spec) and surfaces it via small prompts and numeric-ish heuristics, while using lanes and caps to keep debates from overwhelming the review bottleneck.