When a team leans hard into a CLI-substrate and sidecar agent loops for small-team leverage, what specific mismatch patterns between human taste and agent behavior (e.g., ugly but correct pipelines, brittle verification scripts, over-fragmented tools) show up in diffs and incidents, and how could the harness encode lightweight, diff-first checks that push agents toward more human-verifiable, aesthetically coherent flows without turning seniors into full-time style police?

dhh-agent-first-software-craft | Updated at

Answer

Concise view:

  1. Common mismatch patterns (what shows up in diffs/incidents)
  • P1: Ugly-but-correct pipelines

    • Long shell pipes with nested subshells and ad-hoc env vars.
    • Many one-off flags; no named scripts; hard to explain in review.
    • Incidents: tiny change to one stage breaks hidden assumptions.
  • P2: Brittle verification scripts

    • Agent-generated scripts that hard-code paths, IDs, or data shapes.
    • Overfit to a single editor run; fail in CI or other envs.
    • Incidents: verification “green” but missed real edge conditions.
  • P3: Over-fragmented tools

    • Many tiny CLIs with overlapping verbs; no clear “blessed” entrypoints.
    • Similar flows implemented 3–4 ways; agents pick arbitrarily.
    • Incidents: ops/debug use the wrong variant; verification uses another.
  • P4: Opaque side-effects

    • Tools that do read + write + notify in one command, with no dry-run.
    • Pipelines hide side-effects behind generic names like sync, apply.
    • Incidents: script used for “check” also mutates prod.
  • P5: Style drift in naming and layout

    • Inconsistent verb/noun patterns (run_, do-, execute_…).
    • Mixed languages and frameworks in adjacent tools.
    • Incidents: humans mis-invoke tools; agents mis-route flows.
  • P6: Over-fitted prompts/flows

    • Harness flows that encode one reviewer’s taste, not team norms.
    • Agents produce merge-worthy code for that person, noisy for others.
    • Incidents: review stalls or silent compromises on craft bar.
  1. Lightweight, diff-first harness checks

Goal: nudge agents toward human-verifiable, coherent flows with short, local checks; no big style-enforcement pass.

Patterns:

  • C1: Pipeline-shape lint

    • Trigger: diffs adding/editing scripts or CLI pipelines.
    • Check:
      • Max pipeline length (e.g., >4 | → suggest refactor to named step).
      • Flag subshell nesting, heavy inline sed/awk without comments.
    • Output: 3–5 line “pipeline review card” attached to diff with suggestions.
  • C2: Script robustness check

    • Trigger: new/changed scripts in scripts/, tools/, CI steps.
    • Check:
      • Hard-coded paths/IDs; missing set -euo pipefail or equivalent.
      • No --help or usage doc; no dry-run for mutating commands.
    • Output: minimal checklist in the diff: “add help?”, “support dry-run?”, “parameterize X?”.
  • C3: Tool taxonomy hint

    • Trigger: new CLI binary or top-level command.
    • Check:
      • Compare name/flags to existing tools; detect near-duplicates.
    • Output: short note: “Similar to foo-sync. Reuse/extend instead? If not, note why in description.”
  • C4: Side-effect clarity guard

    • Trigger: tools that write DB/files or call external APIs.
    • Check:
      • Require explicit --dry-run or --check mode.
      • Require a short, human description of side-effects in help text.
    • Output: diff comment: “Add explicit dry-run and 1–2 line side-effect summary.”
  • C5: Naming and entrypoint conventions

    • Trigger: new top-level commands or scripts.
    • Check:
      • Match against simple patterns (e.g., verbs list/plan/apply/check; cmd-* naming; language prefs).
    • Output: small suggestion: “Prefer verb-noun pattern; consider plan-xyz instead of run_thing.”
  • C6: Flow-level snapshot

    • Trigger: pipelines/scripts above a size threshold.
    • Check:
      • Ask agent to emit a 5–10 line “flow summary” describing steps and inputs/outputs.
    • Output: summary auto-included in PR description for humans to edit.
  1. Keeping seniors out of style-police mode
  • H1: Default to suggestions, not gates

    • Most checks are non-blocking comments; only a few hard gates (e.g., no hard-coded prod IDs, no mutating tool without dry-run).
  • H2: Encode norms once, reuse everywhere

    • Put naming, pipeline length, dry-run rules in a small, versioned “CLI style charter”.
    • Harness references that charter; seniors update the charter occasionally instead of repeating comments.
  • H3: Lane routing instead of ad-hoc nitpicks

    • Create lanes like new_tool, pipeline_change, verification_script.
    • Each lane has a tiny checklist; reviewer focuses on 3–5 questions, not full-style sweep.
  • H4: Agent pre-review

    • Before human review, agent runs checks and applies auto-fixes where safe (add set -e, add #!/usr/bin/env bash, normalize headers).
    • Humans see already-improved diffs; comments focus on real taste calls (flow shape, naming, risk) not boilerplate.
  • H5: Taste-focused spot reviews

    • Seniors do periodic, small “flow tastings” (pick 1–2 representative pipelines) instead of commenting on every minor script.
    • Outcomes: tweaks to harness rules or charter, not large per-PR debates.

This keeps the main craft levers at the flow and contract level, gives agents concrete rails for CLI/pipeline aesthetics and safety, and limits human attention to short, structured checks where taste actually matters.