When a team leans hard into a CLI-substrate and sidecar agent loops for small-team leverage, what specific mismatch patterns between human taste and agent behavior (e.g., ugly but correct pipelines, brittle verification scripts, over-fragmented tools) show up in diffs and incidents, and how could the harness encode lightweight, diff-first checks that push agents toward more human-verifiable, aesthetically coherent flows without turning seniors into full-time style police?

dhh-agent-first-software-craft | Updated at 2026-04-09 08:54

Answer

Concise view:

Common mismatch patterns (what shows up in diffs/incidents)

P1: Ugly-but-correct pipelines
- Long shell pipes with nested subshells and ad-hoc env vars.
- Many one-off flags; no named scripts; hard to explain in review.
- Incidents: tiny change to one stage breaks hidden assumptions.
P2: Brittle verification scripts
- Agent-generated scripts that hard-code paths, IDs, or data shapes.
- Overfit to a single editor run; fail in CI or other envs.
- Incidents: verification “green” but missed real edge conditions.
P3: Over-fragmented tools
- Many tiny CLIs with overlapping verbs; no clear “blessed” entrypoints.
- Similar flows implemented 3–4 ways; agents pick arbitrarily.
- Incidents: ops/debug use the wrong variant; verification uses another.
P4: Opaque side-effects
- Tools that do read + write + notify in one command, with no dry-run.
- Pipelines hide side-effects behind generic names like sync, apply.
- Incidents: script used for “check” also mutates prod.
P5: Style drift in naming and layout
- Inconsistent verb/noun patterns (run_, do-, execute_…).
- Mixed languages and frameworks in adjacent tools.
- Incidents: humans mis-invoke tools; agents mis-route flows.
P6: Over-fitted prompts/flows
- Harness flows that encode one reviewer’s taste, not team norms.
- Agents produce merge-worthy code for that person, noisy for others.
- Incidents: review stalls or silent compromises on craft bar.

Lightweight, diff-first harness checks

Goal: nudge agents toward human-verifiable, coherent flows with short, local checks; no big style-enforcement pass.

Patterns:

C1: Pipeline-shape lint
- Trigger: diffs adding/editing scripts or CLI pipelines.
- Check:
  - Max pipeline length (e.g., >4 | → suggest refactor to named step).
  - Flag subshell nesting, heavy inline sed/awk without comments.
- Output: 3–5 line “pipeline review card” attached to diff with suggestions.
C2: Script robustness check
- Trigger: new/changed scripts in scripts/, tools/, CI steps.
- Check:
  - Hard-coded paths/IDs; missing set -euo pipefail or equivalent.
  - No --help or usage doc; no dry-run for mutating commands.
- Output: minimal checklist in the diff: “add help?”, “support dry-run?”, “parameterize X?”.
C3: Tool taxonomy hint
- Trigger: new CLI binary or top-level command.
- Check:
  - Compare name/flags to existing tools; detect near-duplicates.
- Output: short note: “Similar to foo-sync. Reuse/extend instead? If not, note why in description.”
C4: Side-effect clarity guard
- Trigger: tools that write DB/files or call external APIs.
- Check:
  - Require explicit --dry-run or --check mode.
  - Require a short, human description of side-effects in help text.
- Output: diff comment: “Add explicit dry-run and 1–2 line side-effect summary.”
C5: Naming and entrypoint conventions
- Trigger: new top-level commands or scripts.
- Check:
  - Match against simple patterns (e.g., verbs list/plan/apply/check; cmd-* naming; language prefs).
- Output: small suggestion: “Prefer verb-noun pattern; consider plan-xyz instead of run_thing.”
C6: Flow-level snapshot
- Trigger: pipelines/scripts above a size threshold.
- Check:
  - Ask agent to emit a 5–10 line “flow summary” describing steps and inputs/outputs.
- Output: summary auto-included in PR description for humans to edit.

Keeping seniors out of style-police mode

H1: Default to suggestions, not gates
- Most checks are non-blocking comments; only a few hard gates (e.g., no hard-coded prod IDs, no mutating tool without dry-run).
H2: Encode norms once, reuse everywhere
- Put naming, pipeline length, dry-run rules in a small, versioned “CLI style charter”.
- Harness references that charter; seniors update the charter occasionally instead of repeating comments.
H3: Lane routing instead of ad-hoc nitpicks
- Create lanes like new_tool, pipeline_change, verification_script.
- Each lane has a tiny checklist; reviewer focuses on 3–5 questions, not full-style sweep.
H4: Agent pre-review
- Before human review, agent runs checks and applies auto-fixes where safe (add set -e, add #!/usr/bin/env bash, normalize headers).
- Humans see already-improved diffs; comments focus on real taste calls (flow shape, naming, risk) not boilerplate.
H5: Taste-focused spot reviews
- Seniors do periodic, small “flow tastings” (pick 1–2 representative pipelines) instead of commenting on every minor script.
- Outcomes: tweaks to harness rules or charter, not large per-PR debates.

This keeps the main craft levers at the flow and contract level, gives agents concrete rails for CLI/pipeline aesthetics and safety, and limits human attention to short, structured checks where taste actually matters.