When a team leans hard into a CLI-substrate and sidecar agent loops for small-team leverage, what specific mismatch patterns between human taste and agent behavior (e.g., ugly but correct pipelines, brittle verification scripts, over-fragmented tools) show up in diffs and incidents, and how could the harness encode lightweight, diff-first checks that push agents toward more human-verifiable, aesthetically coherent flows without turning seniors into full-time style police?
dhh-agent-first-software-craft | Updated at
Answer
Concise view:
- Common mismatch patterns (what shows up in diffs/incidents)
-
P1: Ugly-but-correct pipelines
- Long shell pipes with nested subshells and ad-hoc env vars.
- Many one-off flags; no named scripts; hard to explain in review.
- Incidents: tiny change to one stage breaks hidden assumptions.
-
P2: Brittle verification scripts
- Agent-generated scripts that hard-code paths, IDs, or data shapes.
- Overfit to a single editor run; fail in CI or other envs.
- Incidents: verification “green” but missed real edge conditions.
-
P3: Over-fragmented tools
- Many tiny CLIs with overlapping verbs; no clear “blessed” entrypoints.
- Similar flows implemented 3–4 ways; agents pick arbitrarily.
- Incidents: ops/debug use the wrong variant; verification uses another.
-
P4: Opaque side-effects
- Tools that do read + write + notify in one command, with no dry-run.
- Pipelines hide side-effects behind generic names like
sync,apply. - Incidents: script used for “check” also mutates prod.
-
P5: Style drift in naming and layout
- Inconsistent verb/noun patterns (
run_,do-,execute_…). - Mixed languages and frameworks in adjacent tools.
- Incidents: humans mis-invoke tools; agents mis-route flows.
- Inconsistent verb/noun patterns (
-
P6: Over-fitted prompts/flows
- Harness flows that encode one reviewer’s taste, not team norms.
- Agents produce merge-worthy code for that person, noisy for others.
- Incidents: review stalls or silent compromises on craft bar.
- Lightweight, diff-first harness checks
Goal: nudge agents toward human-verifiable, coherent flows with short, local checks; no big style-enforcement pass.
Patterns:
-
C1: Pipeline-shape lint
- Trigger: diffs adding/editing scripts or CLI pipelines.
- Check:
- Max pipeline length (e.g., >4
|→ suggest refactor to named step). - Flag subshell nesting, heavy inline
sed/awkwithout comments.
- Max pipeline length (e.g., >4
- Output: 3–5 line “pipeline review card” attached to diff with suggestions.
-
C2: Script robustness check
- Trigger: new/changed scripts in
scripts/,tools/, CI steps. - Check:
- Hard-coded paths/IDs; missing
set -euo pipefailor equivalent. - No
--helpor usage doc; no dry-run for mutating commands.
- Hard-coded paths/IDs; missing
- Output: minimal checklist in the diff: “add help?”, “support dry-run?”, “parameterize X?”.
- Trigger: new/changed scripts in
-
C3: Tool taxonomy hint
- Trigger: new CLI binary or top-level command.
- Check:
- Compare name/flags to existing tools; detect near-duplicates.
- Output: short note: “Similar to
foo-sync. Reuse/extend instead? If not, note why in description.”
-
C4: Side-effect clarity guard
- Trigger: tools that write DB/files or call external APIs.
- Check:
- Require explicit
--dry-runor--checkmode. - Require a short, human description of side-effects in help text.
- Require explicit
- Output: diff comment: “Add explicit dry-run and 1–2 line side-effect summary.”
-
C5: Naming and entrypoint conventions
- Trigger: new top-level commands or scripts.
- Check:
- Match against simple patterns (e.g., verbs
list/plan/apply/check;cmd-*naming; language prefs).
- Match against simple patterns (e.g., verbs
- Output: small suggestion: “Prefer
verb-nounpattern; considerplan-xyzinstead ofrun_thing.”
-
C6: Flow-level snapshot
- Trigger: pipelines/scripts above a size threshold.
- Check:
- Ask agent to emit a 5–10 line “flow summary” describing steps and inputs/outputs.
- Output: summary auto-included in PR description for humans to edit.
- Keeping seniors out of style-police mode
-
H1: Default to suggestions, not gates
- Most checks are non-blocking comments; only a few hard gates (e.g., no hard-coded prod IDs, no mutating tool without dry-run).
-
H2: Encode norms once, reuse everywhere
- Put naming, pipeline length, dry-run rules in a small, versioned “CLI style charter”.
- Harness references that charter; seniors update the charter occasionally instead of repeating comments.
-
H3: Lane routing instead of ad-hoc nitpicks
- Create lanes like
new_tool,pipeline_change,verification_script. - Each lane has a tiny checklist; reviewer focuses on 3–5 questions, not full-style sweep.
- Create lanes like
-
H4: Agent pre-review
- Before human review, agent runs checks and applies auto-fixes where safe (add
set -e, add#!/usr/bin/env bash, normalize headers). - Humans see already-improved diffs; comments focus on real taste calls (flow shape, naming, risk) not boilerplate.
- Before human review, agent runs checks and applies auto-fixes where safe (add
-
H5: Taste-focused spot reviews
- Seniors do periodic, small “flow tastings” (pick 1–2 representative pipelines) instead of commenting on every minor script.
- Outcomes: tweaks to harness rules or charter, not large per-PR debates.
This keeps the main craft levers at the flow and contract level, gives agents concrete rails for CLI/pipeline aesthetics and safety, and limits human attention to short, structured checks where taste actually matters.