For small, high-context teams using a CLI substrate and sidecar agent loops, how could we redesign the review bottleneck itself—checklists, lane routing, pre-review agents—so that seniors mostly read summarized behavior deltas (contracts, invariants, hot paths) rather than raw script or pipeline diffs, and what concrete guardrails keep this abstraction layer from hiding the very edge cases humans are best at catching?

dhh-agent-first-software-craft | Updated at 2026-04-09 08:57

Answer

Outline:

Make behavior deltas a first-class review artifact.
Route diffs through lanes with checklists tuned to behavior, not lines.
Use agents for pre-review summaries, but bind them to tests, traces, and contracts so they can’t hide risk.

Behavior-delta–first review design

Agent pre-review job per PR:
- Parse diff + tests + pipeline specs.
- Emit a small behavior-delta file, e.g. review/behavior_delta.md with sections:
  - Contracts changed (CLI flags, I/O schemas, exit codes).
  - Invariants touched (money, auth, data integrity, idempotency).
  - Hot paths touched (endpoints/CLIs tagged hot_path).
  - Side-effects (writes, external calls, cron jobs).
  - New failure modes and mitigations (retries, timeouts, fallbacks).
- Attach pointers into the actual diff for each bullet.
Seniors review flow:
- Start with behavior-delta file + failing/changed tests.
- Drill into linked hunks only where:
  - contract change looks too broad
  - invariant is weakened
  - hot path adds latency or complexity
  - side-effect scope is unclear

Lane routing tuned to behavior

Tag PRs (by agent + heuristics) into lanes:
- contract_touch: CLI/JSON schema/DB or API surface.
- invariant_touch: tagged domain rules, auth, limits.
- hot_path: known perf-sensitive commands/endpoints.
- verification_only: tests, scripts, harness changes.
- glue_low_risk: local script/stitching changes.
Lane → checklist examples (short, in PR template):
- contract_touch checklist:
  - Have all callers been enumerated?
  - Are versioning or shims provided?
  - Are golden I/O fixtures updated?
- invariant_touch checklist:
  - What was the old invariant? New one?
  - Which tests assert the new rule?
  - Rollback/feature flag path?
- hot_path checklist:
  - Estimate added calls/complexity.
  - Perf test or trace attached?
Harness enforces:
- Lane must be present; agent proposes, human can override.
- Required checklist boxes must be answered (by agent first, then edited by human).

Pre-review agents and diff ergonomics

Agent roles per PR:
- Summarizer: produce behavior-delta file.
- Contract mapper: list impacted commands/flags/APIs and call sites.
- Risk explainer: fill lane checklist with first pass answers.
CLI substrate hooks:
- review snapshot command runs flows before/after change, capturing:
  - CLI help output
  - sample command runs (inputs/outputs)
  - timing for tagged hot paths
- Agent uses snapshot to:
  - attach concrete before/after examples
  - flag non-obvious changes in defaults, logging, or error text

Guardrails so abstraction doesn’t hide edge cases

Guardrail 1: Summaries must be backed by artifacts
- Any bullet in behavior-delta must link to at least one of:
  - code hunk
  - test name
  - snapshot run
- Harness rejects behavior-delta sections with no backing links.
Guardrail 2: Edge-case tripwires
- Maintain a short list of risk tags on files/dirs/tests:
  - money_path, auth_boundary, idempotency, gdpr, data_migration.
- If diff touches tagged areas:
  - require human to read those exact hunks (UI jumps there)
  - forbid “summary-only” approval; checkbox: “read all money_path hunks”.
Guardrail 3: Negative diff views
- Auto-generate:
  - list of removed checks/guards/logs
  - removed or relaxed conditions around invariants
- Seniors must explicitly acknowledge each removed guard in high-risk lanes.
Guardrail 4: Uncertainty surfacing
- Agent marks low-confidence areas:
  - heuristics: complex conditionals; large shell pipelines; regex-heavy scripts; cache or concurrency changes.
- These get a "read raw diff" badge; behavior abstraction is advisory only.
Guardrail 5: Random raw-diff sampling
- For a fixed % of PRs per week (or per engineer), harness forces a standard raw-diff review slice to:
  - detect systematic blind spots in summaries
  - keep reviewers calibrated
Guardrail 6: Incident feedback loop
- When an incident is traced to a missed edge case:
  - note which lane, tags, and checklist were in play
  - add a tiny pattern rule (e.g. any rm -rf, DROP TABLE, eval usage → mandatory raw diff + second review)

How this changes senior work

Seniors mainly:
- read behavior-delta files
- inspect tagged risk hunks and removed guards
- tighten or relax lane rules and checklists over time
They rarely:
- scan whole pipelines for style
- hand-construct behavior summaries

This keeps the review bottleneck centered on behavior and contracts while the harness and agents keep a tight link to raw diffs, tests, and risk tags, so edge cases remain visible where they matter most.