When long-running agents manage scientific coding tasks that repeatedly reuse and update shared components (e.g., simulators, ETL pipelines, common analysis libraries) across many workflows, which forms of cross-workflow verification—such as regression suites on a lab-scale provenance graph, shadow replays of past runs under new code, or automatic re-estimation of key cross-workflow scientific claims—most effectively prevent silent errors from propagating through these shared components over weeks of agent activity, per unit of human review time?

anthropic-scientific-computing | Updated at

Answer

Most value per unit human review comes from three layered mechanisms focused on high‑impact shared components and cross‑workflow claims:

  1. Dependency‑aware regression suites on the lab provenance graph
  • Maintain a small, curated set of “sentinel workflows” that heavily exercise shared simulators/ETL/libs.
  • Use the provenance graph to auto‑select which sentinels to rerun when a shared component changes (or its env/dep hash changes).
  • Run is fully automated; humans only review deltas on a few key metrics/claims.
  • Best default: continuous, low human time, catches many propagation bugs early.
  1. Shadow replays of past runs under new code (selective, graph‑routed)
  • When a shared component changes, re‑execute a small sample of historically important downstream runs (high reuse, high impact, or known fragile) with the new code in a sandbox.
  • Compare outputs on a compact metric vector and key derived claims; flag large or pattern‑shift changes.
  • Human time is spent only on flagged diffs, not on replay setup.
  • Best for catching regression in realistic, end‑to‑end conditions.
  1. Automatic re‑estimation of key cross‑workflow scientific claims
  • Maintain a registry of cross‑workflow scientific claims (constants, effect sizes, calibrations) with links to supporting workflows.
  • When any upstream shared component or dataset version changes, auto‑recompute just the claim‑supporting steps or lightweight surrogates.
  • Monitor for claim drift beyond pre‑set tolerances; route those few cases to human review.
  • Best for catching subtle, scientifically meaningful shifts rather than pure numeric diffs.

Relative effectiveness per human‑time unit

  • Highest: (1) graph‑routed regression suites + (3) claim re‑estimation, because they are highly automatable and give small, focused diff surfaces for humans.
  • Next: (2) targeted shadow replays, especially for high‑impact workflows; they are more compute‑heavy but human‑efficient when diffs are summarized well.
  • Less efficient: ad‑hoc manual spot checks or full re‑audits of entire workflows after each change; they do not scale with weeks of agent activity.

Operational sketch

  • Use the lab‑scale provenance graph to:
    • rank components and workflows by downstream centrality/reuse;
    • auto‑trigger the above three checks when shared components or claims change;
    • limit human review to: (a) inconsistent regression results, (b) large shadow‑replay drifts, (c) claim re‑estimates outside bands.

In combination, these three cross‑workflow mechanisms minimize silent propagation through shared parts while keeping human review focused and sparse.