When long-running agents autonomously refactor and extend shared scientific codebases over weeks (e.g., simulators, analysis libraries, ETL pipelines), which granularity of checkpointed intent and constraints on code changes—such as function-level behavioral contracts, module-level invariants, or workflow-level “allowed change surfaces”—most reduce cross-workflow silent regressions per unit human review, and how do these levels interact with existing provenance-graph regression suites and shadow replays?

anthropic-scientific-computing | Updated at 2026-04-07 11:20

Answer

Best trust-per-review comes from combining:

function-level contracts for high-risk utilities;
module-level invariants for shared components;
workflow-level allowed-change surfaces for integration and scope.

Relative value per human review

Function-level contracts
- Most useful on: numerics kernels, parsers, low-level ETL transforms.
- Form: pre/post-conditions, property tests, simple metamorphic checks.
- Effect: cheap to review once; then many future refactors are self-checked.
- Limitation: local; misses cross-function coupling and topology changes.
Module-level invariants
- Most useful on: shared simulators, model-building libs, complex ETL stages.
- Form: small set of end-to-end metrics or invariants per module (mass/energy conservation, monotonicity, schema + simple stats, calibration curves).
- Effect: good at catching cross-workflow regressions when modules change; pairs well with sentinel workflows.
- Limitation: higher design/review cost; can be too coarse for subtle bugs.
Workflow-level allowed-change surfaces
- Most useful on: widely reused pipelines and analysis workflows.
- Form: explicit lists like “may change: speed, internal refactoring; must not change: API, units, default cohorts, key summary stats beyond tolerance X.”
- Effect: focuses reviews on interface and claim stability; easy to route into provenance-graph checks.
- Limitation: easy to underspecify; needs periodic renegotiation.

Which granularity “wins” where

Core shared libs with mature usage:
- Primary: module invariants + workflow-level allowed surfaces.
- Secondary: function contracts only on a small critical set.
Fragile numerics or parsing/ETL code:
- Primary: function contracts + property tests.
- Support: thin module invariants (e.g., simple distribution/consistency checks).
Top-level workflows and orchestration code:
- Primary: workflow-level allowed-change surfaces (claims, APIs, datasets).
- Support: link those surfaces to which modules/functions are allowed to change.

Interaction with provenance-graph regression and shadow replays

Provenance-routed regression suites
- Use workflow-level allowed-change surfaces to decide:
  - which sentinel workflows to rerun;
  - which metrics/claims are “must not drift” vs “allowed to move.”
- Use module invariants as assertions inside those sentinels.
- Use function contracts as local oracles to explain failing sentinels.
Shadow replays
- Triggered when code changes outside an allowed-change surface, or when module invariants drift.
- For each shadow replay, compare only the metrics marked stable in the workflow’s allowed-change surface.
- When diffs appear, drill down via module invariants and function contracts to localize.

Practical pattern (per unit human review)

Define once, then reuse:
- For each key module: 3–10 invariants.
- For each high-impact workflow: a short allowed-change surface.
- For each fragile function: 2–5 property-based contracts.
Let agents:
- maintain tests/contracts;
- auto-update provenance links;
- propose changes to invariants/surfaces, but require human sign-off.
Human review focuses on:
- new or edited module invariants;
- changes to workflow allowed-change surfaces;
- rare, high-impact function-contract edits.

This layering gives most reduction in cross-workflow silent regressions for a given review budget: local contracts catch routine refactor bugs, module invariants catch shared-component regressions, and workflow-level surfaces align those checks with provenance-routed regression suites and shadow replays.