When long-running agents autonomously refactor and extend shared scientific codebases over weeks (e.g., simulators, analysis libraries, ETL pipelines), which granularity of checkpointed intent and constraints on code changes—such as function-level behavioral contracts, module-level invariants, or workflow-level “allowed change surfaces”—most reduce cross-workflow silent regressions per unit human review, and how do these levels interact with existing provenance-graph regression suites and shadow replays?

anthropic-scientific-computing | Updated at

Answer

Best trust-per-review comes from combining:

  • function-level contracts for high-risk utilities;
  • module-level invariants for shared components;
  • workflow-level allowed-change surfaces for integration and scope.
  1. Relative value per human review
  • Function-level contracts

    • Most useful on: numerics kernels, parsers, low-level ETL transforms.
    • Form: pre/post-conditions, property tests, simple metamorphic checks.
    • Effect: cheap to review once; then many future refactors are self-checked.
    • Limitation: local; misses cross-function coupling and topology changes.
  • Module-level invariants

    • Most useful on: shared simulators, model-building libs, complex ETL stages.
    • Form: small set of end-to-end metrics or invariants per module (mass/energy conservation, monotonicity, schema + simple stats, calibration curves).
    • Effect: good at catching cross-workflow regressions when modules change; pairs well with sentinel workflows.
    • Limitation: higher design/review cost; can be too coarse for subtle bugs.
  • Workflow-level allowed-change surfaces

    • Most useful on: widely reused pipelines and analysis workflows.
    • Form: explicit lists like “may change: speed, internal refactoring; must not change: API, units, default cohorts, key summary stats beyond tolerance X.”
    • Effect: focuses reviews on interface and claim stability; easy to route into provenance-graph checks.
    • Limitation: easy to underspecify; needs periodic renegotiation.
  1. Which granularity “wins” where
  • Core shared libs with mature usage:
    • Primary: module invariants + workflow-level allowed surfaces.
    • Secondary: function contracts only on a small critical set.
  • Fragile numerics or parsing/ETL code:
    • Primary: function contracts + property tests.
    • Support: thin module invariants (e.g., simple distribution/consistency checks).
  • Top-level workflows and orchestration code:
    • Primary: workflow-level allowed-change surfaces (claims, APIs, datasets).
    • Support: link those surfaces to which modules/functions are allowed to change.
  1. Interaction with provenance-graph regression and shadow replays
  • Provenance-routed regression suites

    • Use workflow-level allowed-change surfaces to decide:
      • which sentinel workflows to rerun;
      • which metrics/claims are “must not drift” vs “allowed to move.”
    • Use module invariants as assertions inside those sentinels.
    • Use function contracts as local oracles to explain failing sentinels.
  • Shadow replays

    • Triggered when code changes outside an allowed-change surface, or when module invariants drift.
    • For each shadow replay, compare only the metrics marked stable in the workflow’s allowed-change surface.
    • When diffs appear, drill down via module invariants and function contracts to localize.
  1. Practical pattern (per unit human review)
  • Define once, then reuse:
    • For each key module: 3–10 invariants.
    • For each high-impact workflow: a short allowed-change surface.
    • For each fragile function: 2–5 property-based contracts.
  • Let agents:
    • maintain tests/contracts;
    • auto-update provenance links;
    • propose changes to invariants/surfaces, but require human sign-off.
  • Human review focuses on:
    • new or edited module invariants;
    • changes to workflow allowed-change surfaces;
    • rare, high-impact function-contract edits.

This layering gives most reduction in cross-workflow silent regressions for a given review budget: local contracts catch routine refactor bugs, module invariants catch shared-component regressions, and workflow-level surfaces align those checks with provenance-routed regression suites and shadow replays.