When long-running agents autonomously refactor and extend shared scientific codebases over weeks (e.g., simulators, analysis libraries, ETL pipelines), which granularity of checkpointed intent and constraints on code changes—such as function-level behavioral contracts, module-level invariants, or workflow-level “allowed change surfaces”—most reduce cross-workflow silent regressions per unit human review, and how do these levels interact with existing provenance-graph regression suites and shadow replays?
anthropic-scientific-computing | Updated at
Answer
Best trust-per-review comes from combining:
- function-level contracts for high-risk utilities;
- module-level invariants for shared components;
- workflow-level allowed-change surfaces for integration and scope.
- Relative value per human review
-
Function-level contracts
- Most useful on: numerics kernels, parsers, low-level ETL transforms.
- Form: pre/post-conditions, property tests, simple metamorphic checks.
- Effect: cheap to review once; then many future refactors are self-checked.
- Limitation: local; misses cross-function coupling and topology changes.
-
Module-level invariants
- Most useful on: shared simulators, model-building libs, complex ETL stages.
- Form: small set of end-to-end metrics or invariants per module (mass/energy conservation, monotonicity, schema + simple stats, calibration curves).
- Effect: good at catching cross-workflow regressions when modules change; pairs well with sentinel workflows.
- Limitation: higher design/review cost; can be too coarse for subtle bugs.
-
Workflow-level allowed-change surfaces
- Most useful on: widely reused pipelines and analysis workflows.
- Form: explicit lists like “may change: speed, internal refactoring; must not change: API, units, default cohorts, key summary stats beyond tolerance X.”
- Effect: focuses reviews on interface and claim stability; easy to route into provenance-graph checks.
- Limitation: easy to underspecify; needs periodic renegotiation.
- Which granularity “wins” where
- Core shared libs with mature usage:
- Primary: module invariants + workflow-level allowed surfaces.
- Secondary: function contracts only on a small critical set.
- Fragile numerics or parsing/ETL code:
- Primary: function contracts + property tests.
- Support: thin module invariants (e.g., simple distribution/consistency checks).
- Top-level workflows and orchestration code:
- Primary: workflow-level allowed-change surfaces (claims, APIs, datasets).
- Support: link those surfaces to which modules/functions are allowed to change.
- Interaction with provenance-graph regression and shadow replays
-
Provenance-routed regression suites
- Use workflow-level allowed-change surfaces to decide:
- which sentinel workflows to rerun;
- which metrics/claims are “must not drift” vs “allowed to move.”
- Use module invariants as assertions inside those sentinels.
- Use function contracts as local oracles to explain failing sentinels.
- Use workflow-level allowed-change surfaces to decide:
-
Shadow replays
- Triggered when code changes outside an allowed-change surface, or when module invariants drift.
- For each shadow replay, compare only the metrics marked stable in the workflow’s allowed-change surface.
- When diffs appear, drill down via module invariants and function contracts to localize.
- Practical pattern (per unit human review)
- Define once, then reuse:
- For each key module: 3–10 invariants.
- For each high-impact workflow: a short allowed-change surface.
- For each fragile function: 2–5 property-based contracts.
- Let agents:
- maintain tests/contracts;
- auto-update provenance links;
- propose changes to invariants/surfaces, but require human sign-off.
- Human review focuses on:
- new or edited module invariants;
- changes to workflow allowed-change surfaces;
- rare, high-impact function-contract edits.
This layering gives most reduction in cross-workflow silent regressions for a given review budget: local contracts catch routine refactor bugs, module invariants catch shared-component regressions, and workflow-level surfaces align those checks with provenance-routed regression suites and shadow replays.