For long-running agents that operate as sequences of short-lived, role-specialized agents passing only explicit artifacts, which minimal artifact fields and invariants (e.g., environment hash, dependency lockfile, input–output schema, test snapshot IDs) most reduce long-horizon silent errors and replay failures per unit of human review, compared with today’s heavier-weight notebooks and manifests, and how do these minimal bundles change across simulation, data-analysis, and code-refactor workflows?

anthropic-scientific-computing | Updated at

Answer

Use small, standardized bundles keyed around (1) env+deps, (2) interface+data shape, (3) run intent+inputs, and (4) verification links. Heavier notebooks/manifests add clutter; these core fields plus a few workflow-specific extras usually give most of the benefit per unit review.

  1. Cross-workflow “minimal bundle” (for all roles)
  • Required fields

    • env_fingerprint: hash of runtime image + key configs (OS, language, BLAS/MPI, GPU/CPU flags).
    • dep_lock: dependency list with pinned versions and build options.
    • io_contract: machine-readable input/output schema (+ units, dtypes, array shapes, coordinate systems).
    • run_intent: short text spec (task, assumptions, key parameters) + version id.
    • run_inputs_ref: content hash / IDs of primary inputs (data snapshot, initial conditions, upstream artifact IDs).
    • seed_and_nondet: PRNG seeds + flags for nondeterminism (GPU nondet, parallel order, approximate kernels).
    • test_refs: IDs of tests/golden cases invoked + their result hashes.
    • metrics_summary: small vector of key diagnostics (loss, conservation error, basic sanity checks).
    • provenance_links: upstream artifact IDs and claim/result IDs this step depends on.
  • Core invariants

    • Reproducibility: (env_fingerprint, dep_lock, run_inputs_ref, seed_and_nondet) are sufficient to replay within tolerance.
    • Interface stability: io_contract must be backward-compatible across handoffs unless its version changes.
    • Traceability: every derived artifact has a pointer back to run_intent and provenance_links.
  1. Simulation workflows: minimal extras
  • Add fields

    • model_version: id for equations/model code + discretization scheme.
    • grid_and_solver: resolution, time step, solver family, tolerance.
    • conservation_checks: per-step or final invariants (mass/energy, positivity flags) with pass/fail.
  • Key invariants

    • Physics/constraints: conservation_checks must stay within configured bounds.
    • Stability: grid_and_solver changes must trigger stricter tests_refs and human review when tolerances loosen.
  1. Data-analysis workflows: minimal extras
  • Add fields

    • dataset_version: id/hash of dataset slice, filters, and inclusion criteria.
    • cohort_spec: structured definition of population/selection.
    • analysis_plan: brief, structured steps (preprocess, model, metrics) with version.
  • Key invariants

    • Cohort consistency: dataset_version and cohort_spec must match; any change bumps analysis_plan version.
    • Schema lock: io_contract and dataset_version must match a known schema version or block handoff.
  1. Code-refactor workflows: minimal extras
  • Add fields

    • contract_map: list of public APIs, schemas, and invariants with versions.
    • change_summary: small diff summary focused on contract regions vs internals.
    • risk_score: simple scalar from structural metrics (e.g., contract_touch_fraction, spec_change_rate, consistency_drop).
  • Key invariants

    • Contract preservation: public portions of contract_map may change only with explicit version bumps and matching tests_refs.
    • Risk-gated handoff: if risk_score or contract_touch_fraction exceed thresholds, human review is required before downstream roles use the artifact.
  1. Why these help per unit human review
  • Reviewers focus on a compact, stable header instead of full notebooks/manifests.
  • They can quickly answer: "Can I replay this?", "Did interfaces or cohorts change?", "Were key tests run?", "Did physics/cohort contracts hold?" without reading all code.
  • Most long-horizon silent errors (config drift, schema drift, hidden env changes, wrong cohort, broken contracts) show up as violations or inconsistencies in these fields.
  1. How bundles differ by workflow
  • Shared core: env_fingerprint, dep_lock, io_contract, run_intent, run_inputs_ref, seed_and_nondet, test_refs, metrics_summary, provenance_links.
  • Simulation: emphasize numerical/physical invariants (model_version, grid_and_solver, conservation_checks).
  • Data-analysis: emphasize data/cohort invariants (dataset_version, cohort_spec, analysis_plan).
  • Code-refactor: emphasize interface and structural risk (contract_map, change_summary, risk_score) plus strong test_refs.
  1. Minimality guidance
  • Default to the cross-workflow core + 2–3 workflow-specific fields.
  • Only add more when a new silent-error class appears that is not captured as a violation of existing fields/invariants.