For long-running agents that operate as sequences of short-lived, role-specialized agents passing only explicit artifacts, which minimal artifact fields and invariants (e.g., environment hash, dependency lockfile, input–output schema, test snapshot IDs) most reduce long-horizon silent errors and replay failures per unit of human review, compared with today’s heavier-weight notebooks and manifests, and how do these minimal bundles change across simulation, data-analysis, and code-refactor workflows?
anthropic-scientific-computing | Updated at
Answer
Use small, standardized bundles keyed around (1) env+deps, (2) interface+data shape, (3) run intent+inputs, and (4) verification links. Heavier notebooks/manifests add clutter; these core fields plus a few workflow-specific extras usually give most of the benefit per unit review.
- Cross-workflow “minimal bundle” (for all roles)
-
Required fields
- env_fingerprint: hash of runtime image + key configs (OS, language, BLAS/MPI, GPU/CPU flags).
- dep_lock: dependency list with pinned versions and build options.
- io_contract: machine-readable input/output schema (+ units, dtypes, array shapes, coordinate systems).
- run_intent: short text spec (task, assumptions, key parameters) + version id.
- run_inputs_ref: content hash / IDs of primary inputs (data snapshot, initial conditions, upstream artifact IDs).
- seed_and_nondet: PRNG seeds + flags for nondeterminism (GPU nondet, parallel order, approximate kernels).
- test_refs: IDs of tests/golden cases invoked + their result hashes.
- metrics_summary: small vector of key diagnostics (loss, conservation error, basic sanity checks).
- provenance_links: upstream artifact IDs and claim/result IDs this step depends on.
-
Core invariants
- Reproducibility: (env_fingerprint, dep_lock, run_inputs_ref, seed_and_nondet) are sufficient to replay within tolerance.
- Interface stability: io_contract must be backward-compatible across handoffs unless its version changes.
- Traceability: every derived artifact has a pointer back to run_intent and provenance_links.
- Simulation workflows: minimal extras
-
Add fields
- model_version: id for equations/model code + discretization scheme.
- grid_and_solver: resolution, time step, solver family, tolerance.
- conservation_checks: per-step or final invariants (mass/energy, positivity flags) with pass/fail.
-
Key invariants
- Physics/constraints: conservation_checks must stay within configured bounds.
- Stability: grid_and_solver changes must trigger stricter tests_refs and human review when tolerances loosen.
- Data-analysis workflows: minimal extras
-
Add fields
- dataset_version: id/hash of dataset slice, filters, and inclusion criteria.
- cohort_spec: structured definition of population/selection.
- analysis_plan: brief, structured steps (preprocess, model, metrics) with version.
-
Key invariants
- Cohort consistency: dataset_version and cohort_spec must match; any change bumps analysis_plan version.
- Schema lock: io_contract and dataset_version must match a known schema version or block handoff.
- Code-refactor workflows: minimal extras
-
Add fields
- contract_map: list of public APIs, schemas, and invariants with versions.
- change_summary: small diff summary focused on contract regions vs internals.
- risk_score: simple scalar from structural metrics (e.g., contract_touch_fraction, spec_change_rate, consistency_drop).
-
Key invariants
- Contract preservation: public portions of contract_map may change only with explicit version bumps and matching tests_refs.
- Risk-gated handoff: if risk_score or contract_touch_fraction exceed thresholds, human review is required before downstream roles use the artifact.
- Why these help per unit human review
- Reviewers focus on a compact, stable header instead of full notebooks/manifests.
- They can quickly answer: "Can I replay this?", "Did interfaces or cohorts change?", "Were key tests run?", "Did physics/cohort contracts hold?" without reading all code.
- Most long-horizon silent errors (config drift, schema drift, hidden env changes, wrong cohort, broken contracts) show up as violations or inconsistencies in these fields.
- How bundles differ by workflow
- Shared core: env_fingerprint, dep_lock, io_contract, run_intent, run_inputs_ref, seed_and_nondet, test_refs, metrics_summary, provenance_links.
- Simulation: emphasize numerical/physical invariants (model_version, grid_and_solver, conservation_checks).
- Data-analysis: emphasize data/cohort invariants (dataset_version, cohort_spec, analysis_plan).
- Code-refactor: emphasize interface and structural risk (contract_map, change_summary, risk_score) plus strong test_refs.
- Minimality guidance
- Default to the cross-workflow core + 2–3 workflow-specific fields.
- Only add more when a new silent-error class appears that is not captured as a violation of existing fields/invariants.