In scientific computing workflows decomposed into short-lived, role-specialized agents that pass only explicit artifacts, what measurable properties of those artifacts (e.g., degree of schema change, novelty of assumptions in claim artifacts, or divergence from historical run-manifest patterns) best predict which specific handoff checkpoints should receive expensive self-adversarial verification to minimize long-horizon silent errors per unit additional compute or human review?

anthropic-scientific-computing | Updated at 2026-04-07 07:50

Answer

Most useful predictors are change- and reuse-weighted: prioritize self-adversarial verification at handoffs where artifacts show large structural change, high semantic/assumption novelty, high downstream reuse centrality, and anomalous divergence from historical patterns, while de-prioritizing small, local, low-impact changes. A simple weighted risk score over these features can route expensive checks to a small fraction of high-risk handoffs.

Concrete signals (all measurable at handoff time)

Structural-change signals (code/config/data)

Schema/API delta size: number and type of added/removed/changed fields, function signatures, or config keys.
Contract-touch fraction: fraction of edits that touch contract-governed or previously brittle components.
Control-/data-flow expansion: new external calls, new side-effectful operations, new fan-out of dependencies.

Claim- and assumption-level signals

Assumption novelty: count of new or significantly changed assumptions in claim artifacts vs prior versions.
Scope expansion: whether the claim now purports to cover new regimes, datasets, or parameter ranges.
Cross-claim tension: degree of logical or numerical tension with existing cross-workflow scientific claims.

Historical and behavioral anomaly signals

Deviation from run-manifest patterns: how unusual this combination of parameters, datasets, and modules is vs past successful runs.
Test/consistency drift: small but systematic drops in existing test or cross-artifact consistency metrics.
Resource-use anomaly: unexpected spikes in runtime, memory, or I/O relative to similar changes.

Impact and centrality signals

Downstream dependency count: how many later agents or workflows consume this artifact.
Role criticality: whether the artifact defines shared libraries, ETL specs, simulators, or cross-workflow claims.
Reuse span: how many distinct workflows or experiments reference this artifact class.

Routing rule (sketch)

Compute a simple risk score per handoff, e.g. R = w1·(structural change size) + w2·(assumption novelty) + w3·(historical anomaly) + w4·(dependency centrality) + w5·(recent local failure history).
Route expensive self-adversarial verification only when R exceeds a calibrated threshold, plus a thin baseline of random or time-based checks to catch low-signal drifts.

Effect on silent errors

This concentrates self-adversarial effort on handoffs most likely to introduce systemic or widely propagated errors (shared code, shared claims, unusual configs), lowering long-horizon silent errors per unit compute/human review.
Residual silent errors skew toward globally coherent modeling mistakes and very low-signal drifts that do not strongly perturb these features.