For long-running agents refactoring shared scientific libraries over days, how does adding lab-scale provenance graph–aware checkpointing—where each checkpoint also records and queries the local neighborhood of affected cross-workflow scientific claims and dependents—change the rate and localization of silent errors relative to artifact-local checkpoints only, and what minimal provenance features (e.g., number of affected claims, fan-out of dependents, history of past regressions on those nodes) are needed to route high-risk checkpoints to human review?

anthropic-scientific-computing | Updated at

Answer

Provenance-aware checkpoints mainly rebalance errors: fewer long-lived, cross-workflow silent failures; more localized, earlier-detected issues around high-centrality nodes. A small set of simple graph features is likely enough to triage.

Effect vs artifact-local checkpoints only

  • Global rate: modest reduction in undetected silent errors that touch shared claims; little change for purely local bugs.
  • Localization: errors cluster around a smaller set of shared nodes (core claims and library APIs) and are flagged closer to their introduction.
  • Propagation: bad refactors to high-fan-out code/claims are more likely to trigger checkpoints + review before they spread across many workflows.

Minimal provenance features for triage Track per-checkpoint a few local graph stats over touched nodes (library functions, schemas, cross-workflow scientific claims):

  1. Affected-claim count
  • of distinct cross-workflow scientific claims reachable within k hops that depend on edited artifacts.

  • Use threshold: if above N_claims, escalate.
  1. Dependents fan-out
  • Max or sum of direct dependents (workflows / artifacts) of touched nodes.
  • High fan-out ⇒ higher risk; combine with change size.
  1. Past regression history
  • Simple score per node: count of past test failures / rollbacks / human-rejected checkpoints involving that node.
  • Prior failures raise risk tier even for small edits.
  1. Cross-claim inconsistency blips (optional but high value)
  • Recompute a small panel of key cross-workflow claims affected by the edit and compare to their last accepted values.
  • Large deviations or increased disagreement between workflows push checkpoint to mandatory review.

Simple triage rule sketch

  • Compute a risk score per checkpoint from:
    • R1: normalized change size near high-fan-out nodes.
    • R2: affected-claim count.
    • R3: max regression-history score among touched nodes.
    • R4: any cross-claim inconsistency.
  • Buckets:
    • Auto-continue: all metrics low.
    • Agent-only deep checks: moderate R1–R3, no R4.
    • Human review required: R4 true, or (R2 and R3) above tuned thresholds.

Net effect

  • Compared to artifact-local checkpoints, provenance-aware ones:
    • Reduce undetected, lab-wide silent errors from refactors to shared code/claims.
    • Shift remaining errors toward low-fan-out, workflow-local changes.
    • Improve mapping between “where to look” and “what might be broken” for humans reviewing high-risk checkpoints.
  • Gains are largest when many workflows reuse a small shared library and a small set of cross-workflow scientific claims.