When long-running agents are allowed to perform multi-hour code refactors and simulation campaigns under principal-style contracts, what combinations of dynamic, risk-sensitive checkpointing (based on online risk signals) and compute-budget governance (start/stop/fork policies tied to those same signals) most reduce end-to-end silent-error rates per unit of human and compute budget, and how do these joint policies trade off between catching gradual, low-signal drifts and containing high-risk but rare catastrophic failures?
anthropic-scientific-computing | Updated at
Answer
Best-effort answer:
- Most effective joint pattern (for multi-hour refactors/sim runs)
- Use a small set of online risk signals (e.g., contract-touch fraction, cross-artifact/claim inconsistencies, anomaly in resource use, recent test failures).
- Map these signals to 3 risk bands (low/medium/high) with hysteresis (to avoid flapping).
- For each band, couple:
- checkpoint density (when to verify, when to ask for human input), and
- compute actions (continue, throttle, fork, or halt).
- Example policy grid (sketch)
- Low risk
- Checkpoints: coarse, mostly automated tests + a few golden cases.
- Compute: full speed; no forking; minimal human review.
- Medium risk
- Checkpoints: denser; run contract/golden suites; occasional self-adversarial probes on touched modules.
- Compute: throttled; allow short forks for A/B comparisons; require human review for large API/schema or physics-model changes.
- High risk
- Checkpoints: immediate full contract/golden suite + targeted adversarial checks.
- Compute: pause main run; spawn small, capped forks to diagnose (rollback candidate, alt implementation, replay with more logging); resume only if forks agree and tests pass; otherwise require human decision.
- Where this combo beats either mechanism alone
- Relative to dynamic checkpointing alone (cf. f7156ab6):
- Compute governance limits blast radius when risk spikes (no long, high-risk tail of silent errors).
- Forks let you use disagreement between branches as an extra error signal under the same total budget.
- Relative to compute-governance alone (cf. 8214a430):
- Risk-tuned checkpoints give the trust signals more bite: high-risk periods see stronger tests, not just slower or halted compute.
- Drift vs catastrophic failures (high-level tradeoff)
- To catch gradual, low-signal drift:
- Maintain a thin floor of fixed, time/step-based checkpoints (golden cases + cheap schema/API checks) even in low-risk band.
- Periodically sample extra checks on “boring” intervals, regardless of risk score.
- This sacrifices some compute/human budget but reduces the class of drifts that never cross a dynamic threshold.
- To contain rare catastrophic failures:
- Make high-risk transition thresholds conservative; when they trigger, couple:
- aggressive checkpointing (heavy tests, self-adversarial verification on changed hotspots), and
- strong compute actions (pause + fork under small caps).
- This raises the chance that catastrophic bugs either crash fast under tests or show fork disagreement before consuming large compute.
- Make high-risk transition thresholds conservative; when they trigger, couple:
- Simple practical recipe
- Fix: (a) a minimal baseline schedule, and (b) a 3-band risk→(checkpoint, compute) policy.
- Tune:
- Risk signals: start with contract-touch fraction, cross-artifact diff metrics, and recent failure history.
- Band thresholds: using pilot runs, aim for most work in low band, short bursts in high band.
- Fork policy: when risk first jumps to high, create 1–2 short forks (rollback vs current vs alt patch) and gate further compute on their agreement.
- Outcome pattern (conjectured)
- Silent-error rate per unit human+compute budget drops most when:
- Risk signals have at least modest predictive power for regressions.
- High-risk bands are rare but trigger strong tests + compute caps.
- A nonzero fixed baseline of checks exists to catch slow drifts.
- Residual errors skew toward:
- global modeling mistakes that remain internally consistent, and
- drifts so low-signal that they never affect the chosen risk metrics and slip through baseline checks.