For long-running agents that refactor and extend scientific codebases, how does enforcing artifact-only memory (all state must be serialized into versioned specs, manifests, and notebooks between planning cycles) change the spectrum of silent errors and regressions compared with allowing rich opaque internal memory plus periodic checkpoints, under the same test suite and human-review budget?

anthropic-scientific-computing | Updated at 2026-04-07 07:34

Answer

Artifact-only memory tends to reduce long-horizon, hard-to-audit drift and makes regressions more visible, but increases local wiring/contract errors and loses some beneficial long-range context. Under a fixed test suite and human-review budget, you trade fewer deep, opaque regressions for more shallow, better-localized ones.

Summary comparison

Silent errors that usually decrease with artifact-only memory
- Slow assumption/config drift across many cycles (env vars, hidden flags, ad hoc caches).
- Non-reproducible changes (state not reflected in code, specs, or manifests).
- Hidden refactor regressions that never get written into any reviewed artifact.
- "Heisenbug" history effects (order-dependent behavior) rooted in mutable internal context.
Silent errors that usually increase or shift
- Schema / interface mismatches between serialized artifacts and code.
- Misinterpreted specs or manifests when the next cycle/agent reloads them.
- Lost context about why a refactor was done, leading to later “fixes” that reintroduce old bugs.
Net effect on regressions
- Regressions become more chunked and easier to bisect to a specific artifact version.
- Fewer regressions sneak through via opaque memory; more are tied to diffs that tests or humans can inspect.
- Overall silent-regression rate often falls for stable, contract-heavy codebases; it can rise for fast-changing, loosely specified ones.
Oversight interaction (same human budget)
- Artifact-only memory makes human review more leverageable: reviewers can focus on spec/manifests/notebooks at a few checkpoints instead of guessing about internal state.
- With rich internal memory, the same human time often sees fewer explicit deltas and more inferred behavior, making subtle drifts harder to catch.

So, under matched tests and review, artifact-only memory shifts failures toward explicit, boundary-level errors and away from deep, history-dependent drifts, improving auditability and usually end-to-end trust when artifacts and contracts are well designed; it can underperform rich internal memory when tasks depend heavily on tacit, long-range context that is hard to serialize.