In scientific codebases that long-running agents refactor over days, what measurable properties of serialized artifacts at checkpoints—such as spec entropy (amount of detail), change locality (fraction of diff touching contract-governed regions), and cross-artifact consistency scores—best predict downstream silent-error risk, and how can these signals be turned into simple, human-understandable triage rules for when a human must review a checkpoint?

anthropic-scientific-computing | Updated at

Answer

Most predictive signals are simple structural and semantic deltas on contract-governed artifacts. A practical scheme is to track a small set of normalized metrics per checkpoint and map them to 3–4 triage rules.

  1. Useful checkpoint-level metrics
  • Spec entropy / spec change

    • H1: Large normalized spec diff (text or AST) vs prior checkpoint, especially in contract-governed areas (APIs, schemas, golden cases), correlates with higher silent-error risk.
    • Metric: spec_change_rate = changed_tokens / total_tokens, computed separately for contract vs non-contract regions.
  • Change locality w.r.t. contracts

    • H2: A high fraction of code or config diff that touches contract-governed regions (frozen interfaces, schema locks, golden tests) predicts higher risk than the same-sized diff confined to internals.
    • Metric: contract_touch_fraction = lines_changed_in_contract_regions / total_lines_changed.
  • Cross-artifact consistency

    • H3: Drops in simple consistency scores between paired artifacts predict higher risk: e.g., function signatures vs call sites, schema vs queries, config vs manifests.
    • Metrics (examples):
      • api_consistency_score (0–1): fraction of calls matching declared signatures.
      • schema_consistency_score: fraction of queries and loaders consistent with schema.
      • golden_case_pass_rate: fraction of golden/reference cases passing.
  • Churn and complexity around contracts

    • H4: Spiky churn near contracts (many edits to the same contract-governed files across recent checkpoints) and local complexity jumps (cyclomatic, nesting) correlate with higher risk.
    • Metrics: local_churn_near_contracts (last K checkpoints), complexity_delta_in_contract_files.
  • Reproducibility / invariants

    • H5: Any regression in reproducibility harness or invariant checks on key runs is a strong predictor.
    • Metrics: repro_pass (bool), invariant_violations_count.
  1. Simple triage rules for human review Define three bands per checkpoint: GREEN (no human), AMBER (batch/spot review), RED (blocking manual review).

Example rules (thresholds to be tuned per codebase):

  • Rule R1 (contract-touching diff)

    • If contract_touch_fraction > 0.25 AND total_lines_changed > 100 → at least AMBER.
    • If contract_touch_fraction > 0.5 AND total_lines_changed > 200 → RED.
  • Rule R2 (spec change spike)

    • If spec_change_rate_in_contract_regions > 0.15 OR spec tokens increase/decrease by >20% in one checkpoint → AMBER.
    • Combine with recent history: if moving average spec change over last 3 checkpoints doubles baseline → AMBER.
  • Rule R3 (consistency / golden cases)

    • Any drop in golden_case_pass_rate, or any api_consistency_score or schema_consistency_score < 0.98 → RED.
    • If scores are stable but a new golden case is added or materially changed → AMBER for that checkpoint.
  • Rule R4 (churn and complexity near contracts)

    • If local_churn_near_contracts over last N checkpoints is in top decile of historical values → AMBER.
    • If complexity_delta_in_contract_files > pre-set threshold (e.g., >20% increase in cyclomatic complexity) → AMBER.
  • Rule R5 (reproducibility / invariants)

    • Any reproducibility harness failure or new invariant violation on contract-governed paths → RED.

How humans see this

  • Surfaced as a short badge set per checkpoint: e.g.,
    • “Contract diff: HIGH; Spec change: MED; Consistency: OK; Churn: HIGH; Repro: OK → AMBER (review recommended).”
  • Reviewers focus only when:
    • Contracts or specs shift a lot, or
    • Consistency / golden / repro signals regress, or
    • There is sustained churn around contracts.

Evidence type: synthesis Evidence strength: low