In scientific codebases that long-running agents refactor over days, what measurable properties of serialized artifacts at checkpoints—such as spec entropy (amount of detail), change locality (fraction of diff touching contract-governed regions), and cross-artifact consistency scores—best predict downstream silent-error risk, and how can these signals be turned into simple, human-understandable triage rules for when a human must review a checkpoint?
anthropic-scientific-computing | Updated at
Answer
Most predictive signals are simple structural and semantic deltas on contract-governed artifacts. A practical scheme is to track a small set of normalized metrics per checkpoint and map them to 3–4 triage rules.
- Useful checkpoint-level metrics
-
Spec entropy / spec change
- H1: Large normalized spec diff (text or AST) vs prior checkpoint, especially in contract-governed areas (APIs, schemas, golden cases), correlates with higher silent-error risk.
- Metric: spec_change_rate = changed_tokens / total_tokens, computed separately for contract vs non-contract regions.
-
Change locality w.r.t. contracts
- H2: A high fraction of code or config diff that touches contract-governed regions (frozen interfaces, schema locks, golden tests) predicts higher risk than the same-sized diff confined to internals.
- Metric: contract_touch_fraction = lines_changed_in_contract_regions / total_lines_changed.
-
Cross-artifact consistency
- H3: Drops in simple consistency scores between paired artifacts predict higher risk: e.g., function signatures vs call sites, schema vs queries, config vs manifests.
- Metrics (examples):
- api_consistency_score (0–1): fraction of calls matching declared signatures.
- schema_consistency_score: fraction of queries and loaders consistent with schema.
- golden_case_pass_rate: fraction of golden/reference cases passing.
-
Churn and complexity around contracts
- H4: Spiky churn near contracts (many edits to the same contract-governed files across recent checkpoints) and local complexity jumps (cyclomatic, nesting) correlate with higher risk.
- Metrics: local_churn_near_contracts (last K checkpoints), complexity_delta_in_contract_files.
-
Reproducibility / invariants
- H5: Any regression in reproducibility harness or invariant checks on key runs is a strong predictor.
- Metrics: repro_pass (bool), invariant_violations_count.
- Simple triage rules for human review Define three bands per checkpoint: GREEN (no human), AMBER (batch/spot review), RED (blocking manual review).
Example rules (thresholds to be tuned per codebase):
-
Rule R1 (contract-touching diff)
- If contract_touch_fraction > 0.25 AND total_lines_changed > 100 → at least AMBER.
- If contract_touch_fraction > 0.5 AND total_lines_changed > 200 → RED.
-
Rule R2 (spec change spike)
- If spec_change_rate_in_contract_regions > 0.15 OR spec tokens increase/decrease by >20% in one checkpoint → AMBER.
- Combine with recent history: if moving average spec change over last 3 checkpoints doubles baseline → AMBER.
-
Rule R3 (consistency / golden cases)
- Any drop in golden_case_pass_rate, or any api_consistency_score or schema_consistency_score < 0.98 → RED.
- If scores are stable but a new golden case is added or materially changed → AMBER for that checkpoint.
-
Rule R4 (churn and complexity near contracts)
- If local_churn_near_contracts over last N checkpoints is in top decile of historical values → AMBER.
- If complexity_delta_in_contract_files > pre-set threshold (e.g., >20% increase in cyclomatic complexity) → AMBER.
-
Rule R5 (reproducibility / invariants)
- Any reproducibility harness failure or new invariant violation on contract-governed paths → RED.
How humans see this
- Surfaced as a short badge set per checkpoint: e.g.,
- “Contract diff: HIGH; Spec change: MED; Consistency: OK; Churn: HIGH; Repro: OK → AMBER (review recommended).”
- Reviewers focus only when:
- Contracts or specs shift a lot, or
- Consistency / golden / repro signals regress, or
- There is sustained churn around contracts.
Evidence type: synthesis Evidence strength: low