Modulum-Bench · Data Hygiene Reconciliation

v1 → v2 → v3 → v4 audit reconciliation

Grok (v2) then Codex (v4) ran independent file-read verification passes against modulum-bench-full-report.html v1 and surfaced numerical drift between the visual and the delivered SQLite sources. This page records every correction applied, the data-hygiene pipeline that produced the canonical v2 numbers, and the audit trail a hyperscaler data team can replay.

generated 2026-05-18 auditor: Grok 4.3 (Codex was no-tool) scope: modulum-bench partner + internal reports verdict: NEEDS FIXES → reconciled
Discrepancies closed
14 / 14
All Grok + Codex findings reconciled to canonical source
Canonical rows
4,826
from 5,556 raw across 25 SQLite databases
Duplicates resolved
730
phase-priority dedupe, audit trail preserved
Verifiable cells
100 %
every headline number traces to a phase + idx range

Every value that moved between v1 and v2.

sev Location v1 (stale) v2 (canonical) Why the change Canonical source
Headline leaderboard · Modulum 128k avg 46.4 % 46.0 % v1 used rounded snapshot (72.6/39.5/27.0). v2 reads canonical N=200/200/500 cells. summary.csv qa1·128k=143/200, qa2·128k=79/200, qa3·128k=135/500
Headline · Modulum qa1 128k 72.6 % 71.5 % v1 carried pre-extension snapshot. v2 = phase-1 (65/100) + phase-5 (78/100) deduped. phases m1-qa1-phase1m1-extend-phase5
Mask ablation · Modulum qa1 128k (idx 0..49) 76.6 % 72.0 % v1 number was unreproducible from any clean idx 0..49 slice. v2 = 36/50 from phase-1. phase-1, sample_idx ∈ [0,49]
Mask ablation · qa1 128k Δ Platform +14.6 pp +10.0 pp Recomputed from corrected Modulum value above (72.0 − 62.0). derived from rows above
Mask ablation · Vanilla qa3 64k (partial) 26.7 % 25.6 % v1 cached an earlier 30-row snapshot. v2 = 10/39 rows from current SQLite (1 row added since). vanilla-gemma-mask-ablation
Leaderboard · Claude Opus 4.7 cells all 0 % (canonical drift) 92 / 66 / 38 → 65.3 % v1 dedupe used the temp=0 rerun batch (rejected w/ "temperature deprecated" — all errors). v2 uses the earlier non-temp batch that succeeded. phase priority: frontier-opus-4-7-batch-20260518-024450
Decay-chart subtitle · qa1 Modulum 89 → 77 → 73 % 89 → 77 → 71.5 % Subtitle copied v1 rounded value. SVG polyline y-coord adjusted (440,112 → 440,114). same as headline qa1 128k
Total observations 5,193 4,826 canonical v1 counted CSV lines including 360 dup rows + 198 unparseable. v2 = deterministic dedupe of 5,556 raw rows. manifest.json row_counts
Grok 4.3 qa3 128k footnote ~15 % (no N) 15.4 % (N=26, 3 err) v1 used tilde approximation without disclosing low N. v2 surfaces 26-row sample + 3 backend errors inline. frontier-grok-4.3 SQLite
v3→v4 · Gemini 1M / 256k qa2 framing "frontier collapses at 1M" RETRACTED — HTTP 429 spending cap Codex audit found the 50/50 1M failures and 16/16 qa2 256k failures are Google API budget-cap errors, not model-context-capability failures. We have no data on Gemini 1M actual capability. frontier-gemini-extended-context SQLite, error field
v3→v4 · Mask qa3 64k Δ Platform +12.4 pp (vs 10/39 partial vanilla) +6.0 pp (vs 16/50 complete vanilla, ns p=0.53) Vanilla mask run completed qa3 64k (39 → 50 rows) and qa3 128k partial (6 rows). The full mask comparison weakens. vanilla-gemma-mask-ablation SQLite (post-rebuild)
v3→v4 · qa1 32k Wald p-value p=0.13 p=0.097 Exact Wald recomputation; still ns at α=0.05, but precise value matters for publishable claims. exports/full_audit.json
v3→v4 · "10–28pp on 7 of 8 cells" 10–28pp 6–28pp on 8 of 9 (qa3 128k partial included) qa2 64k is +8pp, qa3 64k is now +6pp (after vanilla completed); the range floor moves. derived from mask cells
v3→v4 · "16 GB / 50–100× / $660" specific figures softened to qualitative Codex flagged these as not evidenced in canonical data. Replaced with "workstation-scale single-GPU" + removed cost estimate. Hypernym-stated deployment profile only. n/a — not in canonical data

Raw SQLite → canonical exports.

v1 concatenated summary.json blobs without dedupe. v2 reads every runs/*/results.sqlite, applies a deterministic phase-priority rule per (model, task, length, sample_idx), and emits both the canonical CSV and a full audit trail.

Step 1 · Ingest
5,556
raw rows pulled from 25 SQLite databases across 16 experimental phases. Three schema shapes harmonized (provider column optional; vanilla adds prefill/decode_ms).
Step 2 · Dedupe
−730
(model, task, length, sample_idx) conflicts resolved by deterministic phase priority. PPL re-runs and parallel-killed phases deprioritized. Every drop logged in dedupe_log.csv.
Step 3 · Emit
4,826
canonical rows written with csv.QUOTE_MINIMAL — no silent line-ending drops. Summary CSV + manifest.json (per-cell phases, idx range, source priority).

Grok ran the audit; Codex was constrained.

Grok 4.3 · file-read auditor (8,989 bytes)

Read summary.csv, all_results.csv, every results.sqlite, and the HTML. Verdict: NEEDS FIXES. Findings reconciled in v2:

  • 9 numerical drift cells identified, all closed (section 01 above).
  • Flagged 360 duplicate (model,task,length,sample_idx) rows in v1 CSV — resolved by phase-priority dedupe.
  • Flagged 198 unparseable lines in v1 CSV — resolved by csv.QUOTE_MINIMAL writer.
  • Confirmed qa1/qa2/qa3 task definitions match babilong/prompts.py verbatim.
  • Confirmed all confound disclosures (C-01 through C-07) directionally accurate.

Codex GPT-5.4 · no-tool constraint (30,869 bytes)

Dispatched with same prompt but without file-read permissions. Findings were procedural, not numerical:

  • Confirmed arithmetic: 96+90+80 ÷ 3 = 88.67, etc.
  • Flagged qa3 description should mention "previous location of object", not just "temporal" — applied.
  • Recommended: aggregation/sample-size notes on every leaderboard cell — applied.
  • Recommended: re-state "temperature omitted after API rejection" precisely — applied.
  • Could not verify any specific number against source.

Known limitations not addressed by this reconciliation.

Data gaps (require new runs)

  • Vanilla Gemma qa3 128k — mask-ablation incomplete on the hardest cell. Vanilla mirror endpoint single-slot; ~3–4 h to complete.
  • Grok 4.3 qa3 128k — 26 of 50 rows due to backend errors. Re-run with retry+backoff to fill.
  • Modulum at 256k+ — endpoint hard-locked at 131,072 tokens. Cannot test the "Modulum holds where frontier collapses" claim until Hypernym provisions a longer-context variant.
  • Gemini qa2 / qa3 at 256k+ — qa2 256k returned 0 of 11 with all errors; qa3 256k+ not run.

Methodology open questions

  • Temperature noise on Opus 4.7 + GPT-5.5 — both APIs reject temperature=0; outputs sampled at default temp. Per-cell drift estimated at ±3–5 pp.
  • Single-slot Modulum vs batched frontier — latency comparison not apples-to-apples. Capability comparison still valid.
  • Substring scoring — re-scored with exact-last / unique-loc / phrase rules; within ±1 row per cell. Reported numbers remain substring-based.
  • v1.5 Pro vs v3.1 Pro framing — Modulum 46 % avg is competitive with 2024-vintage Gemini 1.5 Pro at 128k. Current frontier (2026) opens a 30 pp gap. Stated transparently in §11 of partner report.

Files a partner data team can rerun.

Path Purpose Rows / size
research/tracks/modulum-bench/rebuild_exports.py Deterministic rebuilder. Reads all SQLites; applies phase-priority dedupe; emits canonical CSV + summary + manifest + dedupe log. ~250 LoC
exports/all_results.csv One row per (model, task, length, sample_idx). No duplicates, no silent drops. 4,826 rows
exports/summary.csv Per-cell aggregates with phase provenance + idx_min/max range. 66 cells. 66 rows
exports/manifest.json Machine-readable: row counts, phase priority table, per-cell sources. Source-of-truth for "where did this number come from". ~280 lines
exports/dedupe_log.csv Every duplicate dropped: kept_phase, dropped_phase, correct labels of both. Lets an auditor verify the dedupe rule preserved the right row. 730 rows
visuals/modulum-bench-full-report.html Partner-facing report v2 — numbers reconciled to canonical exports. ~48 KB
visuals/modulum-bench-internal-team-report.html Internal R&D-focused report v2 — same reconciliation applied. ~23 KB