Grok (v2) then Codex (v4) ran independent file-read verification passes against modulum-bench-full-report.html v1 and surfaced numerical drift between the visual and the delivered SQLite sources. This page records every correction applied, the data-hygiene pipeline that produced the canonical v2 numbers, and the audit trail a hyperscaler data team can replay.
| sev | Location | v1 (stale) | v2 (canonical) | Why the change | Canonical source |
|---|---|---|---|---|---|
| Headline leaderboard · Modulum 128k avg | 46.4 % | 46.0 % | v1 used rounded snapshot (72.6/39.5/27.0). v2 reads canonical N=200/200/500 cells. | summary.csv qa1·128k=143/200, qa2·128k=79/200, qa3·128k=135/500 |
|
| Headline · Modulum qa1 128k | 72.6 % | 71.5 % | v1 carried pre-extension snapshot. v2 = phase-1 (65/100) + phase-5 (78/100) deduped. | phases m1-qa1-phase1 ∪ m1-extend-phase5 |
|
| Mask ablation · Modulum qa1 128k (idx 0..49) | 76.6 % | 72.0 % | v1 number was unreproducible from any clean idx 0..49 slice. v2 = 36/50 from phase-1. | phase-1, sample_idx ∈ [0,49] | |
| Mask ablation · qa1 128k Δ Platform | +14.6 pp | +10.0 pp | Recomputed from corrected Modulum value above (72.0 − 62.0). | derived from rows above | |
| Mask ablation · Vanilla qa3 64k (partial) | 26.7 % | 25.6 % | v1 cached an earlier 30-row snapshot. v2 = 10/39 rows from current SQLite (1 row added since). | vanilla-gemma-mask-ablation |
|
| Leaderboard · Claude Opus 4.7 cells | all 0 % (canonical drift) | 92 / 66 / 38 → 65.3 % | v1 dedupe used the temp=0 rerun batch (rejected w/ "temperature deprecated" — all errors). v2 uses the earlier non-temp batch that succeeded. | phase priority: frontier-opus-4-7-batch-20260518-024450 |
|
| Decay-chart subtitle · qa1 Modulum | 89 → 77 → 73 % | 89 → 77 → 71.5 % | Subtitle copied v1 rounded value. SVG polyline y-coord adjusted (440,112 → 440,114). | same as headline qa1 128k | |
| Total observations | 5,193 | 4,826 canonical | v1 counted CSV lines including 360 dup rows + 198 unparseable. v2 = deterministic dedupe of 5,556 raw rows. | manifest.json row_counts |
|
| Grok 4.3 qa3 128k footnote | ~15 % (no N) | 15.4 % (N=26, 3 err) | v1 used tilde approximation without disclosing low N. v2 surfaces 26-row sample + 3 backend errors inline. | frontier-grok-4.3 SQLite |
|
| v3→v4 · Gemini 1M / 256k qa2 framing | "frontier collapses at 1M" | RETRACTED — HTTP 429 spending cap | Codex audit found the 50/50 1M failures and 16/16 qa2 256k failures are Google API budget-cap errors, not model-context-capability failures. We have no data on Gemini 1M actual capability. | frontier-gemini-extended-context SQLite, error field |
|
| v3→v4 · Mask qa3 64k Δ Platform | +12.4 pp (vs 10/39 partial vanilla) | +6.0 pp (vs 16/50 complete vanilla, ns p=0.53) | Vanilla mask run completed qa3 64k (39 → 50 rows) and qa3 128k partial (6 rows). The full mask comparison weakens. | vanilla-gemma-mask-ablation SQLite (post-rebuild) | |
| v3→v4 · qa1 32k Wald p-value | p=0.13 | p=0.097 | Exact Wald recomputation; still ns at α=0.05, but precise value matters for publishable claims. | exports/full_audit.json |
|
| v3→v4 · "10–28pp on 7 of 8 cells" | 10–28pp | 6–28pp on 8 of 9 (qa3 128k partial included) | qa2 64k is +8pp, qa3 64k is now +6pp (after vanilla completed); the range floor moves. | derived from mask cells | |
| v3→v4 · "16 GB / 50–100× / $660" | specific figures | softened to qualitative | Codex flagged these as not evidenced in canonical data. Replaced with "workstation-scale single-GPU" + removed cost estimate. Hypernym-stated deployment profile only. | n/a — not in canonical data |
v1 concatenated summary.json blobs without dedupe. v2 reads every runs/*/results.sqlite, applies a deterministic phase-priority rule per (model, task, length, sample_idx), and emits both the canonical CSV and a full audit trail.
(model, task, length, sample_idx) conflicts resolved by deterministic phase priority. PPL re-runs and parallel-killed phases deprioritized. Every drop logged in dedupe_log.csv.csv.QUOTE_MINIMAL — no silent line-ending drops. Summary CSV + manifest.json (per-cell phases, idx range, source priority).
Read summary.csv, all_results.csv, every results.sqlite, and the HTML. Verdict: NEEDS FIXES. Findings reconciled in v2:
(model,task,length,sample_idx) rows in v1 CSV — resolved by phase-priority dedupe.csv.QUOTE_MINIMAL writer.babilong/prompts.py verbatim.Dispatched with same prompt but without file-read permissions. Findings were procedural, not numerical:
temperature=0; outputs sampled at default temp. Per-cell drift estimated at ±3–5 pp.| Path | Purpose | Rows / size |
|---|---|---|
research/tracks/modulum-bench/rebuild_exports.py |
Deterministic rebuilder. Reads all SQLites; applies phase-priority dedupe; emits canonical CSV + summary + manifest + dedupe log. | ~250 LoC |
exports/all_results.csv |
One row per (model, task, length, sample_idx). No duplicates, no silent drops. | 4,826 rows |
exports/summary.csv |
Per-cell aggregates with phase provenance + idx_min/max range. 66 cells. | 66 rows |
exports/manifest.json |
Machine-readable: row counts, phase priority table, per-cell sources. Source-of-truth for "where did this number come from". | ~280 lines |
exports/dedupe_log.csv |
Every duplicate dropped: kept_phase, dropped_phase, correct labels of both. Lets an auditor verify the dedupe rule preserved the right row. | 730 rows |
visuals/modulum-bench-full-report.html |
Partner-facing report v2 — numbers reconciled to canonical exports. | ~48 KB |
visuals/modulum-bench-internal-team-report.html |
Internal R&D-focused report v2 — same reconciliation applied. | ~23 KB |