Modulum-Bench · Data Hygiene Reconciliation

v1 → v2 → v3 → v4 audit reconciliation

Grok (v2) then Codex (v4) ran independent file-read verification passes against modulum-bench-full-report.html v1 and surfaced numerical drift between the visual and the delivered SQLite sources. This page records every correction applied, the data-hygiene pipeline that produced the canonical v2 numbers, and the audit trail a hyperscaler data team can replay.

generated 2026-05-18 auditor: Grok 4.3 (Codex was no-tool) scope: modulum-bench partner + internal reports verdict: NEEDS FIXES → reconciled

Discrepancies closed

14 / 14

All Grok + Codex findings reconciled to canonical source

Canonical rows

4,826

from 5,556 raw across 25 SQLite databases

Duplicates resolved

730

phase-priority dedupe, audit trail preserved

Verifiable cells

100 %

every headline number traces to a phase + idx range

01 · Numerical corrections

Every value that moved between v1 and v2.

Location	v1 (stale)	v2 (canonical)	Why the change	Canonical source
Headline leaderboard · Modulum 128k avg	46.4 %	46.0 %	v1 used rounded snapshot (72.6/39.5/27.0). v2 reads canonical N=200/200/500 cells.	`summary.csv` qa1·128k=143/200, qa2·128k=79/200, qa3·128k=135/500
Headline · Modulum qa1 128k	72.6 %	71.5 %	v1 carried pre-extension snapshot. v2 = phase-1 (65/100) + phase-5 (78/100) deduped.	phases `m1-qa1-phase1` ∪ `m1-extend-phase5`
Mask ablation · Modulum qa1 128k (idx 0..49)	76.6 %	72.0 %	v1 number was unreproducible from any clean idx 0..49 slice. v2 = 36/50 from phase-1.	phase-1, sample_idx ∈ [0,49]
Mask ablation · qa1 128k Δ Platform	+14.6 pp	+10.0 pp	Recomputed from corrected Modulum value above (72.0 − 62.0).	derived from rows above
Mask ablation · Vanilla qa3 64k (partial)	26.7 %	25.6 %	v1 cached an earlier 30-row snapshot. v2 = 10/39 rows from current SQLite (1 row added since).	`vanilla-gemma-mask-ablation`
Leaderboard · Claude Opus 4.7 cells	all 0 % (canonical drift)	92 / 66 / 38 → 65.3 %	v1 dedupe used the temp=0 rerun batch (rejected w/ "temperature deprecated" — all errors). v2 uses the earlier non-temp batch that succeeded.	phase priority: `frontier-opus-4-7-batch-20260518-024450`
Decay-chart subtitle · qa1 Modulum	89 → 77 → 73 %	89 → 77 → 71.5 %	Subtitle copied v1 rounded value. SVG polyline y-coord adjusted (440,112 → 440,114).	same as headline qa1 128k
Total observations	5,193	4,826 canonical	v1 counted CSV lines including 360 dup rows + 198 unparseable. v2 = deterministic dedupe of 5,556 raw rows.	`manifest.json` row_counts
Grok 4.3 qa3 128k footnote	~15 % (no N)	15.4 % (N=26, 3 err)	v1 used tilde approximation without disclosing low N. v2 surfaces 26-row sample + 3 backend errors inline.	`frontier-grok-4.3` SQLite
v3→v4 · Gemini 1M / 256k qa2 framing	"frontier collapses at 1M"	RETRACTED — HTTP 429 spending cap	Codex audit found the 50/50 1M failures and 16/16 qa2 256k failures are Google API budget-cap errors, not model-context-capability failures. We have no data on Gemini 1M actual capability.	`frontier-gemini-extended-context` SQLite, error field
v3→v4 · Mask qa3 64k Δ Platform	+12.4 pp (vs 10/39 partial vanilla)	+6.0 pp (vs 16/50 complete vanilla, ns p=0.53)	Vanilla mask run completed qa3 64k (39 → 50 rows) and qa3 128k partial (6 rows). The full mask comparison weakens.	vanilla-gemma-mask-ablation SQLite (post-rebuild)
v3→v4 · qa1 32k Wald p-value	p=0.13	p=0.097	Exact Wald recomputation; still ns at α=0.05, but precise value matters for publishable claims.	`exports/full_audit.json`
v3→v4 · "10–28pp on 7 of 8 cells"	10–28pp	6–28pp on 8 of 9 (qa3 128k partial included)	qa2 64k is +8pp, qa3 64k is now +6pp (after vanilla completed); the range floor moves.	derived from mask cells
v3→v4 · "16 GB / 50–100× / $660"	specific figures	softened to qualitative	Codex flagged these as not evidenced in canonical data. Replaced with "workstation-scale single-GPU" + removed cost estimate. Hypernym-stated deployment profile only.	n/a — not in canonical data

02 · Data hygiene pipeline

Raw SQLite → canonical exports.

v1 concatenated summary.json blobs without dedupe. v2 reads every runs/*/results.sqlite, applies a deterministic phase-priority rule per (model, task, length, sample_idx), and emits both the canonical CSV and a full audit trail.

Step 1 · Ingest

5,556

raw rows pulled from 25 SQLite databases across 16 experimental phases. Three schema shapes harmonized (provider column optional; vanilla adds prefill/decode_ms).

Step 2 · Dedupe

−730

(model, task, length, sample_idx) conflicts resolved by deterministic phase priority. PPL re-runs and parallel-killed phases deprioritized. Every drop logged in dedupe_log.csv.

Step 3 · Emit

4,826

canonical rows written with csv.QUOTE_MINIMAL — no silent line-ending drops. Summary CSV + manifest.json (per-cell phases, idx range, source priority).

03 · Verification panel

Grok ran the audit; Codex was constrained.

Grok 4.3 · file-read auditor (8,989 bytes)

Read summary.csv, all_results.csv, every results.sqlite, and the HTML. Verdict: NEEDS FIXES. Findings reconciled in v2:

9 numerical drift cells identified, all closed (section 01 above).
Flagged 360 duplicate (model,task,length,sample_idx) rows in v1 CSV — resolved by phase-priority dedupe.
Flagged 198 unparseable lines in v1 CSV — resolved by csv.QUOTE_MINIMAL writer.
Confirmed qa1/qa2/qa3 task definitions match babilong/prompts.py verbatim.
Confirmed all confound disclosures (C-01 through C-07) directionally accurate.

Codex GPT-5.4 · no-tool constraint (30,869 bytes)

Dispatched with same prompt but without file-read permissions. Findings were procedural, not numerical:

Confirmed arithmetic: 96+90+80 ÷ 3 = 88.67, etc.
Flagged qa3 description should mention "previous location of object", not just "temporal" — applied.
Recommended: aggregation/sample-size notes on every leaderboard cell — applied.
Recommended: re-state "temperature omitted after API rejection" precisely — applied.
Could not verify any specific number against source.

04 · Outside v2 scope

Known limitations not addressed by this reconciliation.

Data gaps (require new runs)

Vanilla Gemma qa3 128k — mask-ablation incomplete on the hardest cell. Vanilla mirror endpoint single-slot; ~3–4 h to complete.
Grok 4.3 qa3 128k — 26 of 50 rows due to backend errors. Re-run with retry+backoff to fill.
Modulum at 256k+ — endpoint hard-locked at 131,072 tokens. Cannot test the "Modulum holds where frontier collapses" claim until Hypernym provisions a longer-context variant.
Gemini qa2 / qa3 at 256k+ — qa2 256k returned 0 of 11 with all errors; qa3 256k+ not run.

Methodology open questions

Temperature noise on Opus 4.7 + GPT-5.5 — both APIs reject temperature=0; outputs sampled at default temp. Per-cell drift estimated at ±3–5 pp.
Single-slot Modulum vs batched frontier — latency comparison not apples-to-apples. Capability comparison still valid.
Substring scoring — re-scored with exact-last / unique-loc / phrase rules; within ±1 row per cell. Reported numbers remain substring-based.
v1.5 Pro vs v3.1 Pro framing — Modulum 46 % avg is competitive with 2024-vintage Gemini 1.5 Pro at 128k. Current frontier (2026) opens a 30 pp gap. Stated transparently in §11 of partner report.

05 · Replay artifacts

Files a partner data team can rerun.

Path	Purpose	Rows / size
`research/tracks/modulum-bench/rebuild_exports.py`	Deterministic rebuilder. Reads all SQLites; applies phase-priority dedupe; emits canonical CSV + summary + manifest + dedupe log.	~250 LoC
`exports/all_results.csv`	One row per (model, task, length, sample_idx). No duplicates, no silent drops.	4,826 rows
`exports/summary.csv`	Per-cell aggregates with phase provenance + idx_min/max range. 66 cells.	66 rows
`exports/manifest.json`	Machine-readable: row counts, phase priority table, per-cell sources. Source-of-truth for "where did this number come from".	~280 lines
`exports/dedupe_log.csv`	Every duplicate dropped: kept_phase, dropped_phase, correct labels of both. Lets an auditor verify the dedupe rule preserved the right row.	730 rows
`visuals/modulum-bench-full-report.html`	Partner-facing report v2 — numbers reconciled to canonical exports.	~48 KB
`visuals/modulum-bench-internal-team-report.html`	Internal R&D-focused report v2 — same reconciliation applied.	~23 KB