Record: SmearGate BOS Fix 3-Seed Compliance Re-run — val_bpb 1.06141 (3-seed mean)#1868
Conversation
3-seed reproduction of PR openai#1851 (SmearGate BOS document boundary fix). Code is byte-identical to openai#1851 by @aquariouseworkman. Results (post-TTT BPB): Seed 42: 1.06128 (original openai#1851 author) Seed 314: 1.06087 (this submission) Seed 1234: 1.06220 (this submission) Mean: 1.06145 ± 0.00068 All artifacts < 16,000,000 bytes. All runs < 600s. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
You are amazing!!! |
PR openai#1902 (cocohearts) accepted openai#1851/openai#1868 over openai#1736 and excluded openai#1855 only on significance grounds (p=0.325). Our prior 050 line built on openai#1797 which is under validity-cloud per cocohearts. Re-anchor research baseline on openai#1855's accepted chain. Pure port — zero modifications. Files copied verbatim from codemath3000/parameter-golf:submission/sp8192-lqer-bos-smear-fix-9hp-stack @ 1e43966 into records/track_10min_16mb/2026-04-29_PR1855_Port_Baseline/. Spec 060B+ will fork exp/060B-* etc. to stack quant-repair / deploy-time levers (046B-tight SDClip, 046L deploy-time repair, 046G-tighter, etc.) on this baseline.
…an 1.06141 Re-ran all 3 seeds (42, 314, 1234) with GPTQ_RESERVE_SECONDS=8.0 (was 0.5) to ensure GPTQ hessian collection completes within the 600s training budget. Code changes: - Serialize artifact immediately after training (before diagnostic eval) - Added timing instrumentation (serialize_wallclock, GPTQ sub-timings) Results (all seeds fresh re-run on RunPod 8×H100 SXM): Seed 42: post-TTT BPB = 1.06083, artifact = 15,949,701, eval = 525.5s Seed 314: post-TTT BPB = 1.06091, artifact = 15,951,777, eval = 429.5s Seed 1234: post-TTT BPB = 1.06249, artifact = 15,951,968, eval = 481.2s 3-seed mean: 1.06141 ± 0.00093 Compliance: training loop ends at ~592s, GPTQ hessians end at ~595.5s (<600s). RunPod cost: ~$31. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…an 1.06141 Re-ran all 3 seeds (42, 314, 1234) with GPTQ_RESERVE_SECONDS=8.0 (was 0.5) to ensure GPTQ hessian collection completes within the 600s training budget. Code changes: - Serialize artifact immediately after training (before diagnostic eval) - Added timing instrumentation (serialize_wallclock, GPTQ sub-timings) Results (all seeds fresh re-run on RunPod 8×H100 SXM): Seed 42: post-TTT BPB = 1.06083, artifact = 15,949,701, eval = 525.5s Seed 314: post-TTT BPB = 1.06091, artifact = 15,951,777, eval = 429.5s Seed 1234: post-TTT BPB = 1.06249, artifact = 15,951,968, eval = 481.2s 3-seed mean: 1.06141 ± 0.00093 Compliance: training loop ends at ~592s, GPTQ hessians end at ~595.5s (<600s). RunPod cost: ~$31. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cocohearts
left a comment
There was a problem hiding this comment.
Thanks, this is useful supporting evidence for #1851. Before merge, please make it a standalone support/record package: include all three seed logs in this directory, including the seed 42 log currently referenced from #1851, and keep the README/submission.json phrased as a 3-seed compliance reproduction/support package rather than a separate technique claim. No ML change needed.
cocohearts
left a comment
There was a problem hiding this comment.
Thanks, this is useful supporting evidence for #1851. Before merge, please make it a standalone support/record package: include all three seed logs in this directory, including the seed 42 log currently referenced from #1851, and keep the README/submission.json phrased as a 3-seed compliance reproduction/support package rather than a separate technique claim. No ML change needed.
Duplicate requested-changes review submitted by automation; keeping the earlier requested-changes review active.
…1851 - Add original seed 42 log from PR openai#1851 (@aquariouseworkman) - Add original seeds 314 and 1234 from independent reproduction - Rename compliance re-run logs to *_rerun_gptq8s.log for clarity - Rewrite README as support/compliance package (not a new technique claim) - Rewrite submission.json with headline val_bpb=1.06145 (original 3-seed mean) - Document compliance re-run as supplementary evidence (no stat-sig difference)
|
Updated, Including both the original 3-seeds and my independent compliance re-run. Thanks! |
Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations
Record Support: 3-Seed Compliance Reproduction for PR #1851
val_bpb = 1.06145 (3-seed mean ± 0.00068) | ~15.95 MB | 8×H100 SXM 80GB
Summary
This PR is now positioned as a standalone support / record package for PR #1851, not as a separate technique claim.
It does three things:
No ML change is claimed here. The technique is still the PR #1851 stack.
Original 3-seed support result
These are the headline results this PR should be judged on.
All three artifacts are under 16,000,000 bytes and all eval times are under 600s.
Logs now included in this submission directory
Original support logs
train_seed42_pr1851_original.logtrain_seed314_original.logtrain_seed1234_original.logLater compliance re-run logs
train_seed42_rerun_gptq8s.logtrain_seed314_rerun_gptq8s.logtrain_seed1234_rerun_gptq8s.logLater compliance re-run (supplementary evidence only)
The original runs used
GPTQ_RESERVE_SECONDS=0.5, which left too little margin for GPTQ hessian collection. To confirm compliance, I re-ran all 3 seeds later withGPTQ_RESERVE_SECONDS=8.0and serialize-before-diagnostic ordering so GPTQ hessians complete within the 600s training-data-access budget.Comparison to the original 3-seed support package:
That delta is well within ordinary seed noise, so the compliance fix does not materially change model quality.
Technique / attribution
This remains the PR #1851 technique stack:
What changed in this PR update
GitHub link: #1851