Skip to content

Record: SmearGate BOS Fix 3-Seed Compliance Re-run — val_bpb 1.06141 (3-seed mean)#1868

Merged
cocohearts merged 3 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/record-1851-3seed
Apr 30, 2026
Merged

Record: SmearGate BOS Fix 3-Seed Compliance Re-run — val_bpb 1.06141 (3-seed mean)#1868
cocohearts merged 3 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/record-1851-3seed

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown
Contributor

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Apr 27, 2026

Record Support: 3-Seed Compliance Reproduction for PR #1851

val_bpb = 1.06145 (3-seed mean ± 0.00068) | ~15.95 MB | 8×H100 SXM 80GB

Summary

This PR is now positioned as a standalone support / record package for PR #1851, not as a separate technique claim.

It does three things:

  1. packages the original 3-seed support evidence for PR Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851,
  2. includes the original seed 42 log from PR Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851 alongside the two independent reproduction logs,
  3. adds a later compliance re-run showing GPTQ fits within the 600s training budget with no statistically significant change in validation BPB.

No ML change is claimed here. The technique is still the PR #1851 stack.

Original 3-seed support result

These are the headline results this PR should be judged on.

Seed Post-TTT BPB Artifact (bytes) Eval time Source
42 1.06128183 15,952,086 519.5s Original PR #1851 by @aquariouseworkman
314 1.06086831 15,952,419 525.6s Independent reproduction
1234 1.06220261 15,952,690 479.6s Independent reproduction
Mean ± Std 1.06145 ± 0.00068

All three artifacts are under 16,000,000 bytes and all eval times are under 600s.

Logs now included in this submission directory

Original support logs

  • train_seed42_pr1851_original.log
  • train_seed314_original.log
  • train_seed1234_original.log

Later compliance re-run logs

  • train_seed42_rerun_gptq8s.log
  • train_seed314_rerun_gptq8s.log
  • train_seed1234_rerun_gptq8s.log

Later compliance re-run (supplementary evidence only)

The original runs used GPTQ_RESERVE_SECONDS=0.5, which left too little margin for GPTQ hessian collection. To confirm compliance, I re-ran all 3 seeds later with GPTQ_RESERVE_SECONDS=8.0 and serialize-before-diagnostic ordering so GPTQ hessians complete within the 600s training-data-access budget.

Seed Re-run post-TTT BPB Train time GPTQ ends by Artifact (bytes)
42 1.06083288 592.1s ~595.5s ✅ 15,949,701
314 1.06090748 592.0s ~595.5s ✅ 15,951,777
1234 1.06248776 592.1s ~595.5s ✅ 15,951,968
Mean ± Std 1.06141 ± 0.00093

Comparison to the original 3-seed support package:

  • original mean: 1.06145
  • compliance re-run mean: 1.06141
  • delta: -0.00004

That delta is well within ordinary seed noise, so the compliance fix does not materially change model quality.

Technique / attribution

This remains the PR #1851 technique stack:

What changed in this PR update

GitHub link: #1851

3-seed reproduction of PR openai#1851 (SmearGate BOS document boundary fix).
Code is byte-identical to openai#1851 by @aquariouseworkman.

Results (post-TTT BPB):
  Seed 42:   1.06128  (original openai#1851 author)
  Seed 314:  1.06087  (this submission)
  Seed 1234: 1.06220  (this submission)
  Mean:      1.06145 ± 0.00068

All artifacts < 16,000,000 bytes. All runs < 600s.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aquariouseworkman
Copy link
Copy Markdown
Contributor

You are amazing!!!

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 28, 2026
PR openai#1902 (cocohearts) accepted openai#1851/openai#1868 over openai#1736 and excluded openai#1855
only on significance grounds (p=0.325). Our prior 050 line built on openai#1797
which is under validity-cloud per cocohearts. Re-anchor research baseline
on openai#1855's accepted chain.

Pure port — zero modifications. Files copied verbatim from
codemath3000/parameter-golf:submission/sp8192-lqer-bos-smear-fix-9hp-stack
@ 1e43966 into records/track_10min_16mb/2026-04-29_PR1855_Port_Baseline/.

Spec 060B+ will fork exp/060B-* etc. to stack quant-repair / deploy-time
levers (046B-tight SDClip, 046L deploy-time repair, 046G-tighter, etc.)
on this baseline.
Christopher-Lee-McClendon added a commit to Christopher-Lee-McClendon/parameter-golf that referenced this pull request Apr 29, 2026
…an 1.06141

Re-ran all 3 seeds (42, 314, 1234) with GPTQ_RESERVE_SECONDS=8.0 (was 0.5)
to ensure GPTQ hessian collection completes within the 600s training budget.

Code changes:
- Serialize artifact immediately after training (before diagnostic eval)
- Added timing instrumentation (serialize_wallclock, GPTQ sub-timings)

Results (all seeds fresh re-run on RunPod 8×H100 SXM):
  Seed 42:   post-TTT BPB = 1.06083, artifact = 15,949,701, eval = 525.5s
  Seed 314:  post-TTT BPB = 1.06091, artifact = 15,951,777, eval = 429.5s
  Seed 1234: post-TTT BPB = 1.06249, artifact = 15,951,968, eval = 481.2s
  3-seed mean: 1.06141 ± 0.00093

Compliance: training loop ends at ~592s, GPTQ hessians end at ~595.5s (<600s).
RunPod cost: ~$31.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…an 1.06141

Re-ran all 3 seeds (42, 314, 1234) with GPTQ_RESERVE_SECONDS=8.0 (was 0.5)
to ensure GPTQ hessian collection completes within the 600s training budget.

Code changes:
- Serialize artifact immediately after training (before diagnostic eval)
- Added timing instrumentation (serialize_wallclock, GPTQ sub-timings)

Results (all seeds fresh re-run on RunPod 8×H100 SXM):
  Seed 42:   post-TTT BPB = 1.06083, artifact = 15,949,701, eval = 525.5s
  Seed 314:  post-TTT BPB = 1.06091, artifact = 15,951,777, eval = 429.5s
  Seed 1234: post-TTT BPB = 1.06249, artifact = 15,951,968, eval = 481.2s
  3-seed mean: 1.06141 ± 0.00093

Compliance: training loop ends at ~592s, GPTQ hessians end at ~595.5s (<600s).
RunPod cost: ~$31.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Christopher-Lee-McClendon Christopher-Lee-McClendon changed the title Record: SmearGate BOS Fix 3-Seed Reproduction — val_bpb 1.06145 (3-seed mean) Record: SmearGate BOS Fix 3-Seed Compliance Re-run — val_bpb 1.06141 (3-seed mean) Apr 29, 2026
Copy link
Copy Markdown
Collaborator

@cocohearts cocohearts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is useful supporting evidence for #1851. Before merge, please make it a standalone support/record package: include all three seed logs in this directory, including the seed 42 log currently referenced from #1851, and keep the README/submission.json phrased as a 3-seed compliance reproduction/support package rather than a separate technique claim. No ML change needed.

Copy link
Copy Markdown
Collaborator

@cocohearts cocohearts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is useful supporting evidence for #1851. Before merge, please make it a standalone support/record package: include all three seed logs in this directory, including the seed 42 log currently referenced from #1851, and keep the README/submission.json phrased as a 3-seed compliance reproduction/support package rather than a separate technique claim. No ML change needed.

@cocohearts cocohearts dismissed their stale review April 29, 2026 19:20

Duplicate requested-changes review submitted by automation; keeping the earlier requested-changes review active.

…1851

- Add original seed 42 log from PR openai#1851 (@aquariouseworkman)
- Add original seeds 314 and 1234 from independent reproduction
- Rename compliance re-run logs to *_rerun_gptq8s.log for clarity
- Rewrite README as support/compliance package (not a new technique claim)
- Rewrite submission.json with headline val_bpb=1.06145 (original 3-seed mean)
- Document compliance re-run as supplementary evidence (no stat-sig difference)
@Christopher-Lee-McClendon
Copy link
Copy Markdown
Contributor Author

Updated, Including both the original 3-seeds and my independent compliance re-run. Thanks!

@cocohearts cocohearts merged commit fdde8dc into openai:main Apr 30, 2026
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since
2026-04-18 for whether val docs are also in the training set.

Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors:
openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps).

Verdicts:
  - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068
  - LEAK (25): openai#1736 (our research baseline) → openai#1769openai#1787openai#1797openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018openai#2118
    (current claimed frontier 1.04350), plus siblings.
  - INHERIT (1): openai#2050 (eval-only on frozen openai#1915)

Code-level evidence (not README claims):
  - Every shipped prepare_caseops_data.py is byte-identical:
    SHARD_TOKENS=10_000_000, default=10_000 for --val-docs
  - NO PR overrides --val-docs (searched all .sh files in all 34 PRs)
  - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1
    HF dataset whose manifest pins docs_val=50000, docs_train=8181945,
    sums match → CLEAN by construction
  - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description
  - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval"

Three signposts:
  - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py
    default invocation
  - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset
  - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally

The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN.
The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is
inflated by val memorization; spec 301 was designed to measure how much
remains under clean data.

Files:
  caseops-memory-leakage/README.md       — overview, methodology, takeaways
  caseops-memory-leakage/verdicts.md     — 34-row master table with evidence
  caseops-memory-leakage/family-tree.md  — ASCII trees with [C]/[L] annotations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants