Record candidate: StageB v2 CaseOps TTT seed42 1.06095913 by Kbediako · Pull Request #2121 · openai/parameter-golf

Kbediako · 2026-05-01T13:51:10Z

Summary

Adds a final-day StageB v2 CaseOps + phased TTT record candidate.

Best submitted single-run score:

seed 42 final rank128/p2000 eval-only pass: val_bpb = 1.06095913, val_loss = 2.32177184
counted submission bytes: 15,995,233
train loop wallclock for the source artifact: 591.924s
final TTT eval timer: 546.432s
final shell process real: 733.28s

Important caveats

This is intentionally framed as a single-run candidate, not a clean top-2 3-seed mean.

Official-reference seed confirmation, using the final seed42 score:

seed 42: 1.06095913
seed 0: 1.06162560
seed 1234: 1.06220754
42/0/1234 mean: 1.06159742

Auxiliary forced-seed confirmation, using the final seed42 score:

seed 314: 1.06100570
seed 999: 1.06247161
42/314/999 mean: 1.06147881
best seed999 rescue mean: 1.06146669

The seed42 score is below the accepted #1855 leaderboard row mean (1.06107587) under a single-run ordering, but it does not clear the accepted top2 3-seed target.

Timing caveat:

Script-reported TTT eval timers are under 600s.
Shell process real time for final seed42, seed0, seed314, and the best seed999 rescue TTT exceeds 600s.
Artifact production including serialization/GPTQ logged over 600s for seed42, seed0, and seed1234. If reviewers count those full process windows, this should not be accepted as a clean official record.

TTT legality caveat:

The scored script scores each batch before its LoRA update and uses normalized model probabilities.
The global TTT queue is length-sorted and phase-prefix global updates are not a verified strict original-validation-order rerun.
A strict-order variant was prepared locally, but it was not rerun to a submission-grade 8xH100 score.

Dependency caveat:

The Runpod research image used brotli, python-minifier/pyminify, and FlashAttention 3's flash_attn_interface.
The record folder includes requirements.txt, but the official runtime should still be checked for the FlashAttention 3 interface.

Method

Accepted-family CaseOps + phased score-first LoRA TTT stack.
Brotli-only self-contained compression path; no lrzip, apt-get, byte PPM, or casefold path.
Scalar/control quantization plus LQER_TOP_K=1.
Final seed42 score uses rank128/prefix2000/beta999/wd1 on the same under-cap artifact.
NGRAM_MIX_ALPHA=0.
PufferLib was inspected for workflow ideas only; no PufferLib code was copied.

Evidence

The folder includes:

submission.json
README.md
train_gpt.py
seed42/seed0/seed1234 train and TTT logs
final seed42 rank128 TTT log and JSONL summary
seed314/seed999 confirmation TTT logs
runpod_terminal_summary.json
runpod_seed0_1234_summary.json
runpod_seed42_rank128_final_summary.jsonl
CaseOps support files and tokenizer model

Flywheel node: 1450c4a2-7893-4a09-ab33-c5b4bee7e380

Flywheel executions:

seed314/seed999 confirmation: 02221723-3999-4255-8d08-9ced71d8d206
seed0/seed1234 official-reference confirmation: 159e253f-b7a8-44dc-bc0b-0084ba29c429
final seed42 rank128 eval: 18ec3489-c592-46eb-a2c2-6fe3648e83f8
final submission-prep tracking: a87aef4d-8207-49ee-90b7-dc66b28022ef

Validation

submission.json parses with python -m json.tool
runpod_terminal_summary.json parses with python -m json.tool
runpod_seed0_1234_summary.json parses with python -m json.tool
runpod_seed42_rank128_final_summary.jsonl parses as JSONL
train_gpt.py, prepare_caseops_data.py, and lossless_caps.py compile with python -m py_compile
git diff --check passes
Runpod pods were stopped and deleted after evidence capture.

cocohearts · 2026-05-02T18:15:19Z

Leaderboard audit note (pre-cutoff state): I don't think this is record-ready as submitted. The headline is a single seed only; the disclosed 3-seed means are worse than PR #1855, and the final process timing evidence exceeds 600s for key runs. This needs a clean matching 3-seed under-cap package to be considered.

Kbediako · 2026-05-02T22:00:11Z

Thanks for the audit note. I agree with the conclusion and am withdrawing this PR as a record candidate.

The current evidence is not a clean matching 3-seed under-cap package:

The headline 1.06095913 BPB is a single seed42 result.
The disclosed 3-seed evidence does not clear the current clean frontier: official-reference mean 1.06159742; best mixed auxiliary mean about 1.06146669, and that is not a same-profile clean record package.
The final timing evidence is not clean under the 600s bar under a strict process-wall interpretation; key eval/process or train-to-artifact windows exceed 600s.
Strict original-order TTT was not reproduced for the final family, and the CaseOps data-lineage/split proof needs cleaner documentation.

I am not pushing a cosmetic cleanup commit to this record-track branch because it would not create the missing evidence. The local artifacts and logs remain useful research evidence, but this PR should not be reviewed as leaderboard-ready.

A future record attempt should be a separate clean package with same-code 3-seed evidence, full-validation coverage, artifact <16,000,000 bytes, train/eval process wall under 600s, and explicit legality/data-lineage proof.

Kbediako added 2 commits May 1, 2026 23:50

Add StageB v2 CaseOps TTT record candidate

f68fa31

Add official-reference seed evidence to StageB record candidate

564fb9e

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

Update StageB record candidate final seed42 score

ab8ea3d

Kbediako changed the title ~~Record candidate: StageB v2 CaseOps TTT seed42 1.06099764~~ Record candidate: StageB v2 CaseOps TTT seed42 1.06095913 May 1, 2026

Kbediako closed this May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record candidate: StageB v2 CaseOps TTT seed42 1.06095913#2121

Record candidate: StageB v2 CaseOps TTT seed42 1.06095913#2121
Kbediako wants to merge 3 commits intoopenai:mainfrom
Kbediako:pg-stageb-v2-single-run-record

Kbediako commented May 1, 2026 •

edited

Loading

Uh oh!

cocohearts commented May 2, 2026

Uh oh!

Kbediako commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kbediako commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Important caveats

Method

Evidence

Validation

Uh oh!

cocohearts commented May 2, 2026

Uh oh!

Kbediako commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kbediako commented May 1, 2026 •

edited

Loading