Record candidate: StageB v2 CaseOps TTT seed42 1.06095913#2121
Record candidate: StageB v2 CaseOps TTT seed42 1.06095913#2121Kbediako wants to merge 3 commits intoopenai:mainfrom
Conversation
|
Leaderboard audit note (pre-cutoff state): I don't think this is record-ready as submitted. The headline is a single seed only; the disclosed 3-seed means are worse than PR #1855, and the final process timing evidence exceeds 600s for key runs. This needs a clean matching 3-seed under-cap package to be considered. |
|
Thanks for the audit note. I agree with the conclusion and am withdrawing this PR as a record candidate. The current evidence is not a clean matching 3-seed under-cap package:
I am not pushing a cosmetic cleanup commit to this record-track branch because it would not create the missing evidence. The local artifacts and logs remain useful research evidence, but this PR should not be reviewed as leaderboard-ready. A future record attempt should be a separate clean package with same-code 3-seed evidence, full-validation coverage, artifact |
Summary
Adds a final-day StageB v2 CaseOps + phased TTT record candidate.
Best submitted single-run score:
val_bpb = 1.06095913,val_loss = 2.3217718415,995,233591.924s546.432sreal:733.28sImportant caveats
This is intentionally framed as a single-run candidate, not a clean top-2 3-seed mean.
Official-reference seed confirmation, using the final seed42 score:
1.060959131.061625601.062207541.06159742Auxiliary forced-seed confirmation, using the final seed42 score:
1.061005701.062471611.061478811.06146669The seed42 score is below the accepted #1855 leaderboard row mean (
1.06107587) under a single-run ordering, but it does not clear the accepted top2 3-seed target.Timing caveat:
realtime for final seed42, seed0, seed314, and the best seed999 rescue TTT exceeds 600s.TTT legality caveat:
Dependency caveat:
brotli,python-minifier/pyminify, and FlashAttention 3'sflash_attn_interface.requirements.txt, but the official runtime should still be checked for the FlashAttention 3 interface.Method
lrzip,apt-get, byte PPM, or casefold path.LQER_TOP_K=1.NGRAM_MIX_ALPHA=0.Evidence
The folder includes:
submission.jsonREADME.mdtrain_gpt.pyrunpod_terminal_summary.jsonrunpod_seed0_1234_summary.jsonrunpod_seed42_rank128_final_summary.jsonlFlywheel node:
1450c4a2-7893-4a09-ab33-c5b4bee7e380Flywheel executions:
02221723-3999-4255-8d08-9ced71d8d206159e253f-b7a8-44dc-bc0b-0084ba29c42918ec3489-c592-46eb-a2c2-6fe3648e83f8a87aef4d-8207-49ee-90b7-dc66b28022efValidation
submission.jsonparses withpython -m json.toolrunpod_terminal_summary.jsonparses withpython -m json.toolrunpod_seed0_1234_summary.jsonparses withpython -m json.toolrunpod_seed42_rank128_final_summary.jsonlparses as JSONLtrain_gpt.py,prepare_caseops_data.py, andlossless_caps.pycompile withpython -m py_compilegit diff --checkpasses