Skip to content

Record candidate: StageB v2 CaseOps TTT seed42 1.06095913#2121

Closed
Kbediako wants to merge 3 commits intoopenai:mainfrom
Kbediako:pg-stageb-v2-single-run-record
Closed

Record candidate: StageB v2 CaseOps TTT seed42 1.06095913#2121
Kbediako wants to merge 3 commits intoopenai:mainfrom
Kbediako:pg-stageb-v2-single-run-record

Conversation

@Kbediako
Copy link
Copy Markdown

@Kbediako Kbediako commented May 1, 2026

Summary

Adds a final-day StageB v2 CaseOps + phased TTT record candidate.

Best submitted single-run score:

  • seed 42 final rank128/p2000 eval-only pass: val_bpb = 1.06095913, val_loss = 2.32177184
  • counted submission bytes: 15,995,233
  • train loop wallclock for the source artifact: 591.924s
  • final TTT eval timer: 546.432s
  • final shell process real: 733.28s

Important caveats

This is intentionally framed as a single-run candidate, not a clean top-2 3-seed mean.

Official-reference seed confirmation, using the final seed42 score:

  • seed 42: 1.06095913
  • seed 0: 1.06162560
  • seed 1234: 1.06220754
  • 42/0/1234 mean: 1.06159742

Auxiliary forced-seed confirmation, using the final seed42 score:

  • seed 314: 1.06100570
  • seed 999: 1.06247161
  • 42/314/999 mean: 1.06147881
  • best seed999 rescue mean: 1.06146669

The seed42 score is below the accepted #1855 leaderboard row mean (1.06107587) under a single-run ordering, but it does not clear the accepted top2 3-seed target.

Timing caveat:

  • Script-reported TTT eval timers are under 600s.
  • Shell process real time for final seed42, seed0, seed314, and the best seed999 rescue TTT exceeds 600s.
  • Artifact production including serialization/GPTQ logged over 600s for seed42, seed0, and seed1234. If reviewers count those full process windows, this should not be accepted as a clean official record.

TTT legality caveat:

  • The scored script scores each batch before its LoRA update and uses normalized model probabilities.
  • The global TTT queue is length-sorted and phase-prefix global updates are not a verified strict original-validation-order rerun.
  • A strict-order variant was prepared locally, but it was not rerun to a submission-grade 8xH100 score.

Dependency caveat:

  • The Runpod research image used brotli, python-minifier/pyminify, and FlashAttention 3's flash_attn_interface.
  • The record folder includes requirements.txt, but the official runtime should still be checked for the FlashAttention 3 interface.

Method

  • Accepted-family CaseOps + phased score-first LoRA TTT stack.
  • Brotli-only self-contained compression path; no lrzip, apt-get, byte PPM, or casefold path.
  • Scalar/control quantization plus LQER_TOP_K=1.
  • Final seed42 score uses rank128/prefix2000/beta999/wd1 on the same under-cap artifact.
  • NGRAM_MIX_ALPHA=0.
  • PufferLib was inspected for workflow ideas only; no PufferLib code was copied.

Evidence

The folder includes:

  • submission.json
  • README.md
  • train_gpt.py
  • seed42/seed0/seed1234 train and TTT logs
  • final seed42 rank128 TTT log and JSONL summary
  • seed314/seed999 confirmation TTT logs
  • runpod_terminal_summary.json
  • runpod_seed0_1234_summary.json
  • runpod_seed42_rank128_final_summary.jsonl
  • CaseOps support files and tokenizer model

Flywheel node: 1450c4a2-7893-4a09-ab33-c5b4bee7e380

Flywheel executions:

  • seed314/seed999 confirmation: 02221723-3999-4255-8d08-9ced71d8d206
  • seed0/seed1234 official-reference confirmation: 159e253f-b7a8-44dc-bc0b-0084ba29c429
  • final seed42 rank128 eval: 18ec3489-c592-46eb-a2c2-6fe3648e83f8
  • final submission-prep tracking: a87aef4d-8207-49ee-90b7-dc66b28022ef

Validation

  • submission.json parses with python -m json.tool
  • runpod_terminal_summary.json parses with python -m json.tool
  • runpod_seed0_1234_summary.json parses with python -m json.tool
  • runpod_seed42_rank128_final_summary.jsonl parses as JSONL
  • train_gpt.py, prepare_caseops_data.py, and lossless_caps.py compile with python -m py_compile
  • git diff --check passes
  • Runpod pods were stopped and deleted after evidence capture.

@Kbediako Kbediako changed the title Record candidate: StageB v2 CaseOps TTT seed42 1.06099764 Record candidate: StageB v2 CaseOps TTT seed42 1.06095913 May 1, 2026
@cocohearts
Copy link
Copy Markdown
Collaborator

Leaderboard audit note (pre-cutoff state): I don't think this is record-ready as submitted. The headline is a single seed only; the disclosed 3-seed means are worse than PR #1855, and the final process timing evidence exceeds 600s for key runs. This needs a clean matching 3-seed under-cap package to be considered.

@Kbediako
Copy link
Copy Markdown
Author

Kbediako commented May 2, 2026

Thanks for the audit note. I agree with the conclusion and am withdrawing this PR as a record candidate.

The current evidence is not a clean matching 3-seed under-cap package:

  • The headline 1.06095913 BPB is a single seed42 result.
  • The disclosed 3-seed evidence does not clear the current clean frontier: official-reference mean 1.06159742; best mixed auxiliary mean about 1.06146669, and that is not a same-profile clean record package.
  • The final timing evidence is not clean under the 600s bar under a strict process-wall interpretation; key eval/process or train-to-artifact windows exceed 600s.
  • Strict original-order TTT was not reproduced for the final family, and the CaseOps data-lineage/split proof needs cleaner documentation.

I am not pushing a cosmetic cleanup commit to this record-track branch because it would not create the missing evidence. The local artifacts and logs remain useful research evidence, but this PR should not be reviewed as leaderboard-ready.

A future record attempt should be a separate clean package with same-code 3-seed evidence, full-validation coverage, artifact <16,000,000 bytes, train/eval process wall under 600s, and explicit legality/data-lineage proof.

@Kbediako Kbediako closed this May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants