Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) by codemath3000 · Pull Request #1855 · openai/parameter-golf

codemath3000 · 2026-04-27T10:49:13Z

Summary

11L 512d 8H/4KV transformer with U-Net skips, parallel residuals, partial RoPE, Polar-Express Newton-Schulz Muon, LQER asymmetric int4 rank-4 quant correction, sparse attention head-output gate, SmearGate with cross-document leak fix on BOS positions (audit response), fused LeakyReLU-square MLP, fused softcapped CE Triton kernel, GPTQ int6 + int7 embed + per-row int8 attn-gate, per-group lrzip + brotli compression pipeline (added in this submission — PR #1797's base only ships lzma / brotli), phased TTT eval (3 cumulative phases at doc-boundaries 833 / 1666 / 2500, max prefix=2500 docs), with 9 hyperparameter overrides validated by greedy forward-selection on 8×H100 real fixed-step.

3-seed mean: 1.06108 BPB (std 0.00090) on 8×H100 SXM, all artifacts under the 16 MB cap.

seed	post-TTT val_bpb	artifact bytes	eval_time
42	1.05989	15,897,259	508.8 s
0	1.06125	15,900,947	455.1 s
1234	1.06209	15,907,550	470.0 s
mean	1.06108	15,901,919	478.0 s

vs current leaderboard (1.0810 BPB): −0.01992 BPB / −0.04359 nats.

SmearGate cross-document leak fix

SmearGate's per-token forward-1 mixing (x[:, 1:] + g * x[:, :-1]) leaks the last token of doc N into the BOS embedding of doc N+1 in a packed validation stream. Fix masks the prev-token term wherever the current token is BOS:

not_bos = (input_ids[:, 1:] != BOS_ID).to(x.dtype).unsqueeze(-1)
x = torch.cat([x[:, :1], x[:, 1:] + g * x[:, :-1] * not_bos], dim=1)

Applied symmetrically in _forward_hidden and forward_ttt so training and TTT eval are leak-free.

Per-group compression pipeline

PR #1797's base only exposes lzma / brotli compressors. This submission adds a per-group serializer (COMPRESSOR=pergroup):

Buckets the int6 GPTQ tensors by role (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank, etc.) so similarly-distributed weights compress together.
For "hot" 2D groups (_tok_emb, attn.c_q, mlp.fc), runs an L1 nearest-neighbour similarity sort on rows before transposing — adjacent rows in the serialized stream are now numerically close, giving the entropy coder longer runs of small deltas. Permutation indices are stored as uint16 and brotli-compressed.
Compresses each group blob with lrzip -z -L 9 (ZPAQ context-mixing back-end). lrzip's long-range deduplication catches cross-tensor repetition that brotli's 24-bit window misses.
Falls back to brotli for the remainder (state-dict scaffolding, scales, LQER factors, gate tensors) and the code wrapper.

Net effect on this stack: ~280 KB smaller artifact than COMPRESSOR=brotli, at the cost of ~75 s of additional serialize time. The lrzip binary must be present on the system before the training script runs (e.g. install with apt-get install lrzip during instance setup). The script itself does not run apt-get; the Python subprocess.run wrapper just shells out to the already-installed lrzip binary.

Hyperparameter stack

9 greedy-validated overrides:

hparam	value	default
MLP_CLIP_SIGMAS	11.5	10.0
EMBED_CLIP_SIGMAS	14.0	20.0
WARMDOWN_FRAC	0.85	0.75
BETA2	0.99	0.95
TTT_BETA2	0.99	0.999
TTT_WEIGHT_DECAY	0.5	1.0
TTT_LORA_RANK	80	96
SPARSE_ATTN_GATE_SCALE	0.5	1.0
PHASED_TTT_PREFIX_DOCS	2500	2000

Each individually accepted on a strict greedy-keep rule (mean improvement vs current best stack) at fixed-step.

See records/track_10min_16mb/2026-04-27_SP8192_LQER_SparseGate_BOSSmearFix_9HpStack_1.0611/README.md for full architecture lineage and credits.

Test plan

Trains within 600s wallclock on 8×H100 80GB SXM (4917-4945 steps achieved, ~121.7 ms/step mean)
All 3 artifacts under 16 MB cap (max 15,907,550 B; min 15,897,259 B)
TTT eval completes within 600s eval cap (max 508.8 s)
3-seed mean reproduced; per-seed numbers verified in attached logs
SmearGate fix verified by code-diff audit; applied to both _forward_hidden and forward_ttt

🤖 Generated with Claude Code

…aram Greedy Stack — val_bpb 1.06108 (3-seed mean) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…or internal hp trials only; submission used 600s wallclock

codemath3000 · 2026-04-27T12:46:43Z

Quick note: the initial commit, 612a1a9, was the only one with any code changes; the other two were README-only changes to clarify things and fix some AI hallucinations. In terms of evaluating when this PR was submitted relative to other PRs, the timing of the initial commit, not that of the most recent commit, should be used, since the initial commit contains all the code, logs, etc., and the other commits are just README changes. Thank you so much!

…ate BOS-fix Lands openai#1797's training stack (PolarNS, MIN_LR, Sparse Attn Gate, Fused CE, Smear Gate, LQER asym) verbatim into a new record dir, with the BOS-fix patch from openai#1855 applied at both _forward_hidden and forward_ttt sites. Per CLAUDE.md baseline-migration exception, lands directly on research (not exp/<slug>). Spec: research/specs/050-baseline-1797-bos-fix.md Code: records/track_10min_16mb/2026-04-27_050_PR1797_Base_BOS_Fix/ Expected: post-TTT ~1.061 (matches openai#1797's 1.06157 ± noise). Skipped from openai#1855: 9-hparam bundle and lrzip serializer (deferred for clean attribution of subsequent levers). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…penai#1855 Changes train_gpt.py defaults for two of openai#1855's 9 greedy-validated hparams: - BETA2 0.95 -> 0.99 (smoother optim variance estimate, generic win) - SPARSE_ATTN_GATE_SCALE 1.0 -> 0.5 (softer gating early; only affects openai#1787's sparse attn-output gate path, no coupling with our 047 family) Both still env-var-overridable for ablation. WARMDOWN_FRAC=0.85 deferred because it interacts with loop-activation timing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… required; PR openai#1848 BPB risk; Day 18 plateau; Session 23 - Merged SOTA still 1.0810 (Day 18, no change since Apr 9) - PPM-D byte mixture confirmed by dexhunter at 1.0322 (PR openai#1857, self-closed) - SmearGate BOS bug documented: prev-token leaks at document boundaries; fix required - PR openai#1848 (newjordan, 0.87980) flagged BPB risk: sibling PR openai#1846 closed same day - PR openai#1858 (0.9946) only covers 8M/40.5M tokens — not leaderboard-comparable - PR openai#1855 (codemath3000, 1.06108) and openai#1851 (aquariouseworkman, 1.06128) both clean - PPM-D wave: PRs openai#1850, openai#1854, openai#1835 await organizer ruling - Added Session 23 lessons to CLAUDE.md - 3 days to deadline (Apr 30) — final GPU run window https://claude.ai/code/session_01RmJtLYUmKNzDgDVTnWoKzU

@aquariouseworkman

- Adds 2-line BOS mask in both forward_logits and forward_ttt SmearGate paths. Before fix, the last token of doc N smeared into the BOS of doc N+1 — model-quality bug, not a C1 issue. Identical fix to PR openai#1851 @aquariouseworkman, audit by @cocohearts. - runpod/phase_g_3seed.sh: full 3-seed driver. Sets PR openai#1797 stack env vars + the PR openai#1855 9-hparam greedy stack delta: MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 WARMDOWN_FRAC=0.85 BETA2=0.99 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 TTT_LORA_RANK=80 SPARSE_ATTN_GATE_SCALE=0.5 PHASED_TTT_PREFIX_DOCS=2500 Mixers (NGRAM/TEMP) stay OFF — pure neural baseline + bug fix + hparam stack. Auto-runs Welch t-test vs PR openai#1797 (1.06157±0.00066). - TTT 4-epoch (PR openai#1812) explicitly NOT adopted: that scheme targets the PR openai#1493 SGD-on-whole-model TTT path, not the PR openai#1797 LoRA-phased per-doc-reset path we're on. No clean mapping. Legality: all 16/16 unit tests still pass. BOS fix preserves causality (it only zeroes a gate at positions where current token is BOS, never references future tokens).

@sharpobject

…olar Express NS + MIN_LR + LQER) Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877): - openai#1852: hard rule violation (pre-quant TTT on validation data). - openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted. - openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over token alphabet), reviewer @sharpobject caught. - openai#1855: techniques mostly legit but apt-get install lrzip violates Issue openai#1017 Rule 3 (artifact must be self-contained). - openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal training-time techniques citing prior validated PRs. If it merges, our submission threshold shifts from 1.0760 to ~1.0627. PR openai#1874's three techniques: 1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5. 2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max instead of decaying to 0. Already wired in our v1+; just env-var opt-in. 3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) - SVD on top-K=3 highest-error GPTQ residuals, packed as int4 per-group-64 asymmetric. ~200-400 LOC; deferred to v4. train_gpt_v3.py implements (1) and exposes (2): - POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off). - _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at import time so torch.compile sees them as constants. - zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use per-iteration coefficients instead of fixed. - MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in. Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst- seed artifact slack: ~4,888 bytes under cap. Tight but workable. AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux). Stacking projection (single-seed): - Phase 0 baseline: 1.08038 - + LR=0.010 (Stage 2): 1.08021 - + Polar Express NS: 1.0787-1.0797 - + MIN_LR=0.10: 1.0777-1.0794 - + ConfTTT (PR openai#1879): 1.0772-1.0793 - + LQER (v4 work): 1.0742-1.0783 - + Phase 2 architecture: 1.0712-1.0773 - + Newton-Muon Stage E: 1.066-1.075 Path B (absorb-and-stack) recommended over Path A (race-to-merge-with- current-stack) since current stack alone doesn't clear 1.0760. Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open. Whichever merges first becomes new SOTA and our threshold tightens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aquariouseworkman · 2026-04-28T14:47:15Z

apt-get install lrzip is required at runtime, thus this breaks rule 3

okezue · 2026-04-28T15:16:47Z

Independent reproduction (3 seeds)

Re-ran the stack on a fresh 8×H100 SXM pod (cu129/torch 2.9.1/FA3 cu129_torch291) with the env vars from this PR's hparam table plus the gates (`SMEAR_GATE_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 EMBED_BITS=7 MIN_LR=0.1 GPTQ_RESERVE_SECONDS=0.5 PHASED_TTT_NUM_PHASES=3` etc.). 600 s wallclock, otherwise identical to this PR.

seed	pre-quant	quantized	post-TTT
42	1.06383	1.07237	1.05965
314	1.06450	1.07316	1.06041
999	1.06545	1.07398	1.06124
mean			1.06043

3-seed mean 1.06043 vs this PR's reported 1.06108 — within 1σ. Independent reproduction confirms the stack.

These three runs used `COMPRESSOR=brotli` and produced 16,112,007-byte artifacts (over the 16 MB cap; the BPB numbers are unaffected by compressor choice but the artifacts are technically non-compliant). One additional pergroup re-run with seed 42 produced a compliant 15,902,285-byte artifact at val_bpb 1.06052, matching seed 42's brotli value within run-to-run noise. (Couldn't get 3 pergroup seeds due to a string of RunPod capacity / image-pull failures over a 4-hour window — the locked volume on the original machine still has all the brotli logs and the s42 pergroup artifact saved.)

Net: the stack reproduces. The −0.019 BPB jump over PR #1493's 1.0810 is real on independent hardware.

codemath3000 · 2026-04-28T15:22:24Z

apt-get install lrzip is required at runtime, thus this breaks rule 3

@aquariouseworkman Thanks for flagging — you're right that the README/PR wording was imprecise about when lrzip is needed, and I've fixed that. To clarify the substantive question:

The training/eval script never runs apt-get itself. The lrzip binary is installed once during instance setup and the script just shells out to the already-installed binary via subprocess.run. No apt-get, no network calls, and no external downloads occur during the 600 s training window or the 600s eval window.

The official FAQ explicitly authorizes external dependencies handled this way: "Yes, you're free to import any package or library you want… Just include a requirements.txt in your records folder and mention setup instructions in your README.md." Both are present (requirements.txt documents lrzip; the README has the install command).

For precedent: the current leaderboard SOTA (PR for the 2026-04-09 SP8192 record at 1.0810 BPB) installs FlashAttention 3 from a custom third-party wheel host (pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/...) — also an external download performed before training begins. lrzip via apt-get is actually a more conservative dependency than that: it pulls from the official Debian/Ubuntu package repos rather than from a single contributor's GitHub Pages site. If the standard FA3 setup is acceptable, this should be acceptable a fortiori.

Also worth noting that "rule 3" as numbered in the field guide is the author's paraphrase, not the official rule text. The literal official rule (FAQ) is the one I quoted above.

…quisite (not auto-installed by the training script)

aquariouseworkman · 2026-04-28T17:06:13Z

apt-get install lrzip is required at runtime, thus this breaks rule 3

@aquariouseworkman Thanks for flagging — you're right that the README/PR wording was imprecise about when lrzip is needed, and I've fixed that. To clarify the substantive question:

The training/eval script never runs apt-get itself. The lrzip binary is installed once during instance setup and the script just shells out to the already-installed binary via subprocess.run. No apt-get, no network calls, and no external downloads occur during the 600 s training window or the 600s eval window.

The official FAQ explicitly authorizes external dependencies handled this way: "Yes, you're free to import any package or library you want… Just include a requirements.txt in your records folder and mention setup instructions in your README.md." Both are present (requirements.txt documents lrzip; the README has the install command).

For precedent: the current leaderboard SOTA (PR for the 2026-04-09 SP8192 record at 1.0810 BPB) installs FlashAttention 3 from a custom third-party wheel host (pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/...) — also an external download performed before training begins. lrzip via apt-get is actually a more conservative dependency than that: it pulls from the official Debian/Ubuntu package repos rather than from a single contributor's GitHub Pages site. If the standard FA3 setup is acceptable, this should be acceptable a fortiori.

Also worth noting that "rule 3" as numbered in the field guide is the author's paraphrase, not the official rule text. The literal official rule (FAQ) is the one I quoted above.

Strange, from what I can see in your current code, you would have to do one of the following to be valid (code review based, not based on your AI README post/information) :

a) re-run all three seeds with COMPRESSOR=brotli and reports those numbers
b) replaces the lrzip subprocess call with a pure-Python ZPAQ implementation that lives inside the artifact
c) maintainers would have to to add lrzip to the base eval image.

Evidence:

hard calls to lrzip train_gpt.py:2381-2390
hard routrer to lrzip # train_gpt.py:2700
requirements.txt:13:# System dep (apt): lrzip (used by per-group compressor) - pip thus cant install this as it isnt a py package (ref apt) thats why adding it to requirements.txt is not a fix, which is why the AI in your README has to call out apt-get install lrzip as a separate setup step, which is exactly what makes it a side dependency

codemath3000 · 2026-04-28T17:19:18Z

Hi @aquariouseworkman, thanks for the detailed read. Walking through the three options:

(a) Re-run with COMPRESSOR=brotli — this would actually disqualify the submission on a different rule. @okezue's independent reproduction earlier in this thread used COMPRESSOR=brotli and got 16,112,007-byte artifacts — about 112 KB over the 16 MB cap. Pergroup is what brings this submission's artifacts under the cap (15.9 MB max). Re-running with brotli isn't a fix; it produces a non-compliant submission on artifact size.

(b) Pure-Python ZPAQ embedded in the artifact — no rule requires this. The official rule (FAQ on artifact size) prohibits "external downloads, training dataset access, or network calls during evaluation" — not subprocess shell-outs to OS utilities. lrzip is invoked only via subprocess.run against an already-installed binary; no network call and no download at runtime.

(c) Maintainers add lrzip to the base eval image — this is the standard workflow for the challenge, explicitly authorized by the FAQ:

"Yes, you're free to import any package or library you want… Just include a requirements.txt in your records folder and mention setup instructions in your README.md."

The FAQ uses "package or library" (not "pip package") and treats requirements.txt and "setup instructions in your README.md" as separate, both-acceptable declaration mechanisms. Pip-installability isn't a rule criterion. We declare lrzip in both places: requirements.txt documents it as a comment ("System dep (apt): lrzip"), and the README spells out the install command.

A point worth being explicit about: the rule's verb is "include a requirements.txt." That means have one, present it as a reference — not "every dependency must resolve automatically via pip install -r requirements.txt." The official README itself reinforces this on line 176, describing requirements.txt as "provided as a reference if you want to self-setup" — a reference document, not a self-executing manifest. Submission contents rule (line 227) similarly says "any other dependencies," with no restriction to pip-installable packages. So a requirements.txt that mentions lrzip in a comment alongside the README's install instructions satisfies the rule as written.

For precedent: the current leaderboard SOTA (2026-04-09, val_bpb 1.0810 by @bigbag) installs FlashAttention 3 from windreamer.github.io, a single contributor's GitHub Pages site, via a custom --find-links URL that does not resolve from PyPI. By the "if it isn't a normal pip-resolvable package, it's a side dependency" reading, that submission would also fail. It doesn't, because the FAQ allows external deps via README setup instructions — the exact same mechanism we're using for lrzip. lrzip from official Debian/Ubuntu apt repos is structurally a more conservative external-dep pattern than installing custom-built CUDA wheels from a GitHub Pages site.

The literal self-containment rule (FAQ on artifact size) is about runtime behavior during evaluation: "No external downloads, training dataset access, or network calls are allowed during evaluation." lrzip is invoked locally against an already-installed binary — no network, no download, no training-data access at runtime.

So to be explicit on the conclusion: the submission is valid as-is. The artifact is under 16 MB, the script makes no network calls or external downloads during evaluation, and the lrzip dependency is declared exactly the way the official FAQ asks external dependencies to be declared (requirements.txt + README setup instructions) — the same way every recent record has handled FA3, including the current leaderboard SOTA. None of options (a)–(c) is required; (a) would in fact create a new compliance violation.

aquariouseworkman · 2026-04-28T17:28:16Z

Hi @aquariouseworkman, thanks for the detailed read. Walking through the three options:

(a) Re-run with COMPRESSOR=brotli — this would actually disqualify the submission on a different rule. @okezue's independent reproduction earlier in this thread used COMPRESSOR=brotli and got 16,112,007-byte artifacts — about 112 KB over the 16 MB cap. Pergroup is what brings this submission's artifacts under the cap (15.9 MB max). Re-running with brotli isn't a fix; it produces a non-compliant submission on artifact size.

(b) Pure-Python ZPAQ embedded in the artifact — no rule requires this. The official rule (FAQ on artifact size) prohibits "external downloads, training dataset access, or network calls during evaluation" — not subprocess shell-outs to OS utilities. lrzip is invoked only via subprocess.run against an already-installed binary; no network call and no download at runtime.

(c) Maintainers add lrzip to the base eval image — this is the standard workflow for the challenge, explicitly authorized by the FAQ:

"Yes, you're free to import any package or library you want… Just include a requirements.txt in your records folder and mention setup instructions in your README.md."

The FAQ uses "package or library" (not "pip package") and treats requirements.txt and "setup instructions in your README.md" as separate, both-acceptable declaration mechanisms. Pip-installability isn't a rule criterion. We declare lrzip in both places: requirements.txt documents it as a comment ("System dep (apt): lrzip"), and the README spells out the install command.

A point worth being explicit about: the rule's verb is "include a requirements.txt." That means have one, present it as a reference — not "every dependency must resolve automatically via pip install -r requirements.txt." The official README itself reinforces this on line 176, describing requirements.txt as "provided as a reference if you want to self-setup" — a reference document, not a self-executing manifest. Submission contents rule (line 227) similarly says "any other dependencies," with no restriction to pip-installable packages. So a requirements.txt that mentions lrzip in a comment alongside the README's install instructions satisfies the rule as written.

For precedent: the current leaderboard SOTA (2026-04-09, val_bpb 1.0810 by @bigbag) installs FlashAttention 3 from windreamer.github.io, a single contributor's GitHub Pages site, via a custom --find-links URL that does not resolve from PyPI. By the "if it isn't a normal pip-resolvable package, it's a side dependency" reading, that submission would also fail. It doesn't, because the FAQ allows external deps via README setup instructions — the exact same mechanism we're using for lrzip. lrzip from official Debian/Ubuntu apt repos is structurally a more conservative external-dep pattern than installing custom-built CUDA wheels from a GitHub Pages site.

The literal self-containment rule (FAQ on artifact size) is about runtime behavior during evaluation: "No external downloads, training dataset access, or network calls are allowed during evaluation." lrzip is invoked locally against an already-installed binary — no network, no download, no training-data access at runtime.

So to be explicit on the conclusion: the submission is valid as-is. The artifact is under 16 MB, the script makes no network calls or external downloads during evaluation, and the lrzip dependency is declared exactly the way the official FAQ asks external dependencies to be declared (requirements.txt + README setup instructions) — the same way every recent record has handled FA3, including the current leaderboard SOTA. None of options (a)–(c) is required; (a) would in fact create a new compliance violation.

Yes .. my bad, this does appear valid. :) your now the #1 spot

PR openai#1902 (cocohearts) accepted openai#1851/openai#1868 over openai#1736 and excluded openai#1855 only on significance grounds (p=0.325). Our prior 050 line built on openai#1797 which is under validity-cloud per cocohearts. Re-anchor research baseline on openai#1855's accepted chain. Pure port — zero modifications. Files copied verbatim from codemath3000/parameter-golf:submission/sp8192-lqer-bos-smear-fix-9hp-stack @ 1e43966 into records/track_10min_16mb/2026-04-29_PR1855_Port_Baseline/. Spec 060B+ will fork exp/060B-* etc. to stack quant-repair / deploy-time levers (046B-tight SDClip, 046L deploy-time repair, 046G-tighter, etc.) on this baseline.

- pinned SHA da50cd6 (spec 060A baseline) - TTT_ENABLED=1, PHASED_TTT_ENABLED=3 (per memory: =3 not =0) - All openai#1855 defaults made explicit in env (BETA2=0.99, SPARSE_ATTN_GATE_SCALE=0.5, MLP_CLIP_SIGMAS=11.5, EMBED_CLIP_SIGMAS=14.0, WARMDOWN_FRAC=0.85, PHASED_TTT_PREFIX_DOCS=2500, TTT_BETA2=0.99, TTT_WEIGHT_DECAY=0.5, TTT_LORA_RANK=80) - apt install lrzip (required by openai#1855's _lrzip_compress) - Both final_model.pt and final_model.int6.ptz verified after run; fail-loud (exit 2) on missing artifacts; chmod a-w on success

Six follow-on specs to spec 060A (openai#1855 port): - 060B: SDClip ATTN tightening (config-only, eval via RESUME_FROM_CKPT) - 060C: 046L deploy-time quant repair (~150 lines code port from exp/046-quant-repair @ fcb816f); eval-side, free - 060D: 046G-tighter SDClip (config-only, fits within openai#1855 lrzip headroom) - 060E: full stack (060B + 060C combined) - 060F: LQER bumps (RANK=5, TOP_K=4, ASYM_GROUP=32; config-only) - 060G: Partial SpinQuant from PR openai#1898 (~100 lines code port) Plus tmp_exec/launch_060_eval.sh: shared eval-only launcher for RESUME_FROM_CKPT mode, used by 060B/D/E/F. Loads 060A's final_model.pt, re-quantizes + re-evals with overridden env vars. ~-3 per arm vs ~ for full retrain. All specs reference 060A's checkpoint at runs/060A-1855-port/seed_42/ final_model.pt as their hotstart.

openai#1855's per-group lrzip compressor saves ~280 KB vs default brotli. Without this, 060A's artifact went over the 16,000,000 byte cap and required post-hoc repacking (per execution session feedback). Switching default to pergroup ensures all 060 family runs (A through G) fit within the cap by default; no separate repack step needed. Affects: launch_060A_run.sh (full training run), launch_060_eval.sh (eval-only via RESUME_FROM_CKPT for 060B/D/E/F).

Phase M seed-42 hit val_bpb 1.05891 (record-clearing) but artifact 17.25 MB (over by 1.25 MB) because lzma compression made things WORSE on quantized weights — Phase G with brotli was 16.14 MB, lzma made it 17.25. Lesson: brotli > lzma on this data. Phase N strategy: same Phase G config (9-hparam stack on PR openai#1797 V2 base with BOS fix, brotli compression) but revert MLP_CLIP_SIGMAS from 11.5 to 10.0 (PR openai#1797 default). Tighter MLP weight clip → narrower magnitude → brotli compresses tighter → expected ~100-200 KB saved. Phase G's 16,144,312 bytes should drop to <16,000,000. BPB cost of reverting MLP_CLIP: small (one of 9 hparams; PR openai#1855 reported mean delta -0.00049 across all 9; reverting one ~adds 0.00006 BPB). Phase G's mean 1.05969 should shift to ~1.0598 — still well below the 1.05963 record bar (PR openai#1797 = 1.06157 - 0.00194 = 1.05963). Auto-stop pod (trap EXIT + hard wallclock kill 100min) and HF result push after each seed (so abort is recoverable from outside the pod).

…0.00217 vs openai#1855 mean)

Four post-training specs to stack on 060A's openai#1855 port: - 060I: port PR openai#1908's activation-aware mixed-bit GPTQ (3-seed validated −0.000265 BPB on openai#1855 itself). 4 env vars + ~100 LOC port. - 060J: PHASED_TTT_NUM_PHASES 3→4 (low confidence; openai#1727 measured noise on weaker base, never tested with 2500 prefix). - 060L: PHASED_TTT_PREFIX_DOCS 2500→3000 (high confidence; codemath3000 greedy-validated 2000→2500 on this exact stack in openai#1855). - 060M: TTT_EPOCHS 3→4 (highest predicted Δ; PR openai#1812 reported −0.008 on weaker base; never tested on phased+SmearGate stack like openai#1855). All eval-only via RESUME_FROM_CKPT on 060A's seed_42_4h pt. No code change for 060J/L/M. 060K (rank-up) deleted — rowed against openai#1855's own greedy direction (which decreased rank 96→80). Idea files: research/ideas/{1908-awq-lite-mixed-bit-gptq,ttt-budget-reinvestment}.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oundary This submission extends PR openai#1855's record candidate (LQER + SparseAttnGate + BOS-fixed SmearGate + Polar-Express Muon + phased TTT eval + 9-hparam stack; 3-seed mean 1.06108) with two additions: 1. MP3 marker-pair fusion (vocab surgery): the three 2-grams [SPACE, TITLE]/ [SPACE, ALLCAPS]/[SPACE, CAPNEXT] are fused into single alias donor tokens (donors 8/9/10 from byte-fallback IDs that occur 0x in the CaseOps corpus). Word X is preserved (no full-fusion d=1 collapse). Token saving 8.47%. 2. Alias smear boundary: SmearGate's previous-position contribution is fully disabled at positions immediately following an alias token (ALIAS_PREV_SMEAR_SCALE=0.0). Regular non-alias positions are unchanged. Conceptually: alias tokens act as smear boundaries. 1-seed reference (8xH100, 600s wallclock, on author's DGX H100 box): val_bpb (phased TTT) : 1.06042 size : 16.74 MB on DGX (over budget); the same PR openai#1855 codebase unmodified also produces 16.75 MB on the same DGX box, so the ~840 KB delta vs the runpod 15.90 MB number is environmental (likely lrzip ZPAQ version / numerical state). The 3-seed runpod verification is the authoritative size measurement. Submission contents: - train_gpt.py : PR openai#1855 train_gpt.py (~3.8k lines) + 5-hunk MP3 patch - prepare_caseops_data.py : CaseOps tokeniser (multiprocess) - prepare_marker_pair_v3.py : MP3 vocab surgery - download_docs.py : HF docs_selected.jsonl downloader - lossless_caps.py : CaseOps infra - tokenizers/...model : SentencePiece model - alias_map.json : MP3 alias map - requirements.txt : Python deps + lrzip note - run_3seed.sh : 3-seed runner (SEEDS=42 0 1234) - README.md Pipeline (skip 1a/1b/2 if MP3 dataset is already prepared): 1a. python3 download_docs.py 1b. python3 prepare_caseops_data.py --docs ... --out ./data --sp tokenizers/... 2. python3 prepare_marker_pair_v3.py 3. bash run_3seed.sh 3-seed runpod verification pending.

…6108 by -0.00219

…ats PR openai#1855 by -0.00190 BPB

@dexhunter

Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations

User pushed back on openai#2014's LEAK call as too inference-based. Verified directly: - README says "uses same shards as PR openai#1855. If you don't have them, prepare with included prepare_caseops_data.py" — phrasing implies inheritance from openai#1855 (LEAK) but doesn't explicitly invoke prep - No setup.sh, no shell script invoking prep - No HF download script - Path /dev/shm/pgolf_caseops_data_80_l17_final is custom flat RAM-disk dir (not triple-nested local-prep signature) - Could be either HF-flattened download OR local-prep copy Demoted openai#2014 from LEAK to AMBIGUOUS (lean LEAK based on "same shards as openai#1855" English, but not iron-clad). Updated tally: CLEAN 9, LEAK 20 (was 21), AMBIGUOUS 4 (was 3), INHERIT 1.

codemath3000 · 2026-05-01T19:06:47Z

Following up on the val_docs=10_000 default question with a more detailed answer.

This submission's training and evaluation data come from the published romeerp/parameter-golf-caseops-v1 HuggingFace dataset, downloaded directly via:

MATCHED_FINEWEB_REPO_ID=romeerp/parameter-golf-caseops-v1 \
python3 cached_challenge_fineweb.py \
  --variant sp8192_lossless_caps_caseops_v1_reserved --train-shards 80

The script used is the cached_challenge_fineweb.py bundled with PR #1729 (record path), which extends dataset_dir_for_variant with a 2-line fallback so non-numeric variant names like sp8192_lossless_caps_caseops_v1_reserved resolve. The canonical data/cached_challenge_fineweb.py on main only handles byte260 and sp<VOCAB_SIZE> numeric variants. The patch is purely a variant-name accommodation and doesn't change the download logic — the data still pulls from romeerp/parameter-golf-caseops-v1 directly.

The shipped prepare_caseops_data.py in our record folder is a reference data-prep script and was not run for this submission.

The romeerp dataset's manifest.json authoritatively documents the partition:

"stats": {
    "docs_val": 50000,
    "docs_train": 8181945,
    "files_val": 1,
    "files_train": 80,
    "tokens_val": 47853344,
    "tokens_train": 8000000058
}

This is the canonical 50K val docs / 8B train tokens setup. Our train_seed42.log reports train_shards: 80 and val_tokens: 47,851,520 — matching the manifest's files_train: 80 and tokens_val: 47,853,344 (the latter rounds to 47,851,520 via the standard (numel-1)//eval_seq_len*eval_seq_len + 1 truncation with eval_seq_len=2048).

The val partition this submission evaluated on is the canonical 50K val docs from romeerp's dataset.

Beats PR openai#1855 (merged rank 1, 1.06108) by 0.00438 BPB. Beats PR openai#2014 (best open, 1.05759) by 0.00089 BPB. Beats PR openai#2060 (1.05792) by 0.00122 BPB. Stack: - Token-only n-gram tilt (PR openai#1514 merged precedent, within/word channels disabled) - AsymLogit Rescale (2 trainable scalars adapted by global TTT) - 3 hyperparameter levers from PR openai#2060 (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5) - PHASED_TTT_NUM_PHASES=1 (matches PR openai#2014) - NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 (precompute INSIDE eval timer per PR openai#1514) Compliance: - All seeds eval ≤533.1s (cap 600s, 67-80s margin) - All artifacts ≤15.95MB (cap 16MB) - Token-only n-gram channel (within_gate=0, word_gate=0) - Score-first TTT (per PR openai#402)

…6108 by -0.00219

…ats PR openai#1855 by -0.00190 BPB

@codemath3000

…0.979556) The cond-PPM mixer used SP-piece UTF-8 bytes (incl. CaseOps sentinel overhead, 164,594,398 per seed) as the BPB denominator instead of the canonical raw-text sidecar (151,074,309 per seed) used by every other CaseOps-lineage record per PR openai#1729 convention. Reported by @codemath3000 on PR openai#2138; thank you. Per-token NLL is invariant under denominator change, so the correction is algebraic — no re-eval required, original artifact and logs preserved as forensic record. New per-seed BPB = old × 164594398 / 151074309 = old × 1.089493: seed 42: 0.97949078 -> 1.067148 seed 1337: 0.97954725 -> 1.067210 seed 314: 0.97962885 -> 1.067299 mean: 0.979556 -> 1.067219 (std ~7.6e-05) On the canonical denominator the submission is +0.006 BPB worse than PR openai#1855 SOTA (1.06108), so this is no longer a SOTA-claim. LBM still gives a real -0.034 BPB improvement over sliding-window-alone (1.101347) on the canonical denominator; the C2-correctness story is unchanged. This commit only patches interpretation: - README.md: prepend Errata section, corrected 3-seed table, source- line citations, algebraic derivation; reposition writeup as not-SOTA. Original technique writeup retained below. - submission.json: corrected val_bpb / val_bpb_per_seed / std / eval_canonical_byte_count_per_seed / headline_metric_description; add errata{} object with summary, original values, inflation ratio, credit, fix-branch pointer. Forensic items deliberately untouched: train_gpt.py (wrapped, contains buggy denominator), final_model.int6.ptz, train_seed*.log (each shows both the buggy 'cond_ppm bytes=164594398' line and the canonical- correct 'quantized_sliding_window val_bpb' line — the sidecar count 151,074,309 is reverse-solvable from the latter). Fix lives on cond-ppm-stack of github.com/anmarhindi/parameter-golf-a. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codemath3000 and others added 2 commits April 27, 2026 03:45

Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hp…

612a1a9

…aram Greedy Stack — val_bpb 1.06108 (3-seed mean) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fix AI hallucinations in README, code files unaffected

7459c31

codemath3000 mentioned this pull request Apr 27, 2026

New record submission for review (#1855) — val_bpb 1.06108 #1856

Open

No code changes, update README to clarify that fixed-step setup was f…

03291bb

…or internal hp trials only; submission used 600s wallclock

No code changes, update README to clarify that lrzip is a setup prere…

1e43966

…quisite (not auto-installed by the training script)

okezue mentioned this pull request Apr 28, 2026

Record: DualClock PPM token-context mixture — 1.0803 BPB seed 42 #1879

Open

3 tasks

X-Abhishek-X mentioned this pull request Apr 28, 2026

Record: Partial SpinQuant (start_layer=5) + PR#1851 Stack — val_bpb 1.06614 (3-seed mean) #1898

Open

cocohearts mentioned this pull request Apr 28, 2026

Update Parameter Golf leaderboard with BOS fix #1902

Merged

This was referenced Apr 28, 2026

Update Parameter Golf leaderboard #1899

Open

Update Parameter Golf leaderboard #1900

Open

romeerp mentioned this pull request Apr 28, 2026

Record: PR #1855 base + activation-aware GPTQ mixed precision - val_bpb 1.06081 (3-seed mean) #1908

Open

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 28, 2026

runs/060A: 060H+TTT result — final TTT 1.05891 (Δ -0.00027 vs 060A; -…

63f85b6

…0.00217 vs openai#1855 mean)

izlley added a commit to izlley/parameter-golf that referenced this pull request May 1, 2026

Track 2 seed 42 result: val_bpb 1.05889 beats PR openai#1855 mean 1.0…

088b4c6

…6108 by -0.00219

izlley added a commit to izlley/parameter-golf that referenced this pull request May 1, 2026

Track 2 final 3-seed (42, 0, 314): val_bpb 1.05917838 (3-seed avg) be…

27dd158

…ats PR openai#1855 by -0.00190 BPB

izlley added a commit to izlley/parameter-golf that referenced this pull request May 1, 2026

Track 2 final 3-seed (42, 0, 314): val_bpb 1.05917838 (3-seed avg) be…

c7f7be5

…ats PR openai#1855 by -0.00190 BPB

izlley added a commit to izlley/parameter-golf that referenced this pull request May 1, 2026

Track 2 final 3-seed (42, 0, 314): val_bpb 1.05917838 (3-seed avg) be…

1f3ef38

…ats PR openai#1855 by -0.00190 BPB

izlley added a commit to izlley/parameter-golf that referenced this pull request May 1, 2026

Track 2 final 3-seed (42, 0, 314): val_bpb 1.05917838 (3-seed avg) be…

02c5b34

…ats PR openai#1855 by -0.00190 BPB

izlley added a commit to izlley/parameter-golf that referenced this pull request May 1, 2026

Track 2 final 3-seed (42, 0, 314): val_bpb 1.05917838 (3-seed avg) be…

df03490

…ats PR openai#1855 by -0.00190 BPB

cloud-777-boy mentioned this pull request May 1, 2026

Non-record: Inhibitory layer on PR #1851 stack (val_bpb 1.06438) #2116

Open

newjordan mentioned this pull request May 1, 2026

non-record - mockingbird - sota copy/10kvocab #2120

Open

Kbediako mentioned this pull request May 1, 2026

Record candidate: StageB v2 CaseOps TTT seed42 1.06095913 #2121

Closed

andrewbaggio1 mentioned this pull request May 1, 2026

Record [corrected] : 1.05770 Gated XSA + token-only n-gram tilt + LQER top-1 + AWQ-lite + AsymLogit) with GPTQ_RESERVE_SECONDS=2.0 and corrected CaseOps data preparation #2118

Open

This was referenced May 1, 2026

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 #2123

Closed

Record : CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 #2124

Open

Record : CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 vaibhavmishra1/parameter-golf#1

Merged

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

This was referenced May 1, 2026

Non-record: Post-Quantization LoRA Distillation (LCQ) on PR #1855 stack, val_bpb=1.06767 #2128

Open

Non-record: Confidence-Adaptive N-gram Boost on PR #2018 stack, val_bpb=1.05874 #2129

Open

TanishGudise mentioned this pull request May 1, 2026

Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130

Open

izlley added a commit to izlley/parameter-golf that referenced this pull request May 1, 2026

Record: seed 42 result: val_bpb 1.05889 beats PR openai#1855 mean 1.0…

3e1fb27

…6108 by -0.00219

izlley added a commit to izlley/parameter-golf that referenced this pull request May 1, 2026

Record: final 3-seed (42, 0, 314): val_bpb 1.05917838 (3-seed avg) be…

ecce6fd

…ats PR openai#1855 by -0.00190 BPB

anmarhindi mentioned this pull request May 1, 2026

Record: SP8192 + Sliding-Window Eval + Lock-In Byte Mixer - val_bpb 1.067219 #2138

Open

simon-marcus mentioned this pull request May 1, 2026

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140

Open

codemath3000 mentioned this pull request May 2, 2026

Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135

Open

cocohearts mentioned this pull request May 2, 2026

Update leaderboard with May 1 audited rows #2146

Merged

codemath3000 mentioned this pull request May 2, 2026

Merge #2019 exclusion explanation request #2148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean)#1855

Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean)#1855
cocohearts merged 4 commits intoopenai:mainfrom
codemath3000:submission/sp8192-lqer-bos-smear-fix-9hp-stack

codemath3000 commented Apr 27, 2026 •

edited

Loading

Uh oh!

codemath3000 commented Apr 27, 2026 •

edited

Loading

Uh oh!

aquariouseworkman commented Apr 28, 2026

Uh oh!

okezue commented Apr 28, 2026

Uh oh!

codemath3000 commented Apr 28, 2026 •

edited

Loading

Uh oh!

aquariouseworkman commented Apr 28, 2026 •

edited

Loading

Uh oh!

codemath3000 commented Apr 28, 2026 •

edited

Loading

Uh oh!

aquariouseworkman commented Apr 28, 2026

Uh oh!

codemath3000 commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

codemath3000 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

SmearGate cross-document leak fix

Per-group compression pipeline

Hyperparameter stack

Test plan

Uh oh!

codemath3000 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aquariouseworkman commented Apr 28, 2026

Uh oh!

okezue commented Apr 28, 2026

Independent reproduction (3 seeds)

Uh oh!

codemath3000 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aquariouseworkman commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codemath3000 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aquariouseworkman commented Apr 28, 2026

Uh oh!

codemath3000 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codemath3000 commented Apr 27, 2026 •

edited

Loading

codemath3000 commented Apr 27, 2026 •

edited

Loading

codemath3000 commented Apr 28, 2026 •

edited

Loading

aquariouseworkman commented Apr 28, 2026 •

edited

Loading

codemath3000 commented Apr 28, 2026 •

edited

Loading

codemath3000 commented May 1, 2026 •

edited

Loading