Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean)#1855
Conversation
…aram Greedy Stack — val_bpb 1.06108 (3-seed mean) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or internal hp trials only; submission used 600s wallclock
|
Quick note: the initial commit, 612a1a9, was the only one with any code changes; the other two were README-only changes to clarify things and fix some AI hallucinations. In terms of evaluating when this PR was submitted relative to other PRs, the timing of the initial commit, not that of the most recent commit, should be used, since the initial commit contains all the code, logs, etc., and the other commits are just README changes. Thank you so much! |
…ate BOS-fix Lands openai#1797's training stack (PolarNS, MIN_LR, Sparse Attn Gate, Fused CE, Smear Gate, LQER asym) verbatim into a new record dir, with the BOS-fix patch from openai#1855 applied at both _forward_hidden and forward_ttt sites. Per CLAUDE.md baseline-migration exception, lands directly on research (not exp/<slug>). Spec: research/specs/050-baseline-1797-bos-fix.md Code: records/track_10min_16mb/2026-04-27_050_PR1797_Base_BOS_Fix/ Expected: post-TTT ~1.061 (matches openai#1797's 1.06157 ± noise). Skipped from openai#1855: 9-hparam bundle and lrzip serializer (deferred for clean attribution of subsequent levers). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…penai#1855 Changes train_gpt.py defaults for two of openai#1855's 9 greedy-validated hparams: - BETA2 0.95 -> 0.99 (smoother optim variance estimate, generic win) - SPARSE_ATTN_GATE_SCALE 1.0 -> 0.5 (softer gating early; only affects openai#1787's sparse attn-output gate path, no coupling with our 047 family) Both still env-var-overridable for ablation. WARMDOWN_FRAC=0.85 deferred because it interacts with loop-activation timing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… required; PR openai#1848 BPB risk; Day 18 plateau; Session 23 - Merged SOTA still 1.0810 (Day 18, no change since Apr 9) - PPM-D byte mixture confirmed by dexhunter at 1.0322 (PR openai#1857, self-closed) - SmearGate BOS bug documented: prev-token leaks at document boundaries; fix required - PR openai#1848 (newjordan, 0.87980) flagged BPB risk: sibling PR openai#1846 closed same day - PR openai#1858 (0.9946) only covers 8M/40.5M tokens — not leaderboard-comparable - PR openai#1855 (codemath3000, 1.06108) and openai#1851 (aquariouseworkman, 1.06128) both clean - PPM-D wave: PRs openai#1850, openai#1854, openai#1835 await organizer ruling - Added Session 23 lessons to CLAUDE.md - 3 days to deadline (Apr 30) — final GPU run window https://claude.ai/code/session_01RmJtLYUmKNzDgDVTnWoKzU
- Adds 2-line BOS mask in both forward_logits and forward_ttt SmearGate paths. Before fix, the last token of doc N smeared into the BOS of doc N+1 — model-quality bug, not a C1 issue. Identical fix to PR openai#1851 @aquariouseworkman, audit by @cocohearts. - runpod/phase_g_3seed.sh: full 3-seed driver. Sets PR openai#1797 stack env vars + the PR openai#1855 9-hparam greedy stack delta: MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 WARMDOWN_FRAC=0.85 BETA2=0.99 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 TTT_LORA_RANK=80 SPARSE_ATTN_GATE_SCALE=0.5 PHASED_TTT_PREFIX_DOCS=2500 Mixers (NGRAM/TEMP) stay OFF — pure neural baseline + bug fix + hparam stack. Auto-runs Welch t-test vs PR openai#1797 (1.06157±0.00066). - TTT 4-epoch (PR openai#1812) explicitly NOT adopted: that scheme targets the PR openai#1493 SGD-on-whole-model TTT path, not the PR openai#1797 LoRA-phased per-doc-reset path we're on. No clean mapping. Legality: all 16/16 unit tests still pass. BOS fix preserves causality (it only zeroes a gate at positions where current token is BOS, never references future tokens).
…olar Express NS + MIN_LR + LQER) Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877): - openai#1852: hard rule violation (pre-quant TTT on validation data). - openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted. - openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over token alphabet), reviewer @sharpobject caught. - openai#1855: techniques mostly legit but apt-get install lrzip violates Issue openai#1017 Rule 3 (artifact must be self-contained). - openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal training-time techniques citing prior validated PRs. If it merges, our submission threshold shifts from 1.0760 to ~1.0627. PR openai#1874's three techniques: 1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5. 2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max instead of decaying to 0. Already wired in our v1+; just env-var opt-in. 3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) - SVD on top-K=3 highest-error GPTQ residuals, packed as int4 per-group-64 asymmetric. ~200-400 LOC; deferred to v4. train_gpt_v3.py implements (1) and exposes (2): - POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off). - _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at import time so torch.compile sees them as constants. - zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use per-iteration coefficients instead of fixed. - MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in. Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst- seed artifact slack: ~4,888 bytes under cap. Tight but workable. AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux). Stacking projection (single-seed): - Phase 0 baseline: 1.08038 - + LR=0.010 (Stage 2): 1.08021 - + Polar Express NS: 1.0787-1.0797 - + MIN_LR=0.10: 1.0777-1.0794 - + ConfTTT (PR openai#1879): 1.0772-1.0793 - + LQER (v4 work): 1.0742-1.0783 - + Phase 2 architecture: 1.0712-1.0773 - + Newton-Muon Stage E: 1.066-1.075 Path B (absorb-and-stack) recommended over Path A (race-to-merge-with- current-stack) since current stack alone doesn't clear 1.0760. Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open. Whichever merges first becomes new SOTA and our threshold tightens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
apt-get install lrzip is required at runtime, thus this breaks rule 3 |
Independent reproduction (3 seeds)Re-ran the stack on a fresh 8×H100 SXM pod (cu129/torch 2.9.1/FA3 cu129_torch291) with the env vars from this PR's hparam table plus the gates (`SMEAR_GATE_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 EMBED_BITS=7 MIN_LR=0.1 GPTQ_RESERVE_SECONDS=0.5 PHASED_TTT_NUM_PHASES=3` etc.). 600 s wallclock, otherwise identical to this PR.
3-seed mean 1.06043 vs this PR's reported 1.06108 — within 1σ. Independent reproduction confirms the stack. These three runs used `COMPRESSOR=brotli` and produced 16,112,007-byte artifacts (over the 16 MB cap; the BPB numbers are unaffected by compressor choice but the artifacts are technically non-compliant). One additional pergroup re-run with seed 42 produced a compliant 15,902,285-byte artifact at val_bpb 1.06052, matching seed 42's brotli value within run-to-run noise. (Couldn't get 3 pergroup seeds due to a string of RunPod capacity / image-pull failures over a 4-hour window — the locked volume on the original machine still has all the brotli logs and the s42 pergroup artifact saved.) Net: the stack reproduces. The −0.019 BPB jump over PR #1493's 1.0810 is real on independent hardware. |
@aquariouseworkman Thanks for flagging — you're right that the README/PR wording was imprecise about when lrzip is needed, and I've fixed that. To clarify the substantive question: The training/eval script never runs apt-get itself. The lrzip binary is installed once during instance setup and the script just shells out to the already-installed binary via subprocess.run. No apt-get, no network calls, and no external downloads occur during the 600 s training window or the 600s eval window. The official FAQ explicitly authorizes external dependencies handled this way: "Yes, you're free to import any package or library you want… Just include a requirements.txt in your records folder and mention setup instructions in your README.md." Both are present (requirements.txt documents lrzip; the README has the install command). For precedent: the current leaderboard SOTA (PR for the 2026-04-09 SP8192 record at 1.0810 BPB) installs FlashAttention 3 from a custom third-party wheel host (pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/...) — also an external download performed before training begins. lrzip via apt-get is actually a more conservative dependency than that: it pulls from the official Debian/Ubuntu package repos rather than from a single contributor's GitHub Pages site. If the standard FA3 setup is acceptable, this should be acceptable a fortiori. Also worth noting that "rule 3" as numbered in the field guide is the author's paraphrase, not the official rule text. The literal official rule (FAQ) is the one I quoted above. |
…quisite (not auto-installed by the training script)
Strange, from what I can see in your current code, you would have to do one of the following to be valid (code review based, not based on your AI README post/information) : a) re-run all three seeds with COMPRESSOR=brotli and reports those numbers Evidence:
|
|
Hi @aquariouseworkman, thanks for the detailed read. Walking through the three options: (a) Re-run with (b) Pure-Python ZPAQ embedded in the artifact — no rule requires this. The official rule (FAQ on artifact size) prohibits "external downloads, training dataset access, or network calls during evaluation" — not subprocess shell-outs to OS utilities. lrzip is invoked only via (c) Maintainers add lrzip to the base eval image — this is the standard workflow for the challenge, explicitly authorized by the FAQ:
The FAQ uses "package or library" (not "pip package") and treats A point worth being explicit about: the rule's verb is "include a requirements.txt." That means have one, present it as a reference — not "every dependency must resolve automatically via pip install -r requirements.txt." The official README itself reinforces this on line 176, describing requirements.txt as "provided as a reference if you want to self-setup" — a reference document, not a self-executing manifest. Submission contents rule (line 227) similarly says "any other dependencies," with no restriction to pip-installable packages. So a requirements.txt that mentions lrzip in a comment alongside the README's install instructions satisfies the rule as written. For precedent: the current leaderboard SOTA (2026-04-09, val_bpb 1.0810 by @bigbag) installs FlashAttention 3 from The literal self-containment rule (FAQ on artifact size) is about runtime behavior during evaluation: "No external downloads, training dataset access, or network calls are allowed during evaluation." lrzip is invoked locally against an already-installed binary — no network, no download, no training-data access at runtime. So to be explicit on the conclusion: the submission is valid as-is. The artifact is under 16 MB, the script makes no network calls or external downloads during evaluation, and the lrzip dependency is declared exactly the way the official FAQ asks external dependencies to be declared ( |
Yes .. my bad, this does appear valid. :) your now the #1 spot |
PR openai#1902 (cocohearts) accepted openai#1851/openai#1868 over openai#1736 and excluded openai#1855 only on significance grounds (p=0.325). Our prior 050 line built on openai#1797 which is under validity-cloud per cocohearts. Re-anchor research baseline on openai#1855's accepted chain. Pure port — zero modifications. Files copied verbatim from codemath3000/parameter-golf:submission/sp8192-lqer-bos-smear-fix-9hp-stack @ 1e43966 into records/track_10min_16mb/2026-04-29_PR1855_Port_Baseline/. Spec 060B+ will fork exp/060B-* etc. to stack quant-repair / deploy-time levers (046B-tight SDClip, 046L deploy-time repair, 046G-tighter, etc.) on this baseline.
- pinned SHA da50cd6 (spec 060A baseline) - TTT_ENABLED=1, PHASED_TTT_ENABLED=3 (per memory: =3 not =0) - All openai#1855 defaults made explicit in env (BETA2=0.99, SPARSE_ATTN_GATE_SCALE=0.5, MLP_CLIP_SIGMAS=11.5, EMBED_CLIP_SIGMAS=14.0, WARMDOWN_FRAC=0.85, PHASED_TTT_PREFIX_DOCS=2500, TTT_BETA2=0.99, TTT_WEIGHT_DECAY=0.5, TTT_LORA_RANK=80) - apt install lrzip (required by openai#1855's _lrzip_compress) - Both final_model.pt and final_model.int6.ptz verified after run; fail-loud (exit 2) on missing artifacts; chmod a-w on success
Six follow-on specs to spec 060A (openai#1855 port): - 060B: SDClip ATTN tightening (config-only, eval via RESUME_FROM_CKPT) - 060C: 046L deploy-time quant repair (~150 lines code port from exp/046-quant-repair @ fcb816f); eval-side, free - 060D: 046G-tighter SDClip (config-only, fits within openai#1855 lrzip headroom) - 060E: full stack (060B + 060C combined) - 060F: LQER bumps (RANK=5, TOP_K=4, ASYM_GROUP=32; config-only) - 060G: Partial SpinQuant from PR openai#1898 (~100 lines code port) Plus tmp_exec/launch_060_eval.sh: shared eval-only launcher for RESUME_FROM_CKPT mode, used by 060B/D/E/F. Loads 060A's final_model.pt, re-quantizes + re-evals with overridden env vars. ~-3 per arm vs ~ for full retrain. All specs reference 060A's checkpoint at runs/060A-1855-port/seed_42/ final_model.pt as their hotstart.
openai#1855's per-group lrzip compressor saves ~280 KB vs default brotli. Without this, 060A's artifact went over the 16,000,000 byte cap and required post-hoc repacking (per execution session feedback). Switching default to pergroup ensures all 060 family runs (A through G) fit within the cap by default; no separate repack step needed. Affects: launch_060A_run.sh (full training run), launch_060_eval.sh (eval-only via RESUME_FROM_CKPT for 060B/D/E/F).
Phase M seed-42 hit val_bpb 1.05891 (record-clearing) but artifact 17.25 MB (over by 1.25 MB) because lzma compression made things WORSE on quantized weights — Phase G with brotli was 16.14 MB, lzma made it 17.25. Lesson: brotli > lzma on this data. Phase N strategy: same Phase G config (9-hparam stack on PR openai#1797 V2 base with BOS fix, brotli compression) but revert MLP_CLIP_SIGMAS from 11.5 to 10.0 (PR openai#1797 default). Tighter MLP weight clip → narrower magnitude → brotli compresses tighter → expected ~100-200 KB saved. Phase G's 16,144,312 bytes should drop to <16,000,000. BPB cost of reverting MLP_CLIP: small (one of 9 hparams; PR openai#1855 reported mean delta -0.00049 across all 9; reverting one ~adds 0.00006 BPB). Phase G's mean 1.05969 should shift to ~1.0598 — still well below the 1.05963 record bar (PR openai#1797 = 1.06157 - 0.00194 = 1.05963). Auto-stop pod (trap EXIT + hard wallclock kill 100min) and HF result push after each seed (so abort is recoverable from outside the pod).
Four post-training specs to stack on 060A's openai#1855 port: - 060I: port PR openai#1908's activation-aware mixed-bit GPTQ (3-seed validated −0.000265 BPB on openai#1855 itself). 4 env vars + ~100 LOC port. - 060J: PHASED_TTT_NUM_PHASES 3→4 (low confidence; openai#1727 measured noise on weaker base, never tested with 2500 prefix). - 060L: PHASED_TTT_PREFIX_DOCS 2500→3000 (high confidence; codemath3000 greedy-validated 2000→2500 on this exact stack in openai#1855). - 060M: TTT_EPOCHS 3→4 (highest predicted Δ; PR openai#1812 reported −0.008 on weaker base; never tested on phased+SmearGate stack like openai#1855). All eval-only via RESUME_FROM_CKPT on 060A's seed_42_4h pt. No code change for 060J/L/M. 060K (rank-up) deleted — rowed against openai#1855's own greedy direction (which decreased rank 96→80). Idea files: research/ideas/{1908-awq-lite-mixed-bit-gptq,ttt-budget-reinvestment}.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oundary This submission extends PR openai#1855's record candidate (LQER + SparseAttnGate + BOS-fixed SmearGate + Polar-Express Muon + phased TTT eval + 9-hparam stack; 3-seed mean 1.06108) with two additions: 1. MP3 marker-pair fusion (vocab surgery): the three 2-grams [SPACE, TITLE]/ [SPACE, ALLCAPS]/[SPACE, CAPNEXT] are fused into single alias donor tokens (donors 8/9/10 from byte-fallback IDs that occur 0x in the CaseOps corpus). Word X is preserved (no full-fusion d=1 collapse). Token saving 8.47%. 2. Alias smear boundary: SmearGate's previous-position contribution is fully disabled at positions immediately following an alias token (ALIAS_PREV_SMEAR_SCALE=0.0). Regular non-alias positions are unchanged. Conceptually: alias tokens act as smear boundaries. 1-seed reference (8xH100, 600s wallclock, on author's DGX H100 box): val_bpb (phased TTT) : 1.06042 size : 16.74 MB on DGX (over budget); the same PR openai#1855 codebase unmodified also produces 16.75 MB on the same DGX box, so the ~840 KB delta vs the runpod 15.90 MB number is environmental (likely lrzip ZPAQ version / numerical state). The 3-seed runpod verification is the authoritative size measurement. Submission contents: - train_gpt.py : PR openai#1855 train_gpt.py (~3.8k lines) + 5-hunk MP3 patch - prepare_caseops_data.py : CaseOps tokeniser (multiprocess) - prepare_marker_pair_v3.py : MP3 vocab surgery - download_docs.py : HF docs_selected.jsonl downloader - lossless_caps.py : CaseOps infra - tokenizers/...model : SentencePiece model - alias_map.json : MP3 alias map - requirements.txt : Python deps + lrzip note - run_3seed.sh : 3-seed runner (SEEDS=42 0 1234) - README.md Pipeline (skip 1a/1b/2 if MP3 dataset is already prepared): 1a. python3 download_docs.py 1b. python3 prepare_caseops_data.py --docs ... --out ./data --sp tokenizers/... 2. python3 prepare_marker_pair_v3.py 3. bash run_3seed.sh 3-seed runpod verification pending.
…6108 by -0.00219
…ats PR openai#1855 by -0.00190 BPB
…ats PR openai#1855 by -0.00190 BPB
…ats PR openai#1855 by -0.00190 BPB
…ats PR openai#1855 by -0.00190 BPB
…ats PR openai#1855 by -0.00190 BPB
Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations
User pushed back on openai#2014's LEAK call as too inference-based. Verified directly: - README says "uses same shards as PR openai#1855. If you don't have them, prepare with included prepare_caseops_data.py" — phrasing implies inheritance from openai#1855 (LEAK) but doesn't explicitly invoke prep - No setup.sh, no shell script invoking prep - No HF download script - Path /dev/shm/pgolf_caseops_data_80_l17_final is custom flat RAM-disk dir (not triple-nested local-prep signature) - Could be either HF-flattened download OR local-prep copy Demoted openai#2014 from LEAK to AMBIGUOUS (lean LEAK based on "same shards as openai#1855" English, but not iron-clad). Updated tally: CLEAN 9, LEAK 20 (was 21), AMBIGUOUS 4 (was 3), INHERIT 1.
|
Following up on the val_docs=10_000 default question with a more detailed answer. This submission's training and evaluation data come from the published The script used is the The shipped The romeerp dataset's "stats": {
"docs_val": 50000,
"docs_train": 8181945,
"files_val": 1,
"files_train": 80,
"tokens_val": 47853344,
"tokens_train": 8000000058
}This is the canonical 50K val docs / 8B train tokens setup. Our The val partition this submission evaluated on is the canonical 50K val docs from romeerp's dataset. |
Beats PR openai#1855 (merged rank 1, 1.06108) by 0.00438 BPB. Beats PR openai#2014 (best open, 1.05759) by 0.00089 BPB. Beats PR openai#2060 (1.05792) by 0.00122 BPB. Stack: - Token-only n-gram tilt (PR openai#1514 merged precedent, within/word channels disabled) - AsymLogit Rescale (2 trainable scalars adapted by global TTT) - 3 hyperparameter levers from PR openai#2060 (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5) - PHASED_TTT_NUM_PHASES=1 (matches PR openai#2014) - NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 (precompute INSIDE eval timer per PR openai#1514) Compliance: - All seeds eval ≤533.1s (cap 600s, 67-80s margin) - All artifacts ≤15.95MB (cap 16MB) - Token-only n-gram channel (within_gate=0, word_gate=0) - Score-first TTT (per PR openai#402)
…6108 by -0.00219
…ats PR openai#1855 by -0.00190 BPB
…0.979556) The cond-PPM mixer used SP-piece UTF-8 bytes (incl. CaseOps sentinel overhead, 164,594,398 per seed) as the BPB denominator instead of the canonical raw-text sidecar (151,074,309 per seed) used by every other CaseOps-lineage record per PR openai#1729 convention. Reported by @codemath3000 on PR openai#2138; thank you. Per-token NLL is invariant under denominator change, so the correction is algebraic — no re-eval required, original artifact and logs preserved as forensic record. New per-seed BPB = old × 164594398 / 151074309 = old × 1.089493: seed 42: 0.97949078 -> 1.067148 seed 1337: 0.97954725 -> 1.067210 seed 314: 0.97962885 -> 1.067299 mean: 0.979556 -> 1.067219 (std ~7.6e-05) On the canonical denominator the submission is +0.006 BPB worse than PR openai#1855 SOTA (1.06108), so this is no longer a SOTA-claim. LBM still gives a real -0.034 BPB improvement over sliding-window-alone (1.101347) on the canonical denominator; the C2-correctness story is unchanged. This commit only patches interpretation: - README.md: prepend Errata section, corrected 3-seed table, source- line citations, algebraic derivation; reposition writeup as not-SOTA. Original technique writeup retained below. - submission.json: corrected val_bpb / val_bpb_per_seed / std / eval_canonical_byte_count_per_seed / headline_metric_description; add errata{} object with summary, original values, inflation ratio, credit, fix-branch pointer. Forensic items deliberately untouched: train_gpt.py (wrapped, contains buggy denominator), final_model.int6.ptz, train_seed*.log (each shows both the buggy 'cond_ppm bytes=164594398' line and the canonical- correct 'quantized_sliding_window val_bpb' line — the sidecar count 151,074,309 is reverse-solvable from the latter). Fix lives on cond-ppm-stack of github.com/anmarhindi/parameter-golf-a. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
11L 512d 8H/4KV transformer with U-Net skips, parallel residuals, partial RoPE, Polar-Express Newton-Schulz Muon, LQER asymmetric int4 rank-4 quant correction, sparse attention head-output gate, SmearGate with cross-document leak fix on BOS positions (audit response), fused LeakyReLU-square MLP, fused softcapped CE Triton kernel, GPTQ int6 + int7 embed + per-row int8 attn-gate, per-group lrzip + brotli compression pipeline (added in this submission — PR #1797's base only ships
lzma/brotli), phased TTT eval (3 cumulative phases at doc-boundaries 833 / 1666 / 2500, max prefix=2500 docs), with 9 hyperparameter overrides validated by greedy forward-selection on 8×H100 real fixed-step.3-seed mean: 1.06108 BPB (std 0.00090) on 8×H100 SXM, all artifacts under the 16 MB cap.
vs current leaderboard (1.0810 BPB): −0.01992 BPB / −0.04359 nats.
SmearGate cross-document leak fix
SmearGate's per-token forward-1 mixing (
x[:, 1:] + g * x[:, :-1]) leaks the last token of doc N into the BOS embedding of doc N+1 in a packed validation stream. Fix masks the prev-token term wherever the current token is BOS:Applied symmetrically in
_forward_hiddenandforward_tttso training and TTT eval are leak-free.Per-group compression pipeline
PR #1797's base only exposes
lzma/brotlicompressors. This submission adds a per-group serializer (COMPRESSOR=pergroup):qo_bank,kv_bank,mlp_up_bank,mlp_down_bank, etc.) so similarly-distributed weights compress together._tok_emb,attn.c_q,mlp.fc), runs an L1 nearest-neighbour similarity sort on rows before transposing — adjacent rows in the serialized stream are now numerically close, giving the entropy coder longer runs of small deltas. Permutation indices are stored asuint16and brotli-compressed.lrzip -z -L 9(ZPAQ context-mixing back-end). lrzip's long-range deduplication catches cross-tensor repetition that brotli's 24-bit window misses.Net effect on this stack: ~280 KB smaller artifact than
COMPRESSOR=brotli, at the cost of ~75 s of additional serialize time. Thelrzipbinary must be present on the system before the training script runs (e.g. install withapt-get install lrzipduring instance setup). The script itself does not runapt-get; the Pythonsubprocess.runwrapper just shells out to the already-installedlrzipbinary.Hyperparameter stack
9 greedy-validated overrides:
Each individually accepted on a strict greedy-keep rule (mean improvement vs current best stack) at fixed-step.
See
records/track_10min_16mb/2026-04-27_SP8192_LQER_SparseGate_BOSSmearFix_9HpStack_1.0611/README.mdfor full architecture lineage and credits.Test plan
_forward_hiddenandforward_ttt🤖 Generated with Claude Code