Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 by dexhunter · Pull Request #1736 · openai/parameter-golf

dexhunter · 2026-04-19T09:55:58Z

Summary

val_bpb 1.06549 (3-seed mean, std 0.00070), val_loss 2.33168 nats/token.
Builds on PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1) bijective case preprocessing from PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 with a per-token byte sidecar so BPB is scored on original pre-transform UTF-8 bytes.
Adds a learned attention out-gate (init_std=0.005) + quant-gate scaling that recovers the overhead introduced by the new control tokens, keeping every seed under the 16 MB decimal cap.

3-seed results (8×H100 80GB SXM, 10-min train / 10-min eval budgets)

Seed	Steps	Pre-TTT BPB	Post-TTT BPB	Artifact (bytes)	train_time	eval_time
42	4854	1.07847	1.06610	15,978,834	596.18s	396.9s
0	4843	1.07719	1.06473	15,971,476	596.17s	399.3s
1234	4847	1.07811	1.06563	15,975,050	596.08s	395.5s
Mean	4848	1.07792	1.06549	15,975,120	596.14s	397.23s
Std		0.00066	0.00070	3,698	0.06s	1.9s

All three seeds clear size, train-time, and eval-time budgets with substantial headroom. 3-seed std is 0.00070 BPB — well inside the 0.005 significance floor.

Key innovation — CaseOps tokenizer + byte sidecar

CaseOps is a bijective, character-level text transform that removes English capitalization from the body of the text and records it as four operator tokens (TITLE, ALLCAPS, CAPNEXT, ESC) that become SentencePiece user_defined_symbols. Because the transform is fully invertible (decode(encode(s)) == s), no information is lost and BPE merges allocate vocabulary around content instead of around case variants. Ships with a per-token byte sidecar (fineweb_val_bytes_*.bin, uint16 parallel to val shards) so BPB is computed on ORIGINAL pre-transform UTF-8 bytes, not on the transformed representation — the score is on the same FineWeb text, just with a different tokenization front end.

Rule compliance

Artifact ≤ 16,000,000 bytes DECIMAL (README FAQ + Issue A Field Guide to Valid Submissions #1017 §II.1): ✅ all seeds ≤ 15,978,834.
train_time ≤ 600s (README line 6): ✅ all seeds 596.1s.
total_eval_time ≤ 600s (README FAQ, separate budget): ✅ all seeds 395.5–399.3s.
Score-first TTT (Issue A Field Guide to Valid Submissions #1017 Condition 3): ✅ phased TTT snapshots the pre-update score on each chunk before the LoRA adapter step; per-doc LoRA reset between documents.
BPB on original bytes (Issue A Field Guide to Valid Submissions #1017 §V): ✅ per-token byte sidecar encodes canonical UTF-8 byte count of each val position.
No val data in training: ✅ training uses only fineweb_train_*.bin shards.
Reproducible: prepare_caseops_data.py is deterministic given the input FineWeb doc stream.

Test plan

Organizer reviews submission folder contents (train_gpt.py, prepare_caseops_data.py, tokenizer .model, 3 seed logs, submission.json, README.md, lossless_caps.py).
Organizer runs prepare_caseops_data.py to generate CaseOps shards + val byte sidecar.
Organizer reproduces at least one seed: SEED=42 CASEOPS_ENABLED=1 GATED_ATTN_QUANT_GATE=1 ... torchrun --standalone --nproc_per_node=8 train_gpt.py (full env in README).
Reproduced quantized_ttt_phased val_bpb matches the logged 1.06610 (±0.0007) within seed noise.
Artifact size, train_time, total_eval_time all within budgets on re-run.

Lineage

@samacqua — PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 base stack.
@romeerp — PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 CaseOps concept + byte sidecar accounting.
@MarioPaerle — PR RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667 attention gate pattern.
@bigbag — PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 prior merged SOTA.

… — val_bpb 1.06549 3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green: - artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL) - train_time 596.14s mean (≤600s) - total_eval_time 397.23s mean (≤600s) Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1) bijective case preprocessing from PR openai#1729 with a per-token byte sidecar so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned attention out-gate (init_std=0.005) + quant-gate scaling that recovers the ~40 KB of overhead introduced by the new control tokens, keeping every seed under the 16 MB decimal cap. Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).

…ad, MP-SGD TTT 4-phase - PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter (~1.189 actual) + artifact size violation; effectively dead - New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) — reversible case-factoring with byte sidecar; stronger legality than casefold; await Issue openai#1604 ruling - PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738 builds on it, both likely void - PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable - Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31

Bulk import of dexhunter's openai#1736 unmerged submission (openai#1736, commit e100586) for reproduction as our new research baseline. Source: records/track_10min_16mb/ 2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/. 9 files, ~6856 lines: - train_gpt.py (training script) - lossless_caps.py (bijective CaseOps transform) - prepare_caseops_data.py (data retokenization script) - fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model (SP tokenizer) - README.md, submission.json, 3 per-seed training logs No modifications to repo-root files. Spec: research/specs/008-1736-reproduction.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After 2026-04-19 frontier scan, rebasing the research baseline from merged SOTA openai#1493 (1.0810) to unmerged PR openai#1736 (dexhunter, claimed 1.06549). Rationale: credible frontier moved ~0.015 bpb past merged SOTA in 10 days via witnessed, legal levers (CaseOps tokenizer, attn-out gate, phased TTT). Continuing off spec-000 leaves us behind before we try anything. - CLAUDE.md: baseline declared; baseline-migration specs land on research directly (exception to exp/<slug> convention). - research/frontier-map.md: credibility filter + dependency map. - diary/2026-04-19-frontier-{scan,map}.md: per-PR evidence base. - research/ideas/1736-improvement.md: three-spec migration plan. - research/specs/008-1736-reproduction.md: spec for the reproduction run, pinned to commit 154c9b8 (openai#1736 import at e100586). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Collapse spec 008 to seed 42 only and add a one-line pre-GPTQ FP checkpoint save at runs/008-1736-reproduction/seed_42/pre_gptq.pt (env-var gated via SAVE_PRE_GPTQ=1 so the reproduction itself is unaffected when the flag is off). Rationale: SpinQuant and subsequent quant-family experiments are purely post-training transforms, so hotstarting off a single pre-GPTQ FP checkpoint is far cheaper than retraining per spec. Single-seed comparison against openai#1736's seed-42 (1.06610, ±0.003) is apples-to-apples for screening. Cost drops ~$40 -> ~$17 for this spec and ~$10 -> ~$1–2 per downstream quant experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First lever layered on the new openai#1736 baseline. Hadamard rotation of weight matrices before GPTQ quantization, hotstarted off spec 008's pre_gptq.pt FP checkpoint. No retraining. Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a openai#1529-adjacent base; expected to compose cleanly with openai#1736 since the quant stage is orthogonal to CaseOps / attention gates / phased TTT. Rotation is a post-training transform with three classes (residual-stream, per-layer attn, per-layer MLP); FP forward pass is invariant by construction, only quantization error drops. Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full retrain. Same hotstart checkpoint reused by future quant experiments (per-group bit, AR-selfgen calib, AWQ). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Based on reading train_gpt.py at commit 154c9b8: Good: RMSNorm is gamma-free (line 529), so the usual gamma-fold step doesn't apply. RMSNorm is rotation-equivariant directly. Bad: openai#1736 has five OTHER per-channel multipliers on residual flow (attn_scale, mlp_scale, resid_mix, skip_weights, skip_gates). These are the real fold targets, not RMSNorm. resid_mix is pre-norm and cannot be cleanly folded. Split into three SpinQuant modes selectable by SPINQUANT_MODE: - internal_only (R_a, R_m per layer; no residual rotation) - full (internal + R0, with attn_scale/mlp_scale/skip folds and resid_mix freeze-to-mean compromise) - port_1695 (conditional on openai#1695 diff being meaningfully different) All three run back-to-back on one pod hotstarted off spec 008's final_model.pt. ~30 min total GPU, ~$17-22 budget, one eval. research/ideas/spinquant-integration-notes.md captures the full design analysis (per-multiplier fold feasibility, three-option tradeoff, shared-code plan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Added SPINQUANT_MODE=baseline as a fourth variant that applies no rotation — just loads final_model.pt, runs serialize/deserialize/ eval/TTT on it. Two purposes: 1. Closes the loop on spec 008's missed post-TTT number (watcher stopped the pod before the TTT eval ran). No separate $3 eval-only rerun needed. 2. Provides the apples-to-apples local reference for measuring the three SpinQuant variants' Deltas — removes any cross-pod bf16 drift from the comparison. Order: baseline -> internal_only -> full -> port_1695, sequential on one pod. Gate: if baseline lands outside openai#1736's 1.06610 +/- 0.003, halt before running rotations (means spec 008 reproduction is off). Total cost ~$27 (was $22); absorbs ~$3 of otherwise-separate eval rerun, so net increment is ~$2 for four measured numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two new files in the openai#1736 submission dir: spinquant_hotstart.py (~360 LOC): - Imports from train_gpt.py for Hyperparameters/GPT/serialize/deserialize/ eval_val/eval_val_ttt_phased/BatchedTTTLoRA/etc. - Modes: baseline, internal_only (R_a only, per-layer per-KV-group, d_head rotation on V-output and O-input). - full, port_1695 are stubs — raise NotImplementedError with explanation. - Pipeline: load FP state_dict from HOTSTART_FP_CKPT -> apply rotations in-place on banked qo_bank/kv_bank -> optional pre-quant diagnostic eval -> call serialize() (GPTQ+compress) -> deserialize() -> quantized eval -> phased TTT eval -> write final.json. - Reproduces the TTT eval block from train_and_eval (lines 2997-3075) in _run_ttt_eval() rather than refactoring the source file. test_rotation_invariance.py (~250 LOC): - CPU-only, standalone (no train_gpt.py import due to flash_attn_3/triton module-level deps). - Self-contained minimal attention forward: Q/K/V projection from the banked tensors, RMSNorm on Q and K (matches real model's bound on attention logits; without this, trained weights saturate softmax and float noise in V amplifies catastrophically). - Tests baseline (bit-exact identity) and internal_only (rel tolerance 1e-4) against either synthetic random weights or spec 008's final_model.pt. Both pass cleanly (rel_max ~1e-6 on real checkpoint). - Can load either banked (qo_bank/kv_bank) or unbanked (blocks.N.attn.*.weight) state_dict format. Spec 009 updated: reduced scope to 2 modes (baseline, internal_only) for this session; full and port_1695 deferred. Rationale in the spec: MLP LeakyReLU-squared breaks R_m float-invariance, resid_mix can't be cleanly folded through RMSNorm, both needing design before implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cleanup pass to resolve inconsistencies between the spec and what's actually in spinquant_hotstart.py + test_rotation_invariance.py: - Title + scope: 2-mode sweep (baseline, internal_only); full and port_1695 explicitly deferred to a follow-up spec. - Checkpoint path: pre_gptq.pt (what execution's spec-008 patch produced, after _unbank_state_dict), not final_model.pt. - Accept criteria: preflight via test_rotation_invariance.py (ALL TESTS PASS), then per-mode on pod. - Rotation structure: trimmed to just the implemented R_a class with exact banked-tensor indexing. R_0 / R_m / skip-stream / RMSNorm-fold sections moved to 'not implemented (deferred)'. - RMSNorm-fold section removed entirely: openai#1736's RMSNorm is gamma-free (F.rms_norm with no weight arg), so no fold needed. - Code-changes section: points at the files on disk instead of TODO pseudocode. - Execution protocol: 2 modes back-to-back on 8xH100, explicit preflight step. - Hardware ladder: 8xH100 required (phased TTT is 8-rank DDP). - Cost estimate: ~$15 total for 2 modes. - Open questions: reframed around unbanked-checkpoint load, bf16 drift, GPTQ interaction, phased-TTT compatibility. - What this spec does NOT do: clarified that residual rotation, R_m, resid_mix, and port_1695 are all deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…approach Read openai#1695's diff. Their approach is fundamentally different from the static-weight-rotation + folds design I had in mind for 'full' mode. They do ONLINE activation rotation: 4 global Hadamard rotations inserted as x @ R matmuls at 4 forward-pass sites (residual->qkv, attn-out->proj, residual->fc, hidden->proj). GPTQ then quantizes in the rotated basis; rotated Hessians keep the quant-side accounting honest. Rotations OFF during training, ON after deserialize for eval+TTT. Why this matters: their scheme sidesteps BOTH blockers that made the full mode complicated: - LeakyReLU non-equivariance: R_mlp_proj_in is applied AFTER the LeakyReLU-square, not across it. - resid_mix: rotations are per-linear-input, never touch the residual stream. All per-channel multipliers (attn_scale, mlp_scale, resid_mix, skip_weights) operate in unchanged basis. No float invariance — the model IS different post-rotation. The bet is that the rotated-basis GPTQ delivers lower quant error and that the perturbation is smaller than the savings. Implication: deprecate the 'full' static-rotation-with-folds plan in favor of a future 'port_1695' spec that ports their online scheme. Internal_only mode from spec 009 remains useful as an independent data point (R_a only, fp-invariant). Spec 010 (tapered WD) drafted as an independent parallel track: - Ports PR openai#1729's WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50 - Muon-WD-only taper on top of openai#1736's existing schedule - Full retrain on 8xH100, single seed, ~\$20 - Independent of spec 009 (different pod, no shared state) - Can run in parallel with 009's eval-only sweep Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n sprint Session-narrative entry covering today's work: - Frontier filter + baseline migration from merged SOTA openai#1493 (1.0810) to unmerged openai#1736 (1.0655), rationale, CLAUDE.md update. - Spec 008 run partial result (training reproduced openai#1736 within +0.00016 at pre-quant; post-TTT gate number not captured due to watcher bug; projected pass ~1.06626). - Spec 009 design evolution through three scope cuts: 4 modes -> unified sweep -> +baseline mode -> cut to 2 modes after discovering real architectural blockers (MLP LeakyReLU breaks R_m, resid_mix doesn't fold cleanly). - openai#1695 diff discovery: they do online activation rotation, not static weight rotation. Sidesteps both LeakyReLU and resid_mix. Reframes 'full' mode -> port_1695 mode as the next quant-side spec. - Specs 010 (port_1695, design only) and 011 (tapered WD, design only) drafted. Only spec 009 is truly runnable right now. Closes with state-of-play table, modal plan, lessons-learned, and open questions for next session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements the port_1695 SpinQuant variant from PR openai#1695 onto the openai#1736 stack. All changes env-var-gated (SPINQUANT_ENABLED=0 default) so spec 008 and spec 009's baseline/internal_only modes are unaffected bit-for-bit. train_gpt.py changes (+247 lines): - import hashlib - Hyperparameters.spinquant_enabled, spinquant_seed - CastedLinear._sq_active class flag (default False) - Utility block: _stable_seed, _hadamard_rotation, install_spinquant_ rotations, _SQ_KEY_TO_TAG, _spinquant_rotate_sd_and_H - 4 forward-path hook sites (2 each in CausalSelfAttention, MLP, _block_with_lora, _parallel_block_with_lora): - pre-QKV: x_qkv = x @ R_attn_in - pre-attn-proj: y @ R_attn_proj_in - pre-fc: x @ R_mlp_in - post-activation pre-proj: hidden @ R_mlp_proj_in - serialize(): call _spinquant_rotate_sd_and_H after Hessian collection and before GPTQ. Rotates weights (W @ R) and Hessians (R.T @ H @ R). - deserialize(): install_spinquant_rotations + set _sq_active=True after loading rotated weights. - MLP.forward: disable fused kernel when SpinQuant active. - LoRA (TTT path) uses unrotated n, base path uses rotated n_qkv. spinquant_hotstart.py changes: - port_1695 mode no longer raises NotImplementedError. Sets h.spinquant_enabled=True and h.spinquant_seed; train_gpt.py's machinery does the rest. Math: orthogonal R means R @ R.T == I, so x_rot @ W_rot = x @ R @ (W @ R).T = x @ R @ R.T @ W.T = x @ W.T. Pre-quant forward is bit-identical to unrotated; GPTQ sees rotated basis where outliers are spread more evenly and quantization error drops. Spec 010 doc updated to reflect the implementation state. Execution runs via SPINQUANT_MODE=port_1695 on spinquant_hotstart.py. Not tested on GPU — flash_attn_3 not available on the dev box. Syntax clean. First pod run will verify end-to-end behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Continuation of the morning diary. Covers: - Spec 009 baseline closed spec 008's gate at 1.06728 (matches openai#1736's 1.06610 within bf16 noise). internal_only null (+0.00003). - Spec 010 port_1695 also null aggregate (-0.00005), BUT per-batch analysis revealed a striking regime-dependent effect: rotation helps long-context docs (-0.0064 bpb on dl>1000) and hurts short-context docs (+0.0146 on dl<300). The null is a cancellation, not an absence of effect. - 'TTT substitutes for rotation' hypothesis revised — the rotation Delta is ~0 at both pre-TTT and post-TTT stages. What rotation actually does is shift where in the doc-length distribution the model is strong, without changing the aggregate. - Designed + implemented spec 010b (SPINQUANT_SITES env var) to isolate which sites (attn vs MLP) carry the help vs hurt. Ready for execution, ~\$25. - Lessons: look at per-batch trajectory data before concluding a null is null. Length-sorted running averages are systematically biased. Don't pivot prematurely from a signal you haven't fully interrogated. Still \$163 under project budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the SpinQuant investigation arc with spec 010b's results and an honest retrospective on the false-signal episode. Key findings: - All 5 SpinQuant variants (baseline, internal_only, port_1695, attn_only, mlp_only) land within 0.00009 bpb at final val_bpb. Pure null. openai#1736 has seed std ~0.00070; we are 10x below that. - Pt.2's "regime-dependence is exploitable" hypothesis refuted. attn_only ≈ baseline on rank 0 (attention rotation does nothing); mlp_only has inverse regime from port_1695 (hurts long, helps short); neither subset comes close to port_1695's emergent rank-0 trajectory lead. - Rank-0 rb spread across variants: 0.0075 bpb. Final val_bpb spread across variants: 0.000085 bpb. 80x compression from 8-rank aggregation + TTT LoRA uniform absorption. Mistake I owned up to: read rank-0 rb:1.0657 for mlp_only at batch 780 and suggested "mlp_only might actually net positive." Final.json came out +0.000005 above baseline. Rank-0 rb is rank 0's 1/8 slice, not a preview of the submission number. Methodology corrections for future runs: - Always check final.json before any trend interpretation - Rank-0 rb is a progress indicator, not a metric preview - When pre-TTT diagnostic_quantized spread < 0.001, post-TTT will be near-identical (TTT LoRA dominates) Budget: spent ~\$52 of \$200 total. 10 days left. Next: spec 011 (tapered Muon WD retrain) — upstream of TTT, might unlock something TTT can't absorb. Patch still unwritten. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Scanned 200 PRs from 2026-04-11..20. After exclusion filters, 3 candidates beat spec 011's expected Δ: openai#1682 (GradPower Muon p=0.9), openai#1648 (xIELU + per-layer QK gain), openai#1555 (Tap-In eval cache). Full artifacts at ~/competition-pr/pr-scan-2026-04-20/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…1716) Two orthogonal training-time levers queued behind spec 011: - bpb-weighted-loss.md (port openai#1519): weight CE by UTF-8 bytes per token. Aligns training objective with eval metric. Risk: SP8192 vocab destabilization (author warns on large vocabs) + CaseOps byte LUT accounting (~1hr of careful code). - bigram-hash-embed.md (port openai#1716): 16384×32 hash-table bigram embed added to token embedding pre-block-0. ~540K params / ~400KB artifact. openai#1736 genuinely lacks this despite prevalence in competitive lineages. Recommended sequencing: 011 → 012 (QK) → 013 (BigramHash, lower risk) → 014 (BPB-weighted, higher risk). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

110 LOC pure addition to train_gpt.py, fully env-gated by BIGRAM_HASH_ENABLED=0/1. Default-off invariant: with env unset the forward pass, state_dict, and optimizer param list are byte-identical to baseline. Components: - BigramHashEmbedding(nn.Module): embed(buckets, dim) + CastedLinear proj(dim, model_dim). proj._zero_init=True -> identity at step 0. Hash: ((prime_a * curr) ^ (prime_b * prev)) % buckets. Position-0 fallback: prev = curr (self-bigram). Cross-doc leakage not special cased, matching openai#1736's SmearGate convention. - GPT.__init__: creates self.bigram_embed when enabled else None. - forward_logits + forward_ttt: additive merge of bigram(input_ids) to tok_emb(input_ids) before SmearGate. attr-guarded. - Optimizers: embed.weight -> AdamW optimizer_tok (embed_wd), proj.weight -> Muon matrix_params. - GPTQ hessian hooks: bigram_embed.embed output -> (dim,dim) hessian; bigram_embed.proj input -> (dim,dim) hessian (proj is <=65536 numel so fp16 passthrough; harmless hook). - Startup log line echoing config. Sizing: 16384*32 int6 embed ~= 393KB. 512*32 fp16 proj = 32KB. Total ~425KB added to artifact; budget dry-run needed before launch. Env vars (defaults): BIGRAM_HASH_ENABLED=0, BIGRAM_HASH_BUCKETS=16384, BIGRAM_HASH_DIM=32, BIGRAM_HASH_PRIME_A=36313, BIGRAM_HASH_PRIME_B=27191. Bug lesson learned from exp/training-bundle commit 8d54854: when Edit's old_string only captures part of a for-loop body, trailing loop statements get pushed outside the loop and may be absorbed by nearby conditional blocks. This patch is a pure prepend/append style (no splits of existing blocks) so that failure mode is avoided. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Compiled reference list for architecture-side research thread, including: - XSA identified as Exclusive Self-Attention (Apple, arXiv 2603.09078). Matches openai#1736's _xsa_efficient exactly. - Universal Transformer (Dehghani 2018), ACT (Graves 2016) as foundational recurrence references. - Key 2025 finding from ILR paper (arXiv 2505.01855): allocating more iterations to EARLIER layers yields optimal results. openai#1736's Loop45 (middle layers) may be sub-optimally positioned. - Parallel residuals literature: GPT-J / PaLM well-studied, multi-lane variants (Branchformer etc.) mostly in vision, thin in NLP. - Synthesis of candidate variants prioritized by novelty × EV × cost. - Proposed next step: instrument openai#1736 to log cross-pass cosine similarity during training. If high → cross-pass XSA worth trying. If already low → different variant needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Added section on 'when to activate recurrence' research. Key findings: - ProRes, SGT, Staged Training all recommend progressive/curriculum activation over hard switches - Literature has conflicting claims about WHERE convergence happens first (shallow vs deep layers) - Consistent claim: progressive beats hard switch for stability - openai#1736's enable_looping_at=0.35 is a hard switch — suboptimal per lit Candidate variants identified, ranked by implementation cost: env-var sweeps (1,2) vs code-change ramps (3,4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… candidates User shared a deep timeline of all recurrence experiments in the PG competition (openai#8 through openai#1739). Several of my previously-proposed experiments have ALREADY BEEN TESTED ON THIS STACK and shown to fail: KILLED: - Timing sweep earlier: openai#1726 showed 0.15 is +0.050 worse; openai#1739 showed step-0 catastrophic (1.3936 bpb) - Progressive ramp: openai#1663 showed hard-onset = smooth, no difference - Position shift: openai#1726 showed layer 2-7 +0.163 worse, layer 5-6 shift +0.006 worse — layer 3-5 IS the empirical sweet spot Also corrected the baseline config: openai#1736 uses LOOP_START=3 LOOP_END=5 (three layers: 3, 4, 5 — "Loop345"), not Loop45 as directory name suggests. 3 layers × 3 passes = 17 virtual layers. VIABLE candidates: - Recur-Alpha (openai#1714, Anakintano): learnable scalar per looped block, init 0 → identity. 6 params. Author's grant ran out before TTT eval so composition with openai#1736's phased TTT is genuinely open. NEW TOP PICK. - Cross-pass XSA: still novel, untested in any PR - Loop3-6 variant (openai#1678): tashapais running it; might wait for result Recommendation updated: port Recur-Alpha onto openai#1736 as spec 015. ~$25, identity-at-init (safe), 30 LOC, direct recurrence question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Shelving actions: - Wrote research/evaluations/014-bpb-weighted-loss.md with full rationale and revisit criteria (post-deadline only) - Added SHELVED status banner to top of the spec file - Added experiments.md row marking 014 as 🗄️ SHELVED (permanently) Decision: do NOT retune. Magnitude too large (+0.0619 = 62× shelve threshold) to be recoverable via LR sweep. Three-null pattern (011, 013, 014) confirms that incremental ports from different-stack authors do not transfer to openai#1736. Moving budget to spec 015 (Recur-Alpha). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replica of spec-000-era lr_schedule.py for openai#1736/spec-015's stack. Shows all four training-time schedules on one figure: 1. lr_mul (warmdown) — wallclock-based, starts at step 1207 2. effective LR — MATRIX_LR × lr_mul, concrete numbers 3. Muon momentum — step-based warmup, plateau at step 1500 4. looping_active — hard switch at step 1690 (wallclock 35%) Key non-obvious finding: warmdown (step 1207) begins BEFORE looping activates (step 1690). When recurrence kicks in, LR is already ~17% decayed. This sequencing is baked into openai#1736's defaults. Five distinct training regimes: - [0, 1207]: muon momentum warming, nothing else changing - [1207, 1500]: warmdown begins, muon still warming - [1500, 1690]: warmdown continues, muon plateau, looping still off - [1690]: looping activates (architectural change) - [1690, 4828]: all settled, just linear LR decay Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- diary/2026-04-21-recur-alpha-findings.md — full story of specs 015/016 single-seed screens: α trajectories side-by-side, 5 findings (α>1 on pass-2, <1 on pass-3 at depth, depth-monotonicity inverts between passes, plateau is path-dependent, late-training rate unchanged), full caveats section, ranked next steps. - research/ideas/beating-1736-note.md — four-run throughput + pipeline comparison (008/015/016/openai#1736). Works backward from target 1.06610 to a 0.00183 gap on pre-quant post-EMA; matched-throughput alone gives 3.3× margin over the gap. Risk ranks TTT composition as the one unknown (GPTQ cost is validated at +0.00947 parity). Concludes: single matched- clock NA run with bug-fixed TTT pipeline (~$10-15) settles the whole story. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Primary submission-candidate run for recur-alpha family. Same commit as 016 (4dd2d63); NA 8xH100 to eliminate JP throughput variance; full training + GPTQ + phased-TTT pipeline end-to-end (no EVAL_ONLY_CHECKPOINT bypass that OOM'd in 016 post-hoc). Goal: post-TTT val_bpb <= 1.06550 (beat openai#1736's 1.06610 by >= 0.0005). Runs regardless of 016b's throughput-tax outcome: - If no tax: high-confidence attempt at openai#1736 beat - If tax: diagnostic for TTT x recur-alpha composition - Either way we capture the post-TTT number that 016 post-hoc missed Single seed 42 first, 3-seed conditional on clear-promote bucket. Costs ~\$10 single-seed, ~\$30-34 with 3-seed confirmation. Includes conditional decision tree on 016b branches and tok/s-logging requirements for direct throughput comparison with 016b's 2xH100 data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…016 full pipeline" NA-1 has no 8xH100 capacity today. Reframe spec 017 as: run spec 016's commit (4dd2d63) with full training + GPTQ + phased-TTT pipeline end-to- end on whichever region has capacity (JP is fine). Primary purpose is capturing the post-TTT val_bpb that 016's screen (killed early) and 016 post-hoc TTT eval (OOM'd) both missed. On JP expected post-TTT ~1.0679-1.0682 — close to but probably not beating openai#1736's 1.06610. Still worth it: real composition measurement replaces the projection chain. Path fixes: JP volume jlxvxeiol4 mounts at /runpod (not /workspace); example launch command rewritten accordingly. Memory entry added to cross-session reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codemath3000 · 2026-04-21T03:52:16Z

Running prepare_caseops_data.py as published, then running train_gpt.py with PHASED_TTT_ENABLED=1 reproducibly raises ZeroDivisionError: float division by zero at train_gpt.py:2303 in _loss_bpb_from_sums — byte_sum.item() is 0 because _find_docs (line 2209) returns an empty list. The prep script never inserts BOS markers, and the tokenizer reserves IDs 0–7 (<pad>, <s>, </s>, <unk>, and the four CaseOps operators), so sp.encode can never naturally output id 1. The training loop has a fallback at _init_shard line 408-409 (if self.bos_idx.size == 0: self.bos_idx = np.array([0], ...)) so training completes, but the phased TTT eval path has no analogous fallback. Am I missing a prep step, or should prepare_caseops_data.py be prepending bos_id=1 to each doc (matching download_hf_docs_and_tokenize.py:364-366)?

Submission-quality test of constant-α (017 endpoint values) with full training + GPTQ + phased-TTT pipeline. Pins commit 2895db3 on exp/recur-alpha-constant-full, which extends 018c's constant-α wiring to the TTT forward path. Target: beat openai#1736's 1.06610 post-TTT. Expected range 1.0650-1.0675 based on 018c's 92% throughput recovery + TTT bug fix. Single seed 42 first, 3-seed conditional on clear promote. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- 4,697 steps (vs 4,828 for 008) due to slow JP node, not constant-α overhead - Per-step quality strictly better than 008/017 at matched steps - Linear extrapolation to step 4828 → post-TTT ~1.0606 (beats openai#1736) - Recommendation: rerun on NA-1 pod Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PR openai#1902 (cocohearts) accepted openai#1851/openai#1868 over openai#1736 and excluded openai#1855 only on significance grounds (p=0.325). Our prior 050 line built on openai#1797 which is under validity-cloud per cocohearts. Re-anchor research baseline on openai#1855's accepted chain. Pure port — zero modifications. Files copied verbatim from codemath3000/parameter-golf:submission/sp8192-lqer-bos-smear-fix-9hp-stack @ 1e43966 into records/track_10min_16mb/2026-04-29_PR1855_Port_Baseline/. Spec 060B+ will fork exp/060B-* etc. to stack quant-repair / deploy-time levers (046B-tight SDClip, 046L deploy-time repair, 046G-tighter, etc.) on this baseline.

@valerio-oai

… new SOTA 1.0608 imminent; PPM-D concerns raised; final day - Discovered organizer has 2 pending branches staging 14 new leaderboard records - BOS-fix branch confirms CaseOps LEGAL (PRs openai#1729/openai#1736/openai#1769/openai#1787 included as records) - New SOTA when merged: 1.0608 (codemath3000, PR openai#1855); new target ≤1.0558 - Tap-In V6 (PR openai#1518) confirmed legal by organizer branch inclusion - PPM-D: @valerio-oai raised concerns on PR openai#1835 (3M/40.5M partial data + autoregressivity); do not implement - SmearGate BOS fix required (top entry PR openai#1855 uses it) - Updated CLAUDE.md competition strategy + added Session 24 lessons learned - Added Apr 29 daily research log entry https://claude.ai/code/session_01AAiiKSwWxDtGTexxogAkeZ

…Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR #1736 (1.06549), -0.00043 vs PR #1779 (1.06421). Stacks 4 orthogonal wins on top of PR #1736, all ablation-validated on seed 0 against stock #1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR #1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR #1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR #1779's frozen recurrent α/β and PR #1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR #1736 d7263a3 and PR #1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 Merge accepted Parameter Golf record submission #1787.

@codemath3000

External reproductions of this submission failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06549 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.

@codemath3000

External reproductions of PR openai#1769 (and PR openai#1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.

… Attn Gate + Fused CE — val_bpb 1.06378 3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR openai#1736 (1.06549), -0.00043 vs PR openai#1779 (1.06421). Stacks 4 orthogonal wins on top of PR openai#1736, all ablation-validated on seed 0 against stock openai#1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR openai#1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR openai#1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR openai#1779's frozen recurrent α/β and PR openai#1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR openai#1736 d7263a3 and PR openai#1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@dexhunter

Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations

…ns through 2026-04-27 Adds the 10 most recent leaderboard records from origin/main, including: - 2026-04-27_SP8192_LQER_SparseGate_BOSSmearFix_9HpStack_1.0611 (CURRENT TOP, val_bpb 1.06108) - 2026-04-23_SP8192_CaseOps_SparseGate_QuantGate_Loop45_PhasedTTT_PolarNS_MinLR_FusedCE - 2026-04-22_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT_MLPClip12 - 2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT - 2026-04-18_PR1626_CaseOps_Taper - 2026-04-14_MultiPhaseGlobalSGD_PhasedTTT - 2026-04-13_VarLenAttn_PhasingTTT2000 - 2026-04-10_VarLenAttn - 2026-04-09_A2_Muon097_3Seed - 2026-03-29_Loader_FullGPTQ_XSA11_BigramHash2816 These records use the SP8192 tokenizer + 8x H100 + TTT + advanced quantization on a non-DEQ baseline (standard transformer with U-Net skips, Loop4-5 depth recurrence, XSA, Polar-Express NS, etc). Top record (1.0611) lineage: PR openai#1797 (SmearGate + LQER) -> PR openai#1787 (Polar NS + MIN_LR + SparseAttnGate + Fused CE) -> PR openai#1736 (CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT). SparseAttnGate (PR openai#1787) reviewed for incorporation into our model arch: analysis follows in next commit / iter 120 design spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

romeerp mentioned this pull request Apr 20, 2026

Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505 #1756

Open

abi2024 mentioned this pull request Apr 24, 2026

Audit 1698 lineage bpb bytecount #1804

Open

aquariouseworkman mentioned this pull request Apr 27, 2026

Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851

Merged

renqianluo mentioned this pull request Apr 28, 2026

Record: Fused softcap CE + WD=2.0 (warm-start stability fix) — val_bpb 1.06957 (3-seed mean) #1886

Open

cocohearts mentioned this pull request Apr 28, 2026

Update Parameter Golf leaderboard with BOS fix #1902

Merged

AayushBaniya2006 mentioned this pull request Apr 28, 2026

Record: PR #1797 reproduction — val_bpb 1.06136 (3-seed mean) #1906

Open

6 tasks

jorge-asenjo mentioned this pull request Apr 29, 2026

Record: SP8192 #1855 Base + Asymmetric Logit Rescale + AWQ-lite — val_bpb 1.05971 (3-seed mean, full val) #1923

Open

3 tasks

dexhunter mentioned this pull request Apr 29, 2026

Record: PR #1855 base + Smear + LQER + LogitCalib + Phased TTT — val_bpb 1.06080 (3-seed) #1924

Closed

simon-marcus mentioned this pull request Apr 29, 2026

Record candidate: 1.06032 CaseOps + Matrix-LR 0.028 + TTT n=1 #1925

Open

8 tasks

MarioPaerle mentioned this pull request Apr 29, 2026

Record: PR #1886 base + per-block MLP output gate (Linear, weight-learnable) — val_bpb 1.06872 (3-seed mean) #1941

Closed

cocohearts merged commit 76e2037 into openai:main Apr 29, 2026

cocohearts added a commit that referenced this pull request Apr 29, 2026

Merge PR #1787: Record: PR #1736 + Polar Express NS + MIN_LR + Sparse…

f9544c7

… Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 Merge accepted Parameter Golf record submission #1787.

TimS-ml mentioned this pull request Apr 29, 2026

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242) #1948

Open

andrewbaggio1 mentioned this pull request Apr 30, 2026

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953

Open

10 tasks

chris-colinsky mentioned this pull request Apr 30, 2026

Record candidate: PR #1855 + Adaptive Hessian-Sensitivity GPTQ Clip — val_bpb 1.06310 (3-seed mean) #1962

Open

6 tasks

TimS-ml mentioned this pull request Apr 30, 2026

Record: MHA Path + 1855 9-hparam Stack + PR #1948 + PR #1855 (val_bpb = 1.06184, 3-seed) #1987

Open

simonbissonnette mentioned this pull request Apr 30, 2026

Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014

Open

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

TanishGudise mentioned this pull request May 1, 2026

Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130

Open

leon2k2k2k mentioned this pull request May 1, 2026

Non-record: notes on the recurrence band (mixing parameters, MLP sizing, loop sizing) #2137

Open

simon-marcus mentioned this pull request May 1, 2026

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140

Open

This was referenced May 2, 2026

Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135

Open

Record: SP8192 + Sliding-Window Eval + Lock-In Byte Mixer - val_bpb 1.067219 #2138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549#1736

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549#1736
cocohearts merged 3 commits intoopenai:mainfrom
dexhunter:dexhunter/caseops-gatedattn-quantgate-1.06549

dexhunter commented Apr 19, 2026

Uh oh!

codemath3000 commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dexhunter commented Apr 19, 2026

Summary

3-seed results (8×H100 80GB SXM, 10-min train / 10-min eval budgets)

Key innovation — CaseOps tokenizer + byte sidecar

Rule compliance

Test plan

Lineage

Uh oh!

codemath3000 commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants