Skip to content

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549#1736

Merged
cocohearts merged 3 commits intoopenai:mainfrom
dexhunter:dexhunter/caseops-gatedattn-quantgate-1.06549
Apr 29, 2026
Merged

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549#1736
cocohearts merged 3 commits intoopenai:mainfrom
dexhunter:dexhunter/caseops-gatedattn-quantgate-1.06549

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

3-seed results (8×H100 80GB SXM, 10-min train / 10-min eval budgets)

Seed Steps Pre-TTT BPB Post-TTT BPB Artifact (bytes) train_time eval_time
42 4854 1.07847 1.06610 15,978,834 596.18s 396.9s
0 4843 1.07719 1.06473 15,971,476 596.17s 399.3s
1234 4847 1.07811 1.06563 15,975,050 596.08s 395.5s
Mean 4848 1.07792 1.06549 15,975,120 596.14s 397.23s
Std 0.00066 0.00070 3,698 0.06s 1.9s

All three seeds clear size, train-time, and eval-time budgets with substantial headroom. 3-seed std is 0.00070 BPB — well inside the 0.005 significance floor.

Key innovation — CaseOps tokenizer + byte sidecar

CaseOps is a bijective, character-level text transform that removes English capitalization from the body of the text and records it as four operator tokens (TITLE, ALLCAPS, CAPNEXT, ESC) that become SentencePiece user_defined_symbols. Because the transform is fully invertible (decode(encode(s)) == s), no information is lost and BPE merges allocate vocabulary around content instead of around case variants. Ships with a per-token byte sidecar (fineweb_val_bytes_*.bin, uint16 parallel to val shards) so BPB is computed on ORIGINAL pre-transform UTF-8 bytes, not on the transformed representation — the score is on the same FineWeb text, just with a different tokenization front end.

Rule compliance

  • Artifact ≤ 16,000,000 bytes DECIMAL (README FAQ + Issue A Field Guide to Valid Submissions #1017 §II.1): ✅ all seeds ≤ 15,978,834.
  • train_time ≤ 600s (README line 6): ✅ all seeds 596.1s.
  • total_eval_time ≤ 600s (README FAQ, separate budget): ✅ all seeds 395.5–399.3s.
  • Score-first TTT (Issue A Field Guide to Valid Submissions #1017 Condition 3): ✅ phased TTT snapshots the pre-update score on each chunk before the LoRA adapter step; per-doc LoRA reset between documents.
  • BPB on original bytes (Issue A Field Guide to Valid Submissions #1017 §V): ✅ per-token byte sidecar encodes canonical UTF-8 byte count of each val position.
  • No val data in training: ✅ training uses only fineweb_train_*.bin shards.
  • Reproducible: prepare_caseops_data.py is deterministic given the input FineWeb doc stream.

Test plan

  • Organizer reviews submission folder contents (train_gpt.py, prepare_caseops_data.py, tokenizer .model, 3 seed logs, submission.json, README.md, lossless_caps.py).
  • Organizer runs prepare_caseops_data.py to generate CaseOps shards + val byte sidecar.
  • Organizer reproduces at least one seed: SEED=42 CASEOPS_ENABLED=1 GATED_ATTN_QUANT_GATE=1 ... torchrun --standalone --nproc_per_node=8 train_gpt.py (full env in README).
  • Reproduced quantized_ttt_phased val_bpb matches the logged 1.06610 (±0.0007) within seed noise.
  • Artifact size, train_time, total_eval_time all within budgets on re-run.

Lineage

… — val_bpb 1.06549

3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green:
- artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL)
- train_time 596.14s mean (≤600s)
- total_eval_time 397.23s mean (≤600s)

Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1)
bijective case preprocessing from PR openai#1729 with a per-token byte sidecar
so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned
attention out-gate (init_std=0.005) + quant-gate scaling that recovers
the ~40 KB of overhead introduced by the new control tokens, keeping
every seed under the 16 MB decimal cap.

Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 19, 2026
…ad, MP-SGD TTT 4-phase

- PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter
  (~1.189 actual) + artifact size violation; effectively dead
- New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) —
  reversible case-factoring with byte sidecar; stronger legality than
  casefold; await Issue openai#1604 ruling
- PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738
  builds on it, both likely void
- PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable
- Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline

https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Bulk import of dexhunter's openai#1736 unmerged submission
(openai#1736, commit e100586) for reproduction as our
new research baseline. Source: records/track_10min_16mb/
2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/.

9 files, ~6856 lines:
- train_gpt.py (training script)
- lossless_caps.py (bijective CaseOps transform)
- prepare_caseops_data.py (data retokenization script)
- fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model (SP tokenizer)
- README.md, submission.json, 3 per-seed training logs

No modifications to repo-root files. Spec: research/specs/008-1736-reproduction.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
After 2026-04-19 frontier scan, rebasing the research baseline from
merged SOTA openai#1493 (1.0810) to unmerged PR openai#1736 (dexhunter, claimed
1.06549). Rationale: credible frontier moved ~0.015 bpb past merged
SOTA in 10 days via witnessed, legal levers (CaseOps tokenizer,
attn-out gate, phased TTT). Continuing off spec-000 leaves us behind
before we try anything.

- CLAUDE.md: baseline declared; baseline-migration specs land on
  research directly (exception to exp/<slug> convention).
- research/frontier-map.md: credibility filter + dependency map.
- diary/2026-04-19-frontier-{scan,map}.md: per-PR evidence base.
- research/ideas/1736-improvement.md: three-spec migration plan.
- research/specs/008-1736-reproduction.md: spec for the reproduction
  run, pinned to commit 154c9b8 (openai#1736 import at e100586).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Collapse spec 008 to seed 42 only and add a one-line pre-GPTQ FP
checkpoint save at runs/008-1736-reproduction/seed_42/pre_gptq.pt
(env-var gated via SAVE_PRE_GPTQ=1 so the reproduction itself is
unaffected when the flag is off).

Rationale: SpinQuant and subsequent quant-family experiments are
purely post-training transforms, so hotstarting off a single
pre-GPTQ FP checkpoint is far cheaper than retraining per spec.
Single-seed comparison against openai#1736's seed-42 (1.06610, ±0.003)
is apples-to-apples for screening. Cost drops ~$40 -> ~$17 for
this spec and ~$10 -> ~$1–2 per downstream quant experiment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
First lever layered on the new openai#1736 baseline. Hadamard rotation of
weight matrices before GPTQ quantization, hotstarted off spec 008's
pre_gptq.pt FP checkpoint. No retraining.

Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a
openai#1529-adjacent base; expected to compose cleanly with openai#1736 since
the quant stage is orthogonal to CaseOps / attention gates / phased
TTT. Rotation is a post-training transform with three classes
(residual-stream, per-layer attn, per-layer MLP); FP forward pass is
invariant by construction, only quantization error drops.

Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full
retrain. Same hotstart checkpoint reused by future quant
experiments (per-group bit, AR-selfgen calib, AWQ).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Based on reading train_gpt.py at commit 154c9b8:

Good: RMSNorm is gamma-free (line 529), so the usual gamma-fold step
doesn't apply. RMSNorm is rotation-equivariant directly.

Bad: openai#1736 has five OTHER per-channel multipliers on residual flow
(attn_scale, mlp_scale, resid_mix, skip_weights, skip_gates). These
are the real fold targets, not RMSNorm. resid_mix is pre-norm and
cannot be cleanly folded.

Split into three SpinQuant modes selectable by SPINQUANT_MODE:
- internal_only (R_a, R_m per layer; no residual rotation)
- full (internal + R0, with attn_scale/mlp_scale/skip folds and
  resid_mix freeze-to-mean compromise)
- port_1695 (conditional on openai#1695 diff being meaningfully different)

All three run back-to-back on one pod hotstarted off spec 008's
final_model.pt. ~30 min total GPU, ~$17-22 budget, one eval.

research/ideas/spinquant-integration-notes.md captures the full
design analysis (per-multiplier fold feasibility, three-option
tradeoff, shared-code plan).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Added SPINQUANT_MODE=baseline as a fourth variant that applies no
rotation — just loads final_model.pt, runs serialize/deserialize/
eval/TTT on it. Two purposes:

1. Closes the loop on spec 008's missed post-TTT number (watcher
   stopped the pod before the TTT eval ran). No separate $3
   eval-only rerun needed.
2. Provides the apples-to-apples local reference for measuring the
   three SpinQuant variants' Deltas — removes any cross-pod bf16
   drift from the comparison.

Order: baseline -> internal_only -> full -> port_1695, sequential on
one pod. Gate: if baseline lands outside openai#1736's 1.06610 +/- 0.003,
halt before running rotations (means spec 008 reproduction is off).

Total cost ~$27 (was $22); absorbs ~$3 of otherwise-separate eval
rerun, so net increment is ~$2 for four measured numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Two new files in the openai#1736 submission dir:

spinquant_hotstart.py (~360 LOC):
- Imports from train_gpt.py for Hyperparameters/GPT/serialize/deserialize/
  eval_val/eval_val_ttt_phased/BatchedTTTLoRA/etc.
- Modes: baseline, internal_only (R_a only, per-layer per-KV-group, d_head
  rotation on V-output and O-input).
- full, port_1695 are stubs — raise NotImplementedError with explanation.
- Pipeline: load FP state_dict from HOTSTART_FP_CKPT -> apply rotations
  in-place on banked qo_bank/kv_bank -> optional pre-quant diagnostic eval
  -> call serialize() (GPTQ+compress) -> deserialize() -> quantized eval
  -> phased TTT eval -> write final.json.
- Reproduces the TTT eval block from train_and_eval (lines 2997-3075) in
  _run_ttt_eval() rather than refactoring the source file.

test_rotation_invariance.py (~250 LOC):
- CPU-only, standalone (no train_gpt.py import due to flash_attn_3/triton
  module-level deps).
- Self-contained minimal attention forward: Q/K/V projection from the
  banked tensors, RMSNorm on Q and K (matches real model's bound on
  attention logits; without this, trained weights saturate softmax and
  float noise in V amplifies catastrophically).
- Tests baseline (bit-exact identity) and internal_only (rel tolerance
  1e-4) against either synthetic random weights or spec 008's
  final_model.pt. Both pass cleanly (rel_max ~1e-6 on real checkpoint).
- Can load either banked (qo_bank/kv_bank) or unbanked
  (blocks.N.attn.*.weight) state_dict format.

Spec 009 updated: reduced scope to 2 modes (baseline, internal_only) for
this session; full and port_1695 deferred. Rationale in the spec: MLP
LeakyReLU-squared breaks R_m float-invariance, resid_mix can't be cleanly
folded through RMSNorm, both needing design before implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Cleanup pass to resolve inconsistencies between the spec and what's
actually in spinquant_hotstart.py + test_rotation_invariance.py:

- Title + scope: 2-mode sweep (baseline, internal_only); full and
  port_1695 explicitly deferred to a follow-up spec.
- Checkpoint path: pre_gptq.pt (what execution's spec-008 patch
  produced, after _unbank_state_dict), not final_model.pt.
- Accept criteria: preflight via test_rotation_invariance.py
  (ALL TESTS PASS), then per-mode on pod.
- Rotation structure: trimmed to just the implemented R_a class
  with exact banked-tensor indexing. R_0 / R_m / skip-stream /
  RMSNorm-fold sections moved to 'not implemented (deferred)'.
- RMSNorm-fold section removed entirely: openai#1736's RMSNorm is
  gamma-free (F.rms_norm with no weight arg), so no fold needed.
- Code-changes section: points at the files on disk instead of
  TODO pseudocode.
- Execution protocol: 2 modes back-to-back on 8xH100, explicit
  preflight step.
- Hardware ladder: 8xH100 required (phased TTT is 8-rank DDP).
- Cost estimate: ~$15 total for 2 modes.
- Open questions: reframed around unbanked-checkpoint load,
  bf16 drift, GPTQ interaction, phased-TTT compatibility.
- What this spec does NOT do: clarified that residual rotation,
  R_m, resid_mix, and port_1695 are all deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
…approach

Read openai#1695's diff. Their approach is fundamentally different from
the static-weight-rotation + folds design I had in mind for 'full'
mode. They do ONLINE activation rotation: 4 global Hadamard rotations
inserted as x @ R matmuls at 4 forward-pass sites (residual->qkv,
attn-out->proj, residual->fc, hidden->proj). GPTQ then quantizes in
the rotated basis; rotated Hessians keep the quant-side accounting
honest. Rotations OFF during training, ON after deserialize for
eval+TTT.

Why this matters: their scheme sidesteps BOTH blockers that made the
full mode complicated:
- LeakyReLU non-equivariance: R_mlp_proj_in is applied AFTER the
  LeakyReLU-square, not across it.
- resid_mix: rotations are per-linear-input, never touch the
  residual stream. All per-channel multipliers (attn_scale,
  mlp_scale, resid_mix, skip_weights) operate in unchanged basis.

No float invariance — the model IS different post-rotation. The bet
is that the rotated-basis GPTQ delivers lower quant error and that
the perturbation is smaller than the savings.

Implication: deprecate the 'full' static-rotation-with-folds plan in
favor of a future 'port_1695' spec that ports their online scheme.
Internal_only mode from spec 009 remains useful as an independent
data point (R_a only, fp-invariant).

Spec 010 (tapered WD) drafted as an independent parallel track:
- Ports PR openai#1729's WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50
- Muon-WD-only taper on top of openai#1736's existing schedule
- Full retrain on 8xH100, single seed, ~\$20
- Independent of spec 009 (different pod, no shared state)
- Can run in parallel with 009's eval-only sweep

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
…n sprint

Session-narrative entry covering today's work:

- Frontier filter + baseline migration from merged SOTA openai#1493 (1.0810)
  to unmerged openai#1736 (1.0655), rationale, CLAUDE.md update.
- Spec 008 run partial result (training reproduced openai#1736 within
  +0.00016 at pre-quant; post-TTT gate number not captured due to
  watcher bug; projected pass ~1.06626).
- Spec 009 design evolution through three scope cuts: 4 modes ->
  unified sweep -> +baseline mode -> cut to 2 modes after discovering
  real architectural blockers (MLP LeakyReLU breaks R_m, resid_mix
  doesn't fold cleanly).
- openai#1695 diff discovery: they do online activation rotation, not
  static weight rotation. Sidesteps both LeakyReLU and resid_mix.
  Reframes 'full' mode -> port_1695 mode as the next quant-side
  spec.
- Specs 010 (port_1695, design only) and 011 (tapered WD, design
  only) drafted. Only spec 009 is truly runnable right now.

Closes with state-of-play table, modal plan, lessons-learned, and
open questions for next session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Implements the port_1695 SpinQuant variant from PR openai#1695 onto the
openai#1736 stack. All changes env-var-gated (SPINQUANT_ENABLED=0 default)
so spec 008 and spec 009's baseline/internal_only modes are
unaffected bit-for-bit.

train_gpt.py changes (+247 lines):
- import hashlib
- Hyperparameters.spinquant_enabled, spinquant_seed
- CastedLinear._sq_active class flag (default False)
- Utility block: _stable_seed, _hadamard_rotation, install_spinquant_
  rotations, _SQ_KEY_TO_TAG, _spinquant_rotate_sd_and_H
- 4 forward-path hook sites (2 each in CausalSelfAttention,
  MLP, _block_with_lora, _parallel_block_with_lora):
  - pre-QKV: x_qkv = x @ R_attn_in
  - pre-attn-proj: y @ R_attn_proj_in
  - pre-fc: x @ R_mlp_in
  - post-activation pre-proj: hidden @ R_mlp_proj_in
- serialize(): call _spinquant_rotate_sd_and_H after Hessian collection
  and before GPTQ. Rotates weights (W @ R) and Hessians (R.T @ H @ R).
- deserialize(): install_spinquant_rotations + set _sq_active=True
  after loading rotated weights.
- MLP.forward: disable fused kernel when SpinQuant active.
- LoRA (TTT path) uses unrotated n, base path uses rotated n_qkv.

spinquant_hotstart.py changes:
- port_1695 mode no longer raises NotImplementedError. Sets
  h.spinquant_enabled=True and h.spinquant_seed; train_gpt.py's
  machinery does the rest.

Math: orthogonal R means R @ R.T == I, so x_rot @ W_rot = x @ R @
(W @ R).T = x @ R @ R.T @ W.T = x @ W.T. Pre-quant forward is
bit-identical to unrotated; GPTQ sees rotated basis where outliers
are spread more evenly and quantization error drops.

Spec 010 doc updated to reflect the implementation state. Execution
runs via SPINQUANT_MODE=port_1695 on spinquant_hotstart.py.

Not tested on GPU — flash_attn_3 not available on the dev box.
Syntax clean. First pod run will verify end-to-end behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Continuation of the morning diary. Covers:

- Spec 009 baseline closed spec 008's gate at 1.06728 (matches
  openai#1736's 1.06610 within bf16 noise). internal_only null (+0.00003).
- Spec 010 port_1695 also null aggregate (-0.00005), BUT per-batch
  analysis revealed a striking regime-dependent effect: rotation
  helps long-context docs (-0.0064 bpb on dl>1000) and hurts
  short-context docs (+0.0146 on dl<300). The null is a
  cancellation, not an absence of effect.
- 'TTT substitutes for rotation' hypothesis revised — the rotation
  Delta is ~0 at both pre-TTT and post-TTT stages. What rotation
  actually does is shift where in the doc-length distribution the
  model is strong, without changing the aggregate.
- Designed + implemented spec 010b (SPINQUANT_SITES env var) to
  isolate which sites (attn vs MLP) carry the help vs hurt. Ready
  for execution, ~\$25.
- Lessons: look at per-batch trajectory data before concluding a
  null is null. Length-sorted running averages are systematically
  biased. Don't pivot prematurely from a signal you haven't fully
  interrogated.

Still \$163 under project budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Closes the SpinQuant investigation arc with spec 010b's results and
an honest retrospective on the false-signal episode.

Key findings:
- All 5 SpinQuant variants (baseline, internal_only, port_1695,
  attn_only, mlp_only) land within 0.00009 bpb at final val_bpb.
  Pure null. openai#1736 has seed std ~0.00070; we are 10x below that.
- Pt.2's "regime-dependence is exploitable" hypothesis refuted.
  attn_only ≈ baseline on rank 0 (attention rotation does nothing);
  mlp_only has inverse regime from port_1695 (hurts long, helps
  short); neither subset comes close to port_1695's emergent
  rank-0 trajectory lead.
- Rank-0 rb spread across variants: 0.0075 bpb.
  Final val_bpb spread across variants: 0.000085 bpb.
  80x compression from 8-rank aggregation + TTT LoRA uniform
  absorption.

Mistake I owned up to: read rank-0 rb:1.0657 for mlp_only at batch
780 and suggested "mlp_only might actually net positive." Final.json
came out +0.000005 above baseline. Rank-0 rb is rank 0's 1/8 slice,
not a preview of the submission number.

Methodology corrections for future runs:
- Always check final.json before any trend interpretation
- Rank-0 rb is a progress indicator, not a metric preview
- When pre-TTT diagnostic_quantized spread < 0.001, post-TTT will
  be near-identical (TTT LoRA dominates)

Budget: spent ~\$52 of \$200 total. 10 days left.

Next: spec 011 (tapered Muon WD retrain) — upstream of TTT, might
unlock something TTT can't absorb. Patch still unwritten.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Scanned 200 PRs from 2026-04-11..20. After exclusion filters, 3 candidates
beat spec 011's expected Δ: openai#1682 (GradPower Muon p=0.9), openai#1648 (xIELU +
per-layer QK gain), openai#1555 (Tap-In eval cache). Full artifacts at
~/competition-pr/pr-scan-2026-04-20/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
…1716)

Two orthogonal training-time levers queued behind spec 011:

- bpb-weighted-loss.md (port openai#1519): weight CE by UTF-8 bytes per token.
  Aligns training objective with eval metric. Risk: SP8192 vocab
  destabilization (author warns on large vocabs) + CaseOps byte LUT
  accounting (~1hr of careful code).

- bigram-hash-embed.md (port openai#1716): 16384×32 hash-table bigram embed
  added to token embedding pre-block-0. ~540K params / ~400KB artifact.
  openai#1736 genuinely lacks this despite prevalence in competitive lineages.

Recommended sequencing: 011 → 012 (QK) → 013 (BigramHash, lower risk)
→ 014 (BPB-weighted, higher risk).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
110 LOC pure addition to train_gpt.py, fully env-gated by
BIGRAM_HASH_ENABLED=0/1. Default-off invariant: with env unset the
forward pass, state_dict, and optimizer param list are byte-identical
to baseline.

Components:
- BigramHashEmbedding(nn.Module): embed(buckets, dim) + CastedLinear
  proj(dim, model_dim). proj._zero_init=True -> identity at step 0.
  Hash: ((prime_a * curr) ^ (prime_b * prev)) % buckets. Position-0
  fallback: prev = curr (self-bigram). Cross-doc leakage not special
  cased, matching openai#1736's SmearGate convention.
- GPT.__init__: creates self.bigram_embed when enabled else None.
- forward_logits + forward_ttt: additive merge of bigram(input_ids)
  to tok_emb(input_ids) before SmearGate. attr-guarded.
- Optimizers: embed.weight -> AdamW optimizer_tok (embed_wd), proj.weight
  -> Muon matrix_params.
- GPTQ hessian hooks: bigram_embed.embed output -> (dim,dim) hessian;
  bigram_embed.proj input -> (dim,dim) hessian (proj is <=65536 numel
  so fp16 passthrough; harmless hook).
- Startup log line echoing config.

Sizing: 16384*32 int6 embed ~= 393KB. 512*32 fp16 proj = 32KB.
Total ~425KB added to artifact; budget dry-run needed before launch.

Env vars (defaults): BIGRAM_HASH_ENABLED=0, BIGRAM_HASH_BUCKETS=16384,
BIGRAM_HASH_DIM=32, BIGRAM_HASH_PRIME_A=36313, BIGRAM_HASH_PRIME_B=27191.

Bug lesson learned from exp/training-bundle commit 8d54854: when Edit's
old_string only captures part of a for-loop body, trailing loop
statements get pushed outside the loop and may be absorbed by nearby
conditional blocks. This patch is a pure prepend/append style (no
splits of existing blocks) so that failure mode is avoided.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Compiled reference list for architecture-side research thread, including:

- XSA identified as Exclusive Self-Attention (Apple, arXiv 2603.09078).
  Matches openai#1736's _xsa_efficient exactly.

- Universal Transformer (Dehghani 2018), ACT (Graves 2016) as
  foundational recurrence references.

- Key 2025 finding from ILR paper (arXiv 2505.01855): allocating more
  iterations to EARLIER layers yields optimal results. openai#1736's Loop45
  (middle layers) may be sub-optimally positioned.

- Parallel residuals literature: GPT-J / PaLM well-studied, multi-lane
  variants (Branchformer etc.) mostly in vision, thin in NLP.

- Synthesis of candidate variants prioritized by novelty × EV × cost.

- Proposed next step: instrument openai#1736 to log cross-pass cosine
  similarity during training. If high → cross-pass XSA worth trying.
  If already low → different variant needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Added section on 'when to activate recurrence' research. Key findings:
- ProRes, SGT, Staged Training all recommend progressive/curriculum
  activation over hard switches
- Literature has conflicting claims about WHERE convergence happens
  first (shallow vs deep layers)
- Consistent claim: progressive beats hard switch for stability
- openai#1736's enable_looping_at=0.35 is a hard switch — suboptimal per lit

Candidate variants identified, ranked by implementation cost:
env-var sweeps (1,2) vs code-change ramps (3,4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
… candidates

User shared a deep timeline of all recurrence experiments in the
PG competition (openai#8 through openai#1739). Several of my previously-proposed
experiments have ALREADY BEEN TESTED ON THIS STACK and shown to fail:

KILLED:
- Timing sweep earlier: openai#1726 showed 0.15 is +0.050 worse; openai#1739
  showed step-0 catastrophic (1.3936 bpb)
- Progressive ramp: openai#1663 showed hard-onset = smooth, no difference
- Position shift: openai#1726 showed layer 2-7 +0.163 worse, layer 5-6 shift
  +0.006 worse — layer 3-5 IS the empirical sweet spot

Also corrected the baseline config: openai#1736 uses LOOP_START=3 LOOP_END=5
(three layers: 3, 4, 5 — "Loop345"), not Loop45 as directory name
suggests. 3 layers × 3 passes = 17 virtual layers.

VIABLE candidates:
- Recur-Alpha (openai#1714, Anakintano): learnable scalar per looped block,
  init 0 → identity. 6 params. Author's grant ran out before TTT eval
  so composition with openai#1736's phased TTT is genuinely open. NEW TOP PICK.
- Cross-pass XSA: still novel, untested in any PR
- Loop3-6 variant (openai#1678): tashapais running it; might wait for result

Recommendation updated: port Recur-Alpha onto openai#1736 as spec 015.
~$25, identity-at-init (safe), 30 LOC, direct recurrence question.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Shelving actions:
- Wrote research/evaluations/014-bpb-weighted-loss.md with full
  rationale and revisit criteria (post-deadline only)
- Added SHELVED status banner to top of the spec file
- Added experiments.md row marking 014 as 🗄️ SHELVED (permanently)

Decision: do NOT retune. Magnitude too large (+0.0619 = 62× shelve
threshold) to be recoverable via LR sweep. Three-null pattern (011,
013, 014) confirms that incremental ports from different-stack authors
do not transfer to openai#1736. Moving budget to spec 015 (Recur-Alpha).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Replica of spec-000-era lr_schedule.py for openai#1736/spec-015's stack.
Shows all four training-time schedules on one figure:

  1. lr_mul (warmdown)       — wallclock-based, starts at step 1207
  2. effective LR            — MATRIX_LR × lr_mul, concrete numbers
  3. Muon momentum           — step-based warmup, plateau at step 1500
  4. looping_active          — hard switch at step 1690 (wallclock 35%)

Key non-obvious finding: warmdown (step 1207) begins BEFORE looping
activates (step 1690). When recurrence kicks in, LR is already ~17%
decayed. This sequencing is baked into openai#1736's defaults.

Five distinct training regimes:
- [0, 1207]:    muon momentum warming, nothing else changing
- [1207, 1500]: warmdown begins, muon still warming
- [1500, 1690]: warmdown continues, muon plateau, looping still off
- [1690]:       looping activates (architectural change)
- [1690, 4828]: all settled, just linear LR decay

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026
- diary/2026-04-21-recur-alpha-findings.md — full story of specs 015/016
  single-seed screens: α trajectories side-by-side, 5 findings (α>1 on
  pass-2, <1 on pass-3 at depth, depth-monotonicity inverts between
  passes, plateau is path-dependent, late-training rate unchanged), full
  caveats section, ranked next steps.

- research/ideas/beating-1736-note.md — four-run throughput + pipeline
  comparison (008/015/016/openai#1736). Works backward from target 1.06610 to
  a 0.00183 gap on pre-quant post-EMA; matched-throughput alone gives 3.3×
  margin over the gap. Risk ranks TTT composition as the one unknown
  (GPTQ cost is validated at +0.00947 parity). Concludes: single matched-
  clock NA run with bug-fixed TTT pipeline (~$10-15) settles the whole
  story.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026
Primary submission-candidate run for recur-alpha family. Same commit as
016 (4dd2d63); NA 8xH100 to eliminate JP throughput variance; full
training + GPTQ + phased-TTT pipeline end-to-end (no EVAL_ONLY_CHECKPOINT
bypass that OOM'd in 016 post-hoc).

Goal: post-TTT val_bpb <= 1.06550 (beat openai#1736's 1.06610 by >= 0.0005).

Runs regardless of 016b's throughput-tax outcome:
- If no tax: high-confidence attempt at openai#1736 beat
- If tax: diagnostic for TTT x recur-alpha composition
- Either way we capture the post-TTT number that 016 post-hoc missed

Single seed 42 first, 3-seed conditional on clear-promote bucket. Costs
~\$10 single-seed, ~\$30-34 with 3-seed confirmation. Includes conditional
decision tree on 016b branches and tok/s-logging requirements for direct
throughput comparison with 016b's 2xH100 data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026
…016 full pipeline"

NA-1 has no 8xH100 capacity today. Reframe spec 017 as: run spec 016's
commit (4dd2d63) with full training + GPTQ + phased-TTT pipeline end-to-
end on whichever region has capacity (JP is fine). Primary purpose is
capturing the post-TTT val_bpb that 016's screen (killed early) and 016
post-hoc TTT eval (OOM'd) both missed.

On JP expected post-TTT ~1.0679-1.0682 — close to but probably not
beating openai#1736's 1.06610. Still worth it: real composition measurement
replaces the projection chain.

Path fixes: JP volume jlxvxeiol4 mounts at /runpod (not /workspace);
example launch command rewritten accordingly. Memory entry added to
cross-session reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codemath3000
Copy link
Copy Markdown
Contributor

Running prepare_caseops_data.py as published, then running train_gpt.py with PHASED_TTT_ENABLED=1 reproducibly raises ZeroDivisionError: float division by zero at train_gpt.py:2303 in _loss_bpb_from_sumsbyte_sum.item() is 0 because _find_docs (line 2209) returns an empty list. The prep script never inserts BOS markers, and the tokenizer reserves IDs 0–7 (<pad>, <s>, </s>, <unk>, and the four CaseOps operators), so sp.encode can never naturally output id 1. The training loop has a fallback at _init_shard line 408-409 (if self.bos_idx.size == 0: self.bos_idx = np.array([0], ...)) so training completes, but the phased TTT eval path has no analogous fallback. Am I missing a prep step, or should prepare_caseops_data.py be prepending bos_id=1 to each doc (matching download_hf_docs_and_tokenize.py:364-366)?

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026
Submission-quality test of constant-α (017 endpoint values) with
full training + GPTQ + phased-TTT pipeline. Pins commit 2895db3 on
exp/recur-alpha-constant-full, which extends 018c's constant-α
wiring to the TTT forward path.

Target: beat openai#1736's 1.06610 post-TTT. Expected range 1.0650-1.0675
based on 018c's 92% throughput recovery + TTT bug fix. Single seed
42 first, 3-seed conditional on clear promote.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026
- 4,697 steps (vs 4,828 for 008) due to slow JP node, not constant-α overhead
- Per-step quality strictly better than 008/017 at matched steps
- Linear extrapolation to step 4828 → post-TTT ~1.0606 (beats openai#1736)
- Recommendation: rerun on NA-1 pod

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 28, 2026
PR openai#1902 (cocohearts) accepted openai#1851/openai#1868 over openai#1736 and excluded openai#1855
only on significance grounds (p=0.325). Our prior 050 line built on openai#1797
which is under validity-cloud per cocohearts. Re-anchor research baseline
on openai#1855's accepted chain.

Pure port — zero modifications. Files copied verbatim from
codemath3000/parameter-golf:submission/sp8192-lqer-bos-smear-fix-9hp-stack
@ 1e43966 into records/track_10min_16mb/2026-04-29_PR1855_Port_Baseline/.

Spec 060B+ will fork exp/060B-* etc. to stack quant-repair / deploy-time
levers (046B-tight SDClip, 046L deploy-time repair, 046G-tighter, etc.)
on this baseline.
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 29, 2026
… new SOTA 1.0608 imminent; PPM-D concerns raised; final day

- Discovered organizer has 2 pending branches staging 14 new leaderboard records
- BOS-fix branch confirms CaseOps LEGAL (PRs openai#1729/openai#1736/openai#1769/openai#1787 included as records)
- New SOTA when merged: 1.0608 (codemath3000, PR openai#1855); new target ≤1.0558
- Tap-In V6 (PR openai#1518) confirmed legal by organizer branch inclusion
- PPM-D: @valerio-oai raised concerns on PR openai#1835 (3M/40.5M partial data + autoregressivity); do not implement
- SmearGate BOS fix required (top entry PR openai#1855 uses it)
- Updated CLAUDE.md competition strategy + added Session 24 lessons learned
- Added Apr 29 daily research log entry

https://claude.ai/code/session_01AAiiKSwWxDtGTexxogAkeZ
@cocohearts cocohearts merged commit 76e2037 into openai:main Apr 29, 2026
cocohearts pushed a commit that referenced this pull request Apr 29, 2026
…Gate + Fused CE — val_bpb 1.06378

3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token.
-0.00171 BPB vs PR #1736 (1.06549), -0.00043 vs PR #1779 (1.06421).

Stacks 4 orthogonal wins on top of PR #1736, all ablation-validated on
seed 0 against stock #1736 before stacking:
- Polar Express per-iteration minimax Newton-Schulz coefficients (from
  PR #1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x
  with 5 distinct tuples baked into zeropower_via_newtonschulz5
- MIN_LR=0.10 warmdown floor (was 0)
- Sparse attention head-output gate (modded-nanogpt pattern, 96
  params/layer vs dense GatedAttn 4096), preserving the attn_gate_w
  name so the int8-per-row quant path still routes it (size-range
  check widened to 32..8192)
- Triton fused softcapped cross-entropy kernel on the training
  forward; eval path keeps eager numerics unchanged

Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0
(was 4000) together reclaim ~15s of training budget for additional
depth-3 steps.

All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max
15,940,380 B, ~60 KB headroom), the 600s train budget (599.46-
599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every
individual seed beats its PR #1736 counterpart (deltas -1.20 to
-2.27 mBPP).

Changes are fully orthogonal to PR #1779's frozen recurrent α/β and
PR #1767's LoRA-TTT tweaks — stackable.

Also ships the BOS-fix patch for prepare_caseops_data.py (matches
PR #1736 d7263a3 and PR #1769 fe7c309): sp.encode can't emit
BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's
_loss_bpb_from_sums divides by zero on BOS-less shards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cocohearts added a commit that referenced this pull request Apr 29, 2026
… Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335

Merge accepted Parameter Golf record submission #1787.
hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
External reproductions of this submission failed with ZeroDivisionError
in phased TTT eval because the shipped prep script did not prepend the
<s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7
(pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1
naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers
with no fallback. Training ran because _init_shard:408-409 falls back to
bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent
fallback.

Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to
the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern
in data/download_hf_docs_and_tokenize.py:364-366.

The submitted 1.06549 metric is unaffected — val_bpb reduces to
loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged
with BOS prepended. Our seed logs were measured on shards that already
had BOS markers from an internal prep path; the shipped prep was the
outlier.

Also adds a Reproduction sanity check section to README.md that asserts
bos_count > 0 on the first val shard.

Reported by @codemath3000 in PR openai#1736 comment 4285805497.
hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
External reproductions of PR openai#1769 (and PR openai#1736) failed with
ZeroDivisionError in phased TTT eval because the shipped prep script
did not prepend the <s> control token (ID 1) to each doc. The SP
tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators),
so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs
(line 2209) requires BOS markers with no fallback. Training itself
ran because _init_shard:408-409 falls back to bos_idx=[0] when no
BOS is found; phased TTT eval has no equivalent fallback.

Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0
to the byte sidecar (BOS = 0 original bytes). Matches the canonical
pattern in data/download_hf_docs_and_tokenize.py:364-366.

The submitted 1.06453 metric is unaffected — val_bpb reduces to
loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is
unchanged with BOS prepended. Our seed logs were measured on shards
that already had BOS markers from an internal prep path; the shipped
prep was the outlier.

Also adds a Reproduction sanity check section to README.md that
asserts bos_count > 0 on the first val shard.

Reported by @codemath3000 in PR openai#1736 comment 4285805497.
hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
… Attn Gate + Fused CE — val_bpb 1.06378

3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token.
-0.00171 BPB vs PR openai#1736 (1.06549), -0.00043 vs PR openai#1779 (1.06421).

Stacks 4 orthogonal wins on top of PR openai#1736, all ablation-validated on
seed 0 against stock openai#1736 before stacking:
- Polar Express per-iteration minimax Newton-Schulz coefficients (from
  PR openai#1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x
  with 5 distinct tuples baked into zeropower_via_newtonschulz5
- MIN_LR=0.10 warmdown floor (was 0)
- Sparse attention head-output gate (modded-nanogpt pattern, 96
  params/layer vs dense GatedAttn 4096), preserving the attn_gate_w
  name so the int8-per-row quant path still routes it (size-range
  check widened to 32..8192)
- Triton fused softcapped cross-entropy kernel on the training
  forward; eval path keeps eager numerics unchanged

Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0
(was 4000) together reclaim ~15s of training budget for additional
depth-3 steps.

All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max
15,940,380 B, ~60 KB headroom), the 600s train budget (599.46-
599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every
individual seed beats its PR openai#1736 counterpart (deltas -1.20 to
-2.27 mBPP).

Changes are fully orthogonal to PR openai#1779's frozen recurrent α/β and
PR openai#1767's LoRA-TTT tweaks — stackable.

Also ships the BOS-fix patch for prepare_caseops_data.py (matches
PR openai#1736 d7263a3 and PR openai#1769 fe7c309): sp.encode can't emit
BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's
_loss_bpb_from_sums divides by zero on BOS-less shards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since
2026-04-18 for whether val docs are also in the training set.

Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors:
openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps).

Verdicts:
  - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068
  - LEAK (25): openai#1736 (our research baseline) → openai#1769openai#1787openai#1797openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018openai#2118
    (current claimed frontier 1.04350), plus siblings.
  - INHERIT (1): openai#2050 (eval-only on frozen openai#1915)

Code-level evidence (not README claims):
  - Every shipped prepare_caseops_data.py is byte-identical:
    SHARD_TOKENS=10_000_000, default=10_000 for --val-docs
  - NO PR overrides --val-docs (searched all .sh files in all 34 PRs)
  - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1
    HF dataset whose manifest pins docs_val=50000, docs_train=8181945,
    sums match → CLEAN by construction
  - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description
  - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval"

Three signposts:
  - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py
    default invocation
  - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset
  - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally

The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN.
The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is
inflated by val memorization; spec 301 was designed to measure how much
remains under clean data.

Files:
  caseops-memory-leakage/README.md       — overview, methodology, takeaways
  caseops-memory-leakage/verdicts.md     — 34-row master table with evidence
  caseops-memory-leakage/family-tree.md  — ASCII trees with [C]/[L] annotations
jzmyres pushed a commit to jzmyres/parameter-golf that referenced this pull request May 3, 2026
…ns through 2026-04-27

Adds the 10 most recent leaderboard records from origin/main, including:
- 2026-04-27_SP8192_LQER_SparseGate_BOSSmearFix_9HpStack_1.0611 (CURRENT TOP, val_bpb 1.06108)
- 2026-04-23_SP8192_CaseOps_SparseGate_QuantGate_Loop45_PhasedTTT_PolarNS_MinLR_FusedCE
- 2026-04-22_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT_MLPClip12
- 2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT
- 2026-04-18_PR1626_CaseOps_Taper
- 2026-04-14_MultiPhaseGlobalSGD_PhasedTTT
- 2026-04-13_VarLenAttn_PhasingTTT2000
- 2026-04-10_VarLenAttn
- 2026-04-09_A2_Muon097_3Seed
- 2026-03-29_Loader_FullGPTQ_XSA11_BigramHash2816

These records use the SP8192 tokenizer + 8x H100 + TTT + advanced quantization
on a non-DEQ baseline (standard transformer with U-Net skips, Loop4-5 depth
recurrence, XSA, Polar-Express NS, etc). Top record (1.0611) lineage:
PR openai#1797 (SmearGate + LQER) -> PR openai#1787 (Polar NS + MIN_LR + SparseAttnGate +
Fused CE) -> PR openai#1736 (CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT).

SparseAttnGate (PR openai#1787) reviewed for incorporation into our model arch:
analysis follows in next commit / iter 120 design spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants