Skip to content

Commit 416b0e4

Browse files
leon2k2k2kclaude
andcommitted
spec 008 + baseline migration to openai#1736
After 2026-04-19 frontier scan, rebasing the research baseline from merged SOTA openai#1493 (1.0810) to unmerged PR openai#1736 (dexhunter, claimed 1.06549). Rationale: credible frontier moved ~0.015 bpb past merged SOTA in 10 days via witnessed, legal levers (CaseOps tokenizer, attn-out gate, phased TTT). Continuing off spec-000 leaves us behind before we try anything. - CLAUDE.md: baseline declared; baseline-migration specs land on research directly (exception to exp/<slug> convention). - research/frontier-map.md: credibility filter + dependency map. - diary/2026-04-19-frontier-{scan,map}.md: per-PR evidence base. - research/ideas/1736-improvement.md: three-spec migration plan. - research/specs/008-1736-reproduction.md: spec for the reproduction run, pinned to commit 154c9b8 (openai#1736 import at e100586). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 154c9b8 commit 416b0e4

6 files changed

Lines changed: 902 additions & 1 deletion

File tree

CLAUDE.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,13 @@
11
# Parameter Golf — Repo Conventions
22

33
## What this is
4-
OpenAI Parameter Golf challenge, **record track**. Goal: beat SOTA **1.0810 bpb**. Deadline **2026-04-30**. Training code (`train_gpt_sota.py`, `hotstart.py`, `run_*.sh`) lives at the repo root alongside this scaffold.
4+
OpenAI Parameter Golf challenge, **record track**. Goal: beat SOTA. Deadline **2026-04-30**. Training code lives at the repo root alongside this scaffold.
5+
6+
**Current baseline (as of 2026-04-20):** rebased from merged SOTA #1493 (1.0810) to PR **#1736** (dexhunter, unmerged, claimed val_bpb **1.06549**: SP8192 + CaseOps tokenizer + attn-out gate + quant-gate + Loop45 + phased TTT). Rationale: in the ~10 days since the merged SOTA landed, the credible unmerged frontier moved ~0.015 bpb past it on a small number of witnessed, legal levers; continuing to iterate off spec-000 leaves us behind the frontier before we even try. Modal-scenario analysis in `research/frontier-map.md`, `diary/2026-04-19-frontier-scan.md`, `diary/2026-04-19-frontier-map.md`. Full plan in `research/ideas/1736-improvement.md`.
7+
8+
**Two baselines live in the repo:**
9+
- `train_gpt_sota.py` + `hotstart.py` + `run_*.sh` at repo root — the old #1493-derived baseline (spec 000). Retained for backward-compat sanity reruns.
10+
- `records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/train_gpt.py` + siblings — the new #1736-derived baseline. This is the base specs 008+ use.
511

612
## Two session modes
713
Every Claude session in this repo is either **research** or **execution**. They do different things and must not overlap.
@@ -56,6 +62,8 @@ Any session unsure which mode it's in should ask the user before acting.
5662

5763
Ideas with **no code change** (hyperparam-only) don't need a branch. The spec pins a `research` commit and lists the config diff.
5864

65+
**Exception — baseline-migration specs.** A spec whose purpose *is* to move the research baseline (e.g. spec 008: rebase to #1736) lands its code directly on `research` rather than on an `exp/<slug>` branch. Rationale: the code is becoming the new baseline, not a side experiment to be evaluated and possibly discarded. Use this exception only when the spec's explicit goal is "replace the baseline," not "try a variant." Normal experimentation still goes through `exp/<slug>`.
66+
5967
**Worktrees:**
6068
Code for an `exp/<slug>` branch lives in a worktree so the research session can keep editing `research/specs/` etc. on `research` while code changes are made in parallel.
6169

diary/2026-04-19-frontier-map.md

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
# Frontier Map — as of 2026-04-19
2+
3+
Snapshot of the Record-track frontier on `openai/parameter-golf`. All PRs below are **Record** track (not Non-record). Scope: last ~2 weeks of activity.
4+
5+
**Legend:**
6+
- ✅ MERGED — accepted, in main
7+
- 🟢 CLEAN — open, no compliance concerns raised
8+
- 🟡 DISPUTED — open, pending maintainer ruling
9+
- 🔴 BROKEN — self-withdrawn, bug confirmed, or canonical bpb > SOTA
10+
- ⛔ BANNED — uses explicitly ruled-illegal mechanism
11+
12+
---
13+
14+
## Timeline view (dated)
15+
16+
### 2026-04-09 — ✅ merged SOTA 1.0810
17+
- **#1493** bigbag — SP8192 + 3L Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — **1.0810**
18+
19+
### 2026-04-10 to 04-11 — first post-SOTA challengers
20+
- **#1523** superseded (closed) — Triple Recurrence + Banking + Fused MLP — 1.0778
21+
- **#1529** msisovic — Improved Parallel Residuals — 1.0758 🟢
22+
- **#1530** samacqua — VarLen attn + Fused MLP + doc-indep TTT — **1.07336** 🟢 ← **FOUNDATION A**
23+
24+
### 2026-04-12 to 04-14 — multi-phase TTT emerges
25+
- **#1557** ndokutovich — N-gram Tilt + Hessian SDClip — 1.0773 🟡 (self-flagged AT-RISK)
26+
- **#1578** mikeapedia — Lossy Casefold Tokenizer — 1.0668 🟡 (tokenizer dispute, lossy variant)
27+
- **#1610** romeerp — VarLenAttn + PhasingTTT — 1.0728 🟢
28+
- **#1626** samacqua — VarLen + Fused MLP + Multi-Phase Global SGD TTT — **1.07193** 🟢 ← **FOUNDATION B**
29+
30+
### 2026-04-16 — gates
31+
- **#1667** MarioPaerle — SmearGate + AttnOutGate — 1.07139 🟢 (untested, 0 audits)
32+
33+
### 2026-04-17
34+
- **#1687** resouer — K_KVShare_Wider FLA 🔴 CLOSED (parent of GDN cluster, same byte bug)
35+
- **#1693** dexhunter — Casefold V4 + AttnOutGate + MP Global SGD TTT — 1.05733 🟡 (tokenizer dispute)
36+
- **#1695** X-Abhishek-X — SpinQuant V1 + MP-SGD-TTT — 1.0759 🟢
37+
- **#1696** kings-crown — Block Attn Residuals + Tuned Legal TTT — 1.1224 🟢 (above SOTA)
38+
- **#1698** arsenis-cmd — GatedDeltaNet + Legal Score-First TTT — 1.00995 🔴 (byte bug + artifact oversize)
39+
- **#1700** jorge-asenjo — SP8192 + MP-SGD + Phased TTT — 1.07219 🟢
40+
- **#1705** genji0306 — K_KVShare_Wider FLA 🔴 CLOSED (GDN family)
41+
42+
### 2026-04-18
43+
- **#1711** aamodbhatt — GatedDeltaNet + Score-First TTT + Brotli — 1.00980 🔴 CLOSED (byte bug, self-withdrawn)
44+
- **#1712** aamodbhatt — GatedDeltaNet + Brotli (No TTT) — 1.01902 🔴 CLOSED (byte bug, self-withdrawn)
45+
- **#1715** G3sparky — QK-Gain 5.5 — 1.0810 🟢 (ties SOTA, no beat)
46+
- **#1716** himanshudongre — SP8192 + BigramHash d=32 + Path A v3 — 1.07882 🟡
47+
- **#1722** deborahnelson — Trinity SLOT v3 + Pre-Quant TTT — 0.65802 ⛔ (SLOT + pre-quant stack)
48+
- **#1723** SlavH — Nairi 9L 512D vocab1024 — 0.5384 🔴 (invalid eval, target clamping)
49+
- **#1727** yahya010 — MP-SGD TTT 4 phases + QK-Gain 5.25 — **1.07217** 🟢 (config diff on #1700)
50+
51+
### 2026-04-19 — big day
52+
- **#1729** romeerp — Lossless CaseOps + Tapered WD — **1.06780** 🟡 (tokenizer dispute, cleanest variant)
53+
- **#1731** Victory963 — Hadamard+AWQ+Layerwise 🔴 CLOSED (replaced by #1732)
54+
- **#1732** Victory963 — Hadamard+AWQ+Layerwise+Hessian — 1.0785 🟢 (untested, kitchen sink)
55+
- **#1734** yahya010 — GatedDeltaNet + Legal TTT + Brotli-11 — 1.01080 🔴 CLOSED (byte bug, self-withdrawn)
56+
- **#1735** AjAnubolu — SP8192 + Parallel Pre-Quant TTT — **1.0429** 🟡 (pre-quant TTT dispute)
57+
- **#1736** dexhunter — SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — **1.06549** 🟡 (tokenizer dispute)
58+
- **#1738** alertcat — #1735 + CaseOps V15 — **1.03540** 🟡🟡 (tokenizer + pre-quant TTT, double dispute)
59+
60+
### 2026-04-19 (today)
61+
- See above — big day: #1729, #1732, #1735, #1736, #1738 (5 frontier PRs in one day)
62+
63+
---
64+
65+
## Family map (by mechanism)
66+
67+
### 🟢 CLEAN frontier chain — everything merges-ready
68+
69+
```
70+
#1493 ✅ 1.0810 (merged SOTA)
71+
72+
┌─────────────┴─────────────┐
73+
│ │
74+
FOUNDATION A FOUNDATION B
75+
#1530 🟢 1.07336 #1626 🟢 1.07193
76+
VarLen + doc-indep TTT VarLen + MP Global SGD TTT
77+
│ │
78+
├─ #1610 🟢 1.0728 ├─ #1700 🟢 1.07219
79+
│ (+ phased global SGD) │ (+ SP8192 + 3 phases)
80+
│ │ │
81+
│ │ └─ #1727 🟢 1.07217
82+
│ │ (phases=4, config-only)
83+
│ │
84+
│ └─ (#1670 Casefold → #1693)
85+
86+
└─ #1667 🟢 1.07139 (SmearGate + AttnOutGate, standalone, no audits)
87+
88+
Independent quantization branch:
89+
#1529/#1445 → #1695 🟢 1.0759 (SpinQuant V1 — Hadamard rotation before GPTQ)
90+
```
91+
92+
**Clean floor today: ~1.071.** Achievable with no disputed rulings.
93+
94+
### 🟡 DISPUTED — tokenizer family (waiting on Issue #1604)
95+
96+
**The axis of the debate is lossy vs. lossless**, not "any normalization vs. none." Two sub-families:
97+
98+
**Lossy casefold sub-family** — irreversible `.lower()` after NFKC; decoded token IDs cannot reproduce original capitalization:
99+
```
100+
#1578 mikeapedia 04-13 Lossy Casefold 1.0668 🟡 likely ILLEGAL
101+
#1670 (parent, Casefold V4 base) 🟡 likely ILLEGAL
102+
#1693 dexhunter 04-17 Casefold V4 + gates 1.0573 🟡 likely ILLEGAL (inherits lossy)
103+
```
104+
105+
**Lossless CaseOps sub-family** — bijective transform with `TITLE`/`ALLCAPS`/`CAPNEXT`/`ESC` operator tokens; `decode(encode(s)) == s`; BPB denominator against original UTF-8 via byte sidecar:
106+
```
107+
#1729 romeerp 04-19 Lossless CaseOps + WD 1.0678 🟡 likely LEGAL
108+
#1736 dexhunter 04-19 Lossless CaseOps + gates 1.0655 🟡 likely LEGAL
109+
```
110+
111+
**Issue #1604 thread analysis:**
112+
113+
| Commenter | Date | Position | Targets |
114+
|---|---|---|---|
115+
| SPThole | 04-14 | concern about generative usability under case-fold | (general) |
116+
| sharpobject | 04-14 | **"exact bytes of validation documents must be reproducible by decoding the token IDs"** — this is the clean principle | (proposed standard) |
117+
| tejasnaladala | 04-16 | formal argument: lossy casefold breaks BPB semantics because `byte_count` is no longer of original bytes | #1670, #1585, #1578 (all lossy) |
118+
| mikeapedia | 04-16 | counter: NFKC is already lossy, every submission counts post-NFKC bytes | (defense of lossy) |
119+
| andrewbaggio1 | 04-17 | "casefold should be illegal, opens door to removing spaces" | (against lossy) |
120+
121+
**Crucial asymmetry:** every argument against specifically targets lossy casefold. Nobody in the thread has challenged CaseOps (lossless + original-UTF-8 byte counting) on its bijective property. The sharpobject principle ("exact bytes reproducible from token IDs") is the likely synthesis — CaseOps satisfies it, casefold doesn't.
122+
123+
**Expected ruling:**
124+
125+
| Family | Expected ruling | Survivors |
126+
|---|---|---|
127+
| Lossy casefold | illegal | — (kills #1578, #1670, #1693) |
128+
| Lossless CaseOps | legal | #1729 (1.0678), #1736 (1.0655) |
129+
130+
**If this plays out:** floor drops to ~1.065 (#1736) with high confidence. Dexhunter's architectural add-ons (GatedAttn, QuantGate, Loop45) are preserved.
131+
132+
### 🟡 DISPUTED — pre-quant TTT (waiting on Issue #1017 Condition 3) — **~85-90% dead on physics**
133+
134+
Uses 21 epochs of AdamW on val data *before* artifact freeze.
135+
136+
```
137+
#1735 AjAnubolu 04-19 SP8192 + Parallel Pre-Quant TTT 1.0429
138+
#1738 alertcat 04-19 #1735 + CaseOps V15 (inherits both) 1.0354 🟡🟡 double dispute
139+
```
140+
141+
**The physics argument (stronger than the textual one):**
142+
143+
From bigbag's own Issue #1017: **"Corpus-level TTT has a ceiling of approximately 0.0003 bits."** (Verified — direct quote from the issue body.)
144+
145+
#1735 claims **−0.038 bpb** from pre-quant TTT alone. That's **~100× the ceiling** the merged-SOTA author put on paper. The FineWeb train/val splits are random samples from the same source distribution with "negligible divergence across every measure" — there is no distributional ground for TTT to make up.
146+
147+
Either:
148+
(a) bigbag's ceiling analysis is wrong — unlikely; he's the authority and has measured it empirically
149+
(b) #1735 is extracting signal from somewhere it shouldn't — i.e. the val stream itself
150+
151+
"Frozen before eval" is a textual defense. The ceiling is a physical one. **Physical usually wins rule-interpretation disputes.**
152+
153+
**Estimated probability of ruling against: 85–90%.** Regardless of whether the textual ruling on Condition 3 lands strict or permissive, a headline number 100× over the stated corpus-level ceiling will not be allowed to sit as SOTA.
154+
155+
**Legal-regardless artifact:** the **8-GPU federated-averaging** systems trick itself is a pure parallelization idea — it speeds up any multi-epoch TTT loop without affecting legality. Worth extracting even after this PR family dies.
156+
157+
### 🔴 BROKEN — GatedDeltaNet (GDN) cluster
158+
159+
All four share the same `build_sentencepiece_luts` byte-counting bug that inflates denominator by ~17.46% → real canonical bpb is ~1.19, **worse than SOTA**.
160+
161+
```
162+
#1687 resouer K_KVShare_Wider FLA 🔴 CLOSED (parent arch)
163+
#1698 arsenis-cmd GDN + Legal Score-First TTT 1.00995 🔴 OPEN, byte bug + artifact oversize
164+
#1705 genji0306 K_KVShare_Wider FLA 1.0339 🔴 CLOSED
165+
#1711 aamodbhatt GDN + Score-First TTT 1.00980 🔴 CLOSED by author
166+
#1712 aamodbhatt GDN + Brotli (No TTT) 1.01902 🔴 CLOSED by author
167+
#1734 yahya010 GDN + Legal TTT + Brotli-11 1.01080 🔴 CLOSED by author ("canonical would not beat SOTA")
168+
```
169+
170+
**Authors of 3 of 4 PRs closed their own submissions** after byte-bug audit.
171+
172+
**Residual interest:** the K_KVShare_Wider architecture itself (10L 544d, KV-share stride=2) might be real but we'd need a clean re-implementation with canonical byte accounting to know. Low priority.
173+
174+
### ⛔ BANNED — Trinity / SLOT / N-gram family
175+
176+
Stacking banned mechanisms. Ignore.
177+
178+
```
179+
#1246 deborahnelson Trinity v7+skip 0.22311 ⛔ (N-gram Order-22 + SLOT + Pre-Quant)
180+
#1722 deborahnelson Trinity SLOT v3 + Pre-Quant 0.65802 ⛔
181+
#1557 ndokutovich N-gram Tilt + SDClip 1.0773 🟡 (self-flagged AT-RISK, N-gram family)
182+
```
183+
184+
Also-banned mechanisms ruled out earlier:
185+
- N-gram caches with target-in-key (PR #779 ruling, 2026-03-27)
186+
- SLOT per-window eval-time AdamW (PR #1376 ruling)
187+
188+
### 🔴 OTHER INVALID
189+
190+
```
191+
#1723 SlavH Nairi 9L 512D vocab1024 0.5384 🔴 (model vocab 1024 vs tokenizer 8192 → target clamping, not a valid BPB)
192+
```
193+
194+
---
195+
196+
## At-a-glance rankings (filter by status)
197+
198+
| Rank | Status | bpb | PR | What |
199+
|---|---|---|---|---|
200+
| 1 | 🟡🟡 | 1.0354 | #1738 | CaseOps + pre-quant TTT (double dispute) |
201+
| 2 | 🟡 | 1.0429 | #1735 | pre-quant TTT (disputed) |
202+
| 3 | 🟡 | 1.0573 | #1693 | Casefold + gates (tokenizer disputed) |
203+
| 4 | 🟡 | 1.0655 | #1736 | CaseOps + gates (tokenizer disputed) |
204+
| 5 | 🟡 | 1.0668 | #1578 | lossy Casefold (tokenizer disputed) |
205+
| 6 | 🟡 | 1.0678 | #1729 | lossless CaseOps (tokenizer disputed, cleanest) |
206+
| 7 | 🟢 | 1.0714 | #1667 | SmearGate + AttnOutGate (untested) |
207+
| 8 | 🟢 | 1.0719 | #1626 | VarLen + MP-SGD TTT ← **FOUNDATION** |
208+
| 9 | 🟢 | 1.0722 | #1727 | phases=4 on #1700 |
209+
| 10 | 🟢 | 1.0722 | #1700 | SP8192 + MP-SGD Phased TTT |
210+
| 11 | 🟢 | 1.0728 | #1610 | VarLen + PhasingTTT |
211+
| 12 | 🟢 | 1.0734 | #1530 | VarLen + doc-indep TTT ← **FOUNDATION** |
212+
| 13 | 🟢 | 1.0759 | #1695 | SpinQuant V1 (Hadamard) + MP-SGD |
213+
| 14 | 🟢 | 1.0785 | #1732 | Hadamard + AWQ kitchen sink |
214+
||| 1.0810 | #1493 | **merged SOTA** |
215+
216+
## Key rulings to watch
217+
218+
| Issue | Affects | Expected outcome | Stake |
219+
|---|---|---|---|
220+
| **#1604 tokenizer** | lossy casefold: #1578/#1670/#1693 | illegal (high confidence) | no frontier loss — lossless CaseOps survives |
221+
| **#1604 tokenizer** | lossless CaseOps: #1729, #1736 | legal (high confidence) | floor drops to 1.0655 |
222+
| **#1017 C3 pre-quant TTT** | #1735, #1738 | illegal (85-90%, ceiling-based) | no real floor change — these were never real |
223+
224+
**Expected map post-rulings:**
225+
- True legal floor: **~1.0655** (#1736) via lossless CaseOps stack
226+
- Uncontested-legal floor if we're conservative: ~1.0714 (#1667) or ~1.072 (#1530/#1626 foundations)
227+
- The −0.035 "pre-quant TTT" tier is fiction; treat it as non-existent
228+
229+
## Pace
230+
- **Clean frontier drift:** 1.0810 → 1.0714 in 10 days (−0.0087 bpb / ~−0.0009/day)
231+
- **Disputed frontier (if ruled legal):** 1.0810 → 1.0354 in 10 days (−0.046 bpb)
232+
- **New PRs/day** on record track: ~2–3 (accelerating — 5 frontier PRs on 04-19 alone)
233+
- **Deadline: 2026-04-30.** 11 days left.

0 commit comments

Comments
 (0)