Skip to content

Commit 5123db7

Browse files
leon2k2k2kclaude
andcommitted
diary pt.4: PR scan for novel levers to port onto openai#1736
Scanned 200 PRs from 2026-04-11..20. After exclusion filters, 3 candidates beat spec 011's expected Δ: openai#1682 (GradPower Muon p=0.9), openai#1648 (xIELU + per-layer QK gain), openai#1555 (Tap-In eval cache). Full artifacts at ~/competition-pr/pr-scan-2026-04-20/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 16d39a7 commit 5123db7

1 file changed

Lines changed: 50 additions & 0 deletions

File tree

diary/2026-04-20-pr-scan-pt4.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# 2026-04-20 pt.4 — PR scan: novel levers to port onto #1736
2+
3+
**Session kind:** research (deep discovery, no pod). Continuation of pt.3 (SpinQuant exhausted). **Days to deadline:** 10.
4+
5+
## TL;DR
6+
7+
- Scanned 200 PRs created 2026-04-11 → 2026-04-20. After exclusion filters (banned mechanisms, already-in-#1736, already-tested): ~20 in-scope candidates.
8+
- **Three candidates have expected Δ ≥ spec 011's thin −0.0005 to −0.002 range, survive TTT absorption, and stack orthogonally with each other and with 011.**
9+
- Strongest finding: #1648's per-layer QK_GAIN convergence (2.0–3.0 range, not uniform 5.25) directly contradicts #1736's QK setting. If that direction is right, it's the biggest remaining architecture lever.
10+
- Zero compute spent. Research artifacts in `~/competition-pr/pr-scan-2026-04-20/`.
11+
12+
## Top 3 candidates (full detail in `~/competition-pr/pr-scan-2026-04-20/top-candidates.md`)
13+
14+
1. **#1682 GradPower for Muon (p=0.9)** — 3-LOC optimizer tweak. Author's matched 1×H100 ablation: −0.00353 bpb vs p=1.0. Bundles with spec 011 in one retrain. Expected Δ: −0.001 to −0.003. **Highest ROI-per-LOC.**
15+
2. **#1648 xIELU + per-layer QK gain + symmetric resid_mix**~50 LOC. Biggest potential lever (−0.002 to −0.005) but needs per-layer coefficient convergence against #1736's stack. Author explicitly reports: "model consistently prefers much softer attention than 5.0 default" → directly challenges #1736's QK_GAIN=5.25.
16+
3. **#1555 Tap-In min_match=1** — eval-time prefix-match cache. Downstream of TTT → can't be absorbed. No retrain needed (~$5 rerun). Expected Δ: −0.0015 to −0.002 on top of #1736.
17+
18+
## Meta-observations about the PR landscape
19+
20+
- **SpinQuant/Hadamard/AWQ/Hessian-clip cluster (#1689, #1695, #1732, #1651)** all land within ±0.003 of each other. Consistent with our own TTT-absorption finding from specs 010/010b.
21+
- **Depth-recurrence space is crowded but stale.** ~15 PRs explore variants of depth recurrence; #1736 already has Loop45.
22+
- **Training-time optimizer tweaks are underexplored.** GradPower is the cleanest example.
23+
- **Eval-time score-before-update lookups are legal and under-used.** Tap-In / score-first TTT / eval-hash-embed all sit downstream of TTT — TTT cannot absorb them.
24+
- **#1709's XSA `F.rms_norm` bug does not affect #1736.** Our `_xsa_efficient` already uses `F.normalize(v, dim=-1)` at line 784. No free fix available.
25+
- **#1586's claimed innovations (per-layer adaptive GPTQ clip + MATRIX_LR 0.026) are already baked into #1736** (MLP_CLIP_SIGMAS=10.0 / ATTN_CLIP_SIGMAS=13.0 / MATRIX_CLIP_SIGMAS=12.85; matrix_lr=0.026).
26+
- **Credible elite zone (1.02–1.07 bpb claimed) has only one non-excluded entry not already in our plan: #1729 (spec 011 tapered WD).** The #1738 / #1735 pair is banned per Condition 3. The FLA cluster is byte-bug-banned.
27+
28+
## Recommended sequencing (10-day runway, ~$148 remaining)
29+
30+
| Day | Action | Cost |
31+
|---|---|---|
32+
| 1 | Spec 011 + 012 bundle: tapered WD + MUON_GRAD_POWER=0.9 in single retrain | ~$20 |
33+
| 2 | Spec 014 Tap-In on winner (no retrain) | ~$5 |
34+
| 3–5 | Spec 013 xIELU + per-layer QK (2-3 mini convergences + 1 full) | ~$30 |
35+
| 8–9 | Final 3-seed on winning stack | ~$30–40 |
36+
37+
Total: ~$85–105. Margin: ~$40–60.
38+
39+
## Artifacts
40+
41+
- `~/competition-pr/pr-scan-2026-04-20/census.csv` — all 200 PRs with triage reasons.
42+
- `~/competition-pr/pr-scan-2026-04-20/ideas.md` — per-PR classification (class/Δ-bucket/feasibility/novelty) for in-scope candidates.
43+
- `~/competition-pr/pr-scan-2026-04-20/top-candidates.md` — ranked top 5 with cost/risk/stack notes.
44+
- `~/competition-pr/pr-scan-2026-04-20/bodies/*.json` — raw PR bodies fetched for deep-dived candidates.
45+
46+
## What to decide in the morning
47+
48+
1. **Approve the day-1 bundle** (tapered WD + GradPower in one spec) vs. splitting them into 011/012 separately. Bundle is cheaper but confounds attribution.
49+
2. **Whether to spec 013 (xIELU + per-layer QK) before or after seeing spec 011/012 results.** Convergence-loop methodology is novel to us; may eat time.
50+
3. **Whether Tap-In's ~400 LOC is worth the ~−0.0015 expected.** It's unique in that it sits downstream of TTT and we know TTT absorbs everything upstream — but it's the biggest patch of the cheap options.

0 commit comments

Comments
 (0)