diary pt.4: PR scan for novel levers to port onto openai#1736

leon2k2k2k · claude · leon2k2k2k · commit 5123db73e39f · 2026-04-20T13:47:40.000+08:00
Scanned 200 PRs from 2026-04-11..20. After exclusion filters, 3 candidates beat spec 011's expected Δ: openai#1682 (GradPower Muon p=0.9), openai#1648 (xIELU + per-layer QK gain), openai#1555 (Tap-In eval cache). Full artifacts at ~/competition-pr/pr-scan-2026-04-20/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/diary/2026-04-20-pr-scan-pt4.md b/diary/2026-04-20-pr-scan-pt4.md
@@ -0,0 +1,50 @@
+# 2026-04-20 pt.4 — PR scan: novel levers to port onto #1736
+
+**Session kind:** research (deep discovery, no pod). Continuation of pt.3 (SpinQuant exhausted). **Days to deadline:** 10.
+
+## TL;DR
+
+- Scanned 200 PRs created 2026-04-11 → 2026-04-20. After exclusion filters (banned mechanisms, already-in-#1736, already-tested): ~20 in-scope candidates.
+- **Three candidates have expected Δ ≥ spec 011's thin −0.0005 to −0.002 range, survive TTT absorption, and stack orthogonally with each other and with 011.**
+- Strongest finding: #1648's per-layer QK_GAIN convergence (2.0–3.0 range, not uniform 5.25) directly contradicts #1736's QK setting. If that direction is right, it's the biggest remaining architecture lever.
+- Zero compute spent. Research artifacts in `~/competition-pr/pr-scan-2026-04-20/`.
+
+## Top 3 candidates (full detail in `~/competition-pr/pr-scan-2026-04-20/top-candidates.md`)
+
+1. **#1682 GradPower for Muon (p=0.9)** — 3-LOC optimizer tweak. Author's matched 1×H100 ablation: −0.00353 bpb vs p=1.0. Bundles with spec 011 in one retrain. Expected Δ: −0.001 to −0.003. **Highest ROI-per-LOC.**
+2. **#1648 xIELU + per-layer QK gain + symmetric resid_mix** — ~50 LOC. Biggest potential lever (−0.002 to −0.005) but needs per-layer coefficient convergence against #1736's stack. Author explicitly reports: "model consistently prefers much softer attention than 5.0 default" → directly challenges #1736's QK_GAIN=5.25.
+3. **#1555 Tap-In min_match=1** — eval-time prefix-match cache. Downstream of TTT → can't be absorbed. No retrain needed (~$5 rerun). Expected Δ: −0.0015 to −0.002 on top of #1736.
+
+## Meta-observations about the PR landscape
+
+- **SpinQuant/Hadamard/AWQ/Hessian-clip cluster (#1689, #1695, #1732, #1651)** all land within ±0.003 of each other. Consistent with our own TTT-absorption finding from specs 010/010b.
+- **Depth-recurrence space is crowded but stale.** ~15 PRs explore variants of depth recurrence; #1736 already has Loop45.
+- **Training-time optimizer tweaks are underexplored.** GradPower is the cleanest example.
+- **Eval-time score-before-update lookups are legal and under-used.** Tap-In / score-first TTT / eval-hash-embed all sit downstream of TTT — TTT cannot absorb them.
+- **#1709's XSA `F.rms_norm` bug does not affect #1736.** Our `_xsa_efficient` already uses `F.normalize(v, dim=-1)` at line 784. No free fix available.
+- **#1586's claimed innovations (per-layer adaptive GPTQ clip + MATRIX_LR 0.026) are already baked into #1736** (MLP_CLIP_SIGMAS=10.0 / ATTN_CLIP_SIGMAS=13.0 / MATRIX_CLIP_SIGMAS=12.85; matrix_lr=0.026).
+- **Credible elite zone (1.02–1.07 bpb claimed) has only one non-excluded entry not already in our plan: #1729 (spec 011 tapered WD).** The #1738 / #1735 pair is banned per Condition 3. The FLA cluster is byte-bug-banned.
+
+## Recommended sequencing (10-day runway, ~$148 remaining)
+
+| Day | Action | Cost |
+|---|---|---|
+| 1 | Spec 011 + 012 bundle: tapered WD + MUON_GRAD_POWER=0.9 in single retrain | ~$20 |
+| 2 | Spec 014 Tap-In on winner (no retrain) | ~$5 |
+| 3–5 | Spec 013 xIELU + per-layer QK (2-3 mini convergences + 1 full) | ~$30 |
+| 8–9 | Final 3-seed on winning stack | ~$30–40 |
+
+Total: ~$85–105. Margin: ~$40–60.
+
+## Artifacts
+
+- `~/competition-pr/pr-scan-2026-04-20/census.csv` — all 200 PRs with triage reasons.
+- `~/competition-pr/pr-scan-2026-04-20/ideas.md` — per-PR classification (class/Δ-bucket/feasibility/novelty) for in-scope candidates.
+- `~/competition-pr/pr-scan-2026-04-20/top-candidates.md` — ranked top 5 with cost/risk/stack notes.
+- `~/competition-pr/pr-scan-2026-04-20/bodies/*.json` — raw PR bodies fetched for deep-dived candidates.
+
+## What to decide in the morning
+
+1. **Approve the day-1 bundle** (tapered WD + GradPower in one spec) vs. splitting them into 011/012 separately. Bundle is cheaper but confounds attribution.
+2. **Whether to spec 013 (xIELU + per-layer QK) before or after seeing spec 011/012 results.** Convergence-loop methodology is novel to us; may eat time.
+3. **Whether Tap-In's ~400 LOC is worth the ~−0.0015 expected.** It's unique in that it sits downstream of TTT and we know TTT absorbs everything upstream — but it's the biggest patch of the cheap options.