|
| 1 | +# 2026-04-20 pt.4 — PR scan: novel levers to port onto #1736 |
| 2 | + |
| 3 | +**Session kind:** research (deep discovery, no pod). Continuation of pt.3 (SpinQuant exhausted). **Days to deadline:** 10. |
| 4 | + |
| 5 | +## TL;DR |
| 6 | + |
| 7 | +- Scanned 200 PRs created 2026-04-11 → 2026-04-20. After exclusion filters (banned mechanisms, already-in-#1736, already-tested): ~20 in-scope candidates. |
| 8 | +- **Three candidates have expected Δ ≥ spec 011's thin −0.0005 to −0.002 range, survive TTT absorption, and stack orthogonally with each other and with 011.** |
| 9 | +- Strongest finding: #1648's per-layer QK_GAIN convergence (2.0–3.0 range, not uniform 5.25) directly contradicts #1736's QK setting. If that direction is right, it's the biggest remaining architecture lever. |
| 10 | +- Zero compute spent. Research artifacts in `~/competition-pr/pr-scan-2026-04-20/`. |
| 11 | + |
| 12 | +## Top 3 candidates (full detail in `~/competition-pr/pr-scan-2026-04-20/top-candidates.md`) |
| 13 | + |
| 14 | +1. **#1682 GradPower for Muon (p=0.9)** — 3-LOC optimizer tweak. Author's matched 1×H100 ablation: −0.00353 bpb vs p=1.0. Bundles with spec 011 in one retrain. Expected Δ: −0.001 to −0.003. **Highest ROI-per-LOC.** |
| 15 | +2. **#1648 xIELU + per-layer QK gain + symmetric resid_mix** — ~50 LOC. Biggest potential lever (−0.002 to −0.005) but needs per-layer coefficient convergence against #1736's stack. Author explicitly reports: "model consistently prefers much softer attention than 5.0 default" → directly challenges #1736's QK_GAIN=5.25. |
| 16 | +3. **#1555 Tap-In min_match=1** — eval-time prefix-match cache. Downstream of TTT → can't be absorbed. No retrain needed (~$5 rerun). Expected Δ: −0.0015 to −0.002 on top of #1736. |
| 17 | + |
| 18 | +## Meta-observations about the PR landscape |
| 19 | + |
| 20 | +- **SpinQuant/Hadamard/AWQ/Hessian-clip cluster (#1689, #1695, #1732, #1651)** all land within ±0.003 of each other. Consistent with our own TTT-absorption finding from specs 010/010b. |
| 21 | +- **Depth-recurrence space is crowded but stale.** ~15 PRs explore variants of depth recurrence; #1736 already has Loop45. |
| 22 | +- **Training-time optimizer tweaks are underexplored.** GradPower is the cleanest example. |
| 23 | +- **Eval-time score-before-update lookups are legal and under-used.** Tap-In / score-first TTT / eval-hash-embed all sit downstream of TTT — TTT cannot absorb them. |
| 24 | +- **#1709's XSA `F.rms_norm` bug does not affect #1736.** Our `_xsa_efficient` already uses `F.normalize(v, dim=-1)` at line 784. No free fix available. |
| 25 | +- **#1586's claimed innovations (per-layer adaptive GPTQ clip + MATRIX_LR 0.026) are already baked into #1736** (MLP_CLIP_SIGMAS=10.0 / ATTN_CLIP_SIGMAS=13.0 / MATRIX_CLIP_SIGMAS=12.85; matrix_lr=0.026). |
| 26 | +- **Credible elite zone (1.02–1.07 bpb claimed) has only one non-excluded entry not already in our plan: #1729 (spec 011 tapered WD).** The #1738 / #1735 pair is banned per Condition 3. The FLA cluster is byte-bug-banned. |
| 27 | + |
| 28 | +## Recommended sequencing (10-day runway, ~$148 remaining) |
| 29 | + |
| 30 | +| Day | Action | Cost | |
| 31 | +|---|---|---| |
| 32 | +| 1 | Spec 011 + 012 bundle: tapered WD + MUON_GRAD_POWER=0.9 in single retrain | ~$20 | |
| 33 | +| 2 | Spec 014 Tap-In on winner (no retrain) | ~$5 | |
| 34 | +| 3–5 | Spec 013 xIELU + per-layer QK (2-3 mini convergences + 1 full) | ~$30 | |
| 35 | +| 8–9 | Final 3-seed on winning stack | ~$30–40 | |
| 36 | + |
| 37 | +Total: ~$85–105. Margin: ~$40–60. |
| 38 | + |
| 39 | +## Artifacts |
| 40 | + |
| 41 | +- `~/competition-pr/pr-scan-2026-04-20/census.csv` — all 200 PRs with triage reasons. |
| 42 | +- `~/competition-pr/pr-scan-2026-04-20/ideas.md` — per-PR classification (class/Δ-bucket/feasibility/novelty) for in-scope candidates. |
| 43 | +- `~/competition-pr/pr-scan-2026-04-20/top-candidates.md` — ranked top 5 with cost/risk/stack notes. |
| 44 | +- `~/competition-pr/pr-scan-2026-04-20/bodies/*.json` — raw PR bodies fetched for deep-dived candidates. |
| 45 | + |
| 46 | +## What to decide in the morning |
| 47 | + |
| 48 | +1. **Approve the day-1 bundle** (tapered WD + GradPower in one spec) vs. splitting them into 011/012 separately. Bundle is cheaper but confounds attribution. |
| 49 | +2. **Whether to spec 013 (xIELU + per-layer QK) before or after seeing spec 011/012 results.** Convergence-loop methodology is novel to us; may eat time. |
| 50 | +3. **Whether Tap-In's ~400 LOC is worth the ~−0.0015 expected.** It's unique in that it sits downstream of TTT and we know TTT absorbs everything upstream — but it's the biggest patch of the cheap options. |
0 commit comments