Skip to content

Commit 16d39a7

Browse files
leon2k2k2kclaude
andcommitted
diary 2026-04-20 pt.3: SpinQuant exhausted, methodology correction
Closes the SpinQuant investigation arc with spec 010b's results and an honest retrospective on the false-signal episode. Key findings: - All 5 SpinQuant variants (baseline, internal_only, port_1695, attn_only, mlp_only) land within 0.00009 bpb at final val_bpb. Pure null. openai#1736 has seed std ~0.00070; we are 10x below that. - Pt.2's "regime-dependence is exploitable" hypothesis refuted. attn_only ≈ baseline on rank 0 (attention rotation does nothing); mlp_only has inverse regime from port_1695 (hurts long, helps short); neither subset comes close to port_1695's emergent rank-0 trajectory lead. - Rank-0 rb spread across variants: 0.0075 bpb. Final val_bpb spread across variants: 0.000085 bpb. 80x compression from 8-rank aggregation + TTT LoRA uniform absorption. Mistake I owned up to: read rank-0 rb:1.0657 for mlp_only at batch 780 and suggested "mlp_only might actually net positive." Final.json came out +0.000005 above baseline. Rank-0 rb is rank 0's 1/8 slice, not a preview of the submission number. Methodology corrections for future runs: - Always check final.json before any trend interpretation - Rank-0 rb is a progress indicator, not a metric preview - When pre-TTT diagnostic_quantized spread < 0.001, post-TTT will be near-identical (TTT LoRA dominates) Budget: spent ~\$52 of \$200 total. 10 days left. Next: spec 011 (tapered Muon WD retrain) — upstream of TTT, might unlock something TTT can't absorb. Patch still unwritten. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 8815c4d commit 16d39a7

1 file changed

Lines changed: 104 additions & 0 deletions

File tree

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# 2026-04-20 pt.3 — SpinQuant exhausted: spec 010b results and what we learned
2+
3+
**Session kind:** research, post-execution-runs for spec 010b. Continuation of pt.2 (SpinQuant results + regime finding) and pt.1 (baseline migration design). **Days to deadline:** 10.
4+
5+
## TL;DR
6+
7+
- Spec 010b ran `attn_only` and `mlp_only` on 8×H100, plus confirmed `port_1695` and baseline numbers. **All five SpinQuant variants land within 0.00009 bpb of each other at final val_bpb.** Full null across the family.
8+
- My working hypothesis from pt.2 ("attention rotation carries long-doc help, MLP rotation carries short-doc hurt") is **refuted**. Neither subset produces port_1695's early-batch trajectory lead; the two rotations together create a nonlinear, emergent effect that neither delivers alone.
9+
- The "port_1695 dropping to rb=1.0524" mid-eval was rank-0's running average over 98 batches of its local doc shard, **not** a preview of final val_bpb. There's an 80× compression between rank-0 rb spread and final val_bpb spread across variants.
10+
- **SpinQuant is fully exhausted on this stack.** Spending more on rotation variants is not going to move the metric.
11+
- Strategic pivot: spec 011 (tapered WD retrain). Anything upstream of phased TTT is what's left.
12+
13+
## What ran today (morning → evening summary)
14+
15+
| run | mode | final val_bpb | Δ vs baseline |
16+
|---|---|---|---|
17+
| spec 008 (morning) | #1736 reproduction | — (projected) | ref |
18+
| spec 009 | baseline | 1.067283 | ref |
19+
| spec 009 | internal_only (R_a) | 1.067309 | +0.000026 |
20+
| spec 010 | port_1695 (all 4 sites, online) | 1.067232 | −0.000050 |
21+
| spec 010b | attn_only | 1.067225 | **−0.000059** (best) |
22+
| spec 010b | mlp_only | 1.067288 | +0.000005 |
23+
24+
**Spread across 5 variants: 0.00009 bpb.** #1736's reported 3-seed std is 0.00070 bpb. We are ~10× below that. All variants are statistically indistinguishable from baseline. Total spend on the SpinQuant arc today: ~$36.
25+
26+
## The session's mistake I want to call out
27+
28+
In pt.2, I wrote an idea file (`research/ideas/rotation-regime-dependence.md` and `port-1695-long-vs-short.md`) based on the per-batch trajectory data from spec 010. That data showed rotation helps long docs (−0.007 bpb), hurts short docs (+0.015 bpb). I interpreted this as a *real, exploitable* regime-dependence and designed spec 010b to isolate which rotation sites carried which half.
29+
30+
The mistake: I was reading rank-0's running-average bpb (`rb` column in `ttp:` log lines) and treating it as a preview of the final val_bpb. It is not. Rank 0 sees only 1/8 of the eval docs; the log's `rb:1.0524` was a real number over rank 0's local doc mix but had no predictive power for the aggregated metric across all 8 ranks.
31+
32+
When spec 010b's data came in, I compounded the mistake by reading an `rb:1.0657` from mlp_only at batch 780 and telling the user "mlp_only might actually net positive!" The user then asked me to check `final.json` — and the actual aggregate came out at +0.000005, above baseline.
33+
34+
**Lessons from the false-signal episode:**
35+
36+
1. **Always check `final.json` before interpreting a trend.** Rank-0 rb is a useful progress indicator, not a metric.
37+
2. **80× compression from rank-0 rb spread → final val_bpb spread is not small noise.** It's an artifact of (a) token-weighted aggregation across 8 ranks vs batch-weighted rb on rank 0, and (b) phased TTT's LoRA adapting uniformly and absorbing variant-specific pre-TTT differences.
38+
3. **"Regime-dependent" is real at rank-0 level but unexploitable at the aggregated metric level** on this eval distribution, because the distribution across ranks smooths the regime effect.
39+
40+
I'm leaving the two idea files in place but will update them to reflect what we actually learned — specifically, that the regime-dependence is a real property of the forward pass but does NOT translate to a leaderboard lever.
41+
42+
## What's genuinely real from the investigation
43+
44+
1. **SpinQuant rotation changes the quantized forward pass.** Rank-0 `rb` values span 0.0075 bpb across variants. The rotation is doing something mechanistically.
45+
2. **TTT LoRA absorbs that difference at the global-aggregation level.** `diagnostic_quantized` (pre-TTT) variants span 0.00012 bpb; `quantized_ttt_phased` (post-TTT) variants span 0.00009 bpb. TTT is the dominant effect (~−0.013 bpb) and it's basically variant-independent.
46+
3. **Attention rotation alone has near-zero effect** on rank-0's trajectory. The `attn_only` run's `rb` column was byte-identical to baseline at 4 decimals for the first ~200 batches. Physical explanation: `softmax(QK.T) V` is rotation-equivariant in V's head_dim, and the attention weights in #1736's stack don't have outlier structure for rotation to smooth.
47+
4. **MLP rotation alone has the inverse rank-0 regime from port_1695.** `mlp_only` starts with a large rank-0 rb lead (1.0900 at batch 5 vs baseline's 1.1142), hurts in the middle (1.0876 at batch 25 vs baseline's 1.0771), then converges.
48+
5. **port_1695's rank-0 rb lead is emergent** — neither `attn_only` nor `mlp_only` comes close (port_1695 at batch 5 is rb=1.0595 vs attn_only's 1.1142 and mlp_only's 1.0900).
49+
50+
The first four points are genuinely interesting about how rotations interact with quantized #1736. The fifth rules out a simple additive decomposition across rotation sites.
51+
52+
## The "rank" conversation, summarized
53+
54+
User asked: "what is rank?" Thread (captured here because it's a useful methodology note):
55+
56+
- `torchrun --nproc_per_node=8` launches 8 GPU processes; each is a "rank" (0–7).
57+
- Each rank processes ~98 of the 782 eval batches (disjoint subsets).
58+
- Only rank 0 writes to the log file; ranks 1–7 compute silently.
59+
- The final `val_bpb` in `final.json` comes from `all_reduce(loss, tokens, bytes)` across all 8 ranks, so it's aggregated over the full eval.
60+
- **The `rb:` column in `ttp:` log lines = rank 0 only.** Different rank-0 doc shards get different `rb` trajectories on different runs/seeds.
61+
62+
We discussed adding per-rank logging (~30 LOC patch + rerun) to validate the regime-dependence across rank boundaries. Decided: not worth the time for the leaderboard push; worth doing later if we want to write this up properly as a research artifact.
63+
64+
## Decisions for research
65+
66+
1. **SpinQuant fully exhausted.** No further rotation variants worth testing for leaderboard purposes. Seed sweeps, layer-selective, `full` (static R₀ + fold), `attn_in_only` — all projected to land within 0.0001 bpb of baseline based on the 010b data.
67+
2. **Update idea files.** The regime-dependence observation is real *mechanistically* but **does not translate to leaderboard lever** on this eval distribution with this TTT stack. The idea files need a correction section.
68+
3. **Move to spec 011** (tapered Muon WD retrain). Full training run, modifies the trained weights themselves, upstream of TTT, not something TTT can absorb. Code patch still unwritten (~30–50 LOC).
69+
4. **Add a methodology note to EXECUTION.md:** *"Rank-0 `rb` column in `ttp:` log lines reflects 1/8 of the eval. Only `final.json` val_bpb is the submission number. Do not extrapolate trends from rb."*
70+
5. **Keep the per-rank logging idea as a low-priority research followup** — useful if we ever want to publish the SpinQuant negative result with proper statistics. Not on the 04-30 critical path.
71+
72+
## Cost and budget
73+
74+
| Spec / arc | Cost |
75+
|---|---|
76+
| Spec 008 (morning, execution) | ~$16 |
77+
| SpinQuant investigation (009 + 010 + 010b) | ~$36 |
78+
| **Today's total** | **~$52** |
79+
| **Project spend to date** | **~$52** of $200 budget |
80+
81+
$148 remaining, 10 days left. Plenty of runway for spec 011 (~$20), a plausible spec 012 (whatever lever 011 points at), and a final 3-seed confirmation on the winning stack (~$30–40).
82+
83+
## Open questions for the next session
84+
85+
1. Spec 011's WD-taper patch: what's the cleanest insertion point in `train_gpt.py`'s training loop? Goes into the Muon optimizer's per-step update logic.
86+
2. Does tapered WD on #1736 reproduce #1729's claimed small positive? If yes, we have our first leaderboard lever of the push. If no, SwiGLU or layerwise LR decay are the next candidates.
87+
3. Does any training-time change on #1736's stack unlock novel levers that TTT *can't* absorb? This is the meta-question — if TTT absorbs quant-side changes, does it also absorb training-time changes? Spec 011 is the first test.
88+
4. Per-rank analysis of spec 010's data (deferred): would confirm the rotation regime-dependence replicates across rank boundaries. Worth doing if we end up writing this up as a research artifact post-deadline.
89+
90+
## What I'd do differently if we ran SpinQuant from scratch
91+
92+
- **Demand final.json numbers before any interpretation.** Don't read `rb` trajectories as forecasts.
93+
- **Run `baseline` + `port_1695` first** before drilling into site ablations. The "SpinQuant is exhausted" finding would have been clear 6 hours earlier, saving the spec 010b design cycle.
94+
- **Look at `diagnostic_quantized` (pre-TTT) spread first.** If it's under ~0.001 across variants, TTT will smash everything to near-identical post-TTT numbers. Spec 009's 0.00003 gap was already a strong hint.
95+
- **The regime-dependence framing was too fast.** "Rotation helps long, hurts short" was a real observation but I leapt to "this is exploitable" before checking whether the aggregate was exploitable. The aggregate is what matters.
96+
97+
## State going into next session
98+
99+
- `fork/research` at commit `8815c4d` (rotation-regime-dependence idea file pushed).
100+
- All 5 SpinQuant variants measured; runs in `runs/{009,010,010b}-*/`.
101+
- Spec 011 doc ready; code not written.
102+
- Spec 010b's summary.md has the full comparative analysis; worth reading before touching SpinQuant again.
103+
104+
Done for the day on SpinQuant. Tomorrow: spec 011 patch + run.

0 commit comments

Comments
 (0)