Skip to content

Commit d95dd08

Browse files
leon2k2k2kclaude
andcommitted
spec 009: SpinQuant V1 hotstart on top of openai#1736
First lever layered on the new openai#1736 baseline. Hadamard rotation of weight matrices before GPTQ quantization, hotstarted off spec 008's pre_gptq.pt FP checkpoint. No retraining. Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a openai#1529-adjacent base; expected to compose cleanly with openai#1736 since the quant stage is orthogonal to CaseOps / attention gates / phased TTT. Rotation is a post-training transform with three classes (residual-stream, per-layer attn, per-layer MLP); FP forward pass is invariant by construction, only quantization error drops. Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full retrain. Same hotstart checkpoint reused by future quant experiments (per-group bit, AR-selfgen calib, AWQ). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9306f89 commit d95dd08

1 file changed

Lines changed: 168 additions & 0 deletions

File tree

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# Spec 009 — SpinQuant V1 hotstart on top of #1736
2+
3+
**Slug:** `spinquant-hotstart`
4+
**Created:** 2026-04-20
5+
**Links to idea:** `research/ideas/1736-improvement.md` (spec-009 section)
6+
**Depends on:** spec 008 complete, `runs/008-1736-reproduction/seed_42/pre_gptq.pt` present.
7+
8+
## Hypothesis
9+
10+
Hadamard rotation of weight matrices before GPTQ quantization (SpinQuant V1) spreads weight-distribution outliers uniformly across input dimensions. This reduces quantization error at fixed bit-width, improving post-quant val_bpb without touching the float-precision forward pass. Witnessed on PR #1695 (X-Abhishek-X) at claimed −0.005 bpb on top of a #1529-adjacent base; expected to compose cleanly with #1736 since the quant stage is orthogonal to CaseOps / attention gates / phased TTT.
11+
12+
## Baseline
13+
14+
Spec 008's reproduced seed-42 val_bpb (target ~1.06610 ± 0.003). Exact number is whatever spec 008 actually lands; this spec compares Δ against that.
15+
16+
## Expected Δ
17+
18+
- **Strong:** −0.005 to −0.007 bpb vs spec 008 → SpinQuant confirmed as a free quant lever on #1736, stacks with CaseOps.
19+
- **Weak:** −0.001 to −0.004 bpb → partial benefit; investigate whether rotation classes were fully applied or whether #1736's int6 GPTQ is already less outlier-sensitive than #1695's stack.
20+
- **Null or negative:** |Δ| ≤ 0.001 or positive → implementation bug suspected (likely a missed consistent-pair rotation); halt and debug before proceeding.
21+
22+
## Accept criteria
23+
24+
### Phase 1 — FP invariance sanity (before GPTQ)
25+
- Load `pre_gptq.pt`, run one forward pass on a small batch, record logits.
26+
- Apply all rotation classes (see "Rotation structure" below).
27+
- Re-run the same forward pass post-rotation, compare logits.
28+
- **Must match within float tolerance** (max abs diff ≤ 1e-3 bf16, ≤ 1e-5 fp32). If not, rotation pairs are inconsistent → halt, debug.
29+
30+
### Phase 2 — GPTQ + TTT + eval
31+
- GPTQ quantization completes under same config as #1736 (`EMBED_BITS=7`, `MLP_CLIP_SIGMAS=12.0`, `ATTN_CLIP_SIGMAS=13.0`, etc.).
32+
- Artifact < 16,000,000 bytes.
33+
- Phased TTT eval completes within 600 s.
34+
- val_bpb is reported.
35+
36+
### Primary success
37+
- **val_bpb < spec 008 baseline by ≥ 0.003** (moves in the expected direction with plausible magnitude).
38+
- Ideally ≥ 0.0072 for a standalone-record claim (0.005 nat threshold), but not required — this is a screen.
39+
40+
## Config diff
41+
42+
Same env block as spec 008 (identical GPTQ / TTT / gate settings). Two additions:
43+
44+
```
45+
SPINQUANT_ENABLED=1
46+
SPINQUANT_SEED=42 # seed for the random orthogonal / signed-Hadamard generator
47+
HOTSTART_FP_CKPT=/workspace/runs/008-1736-reproduction/seed_42/pre_gptq.pt
48+
```
49+
50+
No training run. No `MATRIX_LR` / `MUON` / dataset settings matter — this spec skips training entirely.
51+
52+
## Rotation structure
53+
54+
Three classes, all fixed (not learned), stored as non-parameter buffers:
55+
56+
| Class | Shape | Applied to | # distinct Rs | Constraint |
57+
|---|---|---|---|---|
58+
| **Residual-stream R₀** | d_model × d_model | Embedding weight, attn input projections (Q/K/V slices of `qo_bank`/`kv_bank`), attn output projection input side, MLP W1 input side, MLP W2 output side, lm_head input side | 1 (global, shared across all 11 layers) | must be orthogonal; applied as `W ← R₀·W` or `W·R₀ᵀ` per in/out side |
59+
| **Per-layer attn R_a^ℓ** | d_head × d_head | Internal Q·Kᵀ rotation per layer | 11 (or fewer if banked) | applied to Q-out and K-out consistently |
60+
| **Per-layer MLP R_m^ℓ** | d_ff × d_ff | Internal W1→W2 rotation per layer | 11 | applied to W1-out and W2-in consistently |
61+
62+
**R construction:** preferred `R = diag(±1) · Hadamard(d)` (signed Hadamard — structured, outlier-spreading). Fallback: random orthogonal via `torch.linalg.qr(torch.randn(d,d))`.
63+
64+
**Critical:** for `R₀`, every residual-stream read/write side must use the same R₀ across all layers (including across Loop45 recurrence passes — key R₀ by "residual stream" not by "invocation"). Any miss breaks float invariance.
65+
66+
## Code changes
67+
68+
- **Branch:** `research` (this is a commitment-class change — quant lever becomes part of our baseline if it lands).
69+
- **New file:** `records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/spinquant_hotstart.py` — a standalone script that:
70+
1. Loads the FP checkpoint specified by `HOTSTART_FP_CKPT`.
71+
2. Generates R₀, R_a^ℓ, R_m^ℓ using `SPINQUANT_SEED`.
72+
3. Applies rotations to banked weight tensors (slicing `qo_bank` / `kv_bank` as appropriate).
73+
4. Sanity-checks FP forward invariance on one batch (Phase 1 accept).
74+
5. Invokes #1736's existing GPTQ + TTT + eval pipeline on the rotated weights (reusing functions from `train_gpt.py`).
75+
6. Writes the artifact + bpb to `runs/009-spinquant-hotstart/`.
76+
- **No modifications** to `train_gpt.py` other than exposing a couple of its GPTQ/eval functions as importable (if they're currently inlined under `if __name__ == "__main__"`).
77+
- **Reference:** read #1695's diff when available; port rotation bookkeeping onto #1736's banked layout.
78+
79+
## Hardware ladder
80+
81+
- [x] **1×H100** — sufficient. No training, just rotate + GPTQ (~2 min) + TTT eval (~6–10 min). Could also use 2×H100 if DDP is needed for TTT parallelization, but TTT eval in #1736 runs on 8 ranks per its phased-TTT setup — may need to check whether single-rank eval is supported.
82+
- **Fallback:** 8×H100 if phased TTT requires multi-rank.
83+
84+
## Seed plan
85+
86+
Single seed (42), matching spec 008. Compares directly against spec 008's seed-42 number.
87+
88+
## Inputs
89+
90+
- **FP checkpoint:** `runs/008-1736-reproduction/seed_42/pre_gptq.pt` (spec 008 output).
91+
- **Data:** same CaseOps dataset as spec 008 (on persistent volume, already prepared).
92+
- **Tokenizer:** bundled with #1736 submission dir, unchanged.
93+
94+
## Execution protocol
95+
96+
Single pod, single seed, single pass:
97+
98+
```bash
99+
cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT
100+
101+
mkdir -p /workspace/runs/009-spinquant-hotstart
102+
103+
NCCL_NET=Socket DATA_DIR=./data \
104+
CASEOPS_ENABLED=1 \
105+
PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \
106+
MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \
107+
EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
108+
GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \
109+
GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \
110+
SPINQUANT_ENABLED=1 SPINQUANT_SEED=42 \
111+
HOTSTART_FP_CKPT=/workspace/runs/008-1736-reproduction/seed_42/pre_gptq.pt \
112+
SEED=42 \
113+
torchrun --standalone --nproc_per_node=8 spinquant_hotstart.py \
114+
> /workspace/runs/009-spinquant-hotstart/run.log 2>&1
115+
```
116+
117+
## Kill protocol
118+
119+
- FP invariance check fails (Phase 1) → halt, save rotation seeds + diff stats, flag research.
120+
- GPTQ calibration fails or hangs > 5 min → halt.
121+
- Eval hangs > 15 min → halt, stop pod.
122+
- After successful completion: stop pod per memory default.
123+
124+
## Stop-early criteria
125+
126+
- Phase 1 FP invariance: max abs logit diff > 1e-3 (bf16) → halt before wasting GPTQ time.
127+
- Artifact size > 16 MB → halt, flag.
128+
- val_bpb > spec 008 baseline + 0.003 (got *worse*) → likely rotation error, halt.
129+
130+
## Checkpoints to emit
131+
132+
**None.** Spec 009 is pure post-training, no new FP state worth saving. The only artifact is the rotated-and-quantized `.ptz` submission + the log.
133+
134+
## Cost estimate
135+
136+
| Item | Cost |
137+
|---|---|
138+
| Pod spin-up | $1 |
139+
| Phase 1 invariance check (1×H100, ~1 min) | $0.10 |
140+
| Phase 2 GPTQ + TTT eval (8×H100, ~10–15 min) | $3 |
141+
| Buffer for debug | $2 |
142+
| **Total** | **~$6** |
143+
144+
If FP invariance fails first try and we debug once, total rises to ~$10.
145+
146+
## Extra artifacts
147+
148+
- `runs/009-spinquant-hotstart/run.log` — full log
149+
- `runs/009-spinquant-hotstart/artifact.ptz` — SpinQuant + GPTQ quantized submission
150+
- `runs/009-spinquant-hotstart/rotation_seeds.json` — reproducibility: records the `SPINQUANT_SEED` and any per-layer seed offsets used
151+
- `runs/009-spinquant-hotstart/invariance_report.json` — Phase 1 diff stats (max/mean logit diff pre vs post rotation)
152+
- `runs/009-spinquant-hotstart/final.json` — val_bpb, Δ vs spec 008, artifact size, wall times
153+
154+
## Open questions for interview
155+
156+
1. **Can #1736's `train_gpt.py` export its GPTQ + TTT entry points as importable functions?** If they're all inlined under `main()`, the hotstart script has to duplicate orchestration code. Preferable: minimal refactor in `train_gpt.py` to split `main()` into named helpers that `spinquant_hotstart.py` can call.
157+
2. **Banked layout accounting**`qo_bank` and `kv_bank` in #1736 concatenate Q+O (or Q+O projections) and K+V respectively. Need to verify which slices correspond to which residual-stream reads/writes before rotating (a diagram from the code will confirm).
158+
3. **Loop45 consistency** — confirm that `R₀` applied to residual-stream weights of layers 4 and 5 is invariant across the multiple invocations of those layers in the recurrence.
159+
4. **Phased TTT compatibility** — phased TTT adapts weights during eval. SpinQuant rotation must be applied *before* phased TTT sees the weights, so TTT adapts the rotated weights. Verify the ordering in #1736's eval pipeline.
160+
5. **Reference code availability** — is #1695's diff visible (is the PR source readable)? If yes, port directly. If not, we re-derive from the SpinQuant paper and cross-check against the banked layout.
161+
162+
## What this spec does NOT do
163+
164+
- Does not retrain any weights.
165+
- Does not tune `SPINQUANT_SEED` — first run uses 42; if rotation-seed sensitivity matters (unlikely for Hadamard) we can sweep later.
166+
- Does not change CaseOps, gates, TTT, or any other non-quant lever.
167+
- Does not emit a pre-quant checkpoint (spec 008's serves for all quant-family hotstarts).
168+
- Does not run multi-seed — matches spec 008's single-seed convention.

0 commit comments

Comments
 (0)