|
| 1 | +# Spec 009 — SpinQuant V1 hotstart on top of #1736 |
| 2 | + |
| 3 | +**Slug:** `spinquant-hotstart` |
| 4 | +**Created:** 2026-04-20 |
| 5 | +**Links to idea:** `research/ideas/1736-improvement.md` (spec-009 section) |
| 6 | +**Depends on:** spec 008 complete, `runs/008-1736-reproduction/seed_42/pre_gptq.pt` present. |
| 7 | + |
| 8 | +## Hypothesis |
| 9 | + |
| 10 | +Hadamard rotation of weight matrices before GPTQ quantization (SpinQuant V1) spreads weight-distribution outliers uniformly across input dimensions. This reduces quantization error at fixed bit-width, improving post-quant val_bpb without touching the float-precision forward pass. Witnessed on PR #1695 (X-Abhishek-X) at claimed −0.005 bpb on top of a #1529-adjacent base; expected to compose cleanly with #1736 since the quant stage is orthogonal to CaseOps / attention gates / phased TTT. |
| 11 | + |
| 12 | +## Baseline |
| 13 | + |
| 14 | +Spec 008's reproduced seed-42 val_bpb (target ~1.06610 ± 0.003). Exact number is whatever spec 008 actually lands; this spec compares Δ against that. |
| 15 | + |
| 16 | +## Expected Δ |
| 17 | + |
| 18 | +- **Strong:** −0.005 to −0.007 bpb vs spec 008 → SpinQuant confirmed as a free quant lever on #1736, stacks with CaseOps. |
| 19 | +- **Weak:** −0.001 to −0.004 bpb → partial benefit; investigate whether rotation classes were fully applied or whether #1736's int6 GPTQ is already less outlier-sensitive than #1695's stack. |
| 20 | +- **Null or negative:** |Δ| ≤ 0.001 or positive → implementation bug suspected (likely a missed consistent-pair rotation); halt and debug before proceeding. |
| 21 | + |
| 22 | +## Accept criteria |
| 23 | + |
| 24 | +### Phase 1 — FP invariance sanity (before GPTQ) |
| 25 | +- Load `pre_gptq.pt`, run one forward pass on a small batch, record logits. |
| 26 | +- Apply all rotation classes (see "Rotation structure" below). |
| 27 | +- Re-run the same forward pass post-rotation, compare logits. |
| 28 | +- **Must match within float tolerance** (max abs diff ≤ 1e-3 bf16, ≤ 1e-5 fp32). If not, rotation pairs are inconsistent → halt, debug. |
| 29 | + |
| 30 | +### Phase 2 — GPTQ + TTT + eval |
| 31 | +- GPTQ quantization completes under same config as #1736 (`EMBED_BITS=7`, `MLP_CLIP_SIGMAS=12.0`, `ATTN_CLIP_SIGMAS=13.0`, etc.). |
| 32 | +- Artifact < 16,000,000 bytes. |
| 33 | +- Phased TTT eval completes within 600 s. |
| 34 | +- val_bpb is reported. |
| 35 | + |
| 36 | +### Primary success |
| 37 | +- **val_bpb < spec 008 baseline by ≥ 0.003** (moves in the expected direction with plausible magnitude). |
| 38 | +- Ideally ≥ 0.0072 for a standalone-record claim (0.005 nat threshold), but not required — this is a screen. |
| 39 | + |
| 40 | +## Config diff |
| 41 | + |
| 42 | +Same env block as spec 008 (identical GPTQ / TTT / gate settings). Two additions: |
| 43 | + |
| 44 | +``` |
| 45 | +SPINQUANT_ENABLED=1 |
| 46 | +SPINQUANT_SEED=42 # seed for the random orthogonal / signed-Hadamard generator |
| 47 | +HOTSTART_FP_CKPT=/workspace/runs/008-1736-reproduction/seed_42/pre_gptq.pt |
| 48 | +``` |
| 49 | + |
| 50 | +No training run. No `MATRIX_LR` / `MUON` / dataset settings matter — this spec skips training entirely. |
| 51 | + |
| 52 | +## Rotation structure |
| 53 | + |
| 54 | +Three classes, all fixed (not learned), stored as non-parameter buffers: |
| 55 | + |
| 56 | +| Class | Shape | Applied to | # distinct Rs | Constraint | |
| 57 | +|---|---|---|---|---| |
| 58 | +| **Residual-stream R₀** | d_model × d_model | Embedding weight, attn input projections (Q/K/V slices of `qo_bank`/`kv_bank`), attn output projection input side, MLP W1 input side, MLP W2 output side, lm_head input side | 1 (global, shared across all 11 layers) | must be orthogonal; applied as `W ← R₀·W` or `W·R₀ᵀ` per in/out side | |
| 59 | +| **Per-layer attn R_a^ℓ** | d_head × d_head | Internal Q·Kᵀ rotation per layer | 11 (or fewer if banked) | applied to Q-out and K-out consistently | |
| 60 | +| **Per-layer MLP R_m^ℓ** | d_ff × d_ff | Internal W1→W2 rotation per layer | 11 | applied to W1-out and W2-in consistently | |
| 61 | + |
| 62 | +**R construction:** preferred `R = diag(±1) · Hadamard(d)` (signed Hadamard — structured, outlier-spreading). Fallback: random orthogonal via `torch.linalg.qr(torch.randn(d,d))`. |
| 63 | + |
| 64 | +**Critical:** for `R₀`, every residual-stream read/write side must use the same R₀ across all layers (including across Loop45 recurrence passes — key R₀ by "residual stream" not by "invocation"). Any miss breaks float invariance. |
| 65 | + |
| 66 | +## Code changes |
| 67 | + |
| 68 | +- **Branch:** `research` (this is a commitment-class change — quant lever becomes part of our baseline if it lands). |
| 69 | +- **New file:** `records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/spinquant_hotstart.py` — a standalone script that: |
| 70 | + 1. Loads the FP checkpoint specified by `HOTSTART_FP_CKPT`. |
| 71 | + 2. Generates R₀, R_a^ℓ, R_m^ℓ using `SPINQUANT_SEED`. |
| 72 | + 3. Applies rotations to banked weight tensors (slicing `qo_bank` / `kv_bank` as appropriate). |
| 73 | + 4. Sanity-checks FP forward invariance on one batch (Phase 1 accept). |
| 74 | + 5. Invokes #1736's existing GPTQ + TTT + eval pipeline on the rotated weights (reusing functions from `train_gpt.py`). |
| 75 | + 6. Writes the artifact + bpb to `runs/009-spinquant-hotstart/`. |
| 76 | +- **No modifications** to `train_gpt.py` other than exposing a couple of its GPTQ/eval functions as importable (if they're currently inlined under `if __name__ == "__main__"`). |
| 77 | +- **Reference:** read #1695's diff when available; port rotation bookkeeping onto #1736's banked layout. |
| 78 | + |
| 79 | +## Hardware ladder |
| 80 | + |
| 81 | +- [x] **1×H100** — sufficient. No training, just rotate + GPTQ (~2 min) + TTT eval (~6–10 min). Could also use 2×H100 if DDP is needed for TTT parallelization, but TTT eval in #1736 runs on 8 ranks per its phased-TTT setup — may need to check whether single-rank eval is supported. |
| 82 | +- **Fallback:** 8×H100 if phased TTT requires multi-rank. |
| 83 | + |
| 84 | +## Seed plan |
| 85 | + |
| 86 | +Single seed (42), matching spec 008. Compares directly against spec 008's seed-42 number. |
| 87 | + |
| 88 | +## Inputs |
| 89 | + |
| 90 | +- **FP checkpoint:** `runs/008-1736-reproduction/seed_42/pre_gptq.pt` (spec 008 output). |
| 91 | +- **Data:** same CaseOps dataset as spec 008 (on persistent volume, already prepared). |
| 92 | +- **Tokenizer:** bundled with #1736 submission dir, unchanged. |
| 93 | + |
| 94 | +## Execution protocol |
| 95 | + |
| 96 | +Single pod, single seed, single pass: |
| 97 | + |
| 98 | +```bash |
| 99 | +cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT |
| 100 | + |
| 101 | +mkdir -p /workspace/runs/009-spinquant-hotstart |
| 102 | + |
| 103 | +NCCL_NET=Socket DATA_DIR=./data \ |
| 104 | +CASEOPS_ENABLED=1 \ |
| 105 | +PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \ |
| 106 | +MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \ |
| 107 | +EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \ |
| 108 | +GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \ |
| 109 | +GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \ |
| 110 | +SPINQUANT_ENABLED=1 SPINQUANT_SEED=42 \ |
| 111 | +HOTSTART_FP_CKPT=/workspace/runs/008-1736-reproduction/seed_42/pre_gptq.pt \ |
| 112 | +SEED=42 \ |
| 113 | +torchrun --standalone --nproc_per_node=8 spinquant_hotstart.py \ |
| 114 | + > /workspace/runs/009-spinquant-hotstart/run.log 2>&1 |
| 115 | +``` |
| 116 | + |
| 117 | +## Kill protocol |
| 118 | + |
| 119 | +- FP invariance check fails (Phase 1) → halt, save rotation seeds + diff stats, flag research. |
| 120 | +- GPTQ calibration fails or hangs > 5 min → halt. |
| 121 | +- Eval hangs > 15 min → halt, stop pod. |
| 122 | +- After successful completion: stop pod per memory default. |
| 123 | + |
| 124 | +## Stop-early criteria |
| 125 | + |
| 126 | +- Phase 1 FP invariance: max abs logit diff > 1e-3 (bf16) → halt before wasting GPTQ time. |
| 127 | +- Artifact size > 16 MB → halt, flag. |
| 128 | +- val_bpb > spec 008 baseline + 0.003 (got *worse*) → likely rotation error, halt. |
| 129 | + |
| 130 | +## Checkpoints to emit |
| 131 | + |
| 132 | +**None.** Spec 009 is pure post-training, no new FP state worth saving. The only artifact is the rotated-and-quantized `.ptz` submission + the log. |
| 133 | + |
| 134 | +## Cost estimate |
| 135 | + |
| 136 | +| Item | Cost | |
| 137 | +|---|---| |
| 138 | +| Pod spin-up | $1 | |
| 139 | +| Phase 1 invariance check (1×H100, ~1 min) | $0.10 | |
| 140 | +| Phase 2 GPTQ + TTT eval (8×H100, ~10–15 min) | $3 | |
| 141 | +| Buffer for debug | $2 | |
| 142 | +| **Total** | **~$6** | |
| 143 | + |
| 144 | +If FP invariance fails first try and we debug once, total rises to ~$10. |
| 145 | + |
| 146 | +## Extra artifacts |
| 147 | + |
| 148 | +- `runs/009-spinquant-hotstart/run.log` — full log |
| 149 | +- `runs/009-spinquant-hotstart/artifact.ptz` — SpinQuant + GPTQ quantized submission |
| 150 | +- `runs/009-spinquant-hotstart/rotation_seeds.json` — reproducibility: records the `SPINQUANT_SEED` and any per-layer seed offsets used |
| 151 | +- `runs/009-spinquant-hotstart/invariance_report.json` — Phase 1 diff stats (max/mean logit diff pre vs post rotation) |
| 152 | +- `runs/009-spinquant-hotstart/final.json` — val_bpb, Δ vs spec 008, artifact size, wall times |
| 153 | + |
| 154 | +## Open questions for interview |
| 155 | + |
| 156 | +1. **Can #1736's `train_gpt.py` export its GPTQ + TTT entry points as importable functions?** If they're all inlined under `main()`, the hotstart script has to duplicate orchestration code. Preferable: minimal refactor in `train_gpt.py` to split `main()` into named helpers that `spinquant_hotstart.py` can call. |
| 157 | +2. **Banked layout accounting** — `qo_bank` and `kv_bank` in #1736 concatenate Q+O (or Q+O projections) and K+V respectively. Need to verify which slices correspond to which residual-stream reads/writes before rotating (a diagram from the code will confirm). |
| 158 | +3. **Loop45 consistency** — confirm that `R₀` applied to residual-stream weights of layers 4 and 5 is invariant across the multiple invocations of those layers in the recurrence. |
| 159 | +4. **Phased TTT compatibility** — phased TTT adapts weights during eval. SpinQuant rotation must be applied *before* phased TTT sees the weights, so TTT adapts the rotated weights. Verify the ordering in #1736's eval pipeline. |
| 160 | +5. **Reference code availability** — is #1695's diff visible (is the PR source readable)? If yes, port directly. If not, we re-derive from the SpinQuant paper and cross-check against the banked layout. |
| 161 | + |
| 162 | +## What this spec does NOT do |
| 163 | + |
| 164 | +- Does not retrain any weights. |
| 165 | +- Does not tune `SPINQUANT_SEED` — first run uses 42; if rotation-seed sensitivity matters (unlikely for Hadamard) we can sweep later. |
| 166 | +- Does not change CaseOps, gates, TTT, or any other non-quant lever. |
| 167 | +- Does not emit a pre-quant checkpoint (spec 008's serves for all quant-family hotstarts). |
| 168 | +- Does not run multi-seed — matches spec 008's single-seed convention. |
0 commit comments