spec 008: single seed, pre-GPTQ checkpoint for quant hotstart

leon2k2k2k · claude · leon2k2k2k · commit 9306f89085c1 · 2026-04-20T08:22:22.000+08:00
Collapse spec 008 to seed 42 only and add a one-line pre-GPTQ FP checkpoint save at runs/008-1736-reproduction/seed_42/pre_gptq.pt (env-var gated via SAVE_PRE_GPTQ=1 so the reproduction itself is unaffected when the flag is off). Rationale: SpinQuant and subsequent quant-family experiments are purely post-training transforms, so hotstarting off a single pre-GPTQ FP checkpoint is far cheaper than retraining per spec. Single-seed comparison against openai#1736's seed-42 (1.06610, ±0.003) is apples-to-apples for screening. Cost drops ~$40 -> ~$17 for this spec and ~$10 -> ~$1–2 per downstream quant experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/research/specs/008-1736-reproduction.md b/research/specs/008-1736-reproduction.md
@@ -10,11 +10,9 @@ We can reproduce dexhunter's unmerged PR #1736 (val_bpb 1.06549, 3-seed mean, st
 
 ## Baseline
 
-Comparison reference for this spec is the claimed number from PR #1736's submission:
-- seed 42: 1.06610
-- seed 0: 1.06473
-- seed 1234: 1.06563
-- **mean: 1.06549 ± 0.00070** (per their `submission.json`)
+Comparison reference for this spec is #1736's **seed-42** number from their `submission.json`: **val_bpb = 1.06610**.
+
+(For reference: their full 3-seed set was 42=1.06610, 0=1.06473, 1234=1.06563, mean=1.06549±0.00070. We only reproduce seed 42; per-seed comparison is apples-to-apples and sufficient for screening.)
 
 Our spec-000 number (1.0810, merged-SOTA replica) remains on the books only as a legacy reference for backward-compat sanity reruns.
 
@@ -34,12 +32,12 @@ Not a delta experiment — a baseline migration. Success criterion is *reproduci
 - Training runs 50–100 steps with no NaN, finite first-step loss.
 - Step time within 2× of expected for 2×H100 (i.e., not catastrophically slow).
 
-### Phase 3 — 8×H100 3-seed official
-- All 3 seeds complete without NaN / divergence.
-- Each artifact < 16,000,000 bytes (decimal cap, per #1736's submission).
-- Each seed within 600 s train + 600 s eval budget.
-- **Primary accept:** 3-seed mean val_bpb within ±0.001 of 1.0655 (matches #1736's reported std band).
-- **Secondary accept:** individual seeds within ±0.003 of their matched #1736 seed value.
+### Phase 3 — 8×H100 single-seed official
+- Seed 42 completes without NaN / divergence.
+- Artifact < 16,000,000 bytes (decimal cap, per #1736's submission).
+- Within 600 s train + 600 s eval budget.
+- **Primary accept:** val_bpb within **±0.003 of 1.06610** (#1736's seed-42 number).
+- **Pre-GPTQ checkpoint saved** to `runs/008-1736-reproduction/seed_42/pre_gptq.pt` (FP weights, right before GPTQ quantization runs). This is the hotstart input for specs 009+ quant experiments.
 
 ## Config diff
 
@@ -85,11 +83,11 @@ SEED=<42|0|1234>
 
 - [x] **Phase 1** — CPU (data prep). Any tiny pod, or run on a GPU pod during setup.
 - [x] **Phase 2** — 2×H100 smoke (~10 min, ~$0.50). Skipped only if execution has very high confidence in the integration.
-- [x] **Phase 3** — 8×H100 3-seed official (~30 min × 3).
+- [x] **Phase 3** — 8×H100 single-seed official (~30 min, ~$10).
 
 ## Seed plan
 
-Three seeds, matching #1736: **42, 0, 1234**. Gives apples-to-apples comparison against their reported numbers.
+**Single seed: 42.** Apples-to-apples against #1736's seed-42 number (1.06610). Multi-seed confirmation is deferred to a final leaderboard spec if/when a composition looks submission-ready. Screening single-seed-vs-single-seed is consistent with our step-matched-comparison convention and saves ~$20 per quant experiment downstream.
 
 ## Inputs
 
@@ -133,25 +131,35 @@ torchrun --standalone --nproc_per_node=2 train_gpt.py
 
 If smoke passes → proceed to Phase 3. If smoke fails → halt, post diagnosis, flag research.
 
-### Phase 3 — 8×H100 3-seed official
+### Phase 3 — 8×H100 single-seed official (seed 42)
+
+**Before launching:** patch `train_gpt.py` to save the pre-GPTQ FP checkpoint. Find the code path right before the GPTQ quantization call and inject:
+
+```python
+if int(os.environ.get("SAVE_PRE_GPTQ", "0")) and rank == 0:
+    torch.save(model.state_dict(), os.environ["PRE_GPTQ_CKPT_PATH"])
+```
+
+Gated on an env var so the change is invisible when not requested (keeps the reproduction as clean as possible). One-line equivalent acceptable.
 
 ```bash
 cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT
 
-for SEED in 42 0 1234; do
-  mkdir -p /workspace/runs/008-1736-reproduction/seed_${SEED}
-  NCCL_NET=Socket DATA_DIR=./data \
-  CASEOPS_ENABLED=1 \
-  PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \
-  MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \
-  EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
-  MATRIX_LR=0.026 \
-  GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \
-  GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \
-  SEED=$SEED \
-  torchrun --standalone --nproc_per_node=8 train_gpt.py \
-    > /workspace/runs/008-1736-reproduction/seed_${SEED}/train.log 2>&1
-done
+mkdir -p /workspace/runs/008-1736-reproduction/seed_42
+
+NCCL_NET=Socket DATA_DIR=./data \
+CASEOPS_ENABLED=1 \
+PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \
+MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \
+EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
+MATRIX_LR=0.026 \
+GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \
+GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \
+SAVE_PRE_GPTQ=1 \
+PRE_GPTQ_CKPT_PATH=/workspace/runs/008-1736-reproduction/seed_42/pre_gptq.pt \
+SEED=42 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py \
+  > /workspace/runs/008-1736-reproduction/seed_42/train.log 2>&1
 ```
 
 ### Kill protocol
@@ -161,34 +169,39 @@ done
 
 ## Checkpoints to emit
 
-**None.** Checkpoints disabled for this spec.
+**Exactly one:** `runs/008-1736-reproduction/seed_42/pre_gptq.pt` — FP16/FP32 weights saved right before GPTQ quantization runs.
+
+Rationale: the entire quant-family spec chain (009 SpinQuant, plus any future per-group-bit / AR-selfgen / AWQ experiments) can hotstart off this single checkpoint because SpinQuant and its siblings are post-training transforms. Per-experiment cost drops from ~$10 retrain to ~$1–2 rotate-and-requant. The one-line injection into `train_gpt.py` is gated on an env var so the reproduction itself is unaffected.
 
-Rationale: #1736's `train_gpt.py` doesn't implement our `CKPT_DIR`/`CKPT_STEPS` convention, and patching it adds reproduction risk for negligible gain. If specs 009+ need hotstart off this trajectory, a targeted checkpoint-enabled rerun can be scheduled then.
+No intermediate / phase-boundary checkpoints. No post-GPTQ checkpoints (the training log + `final.json` carry the info we'd want).
 
 ## Stop-early criteria
 
 - Import / CUDA failure on smoke → halt, flag.
 - NaN in train_loss at any step → halt, mark failed.
 - Step time > 2× expected → halt, investigate.
-- Any seed artifact > 16 MB → halt, flag (our build mismatches #1736's compression).
-- Seed-42 official run val_bpb > 0.003 off 1.06610 → halt before running seeds 0 and 1234, flag research.
+- Artifact > 16 MB → halt, flag (our build mismatches #1736's compression).
+- Seed-42 val_bpb > 0.003 off 1.06610 → halt, flag research; decide whether to retry or fall back to #1626 clean-foundation baseline before spawning downstream quant specs.
 
 ## Cost estimate
 
 | Item | Cost |
 |---|---|
 | Phase 1 (data prep, CPU-idle GPU) | ~$1–2 |
 | Phase 2 (2×H100 smoke, 10 min) | ~$0.50 |
-| Phase 3 (8×H100 × 3 seeds) | ~$30 |
-| Buffer for debug | ~$10 |
-| **Total** | **~$40** |
+| Phase 3 (8×H100, single seed, ~30 min) | ~$10 |
+| Buffer for debug | ~$5 |
+| **Total** | **~$17** |
+
+Downstream quant experiments (spec 009+) hotstart off the checkpoint from this run, so each costs ~$1–2 instead of ~$10.
 
 ## Extra artifacts
 
-- `runs/008-1736-reproduction/seed_<S>/train.log` for each seed
-- `runs/008-1736-reproduction/seed_<S>/artifact.ptz` (or whatever `train_gpt.py` writes) for each seed
-- `runs/008-1736-reproduction/smoke/train.log` for Phase 2
-- `runs/008-1736-reproduction/final.json` — 3-seed summary (mean bpb, std, per-seed bpb, per-seed artifact size, wall times)
+- `runs/008-1736-reproduction/seed_42/train.log` — full training stdout/stderr
+- `runs/008-1736-reproduction/seed_42/pre_gptq.pt` — pre-GPTQ FP checkpoint (hotstart input for spec 009+)
+- `runs/008-1736-reproduction/seed_42/artifact.ptz` (or whatever `train_gpt.py` writes) — the submission artifact
+- `runs/008-1736-reproduction/smoke/train.log` — Phase 2 smoke log
+- `runs/008-1736-reproduction/final.json` — seed-42 summary (bpb, artifact size, wall times, step at which pre_gptq.pt was saved)
 - `runs/008-1736-reproduction/notes.md` — execution narrative + any deviations from this spec
 
 ## Open questions for interview
@@ -197,12 +210,13 @@ Rationale: #1736's `train_gpt.py` doesn't implement our `CKPT_DIR`/`CKPT_STEPS`
 2. **HF shortcut** — is `romeerp/parameter-golf-caseops-v1` on HuggingFace byte-compatible with what `prepare_caseops_data.py` produces? If yes, Phase 1 can be replaced with a ~20 GB download + schema check (saves ~2 hours). Quick way to test: download one val shard + its byte sidecar from HF, compare byte-for-byte against local prep output on a sample doc.
 3. **flash-attn-3 install** — is the wheel at `https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/` still reachable from the pod's region? Preflight step per #1736 README: `pip install flash_attn_3 --no-deps --find-links <wheel-url>`. If unreachable, fallback?
 4. **Smoke override mechanism** — does `train_gpt.py` accept an `ITERATIONS` / `MAX_STEPS` env var, or a `DISABLE_EVAL` flag? Execution should grep the script for the iteration count constant and pick the cleanest override path. If no clean override exists, we can instead just run the full training and abort after ~2 min of logging — the smoke goal is the first 50 steps of log output.
-5. **Phase 3 seed-42 early-gate** — if seed-42 misses (>0.003 off 1.06610), do we halt before spawning seeds 0 and 1234? Suggest yes — saves ~$20 on a likely-miss. Execution to implement a simple log-grep gate between seeds.
+5. **Pre-GPTQ hook location** — execution should grep `train_gpt.py` for where GPTQ is invoked (likely a function call on the FP model) and inject the `torch.save(...)` one line before, gated on `SAVE_PRE_GPTQ`. Verify the saved state_dict loads correctly before declaring Phase 3 a pass (simple `torch.load(...)` check on the pod).
 
 ## What this spec does NOT do
 
-- Does not modify #1736's `train_gpt.py` or bundle our checkpoint logic into it.
-- Does not attempt to beat 1.0655. Success is *reproduce*.
-- Does not save intermediate checkpoints.
-- Does not run a full 2×H100 mini (40 min, ~$3) — only a ~10 min smoke. Full screening confidence comes from Phase 3.
+- Does not attempt to beat 1.06610. Success is *reproduce*.
+- Does not run 3 seeds — seed 42 only. Multi-seed confirmation is deferred to a potential final leaderboard spec.
+- Does not save intermediate / phase-boundary checkpoints. One pre-GPTQ checkpoint only.
+- Does not run a full 2×H100 mini (40 min, ~$3) — only a ~10 min smoke.
+- Does not modify `train_gpt.py` beyond the env-var-gated pre-GPTQ checkpoint save.
 - Does not test other unmerged PRs (#1735, #1738, #1667, #1729, #1695) as alternative bases. Those belong in specs 009+.