You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
spec 008: single seed, pre-GPTQ checkpoint for quant hotstart
Collapse spec 008 to seed 42 only and add a one-line pre-GPTQ FP
checkpoint save at runs/008-1736-reproduction/seed_42/pre_gptq.pt
(env-var gated via SAVE_PRE_GPTQ=1 so the reproduction itself is
unaffected when the flag is off).
Rationale: SpinQuant and subsequent quant-family experiments are
purely post-training transforms, so hotstarting off a single
pre-GPTQ FP checkpoint is far cheaper than retraining per spec.
Single-seed comparison against openai#1736's seed-42 (1.06610, ±0.003)
is apples-to-apples for screening. Cost drops ~$40 -> ~$17 for
this spec and ~$10 -> ~$1–2 per downstream quant experiment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: research/specs/008-1736-reproduction.md
+58-44Lines changed: 58 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,11 +10,9 @@ We can reproduce dexhunter's unmerged PR #1736 (val_bpb 1.06549, 3-seed mean, st
10
10
11
11
## Baseline
12
12
13
-
Comparison reference for this spec is the claimed number from PR #1736's submission:
14
-
- seed 42: 1.06610
15
-
- seed 0: 1.06473
16
-
- seed 1234: 1.06563
17
-
-**mean: 1.06549 ± 0.00070** (per their `submission.json`)
13
+
Comparison reference for this spec is #1736's **seed-42** number from their `submission.json`: **val_bpb = 1.06610**.
14
+
15
+
(For reference: their full 3-seed set was 42=1.06610, 0=1.06473, 1234=1.06563, mean=1.06549±0.00070. We only reproduce seed 42; per-seed comparison is apples-to-apples and sufficient for screening.)
18
16
19
17
Our spec-000 number (1.0810, merged-SOTA replica) remains on the books only as a legacy reference for backward-compat sanity reruns.
20
18
@@ -34,12 +32,12 @@ Not a delta experiment — a baseline migration. Success criterion is *reproduci
34
32
- Training runs 50–100 steps with no NaN, finite first-step loss.
35
33
- Step time within 2× of expected for 2×H100 (i.e., not catastrophically slow).
36
34
37
-
### Phase 3 — 8×H100 3-seed official
38
-
-All 3 seeds complete without NaN / divergence.
39
-
-Each artifact < 16,000,000 bytes (decimal cap, per #1736's submission).
40
-
-Each seed within 600 s train + 600 s eval budget.
41
-
-**Primary accept:**3-seed mean val_bpb within ±0.001 of 1.0655 (matches #1736's reported std band).
42
-
-**Secondary accept:**individual seeds within ±0.003 of their matched #1736 seed value.
35
+
### Phase 3 — 8×H100 single-seed official
36
+
-Seed 42 completes without NaN / divergence.
37
+
-Artifact < 16,000,000 bytes (decimal cap, per #1736's submission).
38
+
-Within 600 s train + 600 s eval budget.
39
+
-**Primary accept:** val_bpb within **±0.003 of 1.06610** (#1736's seed-42 number).
40
+
-**Pre-GPTQ checkpoint saved**to `runs/008-1736-reproduction/seed_42/pre_gptq.pt` (FP weights, right before GPTQ quantization runs). This is the hotstart input for specs 009+ quant experiments.
43
41
44
42
## Config diff
45
43
@@ -85,11 +83,11 @@ SEED=<42|0|1234>
85
83
86
84
-[x]**Phase 1** — CPU (data prep). Any tiny pod, or run on a GPU pod during setup.
87
85
-[x]**Phase 2** — 2×H100 smoke (~10 min, ~$0.50). Skipped only if execution has very high confidence in the integration.
88
-
-[x]**Phase 3** — 8×H100 3-seed official (~30 min × 3).
86
+
-[x]**Phase 3** — 8×H100 single-seed official (~30 min, ~$10).
89
87
90
88
## Seed plan
91
89
92
-
Three seeds, matching #1736: **42, 0, 1234**. Gives apples-to-apples comparison against their reported numbers.
90
+
**Single seed: 42.** Apples-to-apples against #1736's seed-42 number (1.06610). Multi-seed confirmation is deferred to a final leaderboard spec if/when a composition looks submission-ready. Screening single-seed-vs-single-seed is consistent with our step-matched-comparison convention and saves ~$20 per quant experiment downstream.
**Exactly one:**`runs/008-1736-reproduction/seed_42/pre_gptq.pt` — FP16/FP32 weights saved right before GPTQ quantization runs.
173
+
174
+
Rationale: the entire quant-family spec chain (009 SpinQuant, plus any future per-group-bit / AR-selfgen / AWQ experiments) can hotstart off this single checkpoint because SpinQuant and its siblings are post-training transforms. Per-experiment cost drops from ~$10 retrain to ~$1–2 rotate-and-requant. The one-line injection into `train_gpt.py` is gated on an env var so the reproduction itself is unaffected.
165
175
166
-
Rationale: #1736's `train_gpt.py` doesn't implement our `CKPT_DIR`/`CKPT_STEPS` convention, and patching it adds reproduction risk for negligible gain. If specs 009+ need hotstart off this trajectory, a targeted checkpoint-enabled rerun can be scheduled then.
176
+
No intermediate / phase-boundary checkpoints. No post-GPTQ checkpoints (the training log + `final.json` carry the info we'd want).
167
177
168
178
## Stop-early criteria
169
179
170
180
- Import / CUDA failure on smoke → halt, flag.
171
181
- NaN in train_loss at any step → halt, mark failed.
- Seed-42 val_bpb > 0.003 off 1.06610 → halt, flag research; decide whether to retry or fall back to #1626 clean-foundation baseline before spawning downstream quant specs.
175
185
176
186
## Cost estimate
177
187
178
188
| Item | Cost |
179
189
|---|---|
180
190
| Phase 1 (data prep, CPU-idle GPU) |~$1–2 |
181
191
| Phase 2 (2×H100 smoke, 10 min) |~$0.50 |
182
-
| Phase 3 (8×H100 × 3 seeds) |~$30 |
183
-
| Buffer for debug |~$10 |
184
-
|**Total**|**~$40**|
192
+
| Phase 3 (8×H100, single seed, ~30 min) |~$10 |
193
+
| Buffer for debug |~$5 |
194
+
|**Total**|**~$17**|
195
+
196
+
Downstream quant experiments (spec 009+) hotstart off the checkpoint from this run, so each costs ~$1–2 instead of ~$10.
185
197
186
198
## Extra artifacts
187
199
188
-
-`runs/008-1736-reproduction/seed_<S>/train.log` for each seed
189
-
-`runs/008-1736-reproduction/seed_<S>/artifact.ptz` (or whatever `train_gpt.py` writes) for each seed
190
-
-`runs/008-1736-reproduction/smoke/train.log` for Phase 2
2.**HF shortcut** — is `romeerp/parameter-golf-caseops-v1` on HuggingFace byte-compatible with what `prepare_caseops_data.py` produces? If yes, Phase 1 can be replaced with a ~20 GB download + schema check (saves ~2 hours). Quick way to test: download one val shard + its byte sidecar from HF, compare byte-for-byte against local prep output on a sample doc.
198
211
3.**flash-attn-3 install** — is the wheel at `https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/` still reachable from the pod's region? Preflight step per #1736 README: `pip install flash_attn_3 --no-deps --find-links <wheel-url>`. If unreachable, fallback?
199
212
4.**Smoke override mechanism** — does `train_gpt.py` accept an `ITERATIONS` / `MAX_STEPS` env var, or a `DISABLE_EVAL` flag? Execution should grep the script for the iteration count constant and pick the cleanest override path. If no clean override exists, we can instead just run the full training and abort after ~2 min of logging — the smoke goal is the first 50 steps of log output.
200
-
5.**Phase 3 seed-42 early-gate** — if seed-42 misses (>0.003 off 1.06610), do we halt before spawning seeds 0 and 1234? Suggest yes — saves ~$20 on a likely-miss. Execution to implement a simple log-grep gate between seeds.
213
+
5.**Pre-GPTQ hook location** — execution should grep `train_gpt.py` for where GPTQ is invoked (likely a function call on the FP model) and inject the `torch.save(...)` one line before, gated on `SAVE_PRE_GPTQ`. Verify the saved state_dict loads correctly before declaring Phase 3 a pass (simple `torch.load(...)` check on the pod).
201
214
202
215
## What this spec does NOT do
203
216
204
-
- Does not modify #1736's `train_gpt.py` or bundle our checkpoint logic into it.
205
-
- Does not attempt to beat 1.0655. Success is *reproduce*.
206
-
- Does not save intermediate checkpoints.
207
-
- Does not run a full 2×H100 mini (40 min, ~$3) — only a ~10 min smoke. Full screening confidence comes from Phase 3.
217
+
- Does not attempt to beat 1.06610. Success is *reproduce*.
218
+
- Does not run 3 seeds — seed 42 only. Multi-seed confirmation is deferred to a potential final leaderboard spec.
219
+
- Does not save intermediate / phase-boundary checkpoints. One pre-GPTQ checkpoint only.
220
+
- Does not run a full 2×H100 mini (40 min, ~$3) — only a ~10 min smoke.
221
+
- Does not modify `train_gpt.py` beyond the env-var-gated pre-GPTQ checkpoint save.
208
222
- Does not test other unmerged PRs (#1735, #1738, #1667, #1729, #1695) as alternative bases. Those belong in specs 009+.
0 commit comments