Skip to content

Commit 9306f89

Browse files
leon2k2k2kclaude
andcommitted
spec 008: single seed, pre-GPTQ checkpoint for quant hotstart
Collapse spec 008 to seed 42 only and add a one-line pre-GPTQ FP checkpoint save at runs/008-1736-reproduction/seed_42/pre_gptq.pt (env-var gated via SAVE_PRE_GPTQ=1 so the reproduction itself is unaffected when the flag is off). Rationale: SpinQuant and subsequent quant-family experiments are purely post-training transforms, so hotstarting off a single pre-GPTQ FP checkpoint is far cheaper than retraining per spec. Single-seed comparison against openai#1736's seed-42 (1.06610, ±0.003) is apples-to-apples for screening. Cost drops ~$40 -> ~$17 for this spec and ~$10 -> ~$1–2 per downstream quant experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 416b0e4 commit 9306f89

1 file changed

Lines changed: 58 additions & 44 deletions

File tree

research/specs/008-1736-reproduction.md

Lines changed: 58 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,9 @@ We can reproduce dexhunter's unmerged PR #1736 (val_bpb 1.06549, 3-seed mean, st
1010

1111
## Baseline
1212

13-
Comparison reference for this spec is the claimed number from PR #1736's submission:
14-
- seed 42: 1.06610
15-
- seed 0: 1.06473
16-
- seed 1234: 1.06563
17-
- **mean: 1.06549 ± 0.00070** (per their `submission.json`)
13+
Comparison reference for this spec is #1736's **seed-42** number from their `submission.json`: **val_bpb = 1.06610**.
14+
15+
(For reference: their full 3-seed set was 42=1.06610, 0=1.06473, 1234=1.06563, mean=1.06549±0.00070. We only reproduce seed 42; per-seed comparison is apples-to-apples and sufficient for screening.)
1816

1917
Our spec-000 number (1.0810, merged-SOTA replica) remains on the books only as a legacy reference for backward-compat sanity reruns.
2018

@@ -34,12 +32,12 @@ Not a delta experiment — a baseline migration. Success criterion is *reproduci
3432
- Training runs 50–100 steps with no NaN, finite first-step loss.
3533
- Step time within 2× of expected for 2×H100 (i.e., not catastrophically slow).
3634

37-
### Phase 3 — 8×H100 3-seed official
38-
- All 3 seeds complete without NaN / divergence.
39-
- Each artifact < 16,000,000 bytes (decimal cap, per #1736's submission).
40-
- Each seed within 600 s train + 600 s eval budget.
41-
- **Primary accept:** 3-seed mean val_bpb within ±0.001 of 1.0655 (matches #1736's reported std band).
42-
- **Secondary accept:** individual seeds within ±0.003 of their matched #1736 seed value.
35+
### Phase 3 — 8×H100 single-seed official
36+
- Seed 42 completes without NaN / divergence.
37+
- Artifact < 16,000,000 bytes (decimal cap, per #1736's submission).
38+
- Within 600 s train + 600 s eval budget.
39+
- **Primary accept:** val_bpb within **±0.003 of 1.06610** (#1736's seed-42 number).
40+
- **Pre-GPTQ checkpoint saved** to `runs/008-1736-reproduction/seed_42/pre_gptq.pt` (FP weights, right before GPTQ quantization runs). This is the hotstart input for specs 009+ quant experiments.
4341

4442
## Config diff
4543

@@ -85,11 +83,11 @@ SEED=<42|0|1234>
8583

8684
- [x] **Phase 1** — CPU (data prep). Any tiny pod, or run on a GPU pod during setup.
8785
- [x] **Phase 2** — 2×H100 smoke (~10 min, ~$0.50). Skipped only if execution has very high confidence in the integration.
88-
- [x] **Phase 3** — 8×H100 3-seed official (~30 min × 3).
86+
- [x] **Phase 3** — 8×H100 single-seed official (~30 min, ~$10).
8987

9088
## Seed plan
9189

92-
Three seeds, matching #1736: **42, 0, 1234**. Gives apples-to-apples comparison against their reported numbers.
90+
**Single seed: 42.** Apples-to-apples against #1736's seed-42 number (1.06610). Multi-seed confirmation is deferred to a final leaderboard spec if/when a composition looks submission-ready. Screening single-seed-vs-single-seed is consistent with our step-matched-comparison convention and saves ~$20 per quant experiment downstream.
9391

9492
## Inputs
9593

@@ -133,25 +131,35 @@ torchrun --standalone --nproc_per_node=2 train_gpt.py
133131

134132
If smoke passes → proceed to Phase 3. If smoke fails → halt, post diagnosis, flag research.
135133

136-
### Phase 3 — 8×H100 3-seed official
134+
### Phase 3 — 8×H100 single-seed official (seed 42)
135+
136+
**Before launching:** patch `train_gpt.py` to save the pre-GPTQ FP checkpoint. Find the code path right before the GPTQ quantization call and inject:
137+
138+
```python
139+
if int(os.environ.get("SAVE_PRE_GPTQ", "0")) and rank == 0:
140+
torch.save(model.state_dict(), os.environ["PRE_GPTQ_CKPT_PATH"])
141+
```
142+
143+
Gated on an env var so the change is invisible when not requested (keeps the reproduction as clean as possible). One-line equivalent acceptable.
137144

138145
```bash
139146
cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT
140147

141-
for SEED in 42 0 1234; do
142-
mkdir -p /workspace/runs/008-1736-reproduction/seed_${SEED}
143-
NCCL_NET=Socket DATA_DIR=./data \
144-
CASEOPS_ENABLED=1 \
145-
PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \
146-
MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \
147-
EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
148-
MATRIX_LR=0.026 \
149-
GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \
150-
GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \
151-
SEED=$SEED \
152-
torchrun --standalone --nproc_per_node=8 train_gpt.py \
153-
> /workspace/runs/008-1736-reproduction/seed_${SEED}/train.log 2>&1
154-
done
148+
mkdir -p /workspace/runs/008-1736-reproduction/seed_42
149+
150+
NCCL_NET=Socket DATA_DIR=./data \
151+
CASEOPS_ENABLED=1 \
152+
PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \
153+
MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \
154+
EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
155+
MATRIX_LR=0.026 \
156+
GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \
157+
GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \
158+
SAVE_PRE_GPTQ=1 \
159+
PRE_GPTQ_CKPT_PATH=/workspace/runs/008-1736-reproduction/seed_42/pre_gptq.pt \
160+
SEED=42 \
161+
torchrun --standalone --nproc_per_node=8 train_gpt.py \
162+
> /workspace/runs/008-1736-reproduction/seed_42/train.log 2>&1
155163
```
156164

157165
### Kill protocol
@@ -161,34 +169,39 @@ done
161169

162170
## Checkpoints to emit
163171

164-
**None.** Checkpoints disabled for this spec.
172+
**Exactly one:** `runs/008-1736-reproduction/seed_42/pre_gptq.pt` — FP16/FP32 weights saved right before GPTQ quantization runs.
173+
174+
Rationale: the entire quant-family spec chain (009 SpinQuant, plus any future per-group-bit / AR-selfgen / AWQ experiments) can hotstart off this single checkpoint because SpinQuant and its siblings are post-training transforms. Per-experiment cost drops from ~$10 retrain to ~$1–2 rotate-and-requant. The one-line injection into `train_gpt.py` is gated on an env var so the reproduction itself is unaffected.
165175

166-
Rationale: #1736's `train_gpt.py` doesn't implement our `CKPT_DIR`/`CKPT_STEPS` convention, and patching it adds reproduction risk for negligible gain. If specs 009+ need hotstart off this trajectory, a targeted checkpoint-enabled rerun can be scheduled then.
176+
No intermediate / phase-boundary checkpoints. No post-GPTQ checkpoints (the training log + `final.json` carry the info we'd want).
167177

168178
## Stop-early criteria
169179

170180
- Import / CUDA failure on smoke → halt, flag.
171181
- NaN in train_loss at any step → halt, mark failed.
172182
- Step time > 2× expected → halt, investigate.
173-
- Any seed artifact > 16 MB → halt, flag (our build mismatches #1736's compression).
174-
- Seed-42 official run val_bpb > 0.003 off 1.06610 → halt before running seeds 0 and 1234, flag research.
183+
- Artifact > 16 MB → halt, flag (our build mismatches #1736's compression).
184+
- Seed-42 val_bpb > 0.003 off 1.06610 → halt, flag research; decide whether to retry or fall back to #1626 clean-foundation baseline before spawning downstream quant specs.
175185

176186
## Cost estimate
177187

178188
| Item | Cost |
179189
|---|---|
180190
| Phase 1 (data prep, CPU-idle GPU) | ~$1–2 |
181191
| Phase 2 (2×H100 smoke, 10 min) | ~$0.50 |
182-
| Phase 3 (8×H100 × 3 seeds) | ~$30 |
183-
| Buffer for debug | ~$10 |
184-
| **Total** | **~$40** |
192+
| Phase 3 (8×H100, single seed, ~30 min) | ~$10 |
193+
| Buffer for debug | ~$5 |
194+
| **Total** | **~$17** |
195+
196+
Downstream quant experiments (spec 009+) hotstart off the checkpoint from this run, so each costs ~$1–2 instead of ~$10.
185197

186198
## Extra artifacts
187199

188-
- `runs/008-1736-reproduction/seed_<S>/train.log` for each seed
189-
- `runs/008-1736-reproduction/seed_<S>/artifact.ptz` (or whatever `train_gpt.py` writes) for each seed
190-
- `runs/008-1736-reproduction/smoke/train.log` for Phase 2
191-
- `runs/008-1736-reproduction/final.json` — 3-seed summary (mean bpb, std, per-seed bpb, per-seed artifact size, wall times)
200+
- `runs/008-1736-reproduction/seed_42/train.log` — full training stdout/stderr
201+
- `runs/008-1736-reproduction/seed_42/pre_gptq.pt` — pre-GPTQ FP checkpoint (hotstart input for spec 009+)
202+
- `runs/008-1736-reproduction/seed_42/artifact.ptz` (or whatever `train_gpt.py` writes) — the submission artifact
203+
- `runs/008-1736-reproduction/smoke/train.log` — Phase 2 smoke log
204+
- `runs/008-1736-reproduction/final.json` — seed-42 summary (bpb, artifact size, wall times, step at which pre_gptq.pt was saved)
192205
- `runs/008-1736-reproduction/notes.md` — execution narrative + any deviations from this spec
193206

194207
## Open questions for interview
@@ -197,12 +210,13 @@ Rationale: #1736's `train_gpt.py` doesn't implement our `CKPT_DIR`/`CKPT_STEPS`
197210
2. **HF shortcut** — is `romeerp/parameter-golf-caseops-v1` on HuggingFace byte-compatible with what `prepare_caseops_data.py` produces? If yes, Phase 1 can be replaced with a ~20 GB download + schema check (saves ~2 hours). Quick way to test: download one val shard + its byte sidecar from HF, compare byte-for-byte against local prep output on a sample doc.
198211
3. **flash-attn-3 install** — is the wheel at `https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/` still reachable from the pod's region? Preflight step per #1736 README: `pip install flash_attn_3 --no-deps --find-links <wheel-url>`. If unreachable, fallback?
199212
4. **Smoke override mechanism** — does `train_gpt.py` accept an `ITERATIONS` / `MAX_STEPS` env var, or a `DISABLE_EVAL` flag? Execution should grep the script for the iteration count constant and pick the cleanest override path. If no clean override exists, we can instead just run the full training and abort after ~2 min of logging — the smoke goal is the first 50 steps of log output.
200-
5. **Phase 3 seed-42 early-gate**if seed-42 misses (>0.003 off 1.06610), do we halt before spawning seeds 0 and 1234? Suggest yes — saves ~$20 on a likely-miss. Execution to implement a simple log-grep gate between seeds.
213+
5. **Pre-GPTQ hook location**execution should grep `train_gpt.py` for where GPTQ is invoked (likely a function call on the FP model) and inject the `torch.save(...)` one line before, gated on `SAVE_PRE_GPTQ`. Verify the saved state_dict loads correctly before declaring Phase 3 a pass (simple `torch.load(...)` check on the pod).
201214

202215
## What this spec does NOT do
203216

204-
- Does not modify #1736's `train_gpt.py` or bundle our checkpoint logic into it.
205-
- Does not attempt to beat 1.0655. Success is *reproduce*.
206-
- Does not save intermediate checkpoints.
207-
- Does not run a full 2×H100 mini (40 min, ~$3) — only a ~10 min smoke. Full screening confidence comes from Phase 3.
217+
- Does not attempt to beat 1.06610. Success is *reproduce*.
218+
- Does not run 3 seeds — seed 42 only. Multi-seed confirmation is deferred to a potential final leaderboard spec.
219+
- Does not save intermediate / phase-boundary checkpoints. One pre-GPTQ checkpoint only.
220+
- Does not run a full 2×H100 mini (40 min, ~$3) — only a ~10 min smoke.
221+
- Does not modify `train_gpt.py` beyond the env-var-gated pre-GPTQ checkpoint save.
208222
- Does not test other unmerged PRs (#1735, #1738, #1667, #1729, #1695) as alternative bases. Those belong in specs 009+.

0 commit comments

Comments
 (0)