specs 060B-060G: stack on top of 060A baseline

leon2k2k2k · leon2k2k2k · commit b3900bea7c7b · 2026-04-29T04:02:14.000+08:00
Six follow-on specs to spec 060A (openai#1855 port): - 060B: SDClip ATTN tightening (config-only, eval via RESUME_FROM_CKPT) - 060C: 046L deploy-time quant repair (~150 lines code port from exp/046-quant-repair @ fcb816f); eval-side, free - 060D: 046G-tighter SDClip (config-only, fits within openai#1855 lrzip headroom) - 060E: full stack (060B + 060C combined) - 060F: LQER bumps (RANK=5, TOP_K=4, ASYM_GROUP=32; config-only) - 060G: Partial SpinQuant from PR openai#1898 (~100 lines code port) Plus tmp_exec/launch_060_eval.sh: shared eval-only launcher for RESUME_FROM_CKPT mode, used by 060B/D/E/F. Loads 060A's final_model.pt, re-quantizes + re-evals with overridden env vars. ~-3 per arm vs ~ for full retrain. All specs reference 060A's checkpoint at runs/060A-1855-port/seed_42/ final_model.pt as their hotstart.
diff --git a/research/specs/060B-sdclip-tight.md b/research/specs/060B-sdclip-tight.md
@@ -0,0 +1,70 @@
+# Spec 060B — 046B-tight SDClip on 060A baseline (eval-only via RESUME_FROM_CKPT)
+
+**Date:** 2026-04-29
+**Branch:** `research` (config-only; no code change)
+**Parent:** 060A (`final_model.pt` from `runs/060A-1855-port/seed_42/`)
+
+## Hypothesis
+
+Per spec 046's measurements, tightening MLP/ATTN/EMBED clip sigmas reduces quantization noise and gains **−0.00086 BPB** (046B-tight on prior baseline). The original 046B-tight ran into the 16 MB artifact cap (+184 KB over). With #1855's lrzip+brotli compressor (≈ −280 KB savings), the same SDClip tightening should now fit comfortably within 16 MB.
+
+## Baseline
+
+060A (1-seed): val_bpb expected in [1.0595, 1.0625]. Use exact measured number once 060A completes.
+
+## Expected Δ
+
+**−0.00086 BPB** vs 060A (research-measured on 045-armD base; expected to transfer with similar magnitude on #1855 base since LQER + GPTQ stack is identical).
+
+## Accept criteria
+
+- post-quant + post-TTT val_bpb ≤ (060A val_bpb − 0.0003)
+- artifact size ≤ 15,990,000 bytes (≥10 KB margin)
+- no GPTQ failure or NaN
+
+## Config diff vs 060A
+
+```
+MLP_CLIP_SIGMAS:    11.5 (1855 default; UNCHANGED — this is already 046B-tight!)
+ATTN_CLIP_SIGMAS:   13.0 → 12.5
+EMBED_CLIP_SIGMAS:  14.0 → 14.5  (matches 046B-tight; was reduced from 15.0 in 1855 already)
+```
+
+**Wait — important note.** #1855 already sets `MLP_CLIP_SIGMAS=11.5` and `EMBED_CLIP_SIGMAS=14.0` (≈ 046B-tight values). So 060B's incremental over 060A is just `ATTN_CLIP_SIGMAS=12.5` (was 13.0). The big SDClip win is already baked into 060A.
+
+## Code changes
+
+None. Pure env-var override + RESUME_FROM_CKPT path (already in our 046 lineage at commit `0ea6a97`).
+
+## Hardware ladder
+
+- 4×H100, eval-only mode (`RESUME_FROM_CKPT=1`)
+- ~5-7 min wall, ~$1-2 cost
+
+## Seed plan
+
+1 seed: 42 (matches 060A).
+
+## Inputs
+
+- Hotstart: `/workspace/runs/060A-1855-port/seed_42/final_model.pt`
+- Train data: same as 060A
+- Tokenizer: same
+
+## Stop-early criteria
+
+- Artifact > 16,000,000 bytes → fail (compression assumption broken; investigate)
+- post-quant val_bpb > 1.075 → kill (something broke)
+
+## Cost estimate
+
+~$2, ~10 min wall.
+
+## Run command
+
+```bash
+SEED=42 RUN_LABEL=seed_42 \
+  RESUME_FROM_CKPT=/workspace/runs/060A-1855-port/seed_42/final_model.pt \
+  ATTN_CLIP_SIGMAS=12.5 \
+  bash tmp_exec/launch_060B_run.sh
+```
diff --git a/research/specs/060C-deploy-time-quant-repair.md b/research/specs/060C-deploy-time-quant-repair.md
@@ -0,0 +1,79 @@
+# Spec 060C — 046L deploy-time quant repair on 060A baseline
+
+**Date:** 2026-04-29
+**Branch:** `exp/060C-deploy-repair` (forked from research, code from `exp/046-quant-repair` @ `fcb816f`)
+**Parent:** 060A checkpoint + 046L code (already exists, just needs cherry-pick into 060A's train_gpt.py).
+
+## Hypothesis
+
+046L's deploy-time quant repair runs a passthrough fp16 fit AT EVAL TIME using AR-self-generated calibration data. Costs ZERO bytes (uses spare 100-180s of eval budget), bypasses the 16MB cap entirely. On 045-armD it was specced but never fully measured; predicted gain ~−0.001 to −0.005 BPB if it acts even partially like TTT.
+
+## Baseline
+
+060A.
+
+## Expected Δ
+
+**−0.001 to −0.005 BPB**, low confidence (was never cleanly validated end-to-end, only specced).
+
+## Accept criteria
+
+- post-quant + post-TTT val_bpb ≤ (060A − 0.0005)
+- eval_time still ≤ 600s (deploy-time repair runs in ~60s; should fit)
+- no NaN, no instability
+
+## Config diff vs 060A
+
+```
+DEPLOY_TIME_REPAIR_ENABLED=1
+DEPLOY_TIME_REPAIR_BATCHES=8
+DEPLOY_TIME_REPAIR_SEQ_LEN=512
+DEPLOY_TIME_REPAIR_LR=1e-3
+DEPLOY_TIME_REPAIR_ITERS=5
+```
+
+## Code changes
+
+Cherry-pick three blocks from `exp/046-quant-repair` @ `fcb816f` into 060A's `train_gpt.py`:
+
+1. `fit_passthrough_to_self_consistency()` — the passthrough param fit function (~50 lines).
+2. `ARSelfGenCalibLoader` class — generates AR samples without val data leak (~30 lines).
+3. Hook in `train_and_eval()` after `deserialize()` (~10 lines):
+   ```python
+   eval_model = deserialize(h, device)
+   if h.num_loops > 0:
+       eval_model.looping_active = True
+   if h.deploy_time_repair_enabled:
+       repair_calib = generate_ar_calib(eval_model, h, n_batches=h.deploy_time_repair_batches, seq_len=h.deploy_time_repair_seq_len)
+       fit_passthrough_to_self_consistency(eval_model, repair_calib, h)
+   ```
+
+Plus 5 new env-var-driven Hyperparameters fields.
+
+## Hardware ladder
+
+- 4×H100, RESUME_FROM_CKPT mode (no re-train; load 060A's pt, repair, eval).
+- ~10-15 min wall, ~$3 cost.
+
+## Seed plan
+
+1 seed: 42.
+
+## Inputs
+
+- Hotstart: `/workspace/runs/060A-1855-port/seed_42/final_model.pt`
+- AR calib generated at eval time (no external data needed beyond what's in the model).
+
+## Stop-early criteria
+
+- Repair fit diverges (loss > initial × 2 after 3 iters) → skip repair, run vanilla eval
+- Post-repair val_bpb > 1.075 → kill (repair broke the model)
+
+## Cost estimate
+
+~$3, ~15 min wall.
+
+## Open questions
+
+1. Cherry-pick onto 060A's #1855-derived train_gpt.py — verify the LR schedule + LoRA-A path doesn't conflict with the 046L hooks.
+2. AR-calib generator may need adjustment for #1855's slightly different forward signature.
diff --git a/research/specs/060D-sdclip-tighter.md b/research/specs/060D-sdclip-tighter.md
@@ -0,0 +1,73 @@
+# Spec 060D — 046G-tighter SDClip on 060A baseline (eval-only via RESUME_FROM_CKPT)
+
+**Date:** 2026-04-29
+**Branch:** `research` (config-only; no code change)
+**Parent:** 060A `final_model.pt` + 060B if 060B passes (suggests SDClip direction holds).
+
+## Hypothesis
+
+046G measured **−0.00146 BPB** by tightening SDClip another step (each clip −1.0σ further). 060B applies one step (ATTN 13.0→12.5); 060D applies the next step (everything −1.0σ from 060A defaults). 046G arms were ALL ILLEGAL (+428 KB over cap on 045-armD); with #1855's lrzip headroom we have ~280 KB of budget — likely still over. If 060B fits with margin, we can probably afford only ATTN+EMBED tighter, not MLP.
+
+## Baseline
+
+060A (or 060B if 060B passes).
+
+## Expected Δ
+
+**−0.0015 BPB** vs 060A if all three clips tighten; **−0.0010 BPB** if only two (artifact-fit constraint).
+
+## Accept criteria
+
+- post-quant + post-TTT val_bpb ≤ (060A − 0.0010)
+- artifact size ≤ 16,000,000 (HARD cap — must fit)
+- no GPTQ failure or NaN
+
+## Config diff vs 060A
+
+Two arms (run sequentially; pick whichever fits + wins):
+
+**Arm D-aggressive** (each clip −1.0σ further):
+```
+MLP_CLIP_SIGMAS:    11.5 → 10.5
+ATTN_CLIP_SIGMAS:   13.0 → 12.0
+EMBED_CLIP_SIGMAS:  14.0 → 13.0
+```
+
+**Arm D-conservative** (only ATTN+EMBED):
+```
+ATTN_CLIP_SIGMAS:   13.0 → 12.0
+EMBED_CLIP_SIGMAS:  14.0 → 13.0
+(MLP_CLIP_SIGMAS unchanged at 11.5)
+```
+
+## Code changes
+
+None. Pure env-var override + RESUME_FROM_CKPT.
+
+## Hardware ladder
+
+- 4×H100, eval-only mode (RESUME_FROM_CKPT)
+- ~5-7 min wall per arm, ~$1-2 each
+
+## Seed plan
+
+1 seed (42) per arm.
+
+## Stop-early criteria
+
+- Artifact > 16 MB → fail this arm; fall back to next arm
+- post-quant val_bpb > 1.080 → kill
+
+## Cost estimate
+
+~$3-4 for both arms.
+
+## Run command
+
+```bash
+# Try aggressive first; if oversize, fall back to conservative.
+SEED=42 RUN_LABEL=seed_42_D_agg \
+  RESUME_FROM_CKPT=/workspace/runs/060A-1855-port/seed_42/final_model.pt \
+  MLP_CLIP_SIGMAS=10.5 ATTN_CLIP_SIGMAS=12.0 EMBED_CLIP_SIGMAS=13.0 \
+  bash tmp_exec/launch_060_eval.sh
+```
diff --git a/research/specs/060E-full-stack.md b/research/specs/060E-full-stack.md
@@ -0,0 +1,62 @@
+# Spec 060E — Full quant-repair stack on 060A (eval-only)
+
+**Date:** 2026-04-29
+**Branch:** `exp/060C-deploy-repair` (uses 060C's code port)
+**Parent:** 060A pt + 060C code; depends on 060B and 060C results being individually positive.
+
+## Hypothesis
+
+Combine the two best validated levers:
+- 060B's ATTN clip tightening (−0.0008)
+- 060C's deploy-time quant repair (−0.001 to −0.005)
+
+Expected combined gain: **−0.002 to −0.005 BPB**, assuming additivity (they target different error sources: SDClip reduces quant noise at calibration; deploy-time repair recovers passthrough fp16 params at eval).
+
+## Baseline
+
+060A.
+
+## Expected Δ
+
+**−0.002 to −0.005 BPB** vs 060A. Lower confidence on additivity — they could partially overlap.
+
+## Accept criteria
+
+- val_bpb ≤ min(060B, 060C) − 0.0005 (better than either alone)
+- artifact ≤ 16 MB
+- no instability
+
+## Config diff vs 060A
+
+```
+ATTN_CLIP_SIGMAS=12.5            (from 060B)
+DEPLOY_TIME_REPAIR_ENABLED=1     (from 060C)
+DEPLOY_TIME_REPAIR_BATCHES=8
+DEPLOY_TIME_REPAIR_SEQ_LEN=512
+DEPLOY_TIME_REPAIR_LR=1e-3
+DEPLOY_TIME_REPAIR_ITERS=5
+```
+
+## Code changes
+
+Same as 060C (deploy-time repair port). 060B is config-only on top.
+
+## Hardware ladder
+
+- 4×H100, eval-only via RESUME_FROM_CKPT
+- ~15 min wall, ~$3
+
+## Stop-early criteria
+
+- Either component alone (B or C) underperforms 060A → stop family; pivot to single-lever approach
+- Combined val_bpb ≥ 060A → no synergy, treat as null and ship best-of-{B,C}
+
+## Run command
+
+```bash
+SEED=42 RUN_LABEL=seed_42_full \
+  RESUME_FROM_CKPT=/workspace/runs/060A-1855-port/seed_42/final_model.pt \
+  ATTN_CLIP_SIGMAS=12.5 \
+  DEPLOY_TIME_REPAIR_ENABLED=1 DEPLOY_TIME_REPAIR_ITERS=5 \
+  bash tmp_exec/launch_060_eval.sh
+```
diff --git a/research/specs/060F-lqer-bumps.md b/research/specs/060F-lqer-bumps.md
@@ -0,0 +1,75 @@
+# Spec 060F — LQER capacity bumps on 060A baseline (eval-only)
+
+**Date:** 2026-04-29
+**Branch:** `research` (config-only)
+**Parent:** 060A `final_model.pt`.
+
+## Hypothesis
+
+046D measured LQER `RANK=6` and `TOP_K=5/8` as null on 045-armD. BUT — 046D ran without lrzip headroom and on a different base. With #1855's lrzip compressor giving ~280 KB headroom, we can afford bigger LQER tensors. Per 046K's rank-adaptive analysis, the marginal residual on additional rank is small but might compound with other 060 levers.
+
+## Baseline
+
+060A.
+
+## Expected Δ
+
+**−0.001 to −0.003 BPB**, low confidence (046D measured null on prior base).
+
+## Accept criteria
+
+- val_bpb ≤ (060A − 0.0005)
+- artifact ≤ 16 MB
+
+## Config diff vs 060A
+
+Three arms (sequential, pick best):
+
+**Arm F1 — RANK bump:**
+```
+LQER_RANK: 4 → 5     (cost ~50 KB)
+LQER_TOP_K: 3        (unchanged)
+```
+
+**Arm F2 — TOP_K bump:**
+```
+LQER_RANK: 4         (unchanged)
+LQER_TOP_K: 3 → 4    (cost ~30-50 KB)
+```
+
+**Arm F3 — finer asym groups:**
+```
+LQER_ASYM_GROUP: 64 → 32   (cost ~50-80 KB; finer per-group quant scales)
+```
+
+## Code changes
+
+None. Env-var override + RESUME_FROM_CKPT.
+
+## Hardware ladder
+
+- 4×H100, eval-only.
+- ~5-7 min wall per arm, ~$1-2 each.
+
+## Seed plan
+
+1 seed (42) per arm.
+
+## Cost estimate
+
+~$5-6 total for all three arms.
+
+## Stop-early criteria
+
+- Artifact > 16 MB → fail this arm
+- 046D showed null on RANK=6; if F1 (RANK=5) is null, skip F2/F3
+
+## Run command (per arm)
+
+```bash
+# Arm F1
+SEED=42 RUN_LABEL=seed_42_F1 \
+  RESUME_FROM_CKPT=/workspace/runs/060A-1855-port/seed_42/final_model.pt \
+  LQER_RANK=5 \
+  bash tmp_exec/launch_060_eval.sh
+```
diff --git a/research/specs/060G-partial-spinquant.md b/research/specs/060G-partial-spinquant.md
diff --git a/tmp_exec/launch_060_eval.sh b/tmp_exec/launch_060_eval.sh