Skip to content

Commit b3900be

Browse files
committed
specs 060B-060G: stack on top of 060A baseline
Six follow-on specs to spec 060A (openai#1855 port): - 060B: SDClip ATTN tightening (config-only, eval via RESUME_FROM_CKPT) - 060C: 046L deploy-time quant repair (~150 lines code port from exp/046-quant-repair @ fcb816f); eval-side, free - 060D: 046G-tighter SDClip (config-only, fits within openai#1855 lrzip headroom) - 060E: full stack (060B + 060C combined) - 060F: LQER bumps (RANK=5, TOP_K=4, ASYM_GROUP=32; config-only) - 060G: Partial SpinQuant from PR openai#1898 (~100 lines code port) Plus tmp_exec/launch_060_eval.sh: shared eval-only launcher for RESUME_FROM_CKPT mode, used by 060B/D/E/F. Loads 060A's final_model.pt, re-quantizes + re-evals with overridden env vars. ~-3 per arm vs ~ for full retrain. All specs reference 060A's checkpoint at runs/060A-1855-port/seed_42/ final_model.pt as their hotstart.
1 parent 786ed3a commit b3900be

7 files changed

Lines changed: 538 additions & 0 deletions

File tree

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Spec 060B — 046B-tight SDClip on 060A baseline (eval-only via RESUME_FROM_CKPT)
2+
3+
**Date:** 2026-04-29
4+
**Branch:** `research` (config-only; no code change)
5+
**Parent:** 060A (`final_model.pt` from `runs/060A-1855-port/seed_42/`)
6+
7+
## Hypothesis
8+
9+
Per spec 046's measurements, tightening MLP/ATTN/EMBED clip sigmas reduces quantization noise and gains **−0.00086 BPB** (046B-tight on prior baseline). The original 046B-tight ran into the 16 MB artifact cap (+184 KB over). With #1855's lrzip+brotli compressor (≈ −280 KB savings), the same SDClip tightening should now fit comfortably within 16 MB.
10+
11+
## Baseline
12+
13+
060A (1-seed): val_bpb expected in [1.0595, 1.0625]. Use exact measured number once 060A completes.
14+
15+
## Expected Δ
16+
17+
**−0.00086 BPB** vs 060A (research-measured on 045-armD base; expected to transfer with similar magnitude on #1855 base since LQER + GPTQ stack is identical).
18+
19+
## Accept criteria
20+
21+
- post-quant + post-TTT val_bpb ≤ (060A val_bpb − 0.0003)
22+
- artifact size ≤ 15,990,000 bytes (≥10 KB margin)
23+
- no GPTQ failure or NaN
24+
25+
## Config diff vs 060A
26+
27+
```
28+
MLP_CLIP_SIGMAS: 11.5 (1855 default; UNCHANGED — this is already 046B-tight!)
29+
ATTN_CLIP_SIGMAS: 13.0 → 12.5
30+
EMBED_CLIP_SIGMAS: 14.0 → 14.5 (matches 046B-tight; was reduced from 15.0 in 1855 already)
31+
```
32+
33+
**Wait — important note.** #1855 already sets `MLP_CLIP_SIGMAS=11.5` and `EMBED_CLIP_SIGMAS=14.0` (≈ 046B-tight values). So 060B's incremental over 060A is just `ATTN_CLIP_SIGMAS=12.5` (was 13.0). The big SDClip win is already baked into 060A.
34+
35+
## Code changes
36+
37+
None. Pure env-var override + RESUME_FROM_CKPT path (already in our 046 lineage at commit `0ea6a97`).
38+
39+
## Hardware ladder
40+
41+
- 4×H100, eval-only mode (`RESUME_FROM_CKPT=1`)
42+
- ~5-7 min wall, ~$1-2 cost
43+
44+
## Seed plan
45+
46+
1 seed: 42 (matches 060A).
47+
48+
## Inputs
49+
50+
- Hotstart: `/workspace/runs/060A-1855-port/seed_42/final_model.pt`
51+
- Train data: same as 060A
52+
- Tokenizer: same
53+
54+
## Stop-early criteria
55+
56+
- Artifact > 16,000,000 bytes → fail (compression assumption broken; investigate)
57+
- post-quant val_bpb > 1.075 → kill (something broke)
58+
59+
## Cost estimate
60+
61+
~$2, ~10 min wall.
62+
63+
## Run command
64+
65+
```bash
66+
SEED=42 RUN_LABEL=seed_42 \
67+
RESUME_FROM_CKPT=/workspace/runs/060A-1855-port/seed_42/final_model.pt \
68+
ATTN_CLIP_SIGMAS=12.5 \
69+
bash tmp_exec/launch_060B_run.sh
70+
```
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Spec 060C — 046L deploy-time quant repair on 060A baseline
2+
3+
**Date:** 2026-04-29
4+
**Branch:** `exp/060C-deploy-repair` (forked from research, code from `exp/046-quant-repair` @ `fcb816f`)
5+
**Parent:** 060A checkpoint + 046L code (already exists, just needs cherry-pick into 060A's train_gpt.py).
6+
7+
## Hypothesis
8+
9+
046L's deploy-time quant repair runs a passthrough fp16 fit AT EVAL TIME using AR-self-generated calibration data. Costs ZERO bytes (uses spare 100-180s of eval budget), bypasses the 16MB cap entirely. On 045-armD it was specced but never fully measured; predicted gain ~−0.001 to −0.005 BPB if it acts even partially like TTT.
10+
11+
## Baseline
12+
13+
060A.
14+
15+
## Expected Δ
16+
17+
**−0.001 to −0.005 BPB**, low confidence (was never cleanly validated end-to-end, only specced).
18+
19+
## Accept criteria
20+
21+
- post-quant + post-TTT val_bpb ≤ (060A − 0.0005)
22+
- eval_time still ≤ 600s (deploy-time repair runs in ~60s; should fit)
23+
- no NaN, no instability
24+
25+
## Config diff vs 060A
26+
27+
```
28+
DEPLOY_TIME_REPAIR_ENABLED=1
29+
DEPLOY_TIME_REPAIR_BATCHES=8
30+
DEPLOY_TIME_REPAIR_SEQ_LEN=512
31+
DEPLOY_TIME_REPAIR_LR=1e-3
32+
DEPLOY_TIME_REPAIR_ITERS=5
33+
```
34+
35+
## Code changes
36+
37+
Cherry-pick three blocks from `exp/046-quant-repair` @ `fcb816f` into 060A's `train_gpt.py`:
38+
39+
1. `fit_passthrough_to_self_consistency()` — the passthrough param fit function (~50 lines).
40+
2. `ARSelfGenCalibLoader` class — generates AR samples without val data leak (~30 lines).
41+
3. Hook in `train_and_eval()` after `deserialize()` (~10 lines):
42+
```python
43+
eval_model = deserialize(h, device)
44+
if h.num_loops > 0:
45+
eval_model.looping_active = True
46+
if h.deploy_time_repair_enabled:
47+
repair_calib = generate_ar_calib(eval_model, h, n_batches=h.deploy_time_repair_batches, seq_len=h.deploy_time_repair_seq_len)
48+
fit_passthrough_to_self_consistency(eval_model, repair_calib, h)
49+
```
50+
51+
Plus 5 new env-var-driven Hyperparameters fields.
52+
53+
## Hardware ladder
54+
55+
- 4×H100, RESUME_FROM_CKPT mode (no re-train; load 060A's pt, repair, eval).
56+
- ~10-15 min wall, ~$3 cost.
57+
58+
## Seed plan
59+
60+
1 seed: 42.
61+
62+
## Inputs
63+
64+
- Hotstart: `/workspace/runs/060A-1855-port/seed_42/final_model.pt`
65+
- AR calib generated at eval time (no external data needed beyond what's in the model).
66+
67+
## Stop-early criteria
68+
69+
- Repair fit diverges (loss > initial × 2 after 3 iters) → skip repair, run vanilla eval
70+
- Post-repair val_bpb > 1.075 → kill (repair broke the model)
71+
72+
## Cost estimate
73+
74+
~$3, ~15 min wall.
75+
76+
## Open questions
77+
78+
1. Cherry-pick onto 060A's #1855-derived train_gpt.py — verify the LR schedule + LoRA-A path doesn't conflict with the 046L hooks.
79+
2. AR-calib generator may need adjustment for #1855's slightly different forward signature.
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Spec 060D — 046G-tighter SDClip on 060A baseline (eval-only via RESUME_FROM_CKPT)
2+
3+
**Date:** 2026-04-29
4+
**Branch:** `research` (config-only; no code change)
5+
**Parent:** 060A `final_model.pt` + 060B if 060B passes (suggests SDClip direction holds).
6+
7+
## Hypothesis
8+
9+
046G measured **−0.00146 BPB** by tightening SDClip another step (each clip −1.0σ further). 060B applies one step (ATTN 13.0→12.5); 060D applies the next step (everything −1.0σ from 060A defaults). 046G arms were ALL ILLEGAL (+428 KB over cap on 045-armD); with #1855's lrzip headroom we have ~280 KB of budget — likely still over. If 060B fits with margin, we can probably afford only ATTN+EMBED tighter, not MLP.
10+
11+
## Baseline
12+
13+
060A (or 060B if 060B passes).
14+
15+
## Expected Δ
16+
17+
**−0.0015 BPB** vs 060A if all three clips tighten; **−0.0010 BPB** if only two (artifact-fit constraint).
18+
19+
## Accept criteria
20+
21+
- post-quant + post-TTT val_bpb ≤ (060A − 0.0010)
22+
- artifact size ≤ 16,000,000 (HARD cap — must fit)
23+
- no GPTQ failure or NaN
24+
25+
## Config diff vs 060A
26+
27+
Two arms (run sequentially; pick whichever fits + wins):
28+
29+
**Arm D-aggressive** (each clip −1.0σ further):
30+
```
31+
MLP_CLIP_SIGMAS: 11.5 → 10.5
32+
ATTN_CLIP_SIGMAS: 13.0 → 12.0
33+
EMBED_CLIP_SIGMAS: 14.0 → 13.0
34+
```
35+
36+
**Arm D-conservative** (only ATTN+EMBED):
37+
```
38+
ATTN_CLIP_SIGMAS: 13.0 → 12.0
39+
EMBED_CLIP_SIGMAS: 14.0 → 13.0
40+
(MLP_CLIP_SIGMAS unchanged at 11.5)
41+
```
42+
43+
## Code changes
44+
45+
None. Pure env-var override + RESUME_FROM_CKPT.
46+
47+
## Hardware ladder
48+
49+
- 4×H100, eval-only mode (RESUME_FROM_CKPT)
50+
- ~5-7 min wall per arm, ~$1-2 each
51+
52+
## Seed plan
53+
54+
1 seed (42) per arm.
55+
56+
## Stop-early criteria
57+
58+
- Artifact > 16 MB → fail this arm; fall back to next arm
59+
- post-quant val_bpb > 1.080 → kill
60+
61+
## Cost estimate
62+
63+
~$3-4 for both arms.
64+
65+
## Run command
66+
67+
```bash
68+
# Try aggressive first; if oversize, fall back to conservative.
69+
SEED=42 RUN_LABEL=seed_42_D_agg \
70+
RESUME_FROM_CKPT=/workspace/runs/060A-1855-port/seed_42/final_model.pt \
71+
MLP_CLIP_SIGMAS=10.5 ATTN_CLIP_SIGMAS=12.0 EMBED_CLIP_SIGMAS=13.0 \
72+
bash tmp_exec/launch_060_eval.sh
73+
```

research/specs/060E-full-stack.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Spec 060E — Full quant-repair stack on 060A (eval-only)
2+
3+
**Date:** 2026-04-29
4+
**Branch:** `exp/060C-deploy-repair` (uses 060C's code port)
5+
**Parent:** 060A pt + 060C code; depends on 060B and 060C results being individually positive.
6+
7+
## Hypothesis
8+
9+
Combine the two best validated levers:
10+
- 060B's ATTN clip tightening (−0.0008)
11+
- 060C's deploy-time quant repair (−0.001 to −0.005)
12+
13+
Expected combined gain: **−0.002 to −0.005 BPB**, assuming additivity (they target different error sources: SDClip reduces quant noise at calibration; deploy-time repair recovers passthrough fp16 params at eval).
14+
15+
## Baseline
16+
17+
060A.
18+
19+
## Expected Δ
20+
21+
**−0.002 to −0.005 BPB** vs 060A. Lower confidence on additivity — they could partially overlap.
22+
23+
## Accept criteria
24+
25+
- val_bpb ≤ min(060B, 060C) − 0.0005 (better than either alone)
26+
- artifact ≤ 16 MB
27+
- no instability
28+
29+
## Config diff vs 060A
30+
31+
```
32+
ATTN_CLIP_SIGMAS=12.5 (from 060B)
33+
DEPLOY_TIME_REPAIR_ENABLED=1 (from 060C)
34+
DEPLOY_TIME_REPAIR_BATCHES=8
35+
DEPLOY_TIME_REPAIR_SEQ_LEN=512
36+
DEPLOY_TIME_REPAIR_LR=1e-3
37+
DEPLOY_TIME_REPAIR_ITERS=5
38+
```
39+
40+
## Code changes
41+
42+
Same as 060C (deploy-time repair port). 060B is config-only on top.
43+
44+
## Hardware ladder
45+
46+
- 4×H100, eval-only via RESUME_FROM_CKPT
47+
- ~15 min wall, ~$3
48+
49+
## Stop-early criteria
50+
51+
- Either component alone (B or C) underperforms 060A → stop family; pivot to single-lever approach
52+
- Combined val_bpb ≥ 060A → no synergy, treat as null and ship best-of-{B,C}
53+
54+
## Run command
55+
56+
```bash
57+
SEED=42 RUN_LABEL=seed_42_full \
58+
RESUME_FROM_CKPT=/workspace/runs/060A-1855-port/seed_42/final_model.pt \
59+
ATTN_CLIP_SIGMAS=12.5 \
60+
DEPLOY_TIME_REPAIR_ENABLED=1 DEPLOY_TIME_REPAIR_ITERS=5 \
61+
bash tmp_exec/launch_060_eval.sh
62+
```

research/specs/060F-lqer-bumps.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Spec 060F — LQER capacity bumps on 060A baseline (eval-only)
2+
3+
**Date:** 2026-04-29
4+
**Branch:** `research` (config-only)
5+
**Parent:** 060A `final_model.pt`.
6+
7+
## Hypothesis
8+
9+
046D measured LQER `RANK=6` and `TOP_K=5/8` as null on 045-armD. BUT — 046D ran without lrzip headroom and on a different base. With #1855's lrzip compressor giving ~280 KB headroom, we can afford bigger LQER tensors. Per 046K's rank-adaptive analysis, the marginal residual on additional rank is small but might compound with other 060 levers.
10+
11+
## Baseline
12+
13+
060A.
14+
15+
## Expected Δ
16+
17+
**−0.001 to −0.003 BPB**, low confidence (046D measured null on prior base).
18+
19+
## Accept criteria
20+
21+
- val_bpb ≤ (060A − 0.0005)
22+
- artifact ≤ 16 MB
23+
24+
## Config diff vs 060A
25+
26+
Three arms (sequential, pick best):
27+
28+
**Arm F1 — RANK bump:**
29+
```
30+
LQER_RANK: 4 → 5 (cost ~50 KB)
31+
LQER_TOP_K: 3 (unchanged)
32+
```
33+
34+
**Arm F2 — TOP_K bump:**
35+
```
36+
LQER_RANK: 4 (unchanged)
37+
LQER_TOP_K: 3 → 4 (cost ~30-50 KB)
38+
```
39+
40+
**Arm F3 — finer asym groups:**
41+
```
42+
LQER_ASYM_GROUP: 64 → 32 (cost ~50-80 KB; finer per-group quant scales)
43+
```
44+
45+
## Code changes
46+
47+
None. Env-var override + RESUME_FROM_CKPT.
48+
49+
## Hardware ladder
50+
51+
- 4×H100, eval-only.
52+
- ~5-7 min wall per arm, ~$1-2 each.
53+
54+
## Seed plan
55+
56+
1 seed (42) per arm.
57+
58+
## Cost estimate
59+
60+
~$5-6 total for all three arms.
61+
62+
## Stop-early criteria
63+
64+
- Artifact > 16 MB → fail this arm
65+
- 046D showed null on RANK=6; if F1 (RANK=5) is null, skip F2/F3
66+
67+
## Run command (per arm)
68+
69+
```bash
70+
# Arm F1
71+
SEED=42 RUN_LABEL=seed_42_F1 \
72+
RESUME_FROM_CKPT=/workspace/runs/060A-1855-port/seed_42/final_model.pt \
73+
LQER_RANK=5 \
74+
bash tmp_exec/launch_060_eval.sh
75+
```

0 commit comments

Comments
 (0)