Skip to content

Commit e8681bc

Browse files
leon2k2k2kclaude
andcommitted
spec 019: hardcoded α full-pipeline on 8xH100 (Path A)
Submission-quality test of constant-α (017 endpoint values) with full training + GPTQ + phased-TTT pipeline. Pins commit 2895db3 on exp/recur-alpha-constant-full, which extends 018c's constant-α wiring to the TTT forward path. Target: beat openai#1736's 1.06610 post-TTT. Expected range 1.0650-1.0675 based on 018c's 92% throughput recovery + TTT bug fix. Single seed 42 first, 3-seed conditional on clear promote. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 03e8bba commit e8681bc

1 file changed

Lines changed: 148 additions & 0 deletions

File tree

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# Spec 019 — Recur-Alpha with hardcoded α, full-pipeline on 8×H100
2+
3+
**Slug:** `recur-alpha-constant-full`
4+
**Created:** 2026-04-21
5+
**Links to:** spec 018c (throughput diagnostic), `research/evaluations/018c-recur-alpha-constant.md`, `research/ideas/beating-1736-note.md`
6+
7+
## Hypothesis
8+
9+
Spec 018c showed that hardcoding α as Python float constants (017's endpoint values) allows torch.compile to specialize the lerp kernel and recover **92% of blend overhead** at proxy scale. At full model scale this approaches zero throughput tax.
10+
11+
This spec tests the training-quality trade-off: **does losing α's adaptive learning (frozen at 017's values from step 1) hurt post-TTT val_bpb?**
12+
13+
Works backward from target:
14+
- Target post-TTT: ≤ 1.06610 (beat #1736)
15+
- 017's actual post-TTT with learnable α: 1.06733 (missed by 0.00123) — but 017 had buggy TTT path (α not applied during TTT)
16+
- 019 recovers ~44 steps from throughput savings (~0.002 bpb training endpoint) + fixes TTT bug (unknown direction, probably helps)
17+
- Expected post-TTT range: 1.0650 – 1.0675
18+
19+
## Baseline
20+
21+
Primary comparison: **#1736's 1.06610** (canonical target).
22+
Secondary comparison: **017's 1.06733** (same recur-alpha mechanism, different α values & buggy TTT).
23+
24+
## Accept criteria
25+
26+
- Training completes without NaN/divergence
27+
- `final_model.pt`, `final_model.int6.ptz` both emit
28+
- Post-GPTQ val_bpb captured
29+
- **Phased-TTT val_bpb captured** (the submission-gate number)
30+
31+
**Decision criterion (post-TTT val_bpb):**
32+
| Post-TTT | Bucket | Next action |
33+
|---|---|---|
34+
| ≤ 1.06550 | Clear beat #1736 | 3-seed confirmation (~$30) then submission |
35+
| (1.06550, 1.06710] | Close, within seed std | 3-seed to resolve |
36+
| (1.06710, 1.06910] | Inside gate but worse than #1736 | Shelve recur-alpha for submission; still a mechanistic finding. |
37+
| > 1.06910 | Outside gate | Investigate; likely hardcoded-α hurt too much |
38+
39+
## Code changes
40+
41+
**Branch:** `exp/recur-alpha-constant-full` forking from `aabfbea` (018c's commit) + additional TTT wiring.
42+
**Commit:** `2895db3` on `fork/exp/recur-alpha-constant-full`.
43+
44+
Key properties:
45+
- **No learnable α** — values hardcoded at `((1.078125, 1.2734375, 1.3984375), (1.015625, 0.97265625, 0.83203125))` from 017 endpoint
46+
- **torch.compile sees α as compile-time constants** in both `forward_logits` and `forward_ttt` lerp sites
47+
- **TTT bug fixed** — recur-alpha applies during TTT adaptation + eval (was missing in 015/016/017)
48+
- `self.recur_alpha = None` — not a Parameter, no gradient tracking, no optimizer state
49+
50+
## Hardware ladder
51+
52+
- **Skip smoke** — cite 018c's mini-model throughput run (commit `aabfbea`) + the new TTT wiring as surgical change that doesn't affect training-path graph.
53+
- **8×H100, region = whichever has capacity**. NA preferred (cleaner throughput), JP acceptable fallback (same constant-α benefit).
54+
- **Seed 42** first. 3-seed (42/43/44) conditional on clear-promote bucket.
55+
56+
## Seed plan
57+
58+
Single seed (42) first. 3-seed if results promote.
59+
60+
## Inputs
61+
62+
- Data: CaseOps dataset. On JP `/runpod/data/...`, on NA `/workspace/data/...`.
63+
- Tokenizer: `fineweb_8192_bpe.model`, bundled.
64+
- Hotstart: **none — fresh from-scratch training with hardcoded α from step 1.**
65+
66+
## Execution protocol
67+
68+
Standard #1736 full pipeline. Example for JP:
69+
70+
```bash
71+
cd /runpod/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT
72+
git checkout 2895db3
73+
74+
mkdir -p /runpod/runs/019-recur-alpha-constant-full/seed_42
75+
mkdir -p /runpod/.torch_inductor_cache
76+
77+
NCCL_NET=Socket DATA_DIR=/runpod/data \
78+
ARTIFACT_DIR=/runpod/runs/019-recur-alpha-constant-full/seed_42 \
79+
TORCHINDUCTOR_CACHE_DIR=/runpod/.torch_inductor_cache \
80+
CASEOPS_ENABLED=1 \
81+
PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \
82+
MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \
83+
EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
84+
MATRIX_LR=0.026 \
85+
GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \
86+
GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \
87+
RECUR_ALPHA_ENABLED=1 \
88+
TRAIN_LOG_EVERY=100 \
89+
SEED=42 \
90+
torchrun --standalone --nproc_per_node=8 train_gpt.py \
91+
> /runpod/runs/019-recur-alpha-constant-full/seed_42/train.log 2>&1
92+
```
93+
94+
Substitute `/workspace` for `/runpod` on NA. pyminify must be installed (preflight).
95+
96+
## Expected throughput
97+
98+
Per 018c's 92% recovery at proxy scale, 019 on 8×H100 full model should run at nearly baseline (008) throughput. Expected endpoint step count: ~4825-4850 (vs 017's 4784 with tensor α). Gains: ~40-65 more training steps vs 017 = ~0.002 bpb of training-endpoint improvement.
99+
100+
Tok/s snapshot logging in final.json required (compare at steps 100/1000/2000/3000/4000/4500 vs 017).
101+
102+
## Checkpoints / artifacts to emit
103+
104+
Inherited baseline:
105+
- `final_model.pt` — pre-GPTQ FP state dict, post-EMA
106+
- `final_model.int6.ptz` — quantized submission artifact
107+
- `train.log` — full pipeline sequence
108+
- `final.json` — must include `val_bpb`, `val_bpb_pre_gptq_post_ema`, `val_bpb_post_gptq`, **`val_bpb_post_ttt`**, `stopping_early_at_step`, `layer_loop_enabled_at_step`, tok/s snapshots, `recur_alpha_values_hardcoded` (the constants from the table for audit)
109+
- `notes.md`
110+
111+
## Stop-early criteria
112+
113+
Unconditional:
114+
- NaN/inf in train_loss → halt
115+
- Step time > 2× spec 008 → halt
116+
117+
Conditional on `looping_active=True`:
118+
- Training loss > 008's matched-step loss + 0.03 for 5+ consecutive log entries → halt (hardcoded α might be badly off for this seed's trajectory)
119+
120+
## Cost estimate
121+
122+
| item | cost |
123+
|---|---|
124+
| 8×H100 × ~25 min (compile + training + full pipeline) | ~$10 |
125+
| Rsync + pod stop | ~$0.10 |
126+
| **Single-seed total** | **~$10-12** |
127+
| (Conditional) 3-seed × 2 additional runs | ~$20-24 |
128+
| **If 3-seed promotes** | **~$30-36** |
129+
130+
## Open questions for interview
131+
132+
1. **Is hardcoded α at 017's values appropriate for seed 42?** 017 itself was seed 42, so the α values were learned on this exact seed. Should reproduce well. For seeds 43/44, hardcoded α would be a different-seed transplant; could be fine (α shape reproduces across seeds per 016/017 finding) or a slight mismatch.
133+
2. **What if α=017's-values produces worse val_bpb than 017's learnable trajectory?** Then we know α co-evolution with weights matters — pivot to Path B (spec 020: learn then freeze).
134+
3. **TTT fix included — is there any risk it interacts badly with recur-alpha?** Should be fine: TTT now applies α consistently with training. But it's unmeasured at full pipeline scale.
135+
136+
## Sequencing
137+
138+
- Run **after** 018c (already done, validated throughput).
139+
- Run **before** any Path-B (learn-then-freeze) experiment — 019's simpler approach is worth testing first.
140+
- If 019 promotes: 3-seed confirmation → submission candidate.
141+
- If 019 is null/regression vs 017: 020 (Path B learn-then-freeze) becomes the next test.
142+
143+
## What 019 does NOT do
144+
145+
- Does not learn α (that's the whole point)
146+
- Does not test Path B (learn then freeze) — that's deferred conditional on 019's outcome
147+
- Does not run smoke (trusting 018c's validation of constant-α path + surgical TTT extension)
148+
- Does not attempt freeze-plus-constant-fold mid-training (too expensive: recompile cost exceeds savings for 10-min run)

0 commit comments

Comments
 (0)