|
| 1 | +# Spec 019 — Recur-Alpha with hardcoded α, full-pipeline on 8×H100 |
| 2 | + |
| 3 | +**Slug:** `recur-alpha-constant-full` |
| 4 | +**Created:** 2026-04-21 |
| 5 | +**Links to:** spec 018c (throughput diagnostic), `research/evaluations/018c-recur-alpha-constant.md`, `research/ideas/beating-1736-note.md` |
| 6 | + |
| 7 | +## Hypothesis |
| 8 | + |
| 9 | +Spec 018c showed that hardcoding α as Python float constants (017's endpoint values) allows torch.compile to specialize the lerp kernel and recover **92% of blend overhead** at proxy scale. At full model scale this approaches zero throughput tax. |
| 10 | + |
| 11 | +This spec tests the training-quality trade-off: **does losing α's adaptive learning (frozen at 017's values from step 1) hurt post-TTT val_bpb?** |
| 12 | + |
| 13 | +Works backward from target: |
| 14 | +- Target post-TTT: ≤ 1.06610 (beat #1736) |
| 15 | +- 017's actual post-TTT with learnable α: 1.06733 (missed by 0.00123) — but 017 had buggy TTT path (α not applied during TTT) |
| 16 | +- 019 recovers ~44 steps from throughput savings (~0.002 bpb training endpoint) + fixes TTT bug (unknown direction, probably helps) |
| 17 | +- Expected post-TTT range: 1.0650 – 1.0675 |
| 18 | + |
| 19 | +## Baseline |
| 20 | + |
| 21 | +Primary comparison: **#1736's 1.06610** (canonical target). |
| 22 | +Secondary comparison: **017's 1.06733** (same recur-alpha mechanism, different α values & buggy TTT). |
| 23 | + |
| 24 | +## Accept criteria |
| 25 | + |
| 26 | +- Training completes without NaN/divergence |
| 27 | +- `final_model.pt`, `final_model.int6.ptz` both emit |
| 28 | +- Post-GPTQ val_bpb captured |
| 29 | +- **Phased-TTT val_bpb captured** (the submission-gate number) |
| 30 | + |
| 31 | +**Decision criterion (post-TTT val_bpb):** |
| 32 | +| Post-TTT | Bucket | Next action | |
| 33 | +|---|---|---| |
| 34 | +| ≤ 1.06550 | Clear beat #1736 | 3-seed confirmation (~$30) then submission | |
| 35 | +| (1.06550, 1.06710] | Close, within seed std | 3-seed to resolve | |
| 36 | +| (1.06710, 1.06910] | Inside gate but worse than #1736 | Shelve recur-alpha for submission; still a mechanistic finding. | |
| 37 | +| > 1.06910 | Outside gate | Investigate; likely hardcoded-α hurt too much | |
| 38 | + |
| 39 | +## Code changes |
| 40 | + |
| 41 | +**Branch:** `exp/recur-alpha-constant-full` forking from `aabfbea` (018c's commit) + additional TTT wiring. |
| 42 | +**Commit:** `2895db3` on `fork/exp/recur-alpha-constant-full`. |
| 43 | + |
| 44 | +Key properties: |
| 45 | +- **No learnable α** — values hardcoded at `((1.078125, 1.2734375, 1.3984375), (1.015625, 0.97265625, 0.83203125))` from 017 endpoint |
| 46 | +- **torch.compile sees α as compile-time constants** in both `forward_logits` and `forward_ttt` lerp sites |
| 47 | +- **TTT bug fixed** — recur-alpha applies during TTT adaptation + eval (was missing in 015/016/017) |
| 48 | +- `self.recur_alpha = None` — not a Parameter, no gradient tracking, no optimizer state |
| 49 | + |
| 50 | +## Hardware ladder |
| 51 | + |
| 52 | +- **Skip smoke** — cite 018c's mini-model throughput run (commit `aabfbea`) + the new TTT wiring as surgical change that doesn't affect training-path graph. |
| 53 | +- **8×H100, region = whichever has capacity**. NA preferred (cleaner throughput), JP acceptable fallback (same constant-α benefit). |
| 54 | +- **Seed 42** first. 3-seed (42/43/44) conditional on clear-promote bucket. |
| 55 | + |
| 56 | +## Seed plan |
| 57 | + |
| 58 | +Single seed (42) first. 3-seed if results promote. |
| 59 | + |
| 60 | +## Inputs |
| 61 | + |
| 62 | +- Data: CaseOps dataset. On JP `/runpod/data/...`, on NA `/workspace/data/...`. |
| 63 | +- Tokenizer: `fineweb_8192_bpe.model`, bundled. |
| 64 | +- Hotstart: **none — fresh from-scratch training with hardcoded α from step 1.** |
| 65 | + |
| 66 | +## Execution protocol |
| 67 | + |
| 68 | +Standard #1736 full pipeline. Example for JP: |
| 69 | + |
| 70 | +```bash |
| 71 | +cd /runpod/parameter-golf/records/track_10min_16mb/2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT |
| 72 | +git checkout 2895db3 |
| 73 | + |
| 74 | +mkdir -p /runpod/runs/019-recur-alpha-constant-full/seed_42 |
| 75 | +mkdir -p /runpod/.torch_inductor_cache |
| 76 | + |
| 77 | +NCCL_NET=Socket DATA_DIR=/runpod/data \ |
| 78 | +ARTIFACT_DIR=/runpod/runs/019-recur-alpha-constant-full/seed_42 \ |
| 79 | +TORCHINDUCTOR_CACHE_DIR=/runpod/.torch_inductor_cache \ |
| 80 | +CASEOPS_ENABLED=1 \ |
| 81 | +PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \ |
| 82 | +MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \ |
| 83 | +EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \ |
| 84 | +MATRIX_LR=0.026 \ |
| 85 | +GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \ |
| 86 | +GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \ |
| 87 | +RECUR_ALPHA_ENABLED=1 \ |
| 88 | +TRAIN_LOG_EVERY=100 \ |
| 89 | +SEED=42 \ |
| 90 | +torchrun --standalone --nproc_per_node=8 train_gpt.py \ |
| 91 | + > /runpod/runs/019-recur-alpha-constant-full/seed_42/train.log 2>&1 |
| 92 | +``` |
| 93 | + |
| 94 | +Substitute `/workspace` for `/runpod` on NA. pyminify must be installed (preflight). |
| 95 | + |
| 96 | +## Expected throughput |
| 97 | + |
| 98 | +Per 018c's 92% recovery at proxy scale, 019 on 8×H100 full model should run at nearly baseline (008) throughput. Expected endpoint step count: ~4825-4850 (vs 017's 4784 with tensor α). Gains: ~40-65 more training steps vs 017 = ~0.002 bpb of training-endpoint improvement. |
| 99 | + |
| 100 | +Tok/s snapshot logging in final.json required (compare at steps 100/1000/2000/3000/4000/4500 vs 017). |
| 101 | + |
| 102 | +## Checkpoints / artifacts to emit |
| 103 | + |
| 104 | +Inherited baseline: |
| 105 | +- `final_model.pt` — pre-GPTQ FP state dict, post-EMA |
| 106 | +- `final_model.int6.ptz` — quantized submission artifact |
| 107 | +- `train.log` — full pipeline sequence |
| 108 | +- `final.json` — must include `val_bpb`, `val_bpb_pre_gptq_post_ema`, `val_bpb_post_gptq`, **`val_bpb_post_ttt`**, `stopping_early_at_step`, `layer_loop_enabled_at_step`, tok/s snapshots, `recur_alpha_values_hardcoded` (the constants from the table for audit) |
| 109 | +- `notes.md` |
| 110 | + |
| 111 | +## Stop-early criteria |
| 112 | + |
| 113 | +Unconditional: |
| 114 | +- NaN/inf in train_loss → halt |
| 115 | +- Step time > 2× spec 008 → halt |
| 116 | + |
| 117 | +Conditional on `looping_active=True`: |
| 118 | +- Training loss > 008's matched-step loss + 0.03 for 5+ consecutive log entries → halt (hardcoded α might be badly off for this seed's trajectory) |
| 119 | + |
| 120 | +## Cost estimate |
| 121 | + |
| 122 | +| item | cost | |
| 123 | +|---|---| |
| 124 | +| 8×H100 × ~25 min (compile + training + full pipeline) | ~$10 | |
| 125 | +| Rsync + pod stop | ~$0.10 | |
| 126 | +| **Single-seed total** | **~$10-12** | |
| 127 | +| (Conditional) 3-seed × 2 additional runs | ~$20-24 | |
| 128 | +| **If 3-seed promotes** | **~$30-36** | |
| 129 | + |
| 130 | +## Open questions for interview |
| 131 | + |
| 132 | +1. **Is hardcoded α at 017's values appropriate for seed 42?** 017 itself was seed 42, so the α values were learned on this exact seed. Should reproduce well. For seeds 43/44, hardcoded α would be a different-seed transplant; could be fine (α shape reproduces across seeds per 016/017 finding) or a slight mismatch. |
| 133 | +2. **What if α=017's-values produces worse val_bpb than 017's learnable trajectory?** Then we know α co-evolution with weights matters — pivot to Path B (spec 020: learn then freeze). |
| 134 | +3. **TTT fix included — is there any risk it interacts badly with recur-alpha?** Should be fine: TTT now applies α consistently with training. But it's unmeasured at full pipeline scale. |
| 135 | + |
| 136 | +## Sequencing |
| 137 | + |
| 138 | +- Run **after** 018c (already done, validated throughput). |
| 139 | +- Run **before** any Path-B (learn-then-freeze) experiment — 019's simpler approach is worth testing first. |
| 140 | +- If 019 promotes: 3-seed confirmation → submission candidate. |
| 141 | +- If 019 is null/regression vs 017: 020 (Path B learn-then-freeze) becomes the next test. |
| 142 | + |
| 143 | +## What 019 does NOT do |
| 144 | + |
| 145 | +- Does not learn α (that's the whole point) |
| 146 | +- Does not test Path B (learn then freeze) — that's deferred conditional on 019's outcome |
| 147 | +- Does not run smoke (trusting 018c's validation of constant-α path + surgical TTT extension) |
| 148 | +- Does not attempt freeze-plus-constant-fold mid-training (too expensive: recompile cost exceeds savings for 10-min run) |
0 commit comments