Skip to content

Commit a1e073e

Browse files
nprime06claude
andcommitted
feat(submission): PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE — val_bpb 1.06378
3-seed mean val_bpb = 1.06378 (std 0.00058), val_loss = 2.32794 nats/token. -0.00171 BPB vs PR #1736 (1.06549), -0.00043 vs PR #1779 (1.06421). Stacks 4 orthogonal wins on top of PR #1736, all ablation-validated on seed 0 against stock #1736 before stacking: - Polar Express per-iteration minimax Newton-Schulz coefficients (from PR #1344), replacing the fixed (3.44, -4.78, 2.03) tuple applied 5x with 5 distinct tuples baked into zeropower_via_newtonschulz5 - MIN_LR=0.10 warmdown floor (was 0) - Sparse attention head-output gate (modded-nanogpt pattern, 96 params/layer vs dense GatedAttn 4096), preserving the attn_gate_w name so the int8-per-row quant path still routes it (size-range check widened to 32..8192) - Triton fused softcapped cross-entropy kernel on the training forward; eval path keeps eager numerics unchanged Polish: GPTQ_RESERVE_SECONDS=0.5 (was 4) and VAL_LOSS_EVERY=0 (was 4000) together reclaim ~15s of training budget for additional depth-3 steps. All 3 seeds (42, 0, 1234) clear the 16M decimal cap (max 15,940,380 B, ~60 KB headroom), the 600s train budget (599.46- 599.57s), and the 600s TTT-eval budget (412.8-511.3s). Every individual seed beats its PR #1736 counterpart (deltas -1.20 to -2.27 mBPP). Changes are fully orthogonal to PR #1779's frozen recurrent α/β and PR #1767's LoRA-TTT tweaks — stackable. Also ships the BOS-fix patch for prepare_caseops_data.py (matches PR #1736 d7263a3 and PR #1769 fe7c309): sp.encode can't emit BOS_ID=1 since IDs 0-7 are reserved, and phased TTT's _loss_bpb_from_sums divides by zero on BOS-less shards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 75700cb commit a1e073e

9 files changed

Lines changed: 7112 additions & 0 deletions

File tree

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attention Gate + Fused CE — val_bpb 1.06378
2+
3+
**val_bpb: 1.06378** (3-seed mean, std=0.00058) | **val_loss: 2.32794 nats/token** (std=0.00128) | **~15.94 MB** | 8×H100 SXM | Phased TTT
4+
5+
**−0.00171 BPB vs PR #1736** (−0.00445 nats vs 1.06549), **−0.00043 vs PR #1779** (1.06421). Every individual seed beats its PR #1736 counterpart, and the changes are fully orthogonal to PR #1779's frozen α/β — stackable.
6+
7+
## Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, phased TTT, 10-min train / 10-min eval budgets)
8+
9+
### Core table (phased TTT)
10+
11+
| Seed | Steps | Pre-TTT BPB | Post-TTT BPB | TTT gain | TTT eval time | Artifact (bytes) |
12+
|------|-------:|------------:|-------------:|---------:|--------------:|-----------------:|
13+
| 42 | 4961 | 1.07699 | 1.06444 | -0.01255 | 511.3s | 15,940,380 |
14+
| 0 | 4957 | 1.07603 | 1.06353 | -0.01250 | 440.9s | 15,939,508 |
15+
| 1234 | 4964 | 1.07595 | 1.06336 | -0.01259 | 412.8s | 15,939,918 |
16+
| **Mean** | **4961** | **1.07632** | **1.06378** | **-0.01255** | **455.0s** | **15,939,935** |
17+
| **Std** | | 0.00058 | **0.00058** | | 50.3s | 436 |
18+
19+
### Supplemental diagnostics
20+
21+
| Seed | Post-EMA BPB (pre-quant) | Quantized BPB (no TTT) | Post-TTT BPB | val_loss (nats) | Train time |
22+
|------|-------------------------:|-----------------------:|-------------:|----------------:|-----------:|
23+
| 42 | 1.06764 | 1.07699 | 1.06444 | 2.32939 | 599.46s |
24+
| 0 | 1.06667 | 1.07603 | 1.06353 | 2.32740 | 599.56s |
25+
| 1234 | 1.06665 | 1.07595 | 1.06336 | 2.32703 | 599.57s |
26+
27+
All three seeds clear both 600s budgets (train + TTT eval) and the 16,000,000-byte decimal artifact cap (60+ KB headroom). 3-seed std is 0.00058 BPB ≈ 0.00151 nats, well under the 0.005-nat significance floor.
28+
29+
### Head-to-head vs PR #1736 (matched seeds)
30+
31+
| Seed | This PR | PR #1736 | Δ (mBPP) |
32+
|------|--------:|---------:|---------:|
33+
| 42 | 1.06444 | 1.06610 | −1.66 |
34+
| 0 | 1.06353 | 1.06473 | −1.20 |
35+
| 1234 | 1.06336 | 1.06563 | −2.27 |
36+
| **Mean** | **1.06378** | **1.06549** | **−1.71** |
37+
38+
## What this submission adds over PR #1736
39+
40+
- **Polar Express Newton-Schulz coefficients (ported from PR #1344):** Replaces Muon's fixed `(a,b,c) = (3.4445, -4.775, 2.0315)` tuple applied 5 times with 5 per-iteration minimax-optimized tuples baked into `zeropower_via_newtonschulz5`, producing a higher-quality polar factor per step at unchanged `MUON_BACKEND_STEPS=5`.
41+
- **MIN_LR=0.10 warmdown floor:** Floors the LR warmdown at 10% of max instead of 0, so the final ~25% of training continues to deliver useful gradient updates instead of frozen no-ops.
42+
- **Sparse attention head-output gate (modded-nanogpt pattern):** Replaces PR #1736's dense `GatedAttn (8, 512) = 4096 params/layer` with a narrow-input variant `(8, gate_window=12) = 96 params/layer`; preserves the `attn_gate_w` name so the existing int8-per-row gate quantization path still routes it (after widening its size-range check to 32..8192). Saves ~44 K params ≈ ~44 KB artifact with no measurable BPB cost.
43+
- **Fused softcapped cross-entropy (Triton, training-only):** Single streaming kernel reads pre-softcap `logits_proj` once and computes `(softcap*tanh, LSE, per-row loss)` in-register; backward mirrors the forward symbolically. Registered via `torch.library.custom_op` + `register_autograd`. Eval path (`forward_logits`) keeps the eager `softcap*tanh + F.cross_entropy` numerics unchanged from PR #1736.
44+
- **Polish:** `GPTQ_RESERVE_SECONDS=0.5` (was 4) and `VAL_LOSS_EVERY=0` (was 4000) together reclaim ~15s of the 600s training budget for additional depth-3 steps.
45+
46+
**Implementation note — TTT path mirroring:** `_block_with_lora` and `_parallel_block_with_lora` manually unroll attention composition (bypassing `CausalSelfAttention.forward`) to thread in LoRA adapters, so any new attention-forward gate must be mirrored in both helpers or TTT silently skips it. We caught this during validation — training applied the sparse gate while TTT didn't, producing post-TTT BPB of 1.908. All three forward paths now have matching conditional branches.
47+
48+
## Rule compliance
49+
50+
- **Artifact ≤ 16,000,000 bytes DECIMAL**: all 3 seeds ≤ 15,940,380 bytes (~60 KB headroom).
51+
- **train_time ≤ 600s**: all 3 seeds 599.46–599.57s.
52+
- **TTT eval time ≤ 600s**: all 3 seeds 412.8–511.3s.
53+
- **Score-first TTT**: phased TTT unchanged from PR #1736; snapshots pre-update score on each chunk BEFORE the LoRA adapter step (per-doc LoRA reset via `reusable_lora.reset()`), satisfying Issue #1017 Condition 3.
54+
- **BPB on original bytes**: per-token byte sidecar unchanged from PR #1736.
55+
- **Reversibility**: CaseOps transform unchanged — `decode_lossless_caps_v2(encode_lossless_caps_v2(x)) == x`.
56+
- **No val data in training**: training uses only `fineweb_train_*.bin` shards.
57+
- **No external network during eval**: self-contained; tokenizer + transform ship with the submission.
58+
59+
## Known bug fix in `prepare_caseops_data.py`
60+
61+
This submission ships the BOS-fix patch identified on PR #1779 / patched on PR #1736 (d7263a3) and PR #1769 (fe7c309). The original prep script called `sp.encode(transformed, out_type=int)` without prepending BOS_ID=1; since the SP model reserves IDs 0–7, BOS cannot be emitted organically, and phased TTT's `_loss_bpb_from_sums` divides by zero on BOS-less shards. The fix is a 4-line diff — see `prepare_caseops_data.py` line 168.
62+
63+
## Requirements
64+
65+
```bash
66+
# PyTorch 2.9.1+cu128 (or compatible) + Flash Attention 3 for Hopper:
67+
pip install torch --index-url https://download.pytorch.org/whl/cu128
68+
pip install flash-attn-interface sentencepiece triton numpy
69+
# Python ≥ 3.12 (minified f-strings use PEP 701 nested same-type quotes).
70+
# Training-only Triton fused CE kernel requires triton ≥ 3.0 (ships with torch 2.9.1).
71+
```
72+
73+
## Lineage
74+
75+
- Builds on **PR #1736** (dexhunter) SP8192 + CaseOps + GatedAttn + Loop3-5 + PhasedTTT stack. All of PR #1736's innovations are preserved (tokenizer, byte sidecar, quant-gate, phased TTT) — this submission only adds.
76+
- **Polar Express NS** ported from **PR #1344** (5-step minimax Newton-Schulz coefficients, originally for Muon).
77+
- **Sparse attention head-output gate** pattern from modded-nanogpt speedrun (narrow `gate_window` input instead of full `dim`), with the `attn_gate_w` naming preserved so PR #1736's quant-gate int8 routing continues to work.
78+
- **Fused softcapped CE Triton kernel** is an original port; follows the `torch.library.custom_op` + `register_autograd` pattern used elsewhere in the repo.
79+
- **MIN_LR floor** is a trivial schedule change; no ports.
80+
81+
## Credits
82+
83+
- @samacqua — PR #1530 base stack.
84+
- @romeerp — PR #1729 CaseOps concept + byte sidecar accounting.
85+
- @bigbag — PR #1493 merged SOTA (1.0810).
86+
- @dexhunter — PR #1736 (direct baseline for this PR).
87+
- @leon2k2k2k — PR #1779 frozen recurrent α/β and TTT improvements (orthogonal, stackable).
88+
89+
## Included files
90+
91+
- `train_gpt.py` — main training script (~146 KB pre-minify; includes the fused CE Triton kernel block).
92+
- `submission.json` — metadata.
93+
- `README.md` — this file.
94+
- `train_seed42.log`, `train_seed0.log`, `train_seed1234.log` — 3-seed run logs.
95+
- `tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model` — CaseOps SentencePiece model (366.5 KB, identical to PR #1736).
96+
- `lossless_caps.py` — bijective CaseOps transform (identical to PR #1736).
97+
- `prepare_caseops_data.py` — one-time data prep script with BOS-fix patch applied.

0 commit comments

Comments
 (0)