Skip to content

Commit e100586

Browse files
committed
Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549
3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green: - artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL) - train_time 596.14s mean (≤600s) - total_eval_time 397.23s mean (≤600s) Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1) bijective case preprocessing from PR openai#1729 with a per-token byte sidecar so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned attention out-gate (init_std=0.005) + quant-gate scaling that recovers the ~40 KB of overhead introduced by the new control tokens, keeping every seed under the 16 MB decimal cap. Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).
1 parent 75700cb commit e100586

9 files changed

Lines changed: 6856 additions & 0 deletions

File tree

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
# Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT — val_bpb 1.06549
2+
3+
**val_bpb: 1.06549** (3-seed mean, std=0.00070) | **val_loss: 2.33168 nats/token** (std=0.00152) | **~15.98 MB** | 8×H100 SXM | Phased TTT
4+
5+
## Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128, phased TTT, 10-min train / 10-min eval budgets)
6+
7+
### Core table (phased TTT)
8+
9+
| Seed | Steps | Pre-TTT BPB | Post-TTT BPB | TTT gain | TTT time | Artifact (bytes) |
10+
|------|-------:|------------:|-------------:|---------:|---------:|-----------------:|
11+
| 42 | 4854 | 1.07847 | 1.06610 | -0.01237 | 396.9s | 15,978,834 |
12+
| 0 | 4843 | 1.07719 | 1.06473 | -0.01247 | 399.3s | 15,971,476 |
13+
| 1234 | 4847 | 1.07811 | 1.06563 | -0.01248 | 395.5s | 15,975,050 |
14+
| **Mean** | **4848** | **1.07792** | **1.06549** | **-0.01244** | **397.2s** | **15,975,120** |
15+
| **Std** | | 0.00066 | **0.00070** | | 1.9s | 3,698 |
16+
17+
### Supplemental diagnostics
18+
19+
| Seed | Post-EMA BPB (pre-quant) | Quantized BPB (no TTT) | Sliding/TTT BPB | val_loss (nats) | Train time | Eval time |
20+
|------|-------------------------:|-----------------------:|----------------:|----------------:|-----------:|----------:|
21+
| 42 | 1.06907 | 1.07847 | 1.06610 | 2.33302 | 596.18s | 396.9s |
22+
| 0 | 1.06779 | 1.07719 | 1.06473 | 2.33002 | 596.17s | 399.3s |
23+
| 1234 | 1.06872 | 1.07811 | 1.06563 | 2.33199 | 596.08s | 395.5s |
24+
25+
All three seeds clear both 600s budgets (train + eval) and the 16,000,000-byte decimal artifact cap. 3-seed std is 0.00070 BPB ≈ 0.00181 nats, well under the 0.005-nat significance floor.
26+
27+
## Key Innovation — CaseOps tokenizer
28+
29+
CaseOps (`lossless_caps_caseops_v1`) is a **bijective**, character-level text transform applied before SentencePiece training. It removes English capitalization from the body of the text and records it as four operator tokens that become part of the BPE vocabulary as SentencePiece `user_defined_symbols`:
30+
31+
- `TITLE` — next word is TitleCase
32+
- `ALLCAPS` — next word (or region) is UPPERCASE
33+
- `CAPNEXT` — next letter is capitalized
34+
- `ESC` — escape for a literal operator-looking sequence
35+
36+
Because the transform is fully invertible (`decode_lossless_caps_v2(encode_lossless_caps_v2(s)) == s` for all strings), **no information is lost**. The SP model sees lowercase-normalized text, so the BPE merges allocate vocabulary around content instead of around case-duplicated variants ("the"/"The"/"THE" collapse to one surface form with operator prefixes). This reclaims ~0.005-0.006 nats per token on FineWeb.
37+
38+
**BPB is scored on ORIGINAL pre-transform UTF-8 bytes**, not on the transformed representation. The training pipeline emits per-token byte sidecar shards (`fineweb_val_bytes_XXXXXX.bin`, uint16 parallel to the val token shards) that record the canonical byte cost of each target position; eval sums those to get true bytes. This sidesteps the "bytes-per-token shift" concern: the score is on the same FineWeb text, just with a different tokenization front end.
39+
40+
```python
41+
# Transform (character-level, bijective):
42+
text = "The quick brown FOX."
43+
encode = encode_lossless_caps_v2(text)
44+
# → "<TITLE>the quick brown <ALLCAPS>fox."
45+
assert decode_lossless_caps_v2(encode) == text
46+
```
47+
48+
## Changes from PR #1530 / PR #1626 baseline
49+
50+
| Component | PR #1530 | This submission |
51+
|-----------|---------:|----------------:|
52+
| Tokenizer | SP8192 FineWeb BPE | SP8192 FineWeb BPE + CaseOps operator tokens |
53+
| BPB accounting | uniform piece.encode() | per-token byte sidecar (original bytes) |
54+
| Attention out-gate || **learned `gate` scalar per head** (init_std=0.005) |
55+
| Attention quant gate || **quant-time gate scaling** (~40 KB artifact savings) |
56+
| Depth recurrence || Loop4-5 (layers 4-5 run twice) |
57+
| TTT | multi-phase SGD score-first | multi-phase SGD score-first (kept) |
58+
| Clip sigmas | (MLP=12, ATTN=13) | (MLP=12, ATTN=13) |
59+
| Embed bits | 7 | 7 |
60+
61+
Net: **-0.00644 BPB / -0.01665 nats vs PR #1626 (1.07193)****3.3× the 0.005-nat record bar**.
62+
63+
## Rule compliance
64+
65+
- **Artifact ≤ 16,000,000 bytes DECIMAL**: all 3 seeds ≤ 15,978,834 bytes (21+ KB headroom).
66+
- **train_time ≤ 600s**: all 3 seeds 596.1-596.2s.
67+
- **total_eval_time ≤ 600s**: all 3 seeds 395.5-399.3s.
68+
- **Score-first TTT**: phased TTT snapshots the pre-update score on each chunk BEFORE the LoRA adapter step (per-doc LoRA reset via `reusable_lora.reset()`), satisfying Issue #1017 Condition 3.
69+
- **BPB on original bytes**: per-token byte sidecar encodes the canonical UTF-8 byte count of each val position; transformed text is only the tokenization front end.
70+
- **Reversibility**: `decode_lossless_caps_v2(encode_lossless_caps_v2(x)) == x` checked by the bijectivity test (see `tools/test_caseops_bijectivity.py` in the author's working tree; the transform is also verifiable in-repo via `lossless_caps.py`).
71+
- **No val data in training**: training uses only `fineweb_train_*.bin` shards.
72+
- **No external network during eval**: self-contained; tokenizer + transform ship with the submission.
73+
74+
## Requirements
75+
76+
```bash
77+
# PyTorch 2.9.1+cu128 (or compatible) + Flash Attention 3 for Hopper:
78+
pip install torch --index-url https://download.pytorch.org/whl/cu128
79+
pip install flash-attn-interface sentencepiece triton numpy
80+
# Python ≥ 3.12 (minified f-strings use PEP 701 nested same-type quotes).
81+
```
82+
83+
## Data setup (run ONCE)
84+
85+
The submission ships with the trained CaseOps SentencePiece model (`tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model`) and the bijective transform module (`lossless_caps.py`). Train/val shards and the byte sidecar are rebuilt from the canonical FineWeb-10B doc stream produced by `data/download_hf_docs_and_tokenize.py` in the repo root:
86+
87+
```bash
88+
# 1. Ensure docs_selected.jsonl exists (standard setup step for the repo).
89+
python3 ../../data/download_hf_docs_and_tokenize.py # or point to existing file
90+
91+
# 2. Build CaseOps-transformed shards + val byte sidecar.
92+
python3 prepare_caseops_data.py \
93+
--docs ./fineweb10B_raw/docs_selected.jsonl \
94+
--out ./data/datasets/fineweb10B_sp8192_caseops/datasets \
95+
--sp ./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model
96+
```
97+
98+
Output layout (what `train_gpt.py` expects with `CASEOPS_ENABLED=1`):
99+
100+
```
101+
data/datasets/fineweb10B_sp8192_caseops/datasets/
102+
tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model
103+
datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/
104+
fineweb_train_000000.bin
105+
fineweb_train_000001.bin
106+
...
107+
fineweb_val_000000.bin
108+
fineweb_val_bytes_000000.bin
109+
```
110+
111+
## Run command (3-seed reproduction)
112+
113+
```bash
114+
for SEED in 42 0 1234; do
115+
NCCL_NET=Socket \
116+
DATA_DIR=./data \
117+
CASEOPS_ENABLED=1 \
118+
PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 PHASED_TTT_NUM_PHASES=3 \
119+
MLP_CLIP_SIGMAS=12.0 ATTN_CLIP_SIGMAS=13.0 \
120+
EMBED_BITS=7 EMBED_CLIP_SIGMAS=15.0 \
121+
MATRIX_LR=0.026 \
122+
GPTQ_RESERVE_SECONDS=4 GPTQ_CALIBRATION_BATCHES=16 \
123+
GATED_ATTN_ENABLED=1 GATED_ATTN_INIT_STD=0.005 GATED_ATTN_QUANT_GATE=1 \
124+
SEED=$SEED \
125+
torchrun --standalone --nproc_per_node=8 train_gpt.py \
126+
> train_seed${SEED}.log 2>&1
127+
done
128+
```
129+
130+
## Lineage
131+
132+
- Builds on **PR #1530** (samacqua) SP8192 + Loop4-5 + parallel residuals + phased TTT stack.
133+
- Borrows **PR #1626** (ours) multi-phase SGD phased TTT schedule (3 phases on the first 2000 val docs).
134+
- Adopts **CaseOps** reversible case preprocessing + per-token byte sidecar BPB accounting from **PR #1729** (romeerp), which established that bijective text preprocessing that preserves byte-level BPB is rule-compliant.
135+
- Adds the learned `gated_attn` out-gate (init_std=0.005) + quant-gate scaling (`GATED_ATTN_QUANT_GATE=1`) which recovers the ~15-40 KB of artifact overhead introduced by the new control tokens and sidecar path, keeping all three seeds under the 16 MB decimal cap.
136+
137+
## Credits
138+
139+
- @samacqua — PR #1530 base stack.
140+
- @romeerp — PR #1729 CaseOps concept + byte sidecar accounting.
141+
- @bigbag — PR #1493 merged SOTA (1.0810).
142+
- @MarioPaerle — PR #1667 AttnOutGate pattern.
143+
144+
## Included files
145+
146+
- `train_gpt.py` — main training script (131,887 bytes pre-minify).
147+
- `submission.json` — metadata.
148+
- `README.md` — this file.
149+
- `train_seed42.log`, `train_seed0.log`, `train_seed1234.log` — 3-seed run logs.
150+
- `tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model` — CaseOps SentencePiece model (366.5 KB).
151+
- `lossless_caps.py` — bijective CaseOps transform (used by `prepare_caseops_data.py`).
152+
- `prepare_caseops_data.py` — one-time data prep script that tokenizes FineWeb via CaseOps + emits the per-token byte sidecar.

0 commit comments

Comments
 (0)