|
| 1 | +# Record: CaseOps Tokenizer + Mild WD Taper |
| 2 | + |
| 3 | +**val_bpb: 1.06780** (3-seed mean, std 0.00037) | **2.33674 nats** | **~15.94 MB** | 8xH100 SXM, ~596s train + ~488s TTT eval |
| 4 | + |
| 5 | +This record builds directly on PR #1626's legal multi-phase TTT stack and adds two changes: |
| 6 | + |
| 7 | +1. A lossless case-operations tokenizer/data export, hosted publicly at [romeerp/parameter-golf-caseops-v1](https://huggingface.co/datasets/romeerp/parameter-golf-caseops-v1) |
| 8 | +2. A mild late Muon weight-decay taper (`WD_TAPER_START_FRAC=0.70`, `WD_TAPER_FINAL_MULT=0.50`) |
| 9 | + |
| 10 | +The tokenizer/data are not checked into this PR; they are downloaded from the Hugging Face dataset above with the included `cached_challenge_fineweb.py`. |
| 11 | + |
| 12 | +## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, Phased TTT) |
| 13 | + |
| 14 | +| Seed | Steps | Pre-Quant BPB | Quantized BPB | **Post-TTT BPB** | Artifact | |
| 15 | +|------|-------|---------------|---------------|------------------|----------| |
| 16 | +| 0 | 4,921 | 1.07032992 | 1.08152131 | **1.06805820** | 15,932,307 | |
| 17 | +| 42 | 4,866 | 1.07065549 | 1.08171495 | **1.06806595** | 15,935,802 | |
| 18 | +| 1234 | 4,870 | 1.06971629 | 1.08036614 | **1.06727867** | 15,943,106 | |
| 19 | +| **Mean** | | **1.07023390** | **1.08120080** | **1.06780094** | **15,937,072** | |
| 20 | + |
| 21 | +## Supplemental Diagnostics |
| 22 | + |
| 23 | +| Seed | Pre-Quant BPB | Quantized BPB | Post-TTT BPB | val_loss (nats) | Code size | Total | Train time | Eval time | |
| 24 | +|------|---------------|---------------|--------------|-----------------|-----------|-------|------------|-----------| |
| 25 | +| 0 | 1.07032992 | 1.08152131 | 1.06805820 | 2.33730724 | 28,320 | 15,932,307 | 596.1s | 488.8s | |
| 26 | +| 42 | 1.07065549 | 1.08171495 | 1.06806595 | 2.33732420 | 30,985 | 15,935,802 | 596.1s | 482.0s | |
| 27 | +| 1234 | 1.06971629 | 1.08036614 | 1.06727867 | 2.33560135 | 30,985 | 15,943,106 | 596.1s | 494.6s | |
| 28 | + |
| 29 | +## Tokenizer: Lossless Case-Ops |
| 30 | + |
| 31 | + |
| 32 | +The tokenizer uses a lossless text transform, `lossless_caps_caseops_v1`, that factorizes text into: |
| 33 | + |
| 34 | +- a lowercase lexical stream |
| 35 | +- a tiny reserved capitalization side-channel |
| 36 | + |
| 37 | +Reserved control symbols: |
| 38 | + |
| 39 | +- `TITLE` |
| 40 | +- `ALLCAPS` |
| 41 | +- `CAPNEXT` |
| 42 | +- `ESC` |
| 43 | + |
| 44 | +Behavior over maximal ASCII alphabetic runs: |
| 45 | + |
| 46 | +- lowercase words stay lowercase |
| 47 | +- `TitleCase` becomes `TITLE + lowercase(word)` |
| 48 | +- `ALLCAPS` becomes `ALLCAPS + lowercase(word)` |
| 49 | +- mixed-case words use sparse `CAPNEXT` markers |
| 50 | +- control symbols themselves are escaped losslessly with `ESC` |
| 51 | + |
| 52 | +Examples: |
| 53 | + |
| 54 | +- `The NASA Launch` -> `TITLE the ALLCAPS nasa TITLE launch` |
| 55 | +- `iPhone OpenAI` -> `i CAPNEXT phone TITLE open CAPNEXT a CAPNEXT i` |
| 56 | + |
| 57 | +The point is to remove redundant case variation from the main lexical token stream without losing any information. At evaluation time, BPB is still charged against the original raw UTF-8 bytes, not the transformed stream. |
| 58 | + |
| 59 | +## Why This Is Still Real BPB |
| 60 | + |
| 61 | +The exporter writes validation byte sidecars: |
| 62 | + |
| 63 | +- `fineweb_val_000000.bin` |
| 64 | +- `fineweb_val_bytes_000000.bin` |
| 65 | + |
| 66 | +The trainer then loads the byte sidecar directly and reports: |
| 67 | + |
| 68 | +- `val_bpb:byte_sidecar:enabled` |
| 69 | + |
| 70 | +So scoring is done against exact original-byte counts rather than tokenized/transformed length. This preserves a true byte-level objective even though the tokenizer uses a reversible preprocessing transform. |
| 71 | + |
| 72 | +## Main Idea |
| 73 | + |
| 74 | +The core intuition is that standard `sp8192` still makes the model represent a lot of casing variation directly in the lexical stream. By transforming capital tokens to sentinel+lowercase, we free up vocabulary for more useful tokens, and possibly may provide some sort of inductive bias helping the model learn capitalization as a rule. The intuition behind tapered weight decay is that the purpose of a high weight decay in this challenge is to make weights more compressible by reducing entropy. While this is necessary at the beginning of training, near the end of training weights tend to be more settled, and therefore unlikely to spike and become outliers, so reducing the weight decay in favor of a better optimization may provide a benefit. |
| 75 | + |
| 76 | +This submission keeps the legal PR #1626 architecture and phased-TTT evaluation path, but swaps in the lossless case-ops tokenizer/data export above. On top of that, it adds a mild late taper on Muon weight decay: |
| 77 | + |
| 78 | +- full Muon WD until 70% of training |
| 79 | +- then linearly taper to 50% of the base WD by the end |
| 80 | + |
| 81 | +This combination improves pretrained BPB and quantized phased-TTT BPB while staying under the 16 MB artifact cap. |
| 82 | + |
| 83 | +## Changes from PR #1626 |
| 84 | + |
| 85 | +| Change | Source | Effect | |
| 86 | +|--------|--------|--------| |
| 87 | +| CaseOps tokenizer + exported dataset | **Novel (this work)** | cleaner lexical stream, exact byte-sidecar eval | |
| 88 | +| Validation byte-sidecar BPB accounting | **Novel (this work)** | exact raw-byte metric with transformed tokenizer | |
| 89 | +| Mild late Muon WD taper (`0.70 -> 0.50`) | This work | small but consistent BPB win | |
| 90 | +| Public HF dataset/tokenizer download path | This work | reproducible on fresh pods | |
| 91 | + |
| 92 | +## Rule Compliance |
| 93 | + |
| 94 | +- **Causal:** all scoring remains autoregressive / causal. |
| 95 | +- **Normalized:** scoring uses standard cross-entropy over the full vocabulary. |
| 96 | +- **Score-before-update:** phased TTT remains PR #1626 style legal score-first TTT. |
| 97 | +- **Single pass:** no rescoring of validation tokens. |
| 98 | +- **No validation during training:** training uses only train shards. |
| 99 | +- **Full validation split:** the full exported validation split is scored. |
| 100 | +- **Byte accounting:** BPB is computed from the validation byte sidecar, which exactly matches the raw UTF-8 byte total of the exported docs. |
| 101 | + |
| 102 | +## Public Artifacts |
| 103 | + |
| 104 | +- Dataset + tokenizer: [romeerp/parameter-golf-caseops-v1](https://huggingface.co/datasets/romeerp/parameter-golf-caseops-v1) |
| 105 | + |
| 106 | +The HF dataset repo contains: |
| 107 | + |
| 108 | +- the caseops tokenizer model / vocab |
| 109 | +- the exported train shards |
| 110 | +- the exported validation shard |
| 111 | +- the validation byte-sidecar shard |
| 112 | +- `manifest.json` |
| 113 | + |
| 114 | +## Requirements |
| 115 | + |
| 116 | +Python >= 3.12. Flash Attention 3 (Hopper) required. |
| 117 | + |
| 118 | +```bash |
| 119 | +pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291 |
| 120 | +pip install -r requirements.txt |
| 121 | +``` |
| 122 | + |
| 123 | +## Run Instructions |
| 124 | + |
| 125 | +Run the commands in this section from the record directory: |
| 126 | + |
| 127 | +```bash |
| 128 | +cd records/track_10min_16mb/2026-04-18_PR1626_CaseOps_Taper |
| 129 | +``` |
| 130 | + |
| 131 | +Prepare the public Hugging Face tokenizer + dataset on a fresh pod: |
| 132 | + |
| 133 | +```bash |
| 134 | +MATCHED_FINEWEB_REPO_ID=romeerp/parameter-golf-caseops-v1 \ |
| 135 | +MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=datasets \ |
| 136 | +python3 cached_challenge_fineweb.py \ |
| 137 | + --variant sp8192_lossless_caps_caseops_v1_reserved \ |
| 138 | + --train-shards 80 |
| 139 | +``` |
| 140 | + |
| 141 | +This downloads both: |
| 142 | + |
| 143 | +- the exported caseops dataset shards |
| 144 | +- the caseops SentencePiece tokenizer artifact |
| 145 | + |
| 146 | +from [romeerp/parameter-golf-caseops-v1](https://huggingface.co/datasets/romeerp/parameter-golf-caseops-v1). |
| 147 | + |
| 148 | +From this record directory, train + quantize + phased eval for one seed: |
| 149 | + |
| 150 | +```bash |
| 151 | +NCCL_NET=Socket \ |
| 152 | +SEED=0 \ |
| 153 | +TOKENIZER_PATH=./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \ |
| 154 | +DATASETS_DIR=./datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \ |
| 155 | +torchrun --standalone --nproc_per_node=8 train_gpt.py \ |
| 156 | + > train_seed0.log 2>&1 |
| 157 | +``` |
| 158 | + |
| 159 | +The submission script itself contains the intended defaults, including: |
| 160 | + |
| 161 | +- `PHASED_TTT_ENABLED=1` |
| 162 | +- `PHASED_TTT_NUM_PHASES=3` |
| 163 | +- `EMBED_BITS=7` |
| 164 | +- `EMBED_CLIP_SIGMAS=15.0` |
| 165 | +- `MLP_CLIP_SIGMAS=12.0` |
| 166 | +- `WD_TAPER_START_FRAC=0.70` |
| 167 | +- `WD_TAPER_FINAL_MULT=0.50` |
| 168 | + |
| 169 | +## Rebuilding the Tokenizer / Dataset |
| 170 | + |
| 171 | +The PR includes the actual Python sources used to create the tokenizer and exported dataset: |
| 172 | + |
| 173 | +- `download_hf_docs_and_tokenize.py` |
| 174 | +- `cached_challenge_fineweb.py` |
| 175 | +- `lossless_caps.py` |
| 176 | +- `tokenizer_specs_export_caseops_v1_reserved_only.json` |
| 177 | + |
| 178 | +Tokenizer export spec: |
| 179 | + |
| 180 | +```json |
| 181 | +{ |
| 182 | + "tokenizers": [ |
| 183 | + { |
| 184 | + "name": "sp_bpe_8192_lossless_caps_caseops_v1_reserved", |
| 185 | + "dataset_suffix": "sp8192_lossless_caps_caseops_v1_reserved", |
| 186 | + "vocab_size": 8192, |
| 187 | + "text_transform": "lossless_caps_caseops_v1", |
| 188 | + "reserve_text_transform_controls": true, |
| 189 | + "model_prefix": "fineweb_8192_bpe_lossless_caps_caseops_v1_reserved" |
| 190 | + } |
| 191 | + ] |
| 192 | +} |
| 193 | +``` |
| 194 | + |
| 195 | +To rebuild the HF artifacts from public docs instead of downloading the prebuilt dataset: |
| 196 | + |
| 197 | +```bash |
| 198 | +python3 download_hf_docs_and_tokenize.py \ |
| 199 | + --repo-id willdepueoai/parameter-golf \ |
| 200 | + --remote-root datasets \ |
| 201 | + --output-root data/caseops_export_rebuilt \ |
| 202 | + --tokenizer-config tokenizer_specs_export_caseops_v1_reserved_only.json \ |
| 203 | + --max-train-shards 80 |
| 204 | +``` |
| 205 | + |
| 206 | +That program imports the lossless transform implementation from `lossless_caps.py`, trains the SentencePiece model with the case-ops transform, exports the `80` train shards, exports the validation shard, and writes the validation byte-sidecar needed for exact BPB scoring. |
| 207 | + |
| 208 | +## Included Files |
| 209 | + |
| 210 | +- `train_gpt.py` |
| 211 | +- `requirements.txt` |
| 212 | +- `README.md` |
| 213 | +- `cached_challenge_fineweb.py` |
| 214 | +- `download_hf_docs_and_tokenize.py` |
| 215 | +- `lossless_caps.py` |
| 216 | +- `tokenizer_specs_export_caseops_v1_reserved_only.json` |
| 217 | +- `train_seed0.log` |
| 218 | +- `train_seed42.log` |
| 219 | +- `train_seed1234.log` |
0 commit comments