Skip to content

Commit 11bae1d

Browse files
authored
Merge PR #1729: Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean)
Merge accepted Parameter Golf record submission #1729.
2 parents 5c8e045 + 59b55a5 commit 11bae1d

11 files changed

Lines changed: 7571 additions & 0 deletions
Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
# Record: CaseOps Tokenizer + Mild WD Taper
2+
3+
**val_bpb: 1.06780** (3-seed mean, std 0.00037) | **2.33674 nats** | **~15.94 MB** | 8xH100 SXM, ~596s train + ~488s TTT eval
4+
5+
This record builds directly on PR #1626's legal multi-phase TTT stack and adds two changes:
6+
7+
1. A lossless case-operations tokenizer/data export, hosted publicly at [romeerp/parameter-golf-caseops-v1](https://huggingface.co/datasets/romeerp/parameter-golf-caseops-v1)
8+
2. A mild late Muon weight-decay taper (`WD_TAPER_START_FRAC=0.70`, `WD_TAPER_FINAL_MULT=0.50`)
9+
10+
The tokenizer/data are not checked into this PR; they are downloaded from the Hugging Face dataset above with the included `cached_challenge_fineweb.py`.
11+
12+
## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128, Phased TTT)
13+
14+
| Seed | Steps | Pre-Quant BPB | Quantized BPB | **Post-TTT BPB** | Artifact |
15+
|------|-------|---------------|---------------|------------------|----------|
16+
| 0 | 4,921 | 1.07032992 | 1.08152131 | **1.06805820** | 15,932,307 |
17+
| 42 | 4,866 | 1.07065549 | 1.08171495 | **1.06806595** | 15,935,802 |
18+
| 1234 | 4,870 | 1.06971629 | 1.08036614 | **1.06727867** | 15,943,106 |
19+
| **Mean** | | **1.07023390** | **1.08120080** | **1.06780094** | **15,937,072** |
20+
21+
## Supplemental Diagnostics
22+
23+
| Seed | Pre-Quant BPB | Quantized BPB | Post-TTT BPB | val_loss (nats) | Code size | Total | Train time | Eval time |
24+
|------|---------------|---------------|--------------|-----------------|-----------|-------|------------|-----------|
25+
| 0 | 1.07032992 | 1.08152131 | 1.06805820 | 2.33730724 | 28,320 | 15,932,307 | 596.1s | 488.8s |
26+
| 42 | 1.07065549 | 1.08171495 | 1.06806595 | 2.33732420 | 30,985 | 15,935,802 | 596.1s | 482.0s |
27+
| 1234 | 1.06971629 | 1.08036614 | 1.06727867 | 2.33560135 | 30,985 | 15,943,106 | 596.1s | 494.6s |
28+
29+
## Tokenizer: Lossless Case-Ops
30+
31+
32+
The tokenizer uses a lossless text transform, `lossless_caps_caseops_v1`, that factorizes text into:
33+
34+
- a lowercase lexical stream
35+
- a tiny reserved capitalization side-channel
36+
37+
Reserved control symbols:
38+
39+
- `TITLE`
40+
- `ALLCAPS`
41+
- `CAPNEXT`
42+
- `ESC`
43+
44+
Behavior over maximal ASCII alphabetic runs:
45+
46+
- lowercase words stay lowercase
47+
- `TitleCase` becomes `TITLE + lowercase(word)`
48+
- `ALLCAPS` becomes `ALLCAPS + lowercase(word)`
49+
- mixed-case words use sparse `CAPNEXT` markers
50+
- control symbols themselves are escaped losslessly with `ESC`
51+
52+
Examples:
53+
54+
- `The NASA Launch` -> `TITLE the ALLCAPS nasa TITLE launch`
55+
- `iPhone OpenAI` -> `i CAPNEXT phone TITLE open CAPNEXT a CAPNEXT i`
56+
57+
The point is to remove redundant case variation from the main lexical token stream without losing any information. At evaluation time, BPB is still charged against the original raw UTF-8 bytes, not the transformed stream.
58+
59+
## Why This Is Still Real BPB
60+
61+
The exporter writes validation byte sidecars:
62+
63+
- `fineweb_val_000000.bin`
64+
- `fineweb_val_bytes_000000.bin`
65+
66+
The trainer then loads the byte sidecar directly and reports:
67+
68+
- `val_bpb:byte_sidecar:enabled`
69+
70+
So scoring is done against exact original-byte counts rather than tokenized/transformed length. This preserves a true byte-level objective even though the tokenizer uses a reversible preprocessing transform.
71+
72+
## Main Idea
73+
74+
The core intuition is that standard `sp8192` still makes the model represent a lot of casing variation directly in the lexical stream. By transforming capital tokens to sentinel+lowercase, we free up vocabulary for more useful tokens, and possibly may provide some sort of inductive bias helping the model learn capitalization as a rule. The intuition behind tapered weight decay is that the purpose of a high weight decay in this challenge is to make weights more compressible by reducing entropy. While this is necessary at the beginning of training, near the end of training weights tend to be more settled, and therefore unlikely to spike and become outliers, so reducing the weight decay in favor of a better optimization may provide a benefit.
75+
76+
This submission keeps the legal PR #1626 architecture and phased-TTT evaluation path, but swaps in the lossless case-ops tokenizer/data export above. On top of that, it adds a mild late taper on Muon weight decay:
77+
78+
- full Muon WD until 70% of training
79+
- then linearly taper to 50% of the base WD by the end
80+
81+
This combination improves pretrained BPB and quantized phased-TTT BPB while staying under the 16 MB artifact cap.
82+
83+
## Changes from PR #1626
84+
85+
| Change | Source | Effect |
86+
|--------|--------|--------|
87+
| CaseOps tokenizer + exported dataset | **Novel (this work)** | cleaner lexical stream, exact byte-sidecar eval |
88+
| Validation byte-sidecar BPB accounting | **Novel (this work)** | exact raw-byte metric with transformed tokenizer |
89+
| Mild late Muon WD taper (`0.70 -> 0.50`) | This work | small but consistent BPB win |
90+
| Public HF dataset/tokenizer download path | This work | reproducible on fresh pods |
91+
92+
## Rule Compliance
93+
94+
- **Causal:** all scoring remains autoregressive / causal.
95+
- **Normalized:** scoring uses standard cross-entropy over the full vocabulary.
96+
- **Score-before-update:** phased TTT remains PR #1626 style legal score-first TTT.
97+
- **Single pass:** no rescoring of validation tokens.
98+
- **No validation during training:** training uses only train shards.
99+
- **Full validation split:** the full exported validation split is scored.
100+
- **Byte accounting:** BPB is computed from the validation byte sidecar, which exactly matches the raw UTF-8 byte total of the exported docs.
101+
102+
## Public Artifacts
103+
104+
- Dataset + tokenizer: [romeerp/parameter-golf-caseops-v1](https://huggingface.co/datasets/romeerp/parameter-golf-caseops-v1)
105+
106+
The HF dataset repo contains:
107+
108+
- the caseops tokenizer model / vocab
109+
- the exported train shards
110+
- the exported validation shard
111+
- the validation byte-sidecar shard
112+
- `manifest.json`
113+
114+
## Requirements
115+
116+
Python >= 3.12. Flash Attention 3 (Hopper) required.
117+
118+
```bash
119+
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
120+
pip install -r requirements.txt
121+
```
122+
123+
## Run Instructions
124+
125+
Run the commands in this section from the record directory:
126+
127+
```bash
128+
cd records/track_10min_16mb/2026-04-18_PR1626_CaseOps_Taper
129+
```
130+
131+
Prepare the public Hugging Face tokenizer + dataset on a fresh pod:
132+
133+
```bash
134+
MATCHED_FINEWEB_REPO_ID=romeerp/parameter-golf-caseops-v1 \
135+
MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=datasets \
136+
python3 cached_challenge_fineweb.py \
137+
--variant sp8192_lossless_caps_caseops_v1_reserved \
138+
--train-shards 80
139+
```
140+
141+
This downloads both:
142+
143+
- the exported caseops dataset shards
144+
- the caseops SentencePiece tokenizer artifact
145+
146+
from [romeerp/parameter-golf-caseops-v1](https://huggingface.co/datasets/romeerp/parameter-golf-caseops-v1).
147+
148+
From this record directory, train + quantize + phased eval for one seed:
149+
150+
```bash
151+
NCCL_NET=Socket \
152+
SEED=0 \
153+
TOKENIZER_PATH=./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
154+
DATASETS_DIR=./datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
155+
torchrun --standalone --nproc_per_node=8 train_gpt.py \
156+
> train_seed0.log 2>&1
157+
```
158+
159+
The submission script itself contains the intended defaults, including:
160+
161+
- `PHASED_TTT_ENABLED=1`
162+
- `PHASED_TTT_NUM_PHASES=3`
163+
- `EMBED_BITS=7`
164+
- `EMBED_CLIP_SIGMAS=15.0`
165+
- `MLP_CLIP_SIGMAS=12.0`
166+
- `WD_TAPER_START_FRAC=0.70`
167+
- `WD_TAPER_FINAL_MULT=0.50`
168+
169+
## Rebuilding the Tokenizer / Dataset
170+
171+
The PR includes the actual Python sources used to create the tokenizer and exported dataset:
172+
173+
- `download_hf_docs_and_tokenize.py`
174+
- `cached_challenge_fineweb.py`
175+
- `lossless_caps.py`
176+
- `tokenizer_specs_export_caseops_v1_reserved_only.json`
177+
178+
Tokenizer export spec:
179+
180+
```json
181+
{
182+
"tokenizers": [
183+
{
184+
"name": "sp_bpe_8192_lossless_caps_caseops_v1_reserved",
185+
"dataset_suffix": "sp8192_lossless_caps_caseops_v1_reserved",
186+
"vocab_size": 8192,
187+
"text_transform": "lossless_caps_caseops_v1",
188+
"reserve_text_transform_controls": true,
189+
"model_prefix": "fineweb_8192_bpe_lossless_caps_caseops_v1_reserved"
190+
}
191+
]
192+
}
193+
```
194+
195+
To rebuild the HF artifacts from public docs instead of downloading the prebuilt dataset:
196+
197+
```bash
198+
python3 download_hf_docs_and_tokenize.py \
199+
--repo-id willdepueoai/parameter-golf \
200+
--remote-root datasets \
201+
--output-root data/caseops_export_rebuilt \
202+
--tokenizer-config tokenizer_specs_export_caseops_v1_reserved_only.json \
203+
--max-train-shards 80
204+
```
205+
206+
That program imports the lossless transform implementation from `lossless_caps.py`, trains the SentencePiece model with the case-ops transform, exports the `80` train shards, exports the validation shard, and writes the validation byte-sidecar needed for exact BPB scoring.
207+
208+
## Included Files
209+
210+
- `train_gpt.py`
211+
- `requirements.txt`
212+
- `README.md`
213+
- `cached_challenge_fineweb.py`
214+
- `download_hf_docs_and_tokenize.py`
215+
- `lossless_caps.py`
216+
- `tokenizer_specs_export_caseops_v1_reserved_only.json`
217+
- `train_seed0.log`
218+
- `train_seed42.log`
219+
- `train_seed1234.log`
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
import argparse
2+
import json
3+
import os
4+
import shutil
5+
from pathlib import Path
6+
7+
from huggingface_hub import hf_hub_download
8+
9+
10+
REPO_ID = os.environ.get("MATCHED_FINEWEB_REPO_ID", "willdepueoai/parameter-golf")
11+
REMOTE_ROOT_PREFIX = os.environ.get("MATCHED_FINEWEB_REMOTE_ROOT_PREFIX", "datasets")
12+
ROOT = Path(__file__).resolve().parent
13+
DATASETS_DIR = ROOT / "datasets"
14+
TOKENIZERS_DIR = ROOT / "tokenizers"
15+
16+
def dataset_dir_for_variant(name: str) -> str:
17+
if name == "byte260":
18+
return "fineweb10B_byte260"
19+
if name.startswith("sp") and name[2:].isdigit():
20+
return f"fineweb10B_{name}"
21+
if name.startswith("sp"):
22+
return f"fineweb10B_{name}"
23+
raise ValueError(f"unsupported variant {name!r}; expected byte260 or sp<VOCAB_SIZE>")
24+
25+
26+
def local_path_for_remote(relative_path: str) -> Path:
27+
remote_path = Path(relative_path)
28+
if REMOTE_ROOT_PREFIX and remote_path.parts[:1] == (REMOTE_ROOT_PREFIX,):
29+
remote_path = remote_path.relative_to(REMOTE_ROOT_PREFIX)
30+
if remote_path.parts[:1] == ("datasets",):
31+
return DATASETS_DIR.joinpath(*remote_path.parts[1:])
32+
if remote_path.parts[:1] == ("tokenizers",):
33+
return TOKENIZERS_DIR.joinpath(*remote_path.parts[1:])
34+
return ROOT / remote_path
35+
36+
37+
def get(relative_path: str) -> None:
38+
destination = local_path_for_remote(relative_path)
39+
if destination.exists():
40+
return
41+
if destination.is_symlink():
42+
destination.unlink()
43+
44+
remote_path = Path(relative_path)
45+
cached_path = Path(
46+
hf_hub_download(
47+
repo_id=REPO_ID,
48+
filename=remote_path.name,
49+
subfolder=remote_path.parent.as_posix() if remote_path.parent != Path(".") else None,
50+
repo_type="dataset",
51+
)
52+
)
53+
# HF cache entries may be snapshot symlinks. Resolve to the underlying blob so we
54+
# always materialize a real file in data/, not a broken relative symlink.
55+
cached_source = cached_path.resolve(strict=True)
56+
destination.parent.mkdir(parents=True, exist_ok=True)
57+
try:
58+
os.link(cached_source, destination)
59+
except OSError:
60+
shutil.copy2(cached_source, destination)
61+
62+
63+
def manifest_path() -> Path:
64+
return local_path_for_remote(f"{REMOTE_ROOT_PREFIX}/manifest.json")
65+
66+
67+
def load_manifest(*, skip_manifest_download: bool) -> dict:
68+
path = manifest_path()
69+
if not path.is_file():
70+
if skip_manifest_download:
71+
raise FileNotFoundError(
72+
f"manifest.json is required for manifest-driven shard counts but is not present locally at {path}"
73+
)
74+
get(f"{REMOTE_ROOT_PREFIX}/manifest.json")
75+
return json.loads(path.read_text(encoding="utf-8"))
76+
77+
78+
def artifact_paths_for_tokenizer(tokenizer_entry: dict) -> list[str]:
79+
artifacts = []
80+
for key in ("model_path", "vocab_path", "path"):
81+
value = tokenizer_entry.get(key)
82+
if value:
83+
artifacts.append(str(value))
84+
if not artifacts:
85+
raise ValueError(f"tokenizer entry is missing downloadable artifacts: {tokenizer_entry}")
86+
return artifacts
87+
88+
89+
def build_parser() -> argparse.ArgumentParser:
90+
parser = argparse.ArgumentParser(description="Download challenge FineWeb shards from Hugging Face")
91+
parser.add_argument(
92+
"train_shards_positional",
93+
nargs="?",
94+
type=int,
95+
default=None,
96+
help=argparse.SUPPRESS,
97+
)
98+
parser.add_argument(
99+
"--train-shards",
100+
type=int,
101+
default=80,
102+
help="Number of training shards to download for the selected variant. Defaults to 80.",
103+
)
104+
parser.add_argument(
105+
"--variant",
106+
default="sp1024",
107+
help="Tokenizer family to download, for example sp1024, sp4096, or byte260.",
108+
)
109+
parser.add_argument(
110+
"--skip-manifest",
111+
action="store_true",
112+
help="Skip downloading manifest.json.",
113+
)
114+
parser.add_argument(
115+
"--with-docs",
116+
action="store_true",
117+
help="Also download docs_selected.jsonl and its sidecar for tokenizer retraining or dataset re-export.",
118+
)
119+
return parser
120+
121+
122+
def main() -> None:
123+
args = build_parser().parse_args()
124+
dataset_dir = dataset_dir_for_variant(args.variant)
125+
train_shards = args.train_shards_positional if args.train_shards_positional is not None else args.train_shards
126+
if train_shards < 0:
127+
raise ValueError("train_shards must be non-negative")
128+
129+
manifest = load_manifest(skip_manifest_download=args.skip_manifest)
130+
dataset_entry = next((x for x in manifest.get("datasets", []) if x.get("name") == dataset_dir), None)
131+
if dataset_entry is None:
132+
raise ValueError(f"dataset {dataset_dir} not found in {REMOTE_ROOT_PREFIX}/manifest.json")
133+
max_train_shards = int((dataset_entry.get("stats") or {}).get("files_train"))
134+
val_shards = int((dataset_entry.get("stats") or {}).get("files_val"))
135+
if train_shards > max_train_shards:
136+
raise ValueError(
137+
f"{args.variant} only has {max_train_shards} training shards on {REPO_ID}, requested {train_shards}"
138+
)
139+
tokenizer_name = dataset_entry.get("tokenizer_name")
140+
tokenizer_entry = next((x for x in manifest.get("tokenizers", []) if x.get("name") == tokenizer_name), None)
141+
if tokenizer_entry is None:
142+
raise ValueError(f"tokenizer {tokenizer_name} not found in {REMOTE_ROOT_PREFIX}/manifest.json")
143+
144+
if args.with_docs:
145+
get(f"{REMOTE_ROOT_PREFIX}/docs_selected.jsonl")
146+
get(f"{REMOTE_ROOT_PREFIX}/docs_selected.source_manifest.json")
147+
148+
dataset_prefix = f"{REMOTE_ROOT_PREFIX}/datasets/{dataset_dir}"
149+
for i in range(val_shards):
150+
get(f"{dataset_prefix}/fineweb_val_{i:06d}.bin")
151+
val_bytes_glob = dataset_entry.get("val_bytes_glob")
152+
if val_bytes_glob:
153+
for i in range(val_shards):
154+
get(f"{dataset_prefix}/fineweb_val_bytes_{i:06d}.bin")
155+
for i in range(train_shards):
156+
get(f"{dataset_prefix}/fineweb_train_{i:06d}.bin")
157+
158+
for artifact_path in artifact_paths_for_tokenizer(tokenizer_entry):
159+
get(f"{REMOTE_ROOT_PREFIX}/{artifact_path}")
160+
161+
162+
if __name__ == "__main__":
163+
main()

0 commit comments

Comments
 (0)