Skip to content

Commit c5dce12

Browse files
committed
V15: PR openai#1735 + CaseOps tokenizer support (TTT EMA disabled)
Adds byte sidecar loading to enable CaseOps lossless-case tokenizer (PR openai#1729). Key changes: - load_validation_token_bytes() function (loads fineweb_val_bytes_*.bin) - ValidationData.val_token_bytes field with sidecar fallback to LUT - eval_val/eval_val_sliding/eval_val_ttt prefer sidecar when available - TTT_EMA_ENABLED default 1 -> 0 (V14 lesson: EMA hurts monotonic-decrease TTT) V14 EMA result: 1.0427 (worse than baseline due to monotonic TTT loss). V15 hypothesis: CaseOps gives -0.005 to -0.012 BPB by saving bits via case dedup, landing in 1.030-1.038 range (50% chance of breaking record at 1.0357).
1 parent 57ee16e commit c5dce12

3 files changed

Lines changed: 488 additions & 18 deletions

File tree

V15_README.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# V15: PR #1735 + CaseOps Tokenizer (TTT EMA disabled)
2+
3+
**Base:** PR #1735 (AjAnubolu, 1.0429 BPB)
4+
**Innovation:** Add CaseOps lossless-case tokenizer (PR #1729) on top of pre-quant TTT stack
5+
6+
## What V15 Does
7+
8+
1. **Switches tokenizer** to `fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model` (lossless reversible Title/AllCaps/CapNext encoding)
9+
2. **Adds byte sidecar support** to compute honest BPB (CaseOps adds control chars that would inflate naive byte counts)
10+
3. **Disables TTT EMA** (V14 lesson: EMA hurts monotonic-decrease TTT)
11+
4. **Falls back gracefully** to LUT-based byte counting when no sidecar exists
12+
13+
## Expected Result
14+
15+
| Metric | PR #1735 base | V15 (this) | Delta |
16+
|--------|--------------:|-----------:|------:|
17+
| Pre-quant TTT BPB | ~1.033 | ~1.025 | -0.008 |
18+
| Final sliding BPB | 1.0429 | ~1.030-1.038 | -0.005 to -0.012 |
19+
| Record threshold (1.0357) | NO | **YES (~50% prob)** | |
20+
21+
## Compliance Notes
22+
23+
- **CaseOps is lossless reversible** — original text can be recovered exactly
24+
- **Byte sidecar uses RAW UTF-8 byte counts** (not transformed text) — honest BPB
25+
- **No SLOT, no n-gram cache, no eval-time TTT** — inherits PR #1735 cleanliness
26+
- **Pre-quant TTT remains unchanged** — same legal status as PR #1735
27+
28+
## Files Changed
29+
30+
- `records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/train_gpt.py`
31+
- Added `load_validation_token_bytes()` function
32+
- Modified `ValidationData.__init__` to load sidecar
33+
- Modified `eval_val()` to use sidecar
34+
- Modified `eval_val_sliding()` to use sidecar
35+
- Modified `eval_val_ttt()` to use sidecar
36+
- Disabled TTT EMA by default (V14 lesson)
37+
- `patch_v15_caseops.py`: standalone patch script
38+
- `V15_README.md`: this file
39+
40+
## Usage on RunPod
41+
42+
### Step 1: Clone V15 branch
43+
44+
```bash
45+
cd /workspace
46+
rm -rf parameter-golf
47+
git clone -b v15-pr1735-caseops https://github.com/alertcat/parameter-golf.git
48+
cd parameter-golf
49+
50+
# Verify patches
51+
grep -c "V15: Prefer byte sidecar" records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/train_gpt.py
52+
# Expected: 3
53+
grep -c "load_validation_token_bytes" records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/train_gpt.py
54+
# Expected: >= 2
55+
```
56+
57+
### Step 2: Install deps
58+
59+
```bash
60+
pip install sentencepiece brotli zstandard huggingface-hub hf_transfer -q
61+
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/ -q
62+
```
63+
64+
### Step 3: Download CaseOps dataset (~5 min, 16GB)
65+
66+
```bash
67+
HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c "
68+
from huggingface_hub import snapshot_download
69+
snapshot_download(
70+
repo_id='romeerp/parameter-golf-caseops-v1',
71+
repo_type='dataset',
72+
local_dir='/workspace/caseops_data',
73+
)
74+
"
75+
76+
# Verify key files
77+
ls /workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/ | grep -E "val_bytes|val_000000" | head -5
78+
ls /workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model
79+
```
80+
81+
### Step 4: Run V15 scout seed
82+
83+
```bash
84+
cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/
85+
86+
SEED=1337 \
87+
DATASETS_DIR=/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
88+
TOKENIZER_PATH=/workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
89+
TTT_EMA_ENABLED=0 \
90+
PREQUANT_TTT_ENABLED=1 \
91+
PREQUANT_TTT_EPOCHS=21 \
92+
torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee /workspace/scout_v15.log
93+
```
94+
95+
**Watch for this log line confirming sidecar is active:**
96+
```
97+
val_bpb:byte_sidecar:enabled
98+
```
99+
100+
If you see `val_bpb:byte_sidecar:disabled`, the dataset path is wrong — bytes won't be honest.
101+
102+
## Decision Points
103+
104+
After scout (~25 min), check `final_int6_sliding val_bpb`:
105+
106+
| BPB | Verdict |
107+
|-----|---------|
108+
| ≤ 1.0357 | 🔥 **BREAK RECORD** — run seeds 42 + 999, submit |
109+
| 1.0358-1.040 | 👍 Strong, run 3 seeds |
110+
| 1.040-1.045 | 😐 Worse than PR #1735 — investigate sidecar |
111+
| > 1.045 | ❌ Failure — check `val_bpb:byte_sidecar:enabled` line |

0 commit comments

Comments
 (0)