|
| 1 | +# V15: PR #1735 + CaseOps Tokenizer (TTT EMA disabled) |
| 2 | + |
| 3 | +**Base:** PR #1735 (AjAnubolu, 1.0429 BPB) |
| 4 | +**Innovation:** Add CaseOps lossless-case tokenizer (PR #1729) on top of pre-quant TTT stack |
| 5 | + |
| 6 | +## What V15 Does |
| 7 | + |
| 8 | +1. **Switches tokenizer** to `fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model` (lossless reversible Title/AllCaps/CapNext encoding) |
| 9 | +2. **Adds byte sidecar support** to compute honest BPB (CaseOps adds control chars that would inflate naive byte counts) |
| 10 | +3. **Disables TTT EMA** (V14 lesson: EMA hurts monotonic-decrease TTT) |
| 11 | +4. **Falls back gracefully** to LUT-based byte counting when no sidecar exists |
| 12 | + |
| 13 | +## Expected Result |
| 14 | + |
| 15 | +| Metric | PR #1735 base | V15 (this) | Delta | |
| 16 | +|--------|--------------:|-----------:|------:| |
| 17 | +| Pre-quant TTT BPB | ~1.033 | ~1.025 | -0.008 | |
| 18 | +| Final sliding BPB | 1.0429 | ~1.030-1.038 | -0.005 to -0.012 | |
| 19 | +| Record threshold (1.0357) | NO | **YES (~50% prob)** | | |
| 20 | + |
| 21 | +## Compliance Notes |
| 22 | + |
| 23 | +- **CaseOps is lossless reversible** — original text can be recovered exactly |
| 24 | +- **Byte sidecar uses RAW UTF-8 byte counts** (not transformed text) — honest BPB |
| 25 | +- **No SLOT, no n-gram cache, no eval-time TTT** — inherits PR #1735 cleanliness |
| 26 | +- **Pre-quant TTT remains unchanged** — same legal status as PR #1735 |
| 27 | + |
| 28 | +## Files Changed |
| 29 | + |
| 30 | +- `records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/train_gpt.py` |
| 31 | + - Added `load_validation_token_bytes()` function |
| 32 | + - Modified `ValidationData.__init__` to load sidecar |
| 33 | + - Modified `eval_val()` to use sidecar |
| 34 | + - Modified `eval_val_sliding()` to use sidecar |
| 35 | + - Modified `eval_val_ttt()` to use sidecar |
| 36 | + - Disabled TTT EMA by default (V14 lesson) |
| 37 | +- `patch_v15_caseops.py`: standalone patch script |
| 38 | +- `V15_README.md`: this file |
| 39 | + |
| 40 | +## Usage on RunPod |
| 41 | + |
| 42 | +### Step 1: Clone V15 branch |
| 43 | + |
| 44 | +```bash |
| 45 | +cd /workspace |
| 46 | +rm -rf parameter-golf |
| 47 | +git clone -b v15-pr1735-caseops https://github.com/alertcat/parameter-golf.git |
| 48 | +cd parameter-golf |
| 49 | + |
| 50 | +# Verify patches |
| 51 | +grep -c "V15: Prefer byte sidecar" records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/train_gpt.py |
| 52 | +# Expected: 3 |
| 53 | +grep -c "load_validation_token_bytes" records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/train_gpt.py |
| 54 | +# Expected: >= 2 |
| 55 | +``` |
| 56 | + |
| 57 | +### Step 2: Install deps |
| 58 | + |
| 59 | +```bash |
| 60 | +pip install sentencepiece brotli zstandard huggingface-hub hf_transfer -q |
| 61 | +pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/ -q |
| 62 | +``` |
| 63 | + |
| 64 | +### Step 3: Download CaseOps dataset (~5 min, 16GB) |
| 65 | + |
| 66 | +```bash |
| 67 | +HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c " |
| 68 | +from huggingface_hub import snapshot_download |
| 69 | +snapshot_download( |
| 70 | + repo_id='romeerp/parameter-golf-caseops-v1', |
| 71 | + repo_type='dataset', |
| 72 | + local_dir='/workspace/caseops_data', |
| 73 | +) |
| 74 | +" |
| 75 | + |
| 76 | +# Verify key files |
| 77 | +ls /workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/ | grep -E "val_bytes|val_000000" | head -5 |
| 78 | +ls /workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model |
| 79 | +``` |
| 80 | + |
| 81 | +### Step 4: Run V15 scout seed |
| 82 | + |
| 83 | +```bash |
| 84 | +cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/ |
| 85 | + |
| 86 | +SEED=1337 \ |
| 87 | + DATASETS_DIR=/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \ |
| 88 | + TOKENIZER_PATH=/workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \ |
| 89 | + TTT_EMA_ENABLED=0 \ |
| 90 | + PREQUANT_TTT_ENABLED=1 \ |
| 91 | + PREQUANT_TTT_EPOCHS=21 \ |
| 92 | + torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee /workspace/scout_v15.log |
| 93 | +``` |
| 94 | + |
| 95 | +**Watch for this log line confirming sidecar is active:** |
| 96 | +``` |
| 97 | +val_bpb:byte_sidecar:enabled |
| 98 | +``` |
| 99 | + |
| 100 | +If you see `val_bpb:byte_sidecar:disabled`, the dataset path is wrong — bytes won't be honest. |
| 101 | + |
| 102 | +## Decision Points |
| 103 | + |
| 104 | +After scout (~25 min), check `final_int6_sliding val_bpb`: |
| 105 | + |
| 106 | +| BPB | Verdict | |
| 107 | +|-----|---------| |
| 108 | +| ≤ 1.0357 | 🔥 **BREAK RECORD** — run seeds 42 + 999, submit | |
| 109 | +| 1.0358-1.040 | 👍 Strong, run 3 seeds | |
| 110 | +| 1.040-1.045 | 😐 Worse than PR #1735 — investigate sidecar | |
| 111 | +| > 1.045 | ❌ Failure — check `val_bpb:byte_sidecar:enabled` line | |
0 commit comments