Non-record: ByteJEPA — True Byte-Level JEPA (val_bpb 1.3496) by hardik-bhadani-git · Pull Request #1443 · openai/parameter-golf

hardik-bhadani-git · 2026-04-07T13:51:40Z

Summary

True byte-level JEPA (Joint-Embedding Predictive Architecture) with no tokenizer (vocab_size=256)
Three-stage training: pure JEPA pretraining (10%) → linear bridge JEPA→CE (70%) → pure CE + SWA (20%)
SIGReg anti-collapse regularizer replaces EMA target encoder, saving 2x memory
Directly addresses the open JEPA bounty in the Requests for PRs section

Results

Metric	Value
Post-quant val_bpb	1.3496
Pre-quant val_bpb	1.3429
Total size (int8+zlib)	15,132,832 bytes (< 16MB)
GPU	1x RTX 5090
Wallclock	3600s (1 hour)
Steps	4141

Why this matters

This is the first submission to successfully change the core learning objective from token-prediction to representation-prediction (JEPA). Byte-level models have an inherent context disadvantage (~4.7x less effective context than SP1024), so the BPB won't match tokenized submissions, but the approach demonstrates that latent-space prediction can drive meaningful language model training.

Test plan

Training completes successfully on RTX 5090
Post-quant int8+zlib roundtrip produces valid val_bpb
Total artifact size under 16MB cap
Train log included

Pure byte-level Joint-Embedding Predictive Architecture with no tokenizer (vocab=256). Three-stage training: JEPA pretraining -> bridge -> CE+SWA. Addresses the open JEPA bounty in the Requests for PRs section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ramLite reversal, new directions Subagent re-verified the 3 still-novel patches (TabHash, GatedAttention, MTP) against the latest 25 open PRs. Zero hits — they remain uncontested, even though only MTP shows marginal training-loss benefit at our scale. EngramLite (Patch 22) verdict SOFT-REVERSED: EL2 cycle-2 = 3.2742, only +0.0008 above champion. Tied within noise, not falsified. Spend ~$1.40 / $36 (6% utilization). Pod healthy. New comp directions worth considering for next research fire: Per-Sample SLOT (legal variant of suspicious PR openai#1430), Codebook VQ compression (PR openai#1433), ByteJEPA (PR openai#1443 — non-competitive but novel category). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T05:20:07Z

Community Review — Non-record: ByteJEPA — True Byte-Level JEPA (val_bpb 1.3496)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Analysis PR #1443 (`2026-03-30_ByteJEPA`) introduces a ByteJEPA architecture — a byte-level GPT with a three-stage training curriculum: pure JEPA pretraining, a linear JEPA→CE bridge, and pure CE fine-tuning with Stochastic Weight Averaging (SWA). ### N-gram / Hash Bug Check No hash tables, XOR operations, or n-gram structures of any kind are present in the model or training code. The only occurrence of `^` in the file is in a comment (line 576: "relu^2 MLP"). No `BigramHash`, no `TrigramHash`, no target-XOR-into-hash-key pattern. CLEAR. ### Pre-Quant TTT Check `val_tokens` are used exclusively inside `eval_val()` (lines 200–247), which runs under `model.eval()` + `torch.inference_mode()` with `.detach()` on the batch loss (line 234). There is no optimizer, no `backward()`, no `zero_grad()`, no `AdamW` step touching val tokens at any point. Val tokens are never passed to the training loop. No TTT of any kind. CLEAR. ### Score-First TTT Check (PR #1413 pattern) No `is_last_chunk` guard, no scored-region conditional training, no score-first pattern. This submission does not implement TTT at all. N/A. ### Scored-Region SLOT Check No evidence of a scored-region SLOT structure — no special handling of the held-out scored window, no per-chunk routing logic. CLEAR. ### Architecture Summary - Pure transformer GPT (byte-level, vocab=256) - JEPA auxiliary objective using a lightweight Predictor MLP and sigreg variance/covariance regularizer - Three-stage training: JEPA → bridge → CE+SWA - Val evaluation is read-only inference throughout - Quantization: int8 + zlib compression post-training This is a clean pure-neural submission. All training happens on train tokens only; val...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

valerio-oai

Selected for the notable non-record submissions section.

valerio-oai approved these changes May 3, 2026

View reviewed changes

valerio-oai merged commit 4908dc4 into openai:main May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: ByteJEPA — True Byte-Level JEPA (val_bpb 1.3496)#1443

Non-record: ByteJEPA — True Byte-Level JEPA (val_bpb 1.3496)#1443
valerio-oai merged 1 commit intoopenai:mainfrom
hardik-bhadani-git:bytejepa-submission

hardik-bhadani-git commented Apr 7, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

valerio-oai left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hardik-bhadani-git commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Why this matters

Test plan

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: ByteJEPA — True Byte-Level JEPA (val_bpb 1.3496)

Uh oh!

valerio-oai left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hardik-bhadani-git commented Apr 7, 2026 •

edited

Loading