Skip to content

Non-record: ByteJEPA — True Byte-Level JEPA (val_bpb 1.3496)#1443

Merged
valerio-oai merged 1 commit intoopenai:mainfrom
hardik-bhadani-git:bytejepa-submission
May 3, 2026
Merged

Non-record: ByteJEPA — True Byte-Level JEPA (val_bpb 1.3496)#1443
valerio-oai merged 1 commit intoopenai:mainfrom
hardik-bhadani-git:bytejepa-submission

Conversation

@hardik-bhadani-git
Copy link
Copy Markdown
Contributor

@hardik-bhadani-git hardik-bhadani-git commented Apr 7, 2026

Summary

  • True byte-level JEPA (Joint-Embedding Predictive Architecture) with no tokenizer (vocab_size=256)
  • Three-stage training: pure JEPA pretraining (10%) → linear bridge JEPA→CE (70%) → pure CE + SWA (20%)
  • SIGReg anti-collapse regularizer replaces EMA target encoder, saving 2x memory
  • Directly addresses the open JEPA bounty in the Requests for PRs section

Results

Metric Value
Post-quant val_bpb 1.3496
Pre-quant val_bpb 1.3429
Total size (int8+zlib) 15,132,832 bytes (< 16MB)
GPU 1x RTX 5090
Wallclock 3600s (1 hour)
Steps 4141

Why this matters

This is the first submission to successfully change the core learning objective from token-prediction to representation-prediction (JEPA). Byte-level models have an inherent context disadvantage (~4.7x less effective context than SP1024), so the BPB won't match tokenized submissions, but the approach demonstrates that latent-space prediction can drive meaningful language model training.

Test plan

  • Training completes successfully on RTX 5090
  • Post-quant int8+zlib roundtrip produces valid val_bpb
  • Total artifact size under 16MB cap
  • Train log included

Pure byte-level Joint-Embedding Predictive Architecture with no tokenizer
(vocab=256). Three-stage training: JEPA pretraining -> bridge -> CE+SWA.
Addresses the open JEPA bounty in the Requests for PRs section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 7, 2026
…ramLite reversal, new directions

Subagent re-verified the 3 still-novel patches (TabHash, GatedAttention, MTP)
against the latest 25 open PRs. Zero hits — they remain uncontested, even
though only MTP shows marginal training-loss benefit at our scale.

EngramLite (Patch 22) verdict SOFT-REVERSED: EL2 cycle-2 = 3.2742, only
+0.0008 above champion. Tied within noise, not falsified.

Spend ~$1.40 / $36 (6% utilization). Pod healthy.

New comp directions worth considering for next research fire: Per-Sample
SLOT (legal variant of suspicious PR openai#1430), Codebook VQ compression
(PR openai#1433), ByteJEPA (PR openai#1443 — non-competitive but novel category).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: ByteJEPA — True Byte-Level JEPA (val_bpb 1.3496)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Analysis PR #1443 (2026-03-30_ByteJEPA) introduces a ByteJEPA architecture — a byte-level GPT with a three-stage training curriculum: pure JEPA pretraining, a linear JEPA→CE bridge, and pure CE fine-tuning with Stochastic Weight Averaging (SWA). ### N-gram / Hash Bug Check No hash tables, XOR operations, or n-gram structures of any kind are present in the model or training code. The only occurrence of ^ in the file is in a comment (line 576: "relu^2 MLP"). No BigramHash, no TrigramHash, no target-XOR-into-hash-key pattern. CLEAR. ### Pre-Quant TTT Check val_tokens are used exclusively inside eval_val() (lines 200–247), which runs under model.eval() + torch.inference_mode() with .detach() on the batch loss (line 234). There is no optimizer, no backward(), no zero_grad(), no AdamW step touching val tokens at any point. Val tokens are never passed to the training loop. No TTT of any kind. CLEAR. ### Score-First TTT Check (PR #1413 pattern) No is_last_chunk guard, no scored-region conditional training, no score-first pattern. This submission does not implement TTT at all. N/A. ### Scored-Region SLOT Check No evidence of a scored-region SLOT structure — no special handling of the held-out scored window, no per-chunk routing logic. CLEAR. ### Architecture Summary - Pure transformer GPT (byte-level, vocab=256) - JEPA auxiliary objective using a lightweight Predictor MLP and sigreg variance/covariance regularizer - Three-stage training: JEPA → bridge → CE+SWA - Val evaluation is read-only inference throughout - Quantization: int8 + zlib compression post-training This is a clean pure-neural submission. All training happens on train tokens only; val...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Copy link
Copy Markdown
Contributor

@valerio-oai valerio-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selected for the notable non-record submissions section.

@valerio-oai valerio-oai merged commit 4908dc4 into openai:main May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants