You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Four new env-gated hyperparameters, all default to no-op so spec 008 is
byte-identical when the vars are unset:
- WD_TAPER_START_FRAC / WD_TAPER_FINAL_MULT (port openai#1729): linear Muon WD
taper from 1.0 at start_step to final_mult at h.iterations. Applied in
step_fn before optimizers.step. Adam/embed WD untouched per openai#1729.
- MUON_GRAD_POWER (port openai#1682): g = sign(g) * |g|^p, applied to Muon
gradients just before the momentum buffer update. Covers both sharded
(shard path) and non-sharded paths.
- QK_GAIN_INIT (existing): already present, lowering default not changed;
setting QK_GAIN_INIT=2.5 at runtime gives uniform softer attention per
openai#1648's convergence finding.
- QK_GAIN_PER_LAYER (new): comma-sep list, overrides each block's
attn.q_gain after block construction. Validated to match num_layers.
Also: one startup log line echoing the four values for post-hoc verification.
Spec: research/specs/012-training-bundle.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments