Skip to content

Commit 9ca2bf8

Browse files
leon2k2k2kclaude
andcommitted
diary + ideas: recur-alpha findings + beating-1736 analysis
- diary/2026-04-21-recur-alpha-findings.md — full story of specs 015/016 single-seed screens: α trajectories side-by-side, 5 findings (α>1 on pass-2, <1 on pass-3 at depth, depth-monotonicity inverts between passes, plateau is path-dependent, late-training rate unchanged), full caveats section, ranked next steps. - research/ideas/beating-1736-note.md — four-run throughput + pipeline comparison (008/015/016/openai#1736). Works backward from target 1.06610 to a 0.00183 gap on pre-quant post-EMA; matched-throughput alone gives 3.3× margin over the gap. Risk ranks TTT composition as the one unknown (GPTQ cost is validated at +0.00947 parity). Concludes: single matched- clock NA run with bug-fixed TTT pipeline (~$10-15) settles the whole story. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ac4708f commit 9ca2bf8

2 files changed

Lines changed: 259 additions & 0 deletions

File tree

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# 2026-04-21 — Recur-Alpha findings (specs 015 + 016)
2+
3+
Two single-seed screens on learnable per-pass α blending, run earlier today on the #1736 stack. Writing the story up now while fresh; formal evaluations deferred to post-3-seed. A public note may follow.
4+
5+
## 1. What Recur-Alpha is (brief)
6+
7+
In Loop345, layers 3-5 run 3 passes each. Baseline #1736 does unweighted recurrence — every pass fully commits its block output to the residual stream. Recur-Alpha adds a learnable scalar `α` per (extra-pass, looped-layer) position:
8+
9+
```
10+
y = block(x_current)
11+
x_new = α × y + (1 − α) × x_current
12+
```
13+
14+
6 scalars total (2 extra passes × 3 looped layers). At α=0: pure passthrough. At α=1: standard Loop345. At α∈(0,1): partial commitment. At α>1: amplify block, subtract residual.
15+
16+
**Why we ran it.** #1714 tested this on a pre-#1736 stack and got 1.0857 pre-TTT; their compute ran out before phased-TTT composition. Recur-Alpha's behavior on #1736's full stack was literally unmeasured — and looked like the strongest remaining same-parent recurrence lever given how many cousin ideas had already been disproven (#1663, #1726, #1739).
17+
18+
- **Spec 015** (α init = 0): safety-first init. Passthrough at activation, model learns upward.
19+
- **Spec 016** (α init = 1): remove 015's α=0→learned "catch-up handicap" at looping activation.
20+
21+
## 2. Results
22+
23+
Training-endpoint val_bpb (post-EMA, pre-GPTQ, no TTT — screening mode):
24+
25+
| run | α init | endpoint step | endpoint val_bpb | matched @4000 | late-training rate (Δ bpb / Δ step) |
26+
|---|---|---|---|---|---|
27+
| spec 008 | — (baseline) | 4828 | 1.0697 | 1.1110 | 5.0e-5 |
28+
| spec 015 | 0 | 4761 | 1.0696 | 1.1078 | 5.0e-5 |
29+
| spec 016 | 1 | 4708 | 1.0712 | **1.1072** | 5.1e-5 |
30+
31+
### Step-deficit correction
32+
33+
Both α runs ran short vs 008 due to JP pod throughput variance (~1-1.4% slower). The late-training improvement rate is nearly identical across all three runs — ~5.1e-5 per step after step 4000. That means the endpoint gap between 016 and 015 is almost entirely explained by 016 running 53 fewer steps, not by α=1 init being worse.
34+
35+
Extrapolating 016 at the 015 step count (4761):
36+
- 53 more steps × 5.1e-5/step ≈ 0.00270 additional improvement
37+
- Projected 016 endpoint ≈ 1.0712 − 0.0027 = **~1.0685**
38+
39+
If the projection holds, 016 beats 015 by ~0.0011 and beats 008 by ~0.0012 — both ≥5× noise floor (SOTA std ≈ 0.0002). This would make 1.0685 our best training-endpoint number on this stack, period.
40+
41+
**Noise/signal judgment:** matched-step @4000 Δ of −0.0006 (016 vs 015) is in spec 016's "null" bucket, but the step-corrected endpoint picture is consistent with a real −0.001-ish gain. Call it *weak promote with a confound*; 3-seed or matched-clock rerun settles it.
42+
43+
## 3. α trajectories side-by-side
44+
45+
Layout: `[[pass2_L3, pass2_L4, pass2_L5], [pass3_L3, pass3_L4, pass3_L5]]`. Activation fires at step ~2142 (015) / ~2123 (016).
46+
47+
```
48+
step | 015 α (init=0) | 016 α (init=1)
49+
------+---------------------------------------------+---------------------------------------------
50+
2000 | [[0.00, 0.00, 0.00], [0.00, 0.00, 0.00]] | [[1.00, 1.00, 1.00], [1.00, 1.00, 1.00]] init
51+
2200 | [[0.03, 0.07, 0.14], [0.16, 0.24, 0.33]] | [[0.84, 1.02, 0.90], [0.75, 0.76, 0.88]] post-act
52+
2500 | [[1.00, 1.16, 1.37], [0.85, 0.76, 0.75]] | —
53+
3000 | [[1.04, 1.16, 1.38], [0.98, 0.86, 0.76]] | [[1.13, 1.30, 1.40], [1.04, 0.93, 0.85]]
54+
4000 | [[1.04, 1.16, 1.38], [1.01, 0.89, 0.77]] | [[1.13, 1.30, 1.40], [1.04, 0.96, 0.85]]
55+
4700 | [[1.04, 1.16, 1.38], [1.01, 0.89, 0.77]] | [[1.13, 1.30, 1.40], [1.04, 0.96, 0.85]] saturated
56+
```
57+
58+
**Shape is preserved.** Both runs converge to: pass-2 increases with depth (L3 < L4 < L5), pass-3 decreases with depth (L3 > L4 > L5). The *shape* of the plateau reproduces across inits.
59+
60+
**Magnitude is not.** 016's plateau sits ~0.10 higher than 015's everywhere. Same shape, translated.
61+
62+
**Non-monotone exploration in 016.** At step 2200 (just past activation), 016's pass-2 values dipped *below* 1.0 at L3 and L5 (0.84 and 0.90) before climbing back above. α=1 init didn't smoothly drift to its final plateau — it first moved down, then corrected. The loss surface near α=1 has downward gradient on some components before the optimizer finds the upward basin.
63+
64+
## 4. Findings
65+
66+
**Finding 1: α > 1 is preferred on pass-2, especially at depth.** Both 015 and 016 chose values above 1.0 on every pass-2 layer. pass2_L5 saturated at 1.38 (015) and 1.40 (016). The model actively amplifies block output beyond standard residual addition; it's saying "standard Loop345 under-commits, especially at deeper looped layers."
67+
68+
**Finding 2: α < 1 is preferred on pass-3 at deep layers.** pass3_L5 settled at 0.77 (015) and 0.85 (016). Pass-3 under-commits at depth — the third pass through layer 5 contributes less than a standard residual add.
69+
70+
**Finding 3: Depth monotonicity inverts between passes.** Pass-2 climbs with depth (L3 < L4 < L5); pass-3 descends (L3 > L4 > L5). In both runs. This is a specific, non-trivial structure: the two extra passes play *different* roles as a function of layer depth. Simplest mechanistic story: pass-2 overshoots (amplify, inject new direction), pass-3 partially corrects back (damp, stabilize). Hardcoded Loop345 with α=1 everywhere can't express this.
71+
72+
**Finding 4: α plateau is path-dependent.** Different init → different converged α values (same shape, ~0.10 offset). Not a global optimum. α co-evolves with the other model parameters, and the init determines which equivalence-class configuration you land on. This rules out "train once, read off the learned α, then hard-code it" — there is no canonical shape.
73+
74+
**Finding 5: Late-training per-step improvement rate is independent of α plateau.** 008/015/016 all show ~5.0-5.1e-5 val_bpb drop per step after step 4000. The α shape doesn't slow warmdown-phase refinement. Earlier worry (that α>1's negative-weight-on-residual would destabilize late-training) was wrong — refinement proceeds equally well from both plateaus.
75+
76+
## 5. What this tells us about Loop345
77+
78+
- **Fixed α=1 is approximately tuned, not optimally tuned.** Under-commits on pass-2 at depth; over-commits on pass-3 at depth. The baseline ships with the wrong commitment coefficient, and the model tells us so if given 6 degrees of freedom.
79+
- **Two-stage overshoot-correct structure** is what the learned α expresses. A fixed-α architecture cannot do this; making α per-pass-per-layer is the minimum expressiveness needed to recover it.
80+
- **Robustness of shape, fragility of magnitude.** Future recurrence variants should look for constraints that encourage the shape (e.g. structural priors on pass-2 > 1 > pass-3 at depth) without pinning specific values.
81+
82+
## 6. Caveats — what we don't know
83+
84+
- **Single seed per run.** No statistical confidence. Shape could collapse or invert on seed 43/44. Magnitudes almost certainly will shift. Until 3-seed confirms, findings 1-3 are hypotheses, not facts.
85+
- **JP pod variance masked endpoint comparisons.** 016's +0.0016 nominal endpoint regression looked real on first read — it isn't, but catching that required careful per-step rate analysis. A matched-clock rerun or 3-seed would eliminate this confound class entirely.
86+
- **Hardware throughput cost.** 016 got 120 fewer steps than 008. Some fraction of that could be legitimate recur-alpha overhead (tiny extra blend ops × 6 slots × many steps) distinct from pod variance. We haven't isolated this. If real, it applies to every future recur-alpha spec as a hidden tax.
87+
- **TTT composition is untested.** Every number here is pre-TTT, pre-GPTQ. The 1.0685 projection assumes typical TTT/GPTQ gain (~0.003-0.005) on top, giving submission val_bpb ~1.063-1.065. But SpinQuant in specs 009/010 got *fully absorbed* by phased TTT — we can't assume Recur-Alpha survives absorption until we run the full pipeline. This is the single biggest open risk on this whole thread.
88+
- **1.0685 is an extrapolation, not a measurement.** Linear projection at 5.1e-5/step. Late-training improvement might decelerate or accelerate in the final ~50 steps differently. Likely close, but not certain.
89+
- **grad_norm logging in 015 was cosmetic only.** 016's post-fix logs show grad_norm 0.001-0.007 range, confirming autograd was always fine. 015's α values clearly moved (0 → 1.38), which couldn't have happened if grads weren't flowing. Mentioned for completeness — not a data issue.
90+
- **Option B (fixed α at learned shape) is now ruled out.** 015 and 016 converged to *different* α values (same shape, different magnitudes). There's no canonical "learned shape" to freeze. This design is off the table.
91+
- **Pre-training / initial-state sensitivity untested.** Both runs were from-scratch. Behavior when hotstarted from a pre-activation checkpoint is unknown.
92+
93+
## 7. Next steps (conditional)
94+
95+
**Most valuable next move (rank order):**
96+
97+
1. **3-seed 015 (seeds 43, 44) + 3-seed 016 (seeds 43, 44)** in parallel on 4 pods. ~$20. Nails down: (a) whether the −0.001-ish gain is real, (b) whether the α shape reproduces across seeds, (c) resolves the 015-vs-016 init question by seed-averaged comparison.
98+
99+
2. **Full-pipeline run (TTT + GPTQ) on 016 seed 42 resumed from its `final_model.pt`** (which is on JP volume per execution's notes). ~$5-10 on 1×H100 — decouples training from eval, lets us see if Recur-Alpha survives TTT absorption. Should do this *before* burning $20 on 3-seed, because if TTT absorbs the entire gain, the 3-seed exercise is academic.
100+
101+
3. **If TTT composes and 3-seed confirms:** submission-grade full run with 3 seeds at 1.0685-ish training endpoint. ~$60. Produces the number we'd submit.
102+
103+
**Candidates worth considering but not immediate:**
104+
- Cross-pass XSA (orthogonality constraint between pass-2 and pass-3 outputs). Directly tests the overshoot-correct hypothesis. Would use 015's α=0 init (confirmed stable). Can wait.
105+
- α magnitude constraint (clamp α ≤ 1, or reparameterize through sigmoid). Would test whether the α>1 preference is productive or artifactual.
106+
107+
**Ruled out:**
108+
- Option B (fixed α at learned shape) — killed by finding 4.
109+
- α = 0.5 init — redundant given path-dependence; would just find a third plateau.
110+
- Submitting 016 as-is — single-seed + pod variance confound makes it unreliable.
111+
112+
## 8. For a future public note
113+
114+
Short outline for if/when we write this up externally:
115+
116+
> **Headline:** Per-pass, per-depth learnable α blending reveals a pass-2 amplify / pass-3 damp structure in PR #1736's Loop345 recurrence. α > 1 at pass-2, α < 1 at pass-3, depth-monotonicity inverts between passes. Training-endpoint val_bpb improves by ~0.001 vs fixed α=1 baseline, on stack #1736.
117+
>
118+
> **Evidence:** side-by-side α trajectories from two inits (α=0, α=1) — same converged shape, different magnitudes; step-matched val_bpb comparison against baseline; behavior holds on the full #1736 stack (CaseOps + phased TTT + GatedAttn + QuantGate), not a toy subset.
119+
>
120+
> **Caveats:** single seed; pre-TTT numbers; hardware variance partially masks raw endpoint; TTT/GPTQ composition pending.
121+
122+
Not writing the actual note now — waiting on 3-seed + TTT composition first.
123+
124+
## Artifacts
125+
126+
- `runs/015-recur-alpha/seed_42/{train.log,final.json,notes.md}` — local
127+
- `runs/016-recur-alpha-ones/seed_42/{train.log,final.json,notes.md}` — local
128+
- `runs/016-recur-alpha-ones/seed_42/final_model.pt` — on JP volume `jlxvxeiol4`, 135 MB, post-EMA pre-GPTQ. Available as hotstart source for TTT-composition experiment.
129+
- `/workspace/.torch_inductor_cache` on JP volume — populated with 016's compile cache (commit 4dd2d63), cuts next same-commit launch from ~10 min to ~1-2 min.
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# Beating-1736 note — what it takes, given 016's post-hoc data
2+
3+
**Written:** 2026-04-21 (after spec 016's post-hoc TTT eval OOM'd but captured pre-quant and post-GPTQ numbers)
4+
**Status:** living document; update as runs land
5+
6+
## Target
7+
8+
Beat **#1736's claimed post-TTT val_bpb = 1.06610** on our submission with a meaningful margin (≥0.0005 given ~0.0002 seed std).
9+
10+
## Four-run comparison (all JP 8×H100, 596s wallclock cap)
11+
12+
### Throughput
13+
14+
| metric | 008 | 015 | 016 | #1736 |
15+
|---|---|---|---|---|
16+
| endpoint step | 4828 | 4761 | 4708 | 4854 |
17+
| tok/s at step 4000 | 6.59M | 6.40M | 6.34M ||
18+
| tok/s at step 4500 | 6.45M | 6.33M | 6.26M ||
19+
| tok/s at step 4700 || 6.29M | 6.21M ||
20+
| **steps/596s** | **8.10** | **7.99** | **7.90** | **8.14** |
21+
| throughput vs #1736 | 99.5% | 98.2% | 97.1% | 100% |
22+
23+
### Training-endpoint val_bpb
24+
25+
| metric | 008 | 015 | 016 | #1736 |
26+
|---|---|---|---|---|
27+
| endpoint val_bpb (bare) | 1.0697 | 1.0696 | 1.0712 | 1.0696 |
28+
| pre-quant post-EMA | 1.06922 | 1.06916 | 1.07083 | 1.06906 |
29+
| matched-step @4000 | 1.1110 | 1.1078 | 1.1072 ||
30+
31+
### Post-training pipeline (only 016 and #1736 have these numbers)
32+
33+
| stage | 016 | #1736 | 016 Δ |
34+
|---|---|---|---|
35+
| post-GPTQ (int6) | 1.08029 | 1.07847 | +0.00182 |
36+
| GPTQ cost | +0.00946 | +0.00941 | +0.00005 |
37+
| post-TTT submission | — (OOM) | 1.06610 ||
38+
| TTT recovery | ? | −0.01237 | ? |
39+
40+
## The chain of assumptions
41+
42+
To go from current 016 measurements to a post-TTT number that beats 1.06610:
43+
44+
```
45+
training endpoint (pre-quant post-EMA)
46+
→ + GPTQ cost (≈ +0.00947)
47+
→ post-GPTQ (int6)
48+
→ + TTT recovery (≈ −0.01237 for #1736)
49+
→ post-TTT submission
50+
```
51+
52+
**Working backward:**
53+
54+
- Target post-TTT: ≤ 1.06610
55+
- If TTT recovery = #1736's −0.01237 → post-GPTQ needs ≤ 1.07847
56+
- If GPTQ cost = observed +0.00947 → pre-quant post-EMA needs ≤ **1.06900**
57+
58+
016's current pre-quant post-EMA at step 4708: **1.07083**.
59+
60+
**Gap to close:** 1.07083 − 1.06900 = **0.00183**.
61+
62+
## How much each lever gives us
63+
64+
At the measured late-training rate of ~5.1e-5 per step:
65+
66+
| scenario | steps gained | bpb improvement | result |
67+
|---|---|---|---|
68+
| Current 016 (step 4708) | 0 | 0 | 1.07083 → miss by 0.00183 |
69+
| +36 steps (step 4744) | 36 | 0.00183 | **just hits 1.06900** |
70+
| Matched to 008 (step 4828) | +120 | 0.00612 | 1.06471 — **beats by 0.00429** |
71+
| Matched to #1736 (step 4854) | +146 | 0.00745 | 1.06338 — **beats by 0.00562** |
72+
73+
**The matched-throughput bar clears the target with 3.3× margin.** The rate analysis says throughput alone is the easy lever.
74+
75+
## Risk ranking — what could kill this
76+
77+
| assumption | current confidence | risk if worse by 0.00183 |
78+
|---|---|---|
79+
| Late-training rate holds at ~5.0e-5/step | High — 3 independent runs show it | Low — would need rate to halve |
80+
| GPTQ cost stays ≤ +0.00947 | High — measured on 016, matches #1736 | Low — already validated |
81+
| **TTT recovery stays ≥ −0.01054** (#1736 got −0.01237) | **UNTESTED** | **HIGH** — recur-alpha × phased-TTT composition is the unknown |
82+
83+
## The one-experiment answer
84+
85+
A single properly-configured run settles the whole story:
86+
87+
- **Matched-clock 016 on NA 8×H100** (kill JP pod variance)
88+
- **Full TTT pipeline** (not the EVAL_ONLY_CHECKPOINT bypass that OOM'd)
89+
- **Bug fix**: the eval-only bypass skipped CUDA-graph + loop warmup, starving the TTT allocator. Fix = restore the warmup phases on the eval path, or just let training-then-eval run in one pod with the training wallclock cap.
90+
91+
Cost: ~$10-15. Delivers:
92+
1. Real post-TTT number, not projection
93+
2. Real NA-throughput baseline to check whether JP variance was the whole story
94+
3. Confirms/disproves "matched-throughput 016 beats #1736 single-seed"
95+
96+
This is the single most valuable next run on this research thread.
97+
98+
## Scenarios by outcome
99+
100+
**Outcome A — Post-TTT ≤ 1.06550** (matches or beats target by ≥0.0005):
101+
- Promote to 3-seed confirmation (~$15-20).
102+
- If 3-seed holds: submission-grade run for leaderboard.
103+
104+
**Outcome B — Post-TTT in [1.06550, 1.06710]** (within ±0.0005 of #1736):
105+
- Too close to call single-seed. 3-seed to resolve.
106+
- Consider stacking with another lever (spec 017 candidate) first.
107+
108+
**Outcome C — Post-TTT > 1.06710** (worse than #1736 by ≥0.0005):
109+
- Recur-alpha gain got absorbed by TTT. Shelf the submission path.
110+
- Keep the α-shape findings as a mechanistic result; don't build further on it.
111+
- Pivot to a non-recurrence lever (cross-pass XSA is a candidate but would need its own TTT composition test).
112+
113+
## What we've learned about #1736-beating regardless of 016
114+
115+
These are observations that survive regardless of 016's outcome:
116+
117+
1. **Matched-step @4000 Δ is the honest comparison for training-endpoint** — raw endpoint comparisons on JP are noisy by ±67-120 steps (~0.003-0.006 bpb). Research should always report matched-step whenever possible.
118+
119+
2. **GPTQ cost is stable at ~+0.00944** across #1736 and 016. Budget this into any projection.
120+
121+
3. **TTT recovery (≈ −0.01237 for #1736) is the "submission multiplier"**, but it's been shown to be ABSORB-capable (SpinQuant in specs 009/010). Don't assume recovery; test it.
122+
123+
4. **Pre-quant post-EMA as a function of step count is nearly linear** across 008/015/016/#1736 (regressed at ~5.0e-5/step, within ±0.0002 of fit). This is useful for quick projections but shouldn't be trusted inside the final ~50 steps where warmdown nonlinearities kick in.
124+
125+
5. **Same-pool JP pod variance is ±1-3% throughput** (~50-150 step deficit on a 596s wallclock). NA pool may be cleaner; haven't measured.
126+
127+
## Not-on-this-thread but related
128+
129+
- If 016 passes the submission bar, next-stacking candidate is a training-time lever orthogonal to recurrence (cross-pass XSA won't stack trivially; training-dynamics levers like tapered WD from spec 011 might).
130+
- Multi-lever stacking math: if we stack a −0.001-bpb training-time lever on top of matched-throughput 016, we move from ~1.06338 to ~1.06238 post-TTT. Cumulative 2-lever path beats #1736 by ~0.004.

0 commit comments

Comments
 (0)