You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
research notes: recurrence activation schedule literature
Added section on 'when to activate recurrence' research. Key findings:
- ProRes, SGT, Staged Training all recommend progressive/curriculum
activation over hard switches
- Literature has conflicting claims about WHERE convergence happens
first (shallow vs deep layers)
- Consistent claim: progressive beats hard switch for stability
- openai#1736's enable_looping_at=0.35 is a hard switch — suboptimal per lit
Candidate variants identified, ranked by implementation cost:
env-var sweeps (1,2) vs code-change ramps (3,4).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: research/ideas/recurrence-parallel-literature.md
+45Lines changed: 45 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -182,3 +182,48 @@ If you read TWO: add **Exclusive Self-Attention** (2603.09078) — the XSA paper
182
182
If you read THREE: add **Universal Transformer** (1807.03819) — the foundational recurrence paper.
183
183
184
184
Everything else is reference material.
185
+
186
+
---
187
+
188
+
## Recurrence schedule / when to activate (added 2026-04-21)
189
+
190
+
### What the literature says
191
+
192
+
1.**ProRes — Progressive Residual Warmup** — arXiv [2603.05369](https://arxiv.org/abs/2603.05369). Residual branches activate sequentially from shallow to deep with a coefficient warming 0→1 linearly. Shallow layers stabilize first.
193
+
194
+
2.**Sparse Growing Transformer (SGT)** — arXiv [2603.23998](https://arxiv.org/abs/2603.23998). Recurrence activated in DEEPER layers FIRST, then extended to shallower. Claims deeper layers differentiate earlier in training.
195
+
196
+
3.**Staged Training for Transformer LMs** — Shen et al, ICML 2022 ([PDF](https://proceedings.mlr.press/v162/shen22f/shen22f.pdf)). Progressive stacking with heuristic schedules (50K/70K/280K steps for 3/6/12 layer models).
197
+
198
+
4.**Learning to Grow (LiGO)** — reuse pretrained smaller models as initialization for deeper ones. Saves ~44% FLOPs on BERT-Base pretraining.
199
+
200
+
5.**Curriculum-Guided Layer Scaling (CGLS)** — arXiv [2506.11389](https://arxiv.org/abs/2506.11389). Couples progressive model expansion with data-complexity curriculum.
201
+
202
+
### Key tensions in the literature
203
+
204
+
-**ProRes / FreezeOut:** shallow layers converge first → freeze them early.
205
+
-**SGT:** deeper layers differentiate first → activate recurrence there first.
206
+
-**ILR:** early-layer recurrence beats late-layer recurrence (in their test).
207
+
208
+
These are contradictory claims about WHERE convergence happens first. Likely architecture- and task-dependent. **Our stack's dynamics haven't been measured.**
209
+
210
+
### Consistent claim across papers
211
+
212
+
**Progressive/curriculum schedules beat hard switches** for training stability. #1736's current `enable_looping_at=0.35` is a hard switch — the literature would predict this is suboptimal but not catastrophic.
213
+
214
+
### Candidate variants for "better timing"
215
+
216
+
Ranked by cost:
217
+
218
+
1.**Earlier activation:**`enable_looping_at=0.15` or `0.20`. Env var only. Cheap to test.
219
+
2.**Later activation:**`enable_looping_at=0.50` or `0.70`. Env var only. Cheap to test.
220
+
3.**Smooth ramp of NUM_LOOPS:** 0 → 1 at 35% → 2 at 60%. Code change, ~30-50 LOC.
221
+
4.**Mixing-coefficient warmup:** blend single-forward and looped output via α that ramps from 0 to 1. Code change, moderate.
222
+
5.**Progressive layer expansion (SGT-style):** layer-specific activation schedule. Complex; probably not worth the budget.
223
+
224
+
### Honest assessment for our decision
225
+
226
+
- The literature is NOT conclusive. Progressive > hard is consistent, but specific schedules vary.
227
+
- Variants 1 and 2 are env-var-only and could be tested in a 3-run sweep (~$15 total in screening mode).
228
+
- Variants 3 and 4 need code changes; higher risk of bugs per our Edit-split-block lesson.
229
+
- Running a 3-point sweep of `enable_looping_at` on spec 008 would generate empirically-grounded data about OUR stack's dynamics, which neither the ILR nor SGT papers provide.
0 commit comments