Skip to content

Commit ca370ff

Browse files
leon2k2k2kclaude
andcommitted
research notes: recurrence activation schedule literature
Added section on 'when to activate recurrence' research. Key findings: - ProRes, SGT, Staged Training all recommend progressive/curriculum activation over hard switches - Literature has conflicting claims about WHERE convergence happens first (shallow vs deep layers) - Consistent claim: progressive beats hard switch for stability - openai#1736's enable_looping_at=0.35 is a hard switch — suboptimal per lit Candidate variants identified, ranked by implementation cost: env-var sweeps (1,2) vs code-change ramps (3,4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 6c55441 commit ca370ff

1 file changed

Lines changed: 45 additions & 0 deletions

File tree

research/ideas/recurrence-parallel-literature.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,3 +182,48 @@ If you read TWO: add **Exclusive Self-Attention** (2603.09078) — the XSA paper
182182
If you read THREE: add **Universal Transformer** (1807.03819) — the foundational recurrence paper.
183183

184184
Everything else is reference material.
185+
186+
---
187+
188+
## Recurrence schedule / when to activate (added 2026-04-21)
189+
190+
### What the literature says
191+
192+
1. **ProRes — Progressive Residual Warmup** — arXiv [2603.05369](https://arxiv.org/abs/2603.05369). Residual branches activate sequentially from shallow to deep with a coefficient warming 0→1 linearly. Shallow layers stabilize first.
193+
194+
2. **Sparse Growing Transformer (SGT)** — arXiv [2603.23998](https://arxiv.org/abs/2603.23998). Recurrence activated in DEEPER layers FIRST, then extended to shallower. Claims deeper layers differentiate earlier in training.
195+
196+
3. **Staged Training for Transformer LMs** — Shen et al, ICML 2022 ([PDF](https://proceedings.mlr.press/v162/shen22f/shen22f.pdf)). Progressive stacking with heuristic schedules (50K/70K/280K steps for 3/6/12 layer models).
197+
198+
4. **Learning to Grow (LiGO)** — reuse pretrained smaller models as initialization for deeper ones. Saves ~44% FLOPs on BERT-Base pretraining.
199+
200+
5. **Curriculum-Guided Layer Scaling (CGLS)** — arXiv [2506.11389](https://arxiv.org/abs/2506.11389). Couples progressive model expansion with data-complexity curriculum.
201+
202+
### Key tensions in the literature
203+
204+
- **ProRes / FreezeOut:** shallow layers converge first → freeze them early.
205+
- **SGT:** deeper layers differentiate first → activate recurrence there first.
206+
- **ILR:** early-layer recurrence beats late-layer recurrence (in their test).
207+
208+
These are contradictory claims about WHERE convergence happens first. Likely architecture- and task-dependent. **Our stack's dynamics haven't been measured.**
209+
210+
### Consistent claim across papers
211+
212+
**Progressive/curriculum schedules beat hard switches** for training stability. #1736's current `enable_looping_at=0.35` is a hard switch — the literature would predict this is suboptimal but not catastrophic.
213+
214+
### Candidate variants for "better timing"
215+
216+
Ranked by cost:
217+
218+
1. **Earlier activation:** `enable_looping_at=0.15` or `0.20`. Env var only. Cheap to test.
219+
2. **Later activation:** `enable_looping_at=0.50` or `0.70`. Env var only. Cheap to test.
220+
3. **Smooth ramp of NUM_LOOPS:** 0 → 1 at 35% → 2 at 60%. Code change, ~30-50 LOC.
221+
4. **Mixing-coefficient warmup:** blend single-forward and looped output via α that ramps from 0 to 1. Code change, moderate.
222+
5. **Progressive layer expansion (SGT-style):** layer-specific activation schedule. Complex; probably not worth the budget.
223+
224+
### Honest assessment for our decision
225+
226+
- The literature is NOT conclusive. Progressive > hard is consistent, but specific schedules vary.
227+
- Variants 1 and 2 are env-var-only and could be tested in a 3-run sweep (~$15 total in screening mode).
228+
- Variants 3 and 4 need code changes; higher risk of bugs per our Edit-split-block lesson.
229+
- Running a 3-point sweep of `enable_looping_at` on spec 008 would generate empirically-grounded data about OUR stack's dynamics, which neither the ILR nor SGT papers provide.

0 commit comments

Comments
 (0)