call for contributionenhancementgood first issue
Repository metrics
- Stars
- (21,533 stars)
- PR merge metrics
- (Avg merge 5d) (146 merged PRs in 30d)
Description
- Activation offloading (see implementation here)
- Fusing optimizer step into backward pass (see implementation here)
- Utilize
full_shardreshard_after_forward(see here). I wasn't 100% sure if I could see this already implemented in veRL.
These optimizations largely trade off decreased peak memory useage for additional compute, so may only be useful for training larger models, and in GPU-constrained settings.