Additional memory optimization features · verl-project/verl#144

(6 comments) (3 reactions) (0 assignees)Python (3,940 forks)auto 404

call for contributionenhancementgood first issue

Repository metrics

Stars: (21,533 stars)
PR merge metrics: (Avg merge 5d) (146 merged PRs in 30d)

Description

Activation offloading (see implementation here)
Fusing optimizer step into backward pass (see implementation here)
Utilize full_shard reshard_after_forward (see here). I wasn't 100% sure if I could see this already implemented in veRL.

These optimizations largely trade off decreased peak memory useage for additional compute, so may only be useful for training larger models, and in GPU-constrained settings.

Contributor guide

Research direction: Explore the provided links to activation offloading and fused optimizer step implementations from torchtune (torchtune/training/ activation offloading.py and torchtune/training/memory.py). Study the FSDP2 API differences in torchtitan (docs/fsdp.md). Then, survey the verI codebase to identify where similar memory optimizations can be integrated, focusing on the training loop and model sharding. Consider discussing with maintainers on which optimization is most needed first.
Tech stack: pythonpytorch
Domain: machine learningperformance
Issue type: Feature
Difficulty: 4
Estimated time: Over 1 week
Activity status: Fresh
Clarity: Mostly clear
Prerequisites: PythonPyTorchFSDP
Newbie friendliness: 15

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.