This is a Ray-based RLHF framework built for training large models (7B-70B+) with vLLM inference acceleration. It supports PPO, GRPO, RLOO, and DPO algorithms in one package, claiming 2× speedup over DeepSpeedChat through distributed architecture and GPU resource sharing via ZeRO-3. The standout feature is the Hybrid Engine that lets vLLM and DeepSpeed share GPUs through sleep modes, which matters when you're running actor, critic, reward, and reference models simultaneously. Choose GRPO if you want to skip the critic model and save memory, or stick with PPO for maximum control. The setup is Docker-heavy and Ray-centric, so expect some infrastructure overhead compared to simpler frameworks like TRL, but that's the trade-off for scaling to multi-node clusters.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill openrlhf-training