This handles reinforcement learning training for LLMs using ByteDance's verl framework, which powers their Doubao models. It supports GRPO, PPO, and other RL algorithms with flexible backends that let you swap between FSDP, Megatron-LM, vLLM, and SGLang. The skill walks through math reasoning workflows, critic-based PPO setups, and large-scale training configurations for models up to 671B parameters. If you're doing RLHF or post-training at scale and need production-grade infrastructure that doesn't lock you into one framework, this is solid. The HybridFlow architecture separates orchestration from computation, which makes debugging distributed RL much less painful than fighting through monolithic training loops.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill verl-rl-training