This is comprehensive guidance for fine-tuning language models with Group Relative Policy Optimization using the TRL library. You'd reach for this when you need to enforce structured outputs (XML tags, JSON) or improve reasoning on verifiable tasks like math or coding, where you have clear reward signals but not preference pairs. The skill walks through the full workflow: preparing chat-formatted datasets, composing multiple reward functions (correctness, format, style), and tuning hyperparameters like group size and learning rate. The reward function examples are especially practical, showing both binary scoring and incremental rewards with partial credit. Worth noting that Orchestra Research includes honest guidance about when not to use GRPO, which saves you from going down the wrong path on supervised tasks.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill grpo-rl-training