Production fork of slime for enterprise RL training on massive MoE models. The real draw here is unified FP8 training and inference with bit-wise alignment between SGLang rollouts and Megatron training, which prevents the routing inconsistencies that kill MoE stability. Supports INT4 quantization-aware training (fit 1TB models on single H200s) and speculative RL with EAGLE for 25%+ rollout speedup. If you're training DeepSeek V3 or Qwen3-MoE scale models and need production stability over research flexibility, this is the tool. Includes Rollout Routing Replay to record and replay expert routing decisions, eliminating quantization-induced discrepancies that cause RL collapse.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill miles-rl-training