SimPO is a reference-free preference optimization method that outperforms DPO by 6.4 points on AlpacaEval 2.0 without needing a reference model during training. You use it when you have preference pairs (chosen/rejected responses) and want simpler, more efficient alignment than DPO or PPO. The implementation lives in Hugging Face's alignment-handbook and requires careful tuning of learning rate (3e-7 to 1e-6) and beta (2.0-10.0) parameters. Works on single-node setups with DeepSpeed ZeRO-3, making it accessible if you don't have massive distributed infrastructure. The honest take: if DPO is your baseline for preference alignment, SimPO gives you better results with less complexity, though you still need to babysit hyperparameters to avoid divergence.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill simpo-training