This is a solid implementation guide for compressing large language models using knowledge distillation, with working code for the main techniques. You get standard teacher-student training with temperature scaling and soft targets, plus MiniLLM's reverse KLD approach which actually works better for generative models because it's mode-covering instead of mode-seeking. The examples show real transfers like 70B to 7B LLaMA models and include response distillation for creating synthetic training data. Good coverage of the alpha hyperparameter for balancing soft and hard losses. If you're trying to deploy a cheaper model or transfer GPT-4 capabilities to something you can run locally, this gives you the core techniques with reasonable defaults.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill knowledge-distillation