This covers three main approaches to making LLMs generate text faster without quality loss: classic speculative decoding with a small draft model that guesses tokens for a large target model to verify in parallel (1.5-2× speedup), Medusa which adds extra prediction heads to guess multiple future tokens at once using tree-based attention (2.3-3.6× speedup), and Lookahead Decoding which reformulates generation as parallel Jacobi iteration over n-grams. The implementations are practical, with working code for transformers' assisted generation, the Medusa library, and lookahead parameters. Best when you need lower latency for chatbots or code completion and have the memory to load draft models or train lightweight heads. The math checks out and these actually work in production.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill speculative-decoding