If you're running large language models on Apple Silicon with MLX and feeling the latency, this is the fix. DFlash uses a small 1B-parameter draft model to speculatively generate 16 tokens at once via block diffusion, then verifies them in one pass through your target model. It's completely lossless, meaning every token still comes from your actual model's greedy output. You get 1.7x to 4x speedups on Qwen models with 87-90% acceptance rates. Ships with a CLI, an OpenAI-compatible server, and Python API. The tape-replay rollback for GatedDeltaNet models and custom Metal kernels show this was built by someone who actually cares about making inference fast on M-series chips.
npx skills add https://github.com/aradotso/trending-skills --skill dflash-mlx-speculative-decoding