This is a pure C/Metal inference engine that runs Qwen3.5-397B (397B parameters, 209GB on disk) on a MacBook Pro with 48GB RAM at 4.4 tokens per second. It streams expert weights from SSD on demand using parallel pread calls, no Python or ML frameworks at runtime. The repo includes hand-tuned Metal shaders with fused dequantization FMA kernels and uses Accelerate BLAS for the GatedDeltaNet attention layers. You'll need an M3 Max or similar, 1TB SSD, and patience for the weight extraction and repacking process. There's an optional 2-bit quantization path that's faster but breaks tool calling. This is what happens when someone decides PyTorch is too slow and writes 7000 lines of Objective-C instead.
npx skills add https://github.com/aradotso/trending-skills --skill flash-moe-inference