Flash Moe Inference

Editor's Note

This is a pure C/Metal inference engine that runs Qwen3.5-397B (397B parameters, 209GB on disk) on a MacBook Pro with 48GB RAM at 4.4 tokens per second. It streams expert weights from SSD on demand using parallel pread calls, no Python or ML frameworks at runtime. The repo includes hand-tuned Metal shaders with fused dequantization FMA kernels and uses Accelerate BLAS for the GatedDeltaNet attention layers. You'll need an M3 Max or similar, 1TB SSD, and patience for the weight extraction and repacking process. There's an optional 2-bit quantization path that's faster but breaks tool calling. This is what happens when someone decides PyTorch is too slow and writes 7000 lines of Objective-C instead.

Install

npx skills add https://github.com/aradotso/trending-skills --skill flash-moe-inference

Votes

Installs977

GitHub Stars7

Flash Moe Inference

Install

Flash Moe Inference

Install

Related Data Science & ML Skills

Related Data Science & ML Skills