This is your go-to for converting models to GGUF and running them with llama.cpp, especially if you're deploying on Apple Silicon, consumer hardware, or need CPU inference. The skill covers the full workflow: converting HuggingFace models, quantizing with K-quant methods (Q4_K_M is the recommended default), and using importance matrices for better quality at lower bits. It includes both CLI and Python examples with llama-cpp-python, plus server mode for OpenAI-compatible APIs. The real value is in the practical workflows and the quantization comparison table showing actual size/quality tradeoffs. If you need maximum speed on NVIDIA in production, you'd want TensorRT-LLM instead, but for flexible deployment across hardware without GPU requirements, this covers everything.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill gguf-quantization