This gets a full multimodal voice assistant running locally with no API costs. Gemma 4 E2B handles speech and vision understanding, Kokoro TTS speaks back, and everything stays on device via a FastAPI WebSocket server. The architecture is clean: Silero VAD in the browser detects when you're talking, sends audio plus camera frames over WebSocket, and streams sentence-level TTS back so you hear responses before they're fully generated. Runs on Apple Silicon with MLX or Linux with ONNX. Expect around 2.5-3 seconds end to end on an M3 Pro. You can interrupt mid-sentence, tweak the system prompt to change behavior, and adjust VAD sensitivity. Good for prototyping private voice interfaces or experimenting with on-device inference without racking up cloud bills.
npx skills add https://github.com/aradotso/trending-skills --skill parlor-on-device-ai