This handles local speech-to-text using Faster Whisper, built specifically for the JARVIS voice assistant. It's a complete TDD-based implementation that prioritizes privacy (processes locally, deletes audio immediately) and real-time performance with VAD filtering and streaming transcription. The guide walks through model selection from tiny to large-v3, includes concrete latency targets (under 300ms for short audio), and shows optimization patterns like int8 quantization for CPU and chunked processing to avoid waiting for complete recordings. The medium risk rating reflects audio processing and privacy concerns, but the approach is solid if you need offline voice recognition without cloud dependencies.
npx skills add https://github.com/martinholovsky/claude-skills-generator --skill speech-to-text