Converts speech to text with timestamps and optional speaker identification, which is exactly what you need for meeting transcriptions, subtitle generation, or any voice recording processing. Supports multiple providers: run locally with Transformers.js if you want zero API costs (though no speaker ID), or use fal, Replicate, or RunPod for cloud processing with diarization. The JSON output includes segment-level timing and speaker labels when enabled. One honest take: the local provider uses Moonshine instead of Whisper and claims 5x speed gains, but you'll need to hit the cloud providers if speaker identification matters for your use case. Works with mp3, wav, m4a, and ogg files, plus auto-detects language if you don't specify it.
npx skills add https://github.com/agntswrm/agent-media --skill audio-transcribe