This converts written documents into TTS narration with word-level timing for video production. It splits your markdown into natural scene breaks based on argument flow rather than just heading structure, generates audio using a local Qwen3-TTS model, then transcribes with Whisper to get precise word timestamps for captions. The full narration pipeline is smarter than the legacy per-scene approach because it avoids volume jumps between concatenated clips. You get scene text files, WAV audio, VTT captions, and timing boundaries ready for video composition. Requires a chunky local setup with Deno, Python 3.12, ffmpeg, and a 7.8GB TTS model, so this is for people already committed to local TTS workflows who need that granular timing data.
npx skills add https://github.com/jwynia/agent-skills --skill document-to-narration