Three preprocessing commands that generate assets for video compositions: text-to-speech with Kokoro (54 voices, 9 languages, runs locally), Whisper transcription (word-level timestamps for captions), and background removal with u2net (transparent cutouts for overlay work). Each tool downloads its own model on first run and caches under ~/.cache/hyperframes/. The real value is in the output formats. TTS produces clean wav files ready to drop into a timeline. Transcribe normalizes everything to the same JSON shape whether you're importing SRT, VTT, or running fresh inference. Background removal can emit both the cutout and the inverse plate in one pass, which saves time when you need layered composites. Useful if you're assembling programmatic video and need narration, captions, or transparent talking heads without hitting external APIs.
npx skills add https://github.com/heygen-com/hyperframes --skill hyperframes-media