This is a complete local TTS workflow built around Qwen3-TTS models with three distinct modes: CustomVoice for built-in speakers with emotion control, VoiceDesign for describing voices in natural language (like "high-pitched loli voice"), and VoiceClone for mimicking reference audio. The real utility is in batch dubbing long articles into multi-voice audio with automatic speaker assignment and emotion tagging, then merging everything with FFmpeg. Supports Chinese, English, Japanese, and Korean with speakers like Vivian and Ryan out of the box. The documentation is thorough with actual command examples, though it's heavy on Chinese content. You'll need a GPU for reasonable performance and FFmpeg installed for the batch features. If you're doing voiceovers for articles, audiobooks, or multi-character dialogue, this handles the whole pipeline from text splitting to final WAV.
npx skills add https://github.com/mu-zi-lee/qwen3-tts-skill --skill qwen3-tts-skills