This is a full-featured text-to-speech system with voice cloning built in. You feed it reference audio (3-30 seconds of clean speech) and it extracts voice characteristics you can reuse across different text. It analyzes text for emotional context and automatically adjusts speed, pitch, and volume to match the sentiment. The dual model setup is practical: 1.7B for quality work like audiobooks, 0.6B for real-time applications. Streaming mode handles long text by chunking intelligently. Needs 8GB+ VRAM for the larger model, though the smaller one runs on CPU. The emotion adaptation is agent-driven, so results depend on how well it parses your text's mood.
npx -y skills add anbeime/skill --skill tts-voice-synthesis --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
orchestra-research/ai-research-skills
agentspace-so/runcomfy-agent-skills
inferen-sh/skills
inferen-sh/skills