MOSS-TTS-Nano is a 0.1B parameter text-to-speech model that runs on CPU and does voice cloning in 20 languages. It's genuinely tiny compared to most TTS systems but still generates 48 kHz stereo audio in real time using an audio tokenizer plus autoregressive LLM pipeline. You feed it a reference audio clip and your text, and it mimics the voice. The streaming API is solid for low-latency applications, and the FastAPI server keeps the model loaded so repeated requests stay fast. Good choice when you need multilingual voice cloning without GPU dependencies or when you're prototyping speech features locally. The autoregressive approach means quality won't match diffusion models, but the speed and resource efficiency trade-off is reasonable for real-time use cases.
npx skills add https://github.com/aradotso/trending-skills --skill moss-tts-nano-speech