This lets you generate talking head videos from a portrait photo and text or audio. P-Video-Avatar is the standout here: 18x faster and 6x cheaper than alternatives, with built-in TTS in 30 voices and 10 languages, plus it does 1080p. You can control tone and background with separate prompts. Good for cranking out UGC-style ads, product demos, or explainer videos at scale without touching video editing software. The workflow is simple: generate a portrait with P-Image, then feed it to P-Video-Avatar with your script. Other models like OmniHuman handle multi-character scenes if you need that, but for most cases the all-in-one speed and cost of P-Video-Avatar is hard to beat.
npx skills add https://github.com/inference-sh-skills/skills --skill ai-avatar-video