A comprehensive reference for vision, audio, and video generation across multiple providers. Covers image analysis with Claude Opus 4-8 and GPT-5.5, document processing with Gemini 3.1, speech-to-text with 9.5 hour audio support, and AI video generation with Kling v3, Sora 2, Veo 3.1, and Runway Gen-4.5. The multi-shot video patterns are especially useful if you need character consistency across scenes. Includes canonical model IDs, provider comparisons, and async polling patterns. The common mistakes section will save you from the usual traps like forgetting max_tokens on vision requests or trying to use video APIs synchronously. Nine rules total, but the quick reference table makes navigation straightforward.
npx skills add https://github.com/yonatangross/orchestkit --skill multimodal-llm