This is a unified interface for Google Gemini's multimodal capabilities: transcribe audio up to 9.5 hours, analyze images with object detection and segmentation, process videos including YouTube URLs, extract tables and charts from PDFs, and generate images from text. Supports both Google AI Studio and Vertex AI with context windows up to 2M tokens. The real value is having one consistent API for everything instead of juggling separate services. Pick gemini-2.5-flash for most work, though you'll need specific models for segmentation or image generation. Includes utilities for optimizing large files before upload and converting documents to markdown. Built for developers who need multimodal AI without managing a dozen different integrations.
npx skills add https://github.com/mrgoonie/claudekit-skills --skill ai-multimodal