LLaVA combines a CLIP vision encoder with Vicuna language models to build conversational image understanding systems. You'd use this for visual question answering, multi-turn image chat, or document understanding where you need natural back-and-forth dialogue about images. The 7B model is the sweet spot at 14GB VRAM, though you can drop to 4GB with quantization. It's genuinely open source with Apache 2.0 licensing and hits competitive benchmarks on VQAv2 and GQA. Be aware it hallucinates details and struggles with spatial reasoning and counting, so verify critical information. If you just need image classification, stick with CLIP. If you need GPT-4V quality and have budget, use the API. This is for when you want capable vision-language chat running on your own infrastructure.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill llava