BLIP-2 bridges frozen vision encoders with large language models through a lightweight Q-Former architecture, giving you high-quality image captioning and visual question answering without the overhead of fine-tuning massive models. You get state-of-the-art zero-shot performance by only training ~188M parameters while keeping the ViT and LLM frozen. It comes in OPT and FlanT5 variants from 2.7B to XXL, with the transformers library making it straightforward to load and run. The architecture is clever: learned queries cross-attend to image features, then feed into the frozen LLM for generation. If you need instruction-following specifically, InstructBLIP is the newer successor, but BLIP-2 remains solid for straightforward captioning and VQA tasks where you want LLM-quality reasoning about images.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill blip-2-vision-language