CLIP lets you match images and text without training a classifier first. You give it an image and a list of labels ("a dog", "a cat", "a sunset"), and it tells you which ones match using cosine similarity between embeddings. Works surprisingly well for semantic image search, content moderation, and zero-shot classification. Trained on 400M image-text pairs from the web. The ViT-B/32 model is the sweet spot for speed and quality. Main limitation is that it only understands whole images, not regions or fine-grained details, and vague text prompts give you vague results. If you need actual captions or chat, use BLIP-2 or LLaVA instead.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill clip