Meta's 86M parameter classifier that catches prompt injection and jailbreak attempts before they hit your LLM. Returns three scores: benign, injection (embedded instructions in third-party data), and jailbreak (direct override attempts). Runs in under 2ms on GPU, works across 8 languages, and hits 99%+ true positive rate with sub-1% false positives. The dual-mode design is smart: use stricter thresholds (0.5) for user inputs, looser ones (0.3) for RAG documents and API responses. Batch processing support makes it practical for filtering retrieved docs at scale. One caveat: legitimate security research queries can trigger false positives, so you'll want context-aware thresholds for trusted users. Best layered with content moderators like LlamaGuard rather than used as your only defense.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill prompt-guard