If you're building LLM applications, you need evals, and Phoenix gives you a practical framework for both. The approach is sensible: start with error analysis to understand what's actually failing, build code-based evaluators for deterministic checks first, then layer in LLM judges for nuanced cases. The skill covers pre-built evaluators for common patterns like RAG, but the real value is in helping you build custom ones from your actual failures. One thing I appreciate: they're explicit about validating your evaluators against human labels (aiming for 80%+ accuracy) and prefer binary pass/fail over fuzzy scoring scales. Works in both Python and TypeScript, requires a Phoenix server running.
npx skills add https://github.com/arize-ai/phoenix --skill phoenix-evals