This handles the full lifecycle of LLM-as-judge and code-based evaluators in Arize: creating evaluators with custom prompts and classification choices, mapping template variables to actual span or experiment columns, running evals against traces or datasets, and setting up continuous monitoring as new data arrives. You can score at span, trace, or session granularity, which matters if you're evaluating multi-turn conversations or agent trajectories instead of single outputs. The skill wraps the ax CLI and assumes you've already got Arize credentials configured. One thing to watch: if an eval run fails, it won't fake results or give you manual scores. It tells you what broke and points you at the integration settings or support.
npx skills add https://github.com/arize-ai/arize-skills --skill arize-evaluator