If you're building LLM apps and need to move beyond vibes-based testing, this gives you the Phoenix evaluation framework. It's refreshingly opinionated: error analysis before automation, code evaluators before LLM judges, binary pass/fail over fuzzy scoring. You get pre-built evaluators for common cases like RAG, tools to build custom ones in Python or TypeScript, and crucially, validation workflows to check your LLM judges against human labels (they recommend 80% true positive/negative rates minimum). The workflow guides are practical, from trace sampling through production guardrails. Worth noting it requires running a Phoenix server, so there's infrastructure to manage.
npx -y skills add github/awesome-copilot --skill phoenix-evals --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
github/awesome-copilot
alirezarezvani/claude-skills
microsoft/win-dev-skills
github/awesome-copilot