This is a full workflow for testing Python apps that call LLMs, built around the pixie-qa framework. It walks you through defining evaluation criteria, instrumenting your application, building test datasets, and running scored evaluations with real LLM calls (no mocking). The skill is opinionated about what good eval looks like: your app code runs end-to-end while external data sources get stubbed, but the LLM itself must be real or the scores mean nothing. It's designed for people who want concrete test results, not another planning doc. If you're building an AI app and need to move past "it seems to work" into measurable quality improvement, this gives you the structure to actually do it.
npx -y skills add github/awesome-copilot --skill eval-driven-dev --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
github/awesome-copilot
alirezarezvani/claude-skills
microsoft/win-dev-skills
github/awesome-copilot