This is how you actually test whether your agent workflow works in practice. It walks you through five evaluation dimensions (task completion, output quality, error behavior, user experience, consistency) and forces you to run concrete scenarios including edge cases and adversarial inputs. You get a structured table to document what happened versus what should have happened, then a graded report with specific improvement actions. The real value is that it won't let you skip the uncomfortable tests. Most people only check the happy path, but this pushes you to test malformed input, tool failures, and tricky cases. If you're shipping an agent workflow to users, run this first and prepare to feel slightly embarrassed by what breaks.
npx skills add https://github.com/sharpdeveye/maestro --skill evaluate