This is for when you need to test whether your AI agent is making the right decisions, not just whether your code works. You'll use evalTest for logic validation and appEvalTest for UI-heavy scenarios. The workflow is structured around a policy system: new tests start as USUALLY_PASSES, then get promoted to ALWAYS_PASSES once stable to lock in regression coverage. What I like here is the pragmatism. It includes specific guides for the three things you actually do: creating tests with workspace seeding and breakpoint assertions, fixing failures with investigation steps, and promoting candidates when they're ready. The decision tree at the top saves you from overthinking which approach to use.
npx skills add https://github.com/google-gemini/gemini-cli --skill behavioral-evals