Automates the process of evaluating SageMaker models by generating Python code for two evaluation approaches: LLM-as-Judge, where another LLM grades your model's outputs, and Custom Scorer, which uses Lambda functions for programmatic testing like math or code validation. It works with both open source models and Nova, though LLM-as-Judge is OSS only. The workflow is conversational and deliberate, asking one question at a time to determine which evaluation type fits your situation, validating prerequisites like dataset availability and model compatibility, then generating the actual evaluation code. If you're running benchmarks or comparing fine-tuned models in SageMaker, this handles the boilerplate so you can focus on interpreting results instead of wiring up evaluation infrastructure.
npx -y skills add awslabs/agent-plugins --skill model-evaluation --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
wshobson/agents
github/awesome-copilot