This walks you through the three pieces you need for LangSmith evaluations: writing evaluators (LLM-as-judge or custom code), defining run functions that capture your agent's outputs, and actually running the evals either locally with evaluate() or by uploading them via the CLI. The golden rule here is solid: always inspect your actual output structure before writing extraction logic, because frameworks vary wildly. One thing to watch: LLM-as-judge evaluators can't be uploaded yet, only run locally, so you'll want to use evaluate() with local evaluators for dataset comparisons. The examples cover both Python and TypeScript, and there's a helpful table showing the differences between local and uploaded evaluator behavior, which matters more than you'd think for return formats.
npx skills add https://github.com/langchain-ai/langsmith-skills --skill langsmith-evaluator