Recipes
Evals in CI
Run agent evals on every PR, fail builds below a minimum accuracy, surface results in the PR UI.
Agent quality should gate merges the same way unit tests do.
@agentskit/eval/ci + the bundled agentskit-evals composite action
wire your suite into GitHub Actions: JUnit report for the test-result
UI, Markdown for $GITHUB_STEP_SUMMARY, inline annotations on
failures, and a minimum-accuracy gate that fails the job when agent
quality regresses.
Install
npm install -D @agentskit/evalAuthor an eval runner
import { runEval } from '@agentskit/eval'
import { reportToCi } from '@agentskit/eval/ci'
import { createRuntime } from '@agentskit/runtime'
import { anthropic } from '@agentskit/adapters'
const runtime = createRuntime({
adapter: anthropic({ apiKey: process.env.ANTHROPIC_API_KEY!, model: 'claude-sonnet-4-6' }),
})
const result = await runEval({
agent: async input => (await runtime.run(input)).content,
suite: {
name: 'qa-baseline',
cases: [
{ input: 'Capital of France?', expected: 'Paris' },
{ input: 'Square root of 64?', expected: '8' },
],
},
})
const min = Number(process.env.AGENTSKIT_EVAL_MIN_ACCURACY ?? '1')
const outDir = process.env.AGENTSKIT_EVAL_OUT_DIR ?? 'agentskit-evals'
const report = await reportToCi({
suiteName: 'qa-baseline',
result,
minAccuracy: min,
outDir,
})
if (!report.pass) process.exit(1)Drop in the composite action
name: evals
on: [pull_request]
jobs:
evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/agentskit-evals
with:
script: evals/run.ts
min-accuracy: '0.9'
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}Action inputs:
| Input | Default | Purpose |
|---|---|---|
script | (required) | Path to the runner |
node-version | 20 | Node.js version |
package-manager | pnpm | pnpm / npm / yarn |
min-accuracy | 1 | Fail below this (0..1) |
out-dir | agentskit-evals | Reports directory |
upload-artifact | true | Publish reports as an artifact |
What you get in the PR
report.xml— JUnit, surfaced by test-reporter actionsreport.md— appended to the workflow summary::error::/::notice::annotations inline on the diff- Exit code 1 when accuracy drops below
min-accuracy
Reporters
Each reporter is also exported for custom pipelines:
import {
renderJUnit,
renderMarkdown,
renderGitHubAnnotations,
} from '@agentskit/eval/ci'See also
- Eval suite — author the suite itself
- Deterministic replay — pin cassettes in CI