Evals in CI

Run agent evals on every PR, fail builds below a minimum accuracy, surface results in the PR UI.

Agent quality should gate merges the same way unit tests do. @agentskit/eval/ci + the bundled agentskit-evals composite action wire your suite into GitHub Actions: JUnit report for the test-result UI, Markdown for $GITHUB_STEP_SUMMARY, inline annotations on failures, and a minimum-accuracy gate that fails the job when agent quality regresses.

Install

npm install -D @agentskit/eval

Author an eval runner

evals/run.ts

import { runEval } from '@agentskit/eval'
import { reportToCi } from '@agentskit/eval/ci'
import { createRuntime } from '@agentskit/runtime'
import { anthropic } from '@agentskit/adapters'

const runtime = createRuntime({
  adapter: anthropic({ apiKey: process.env.ANTHROPIC_API_KEY!, model: 'claude-sonnet-4-6' }),
})

const result = await runEval({
  agent: async input => (await runtime.run(input)).content,
  suite: {
    name: 'qa-baseline',
    cases: [
      { input: 'Capital of France?', expected: 'Paris' },
      { input: 'Square root of 64?', expected: '8' },
    ],
  },
})

const min = Number(process.env.AGENTSKIT_EVAL_MIN_ACCURACY ?? '1')
const outDir = process.env.AGENTSKIT_EVAL_OUT_DIR ?? 'agentskit-evals'

const report = await reportToCi({
  suiteName: 'qa-baseline',
  result,
  minAccuracy: min,
  outDir,
})

if (!report.pass) process.exit(1)

Drop in the composite action

.github/workflows/evals.yml

name: evals
on: [pull_request]
jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/agentskit-evals
        with:
          script: evals/run.ts
          min-accuracy: '0.9'
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Action inputs:

Input	Default	Purpose
`script`	(required)	Path to the runner
`node-version`	`20`	Node.js version
`package-manager`	`pnpm`	`pnpm` / `npm` / `yarn`
`min-accuracy`	`1`	Fail below this (0..1)
`out-dir`	`agentskit-evals`	Reports directory
`upload-artifact`	`true`	Publish reports as an artifact

What you get in the PR

report.xml — JUnit, surfaced by test-reporter actions
report.md — appended to the workflow summary
::error:: / ::notice:: annotations inline on the diff
Exit code 1 when accuracy drops below min-accuracy

Reporters

Each reporter is also exported for custom pipelines:

import {
  renderJUnit,
  renderMarkdown,
  renderGitHubAnnotations,
} from '@agentskit/eval/ci'