agentskit.js
Recipes

Evals in CI

Run agent evals on every PR, fail builds below a minimum accuracy, surface results in the PR UI.

Agent quality should gate merges the same way unit tests do. @agentskit/eval/ci + the bundled agentskit-evals composite action wire your suite into GitHub Actions: JUnit report for the test-result UI, Markdown for $GITHUB_STEP_SUMMARY, inline annotations on failures, and a minimum-accuracy gate that fails the job when agent quality regresses.

Install

npm install -D @agentskit/eval

Author an eval runner

evals/run.ts
import { runEval } from '@agentskit/eval'
import { reportToCi } from '@agentskit/eval/ci'
import { createRuntime } from '@agentskit/runtime'
import { anthropic } from '@agentskit/adapters'

const runtime = createRuntime({
  adapter: anthropic({ apiKey: process.env.ANTHROPIC_API_KEY!, model: 'claude-sonnet-4-6' }),
})

const result = await runEval({
  agent: async input => (await runtime.run(input)).content,
  suite: {
    name: 'qa-baseline',
    cases: [
      { input: 'Capital of France?', expected: 'Paris' },
      { input: 'Square root of 64?', expected: '8' },
    ],
  },
})

const min = Number(process.env.AGENTSKIT_EVAL_MIN_ACCURACY ?? '1')
const outDir = process.env.AGENTSKIT_EVAL_OUT_DIR ?? 'agentskit-evals'

const report = await reportToCi({
  suiteName: 'qa-baseline',
  result,
  minAccuracy: min,
  outDir,
})

if (!report.pass) process.exit(1)

Drop in the composite action

.github/workflows/evals.yml
name: evals
on: [pull_request]
jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/agentskit-evals
        with:
          script: evals/run.ts
          min-accuracy: '0.9'
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Action inputs:

InputDefaultPurpose
script(required)Path to the runner
node-version20Node.js version
package-managerpnpmpnpm / npm / yarn
min-accuracy1Fail below this (0..1)
out-diragentskit-evalsReports directory
upload-artifacttruePublish reports as an artifact

What you get in the PR

  • report.xml — JUnit, surfaced by test-reporter actions
  • report.md — appended to the workflow summary
  • ::error:: / ::notice:: annotations inline on the diff
  • Exit code 1 when accuracy drops below min-accuracy

Reporters

Each reporter is also exported for custom pipelines:

import {
  renderJUnit,
  renderMarkdown,
  renderGitHubAnnotations,
} from '@agentskit/eval/ci'

See also

✎ Edit this page on GitHub·Found a problem? Open an issue →·How to contribute →

On this page