Eval suites

Define cases with inputs and assertions, then run them against any async agent function to get pass rates and latency metrics.

runEval is the entry point for all evaluations: give it an async function that wraps your agent and an EvalSuite with test cases, and it returns a report with per-case results and aggregate metrics. Assertions can be boolean functions, regex, or an LLM-as-judge that returns a rationale.

import { runEval } from '@agentskit/eval'

const suite = {
  name: 'support-triage',
  cases: [
    {
      id: 'refund',
      input: 'How do I get a refund?',
      assert: (out) => out.includes('refund policy'),
    },
  ],
}

const report = await runEval({
  agent: async (input) => runtime.run({ input }).then((r) => r.output),
  suite,
})

console.log(report.passRate, report.failures)

#Assertions

boolean fn → pass/fail
async LLM-as-judge → ({ pass, rationale })
regex → match required

#Metrics

Built-in: passRate, latencyP50, latencyP95, tokensTotal, usdTotal.

Replay · Snapshots · CI
Recipe: eval suite

Explore nearby

✎ Edit this page on GitHub·Found a problem? Open an issue →·How to contribute →

Eval suites

#Assertions

#Metrics

#Related

Explore nearby

On this page