Eval suites
Define cases with inputs and assertions, then run them against any async agent function to get pass rates and latency metrics.
runEval is the entry point for all evaluations: give it an async function that wraps your agent and an EvalSuite with test cases, and it returns a report with per-case results and aggregate metrics. Assertions can be boolean functions, regex, or an LLM-as-judge that returns a rationale.
import { runEval } from '@agentskit/eval'
const suite = {
name: 'support-triage',
cases: [
{
id: 'refund',
input: 'How do I get a refund?',
assert: (out) => out.includes('refund policy'),
},
],
}
const report = await runEval({
agent: async (input) => runtime.run({ input }).then((r) => r.output),
suite,
})
console.log(report.passRate, report.failures)#Assertions
- boolean fn β pass/fail
- async LLM-as-judge β
({ pass, rationale }) - regex β match required
#Metrics
Built-in: passRate, latencyP50, latencyP95, tokensTotal, usdTotal.
#Related
- Replay Β· Snapshots Β· CI
- Recipe: eval suite
Explore nearby
- PeerEvals
Run eval suites against any async agent function, replay recorded sessions in CI, and track prompt regressions with snapshots.
- PeerDeterministic replay
Record LLM responses to a cassette file, then replay them in CI without network calls for fast, deterministic tests.
- PeerPrompt snapshots + diff
Assert that rendered prompts haven't changed unexpectedly, and trace exactly which edit caused a drift.