@agentskit/eval
Eval suites + deterministic replay + snapshot testing + prompt diff + CI reporters.
@agentskit/eval is how you stop shipping changes to prompts, tools, and providers on vibes alone. It gives you repeatable ways to measure quality, compare behavior, and catch regressions before they reach users.
#When to reach for it
- You want to score agent quality with numbers, in CI.
- You want deterministic replay (record once, replay forever).
- You want Jest-style prompt snapshots with semantic tolerance.
- You want a "git blame for prompts" — diff + attribution.
#Best fit
- Add this when an agent starts becoming product-critical.
- Pair with
@agentskit/observabilityso you can turn real failures into eval cases. - Pair with
@agentskit/adaptersto compare providers against the same suite. - Pair with
@agentskit/runtimewhen you need to evaluate full multi-step workflows, not isolated prompts.
#Install
npm install -D @agentskit/eval#Hello world
import { runEval } from '@agentskit/eval'
const result = await runEval({
agent: async (input) => (await runtime.run(input)).content,
suite: {
name: 'qa',
cases: [{ input: 'Capital of France?', expected: 'Paris' }],
},
})
console.log(`${result.passed}/${result.totalCases} passed`)That feedback loop is what lets a team keep improving an agent without losing control of it.
#Surface
runEval({ agent, suite })./replay:createRecordingAdapter·createReplayAdapter· cassettes ·createTimeTravelSession·replayAgainst·summarizeReplay./snapshot:matchPromptSnapshot./diff:promptDiff·attributePromptChange·formatDiff./ci:renderJUnit·renderMarkdown·renderGitHubAnnotations·reportToCi.
#Recipes
- Eval suite
- Deterministic replay
- Time-travel debug
- Replay-different-model
- Prompt snapshots
- Prompt diff
- Evals in CI
#Stability
- Version:
0.4.1 - Tier: alpha
- Contract: evolving
- Roadmap: see packages roadmap for what this package needs to reach v1.0.
#Related
#Source
npm: @agentskit/eval · repo: packages/eval
Explore nearby
- PeerPackages overview
Every AgentsKit package at a glance — what it does, when to reach for it, where to read the deep dive.
- PeerRoadmap
Per-package stability status, current version, and what each package needs to reach v1.0.
- Peer@agentskit/core
Shared contract layer — TypeScript types, headless chat controller, stream helpers. Zero-dep, under 10 KB gzipped.
@agentskit/observability-langfuse
Langfuse tracing backend for @agentskit/observability — spans for plan, tool, model, and HITL gates with token/cost/latency capture.
@agentskit/eval-braintrust
Braintrust scoring pipeline for @agentskit/eval — quality + robustness scorers, CI regression alerts, dataset sync.