@agentskit/eval

Eval suites + deterministic replay + snapshot testing + prompt diff + CI reporters.

@agentskit/eval is how you stop shipping changes to prompts, tools, and providers on vibes alone. It gives you repeatable ways to measure quality, compare behavior, and catch regressions before they reach users.

#When to reach for it

You want to score agent quality with numbers, in CI.
You want deterministic replay (record once, replay forever).
You want Jest-style prompt snapshots with semantic tolerance.
You want a "git blame for prompts" — diff + attribution.

#Best fit

Add this when an agent starts becoming product-critical.
Pair with @agentskit/observability so you can turn real failures into eval cases.
Pair with @agentskit/adapters to compare providers against the same suite.
Pair with @agentskit/runtime when you need to evaluate full multi-step workflows, not isolated prompts.

#Install

npm install -D @agentskit/eval

#Hello world

import { runEval } from '@agentskit/eval'

const result = await runEval({
  agent: async (input) => (await runtime.run(input)).content,
  suite: {
    name: 'qa',
    cases: [{ input: 'Capital of France?', expected: 'Paris' }],
  },
})
console.log(`${result.passed}/${result.totalCases} passed`)

That feedback loop is what lets a team keep improving an agent without losing control of it.

#Surface

runEval({ agent, suite }).
/replay: universal in-memory createRecordingAdapter · createReplayAdapter · cassettes · createTimeTravelSession · replayAgainst · summarizeReplay.
/replay/io: Node-only saveCassette · loadCassette. Browser and native applications should persist serialized cassettes through a host-owned storage adapter.
/snapshot: matchPromptSnapshot.
/diff: promptDiff · attributePromptChange · formatDiff.
/ci: renderJUnit · renderMarkdown · renderGitHubAnnotations · reportToCi.

#Recipes

#Stability

Version: 0.4.1
Tier: alpha
Contract: evolving
Roadmap: see packages roadmap for what this package needs to reach v1.0.

#Source

npm: @agentskit/eval · repo: packages/eval

Explore nearby

✎ Edit this page on GitHub·Found a problem? Open an issue →·How to contribute →