@agentskit/eval — for agents

Evaluation harness + deterministic replay + snapshot testing + diff + CI reporters.

#Install

npm install @agentskit/eval

#Primary exports

runEval({ agent, suite }) — run an EvalSuite against any async agent fn.

#Subpaths

Subpath	Contents
`@agentskit/eval/replay`	Universal in-memory APIs: `createRecordingAdapter`, `createReplayAdapter`, cassettes, `createTimeTravelSession`, `replayAgainst`, `summarizeReplay`. Browser and React Native conditions exclude Node built-ins; legacy filesystem export names reject with a Node-only diagnostic. See Deterministic replay, Time travel, Replay-different-model.
`@agentskit/eval/replay/io`	Node-only `saveCassette` and `loadCassette` filesystem helpers. Do not import this subpath from browser, Expo, or React Native applications.
`@agentskit/eval/snapshot`	`matchPromptSnapshot` (exact / normalized / similarity). See Snapshots.
`@agentskit/eval/diff`	`promptDiff`, `attributePromptChange`, `formatDiff`. See Prompt diff.
`@agentskit/eval/ci`	`renderJUnit`, `renderMarkdown`, `renderGitHubAnnotations`, `reportToCi`. See Evals in CI.
`@agentskit/eval/braintrust`	Braintrust scoring pipeline, scorer families, regression detection, and dataset upload helpers. Install the optional `braintrust` peer when using this subpath.
`@agentskit/eval/braintrust/scorers`	Braintrust-compatible quality and robustness scorers.
`@agentskit/eval/braintrust/ci`	Braintrust regression detection and Markdown alerts.

replay/io, snapshot, and ci are Node-oriented filesystem/CI entry points. The conditional replay entry is the portable choice for browsers, Expo, and React Native.

Replay helpers take defensive snapshots of requests, chunks, dates, and plain metadata. Caller mutation cannot rewrite a recording or an already-created replay adapter.

#Minimal example

import { runEval } from '@agentskit/eval'

const result = await runEval({
  agent: async (input) => (await runtime.run(input)).content,
  suite: {
    name: 'qa',
    cases: [{ input: 'Capital of France?', expected: 'Paris' }],
  },
})

console.log(`${result.passed}/${result.totalCases}`)

Malformed agent response objects and thrown assertion predicates become failed cases rather than aborting the suite. Predicate failures preserve the agent output and token usage for diagnosis.

@agentskit/runtime.
@agentskit/observability — trace + cost data feeds eval comparisons.
@agentskit/core/eval-format — portable format spec.

#Source

Explore nearby

✎ Edit this page on GitHub·Found a problem? Open an issue →·How to contribute →