Evals

Run eval suites against any async agent function, replay recorded sessions in CI, and track prompt regressions with snapshots.

Agent quality degrades silently — a prompt change that improves one case breaks three others, and you only find out in production. @agentskit/eval gives you pass/fail metrics, deterministic replay without network calls, snapshot diffing for prompts, and reporters that integrate with any CI stack.

#Suites

runEval({ agent, suite }) — run any EvalSuite against any async agent fn. Recipe.

#Deterministic replay

createRecordingAdapter + createReplayAdapter — bit-for-bit replay. Recipe.
createTimeTravelSession — rewind + override + fork. Recipe.
replayAgainst — A/B cassette vs different model. Recipe.

#Snapshots + diff

matchPromptSnapshot — Jest-style with exact / normalized / similarity. Recipe.
promptDiff + attributePromptChange — git-blame for prompts. Recipe.

#CI reporters

reportToCi + renderJUnit + renderMarkdown + renderGitHubAnnotations. Recipe.

#Open format

@agentskit/core/eval-format — portable eval JSON spec. Specs.

#CI integration with Braintrust

@agentskit/eval/braintrust wraps the Braintrust SDK to push eval results to Braintrust — experiments, scores, and traces are visible in the Braintrust dashboard without any additional pipeline setup.

import { runEval } from '@agentskit/eval'
import { braintrustReporter } from '@agentskit/eval/braintrust'

await runEval({
  agent: myAgent,
  suite: mySuite,
  reporters: [braintrustReporter({ apiKey: process.env.BRAINTRUST_API_KEY! })],
})

In CI, set BRAINTRUST_API_KEY as a secret and add the eval run as a step after your test suite. Failed evals surface as non-zero exit codes.

Explore nearby

✎ Edit this page on GitHub·Found a problem? Open an issue →·How to contribute →