agentskit.js
Evals

Evals

Run eval suites against any async agent function, replay recorded sessions in CI, and track prompt regressions with snapshots.

Agent quality degrades silently β€” a prompt change that improves one case breaks three others, and you only find out in production. @agentskit/eval gives you pass/fail metrics, deterministic replay without network calls, snapshot diffing for prompts, and reporters that integrate with any CI stack.

#Suites

  • runEval({ agent, suite }) β€” run any EvalSuite against any async agent fn. Recipe.

#Deterministic replay

  • createRecordingAdapter + createReplayAdapter β€” bit-for-bit replay. Recipe.
  • createTimeTravelSession β€” rewind + override + fork. Recipe.
  • replayAgainst β€” A/B cassette vs different model. Recipe.

#Snapshots + diff

  • matchPromptSnapshot β€” Jest-style with exact / normalized / similarity. Recipe.
  • promptDiff + attributePromptChange β€” git-blame for prompts. Recipe.

#CI reporters

  • reportToCi + renderJUnit + renderMarkdown + renderGitHubAnnotations. Recipe.

#Open format

  • @agentskit/core/eval-format β€” portable eval JSON spec. Specs.

#CI integration with Braintrust

@agentskit/eval-braintrust wraps the Braintrust SDK to push eval results to Braintrust β€” experiments, scores, and traces are visible in the Braintrust dashboard without any additional pipeline setup.

import { runEval } from '@agentskit/eval'
import { braintrustReporter } from '@agentskit/eval-braintrust'

await runEval({
  agent: myAgent,
  suite: mySuite,
  reporters: [braintrustReporter({ apiKey: process.env.BRAINTRUST_API_KEY! })],
})

In CI, set BRAINTRUST_API_KEY as a secret and add the eval run as a step after your test suite. Failed evals surface as non-zero exit codes.

Explore nearby

✎ Edit this page on GitHubΒ·Found a problem? Open an issue β†’Β·How to contribute β†’

On this page