Evals
Run eval suites against any async agent function, replay recorded sessions in CI, and track prompt regressions with snapshots.
Agent quality degrades silently β a prompt change that improves one case breaks three others, and you only find out in production. @agentskit/eval gives you pass/fail metrics, deterministic replay without network calls, snapshot diffing for prompts, and reporters that integrate with any CI stack.
#Suites
runEval({ agent, suite })β run anyEvalSuiteagainst any async agent fn. Recipe.
#Deterministic replay
createRecordingAdapter+createReplayAdapterβ bit-for-bit replay. Recipe.createTimeTravelSessionβ rewind + override + fork. Recipe.replayAgainstβ A/B cassette vs different model. Recipe.
#Snapshots + diff
matchPromptSnapshotβ Jest-style with exact / normalized / similarity. Recipe.promptDiff+attributePromptChangeβ git-blame for prompts. Recipe.
#CI reporters
reportToCi+renderJUnit+renderMarkdown+renderGitHubAnnotations. Recipe.
#Open format
@agentskit/core/eval-formatβ portable eval JSON spec. Specs.
#CI integration with Braintrust
@agentskit/eval-braintrust wraps the Braintrust SDK to push eval results to Braintrust β experiments, scores, and traces are visible in the Braintrust dashboard without any additional pipeline setup.
import { runEval } from '@agentskit/eval'
import { braintrustReporter } from '@agentskit/eval-braintrust'
await runEval({
agent: myAgent,
suite: mySuite,
reporters: [braintrustReporter({ apiKey: process.env.BRAINTRUST_API_KEY! })],
})In CI, set BRAINTRUST_API_KEY as a secret and add the eval run as a step after your test suite. Failed evals surface as non-zero exit codes.
#Related
Explore nearby
- PeerEval suites
Define cases with inputs and assertions, then run them against any async agent function to get pass rates and latency metrics.
- PeerDeterministic replay
Record LLM responses to a cassette file, then replay them in CI without network calls for fast, deterministic tests.
- PeerPrompt snapshots + diff
Assert that rendered prompts haven't changed unexpectedly, and trace exactly which edit caused a drift.