Evals
Evals
Measure quality with numbers, not vibes. Suites, replay, snapshots, diff, CI reporters.
Suites
runEval({ agent, suite })— run anyEvalSuiteagainst any async agent fn. Recipe.
Deterministic replay
createRecordingAdapter+createReplayAdapter— bit-for-bit replay. Recipe.createTimeTravelSession— rewind + override + fork. Recipe.replayAgainst— A/B cassette vs different model. Recipe.
Snapshots + diff
matchPromptSnapshot— Jest-style with exact / normalized / similarity. Recipe.promptDiff+attributePromptChange— git-blame for prompts. Recipe.
CI reporters
reportToCi+renderJUnit+renderMarkdown+renderGitHubAnnotations. Recipe.
Open format
@agentskit/core/eval-format— portable eval JSON spec. Specs.
Per-primitive deep dives land in step 6 of the docs IA rollout.