Deterministic replay
Record a real agent session once, then replay it bit-for-bit in tests and bug repros — no more flaky LLM traces.
LLMs are non-deterministic. That makes debugging a nightmare: a bug that reproduces locally vanishes the next run, and CI flakes when providers return slightly different tokens.
@agentskit/eval/replay fixes this by recording every StreamChunk
an adapter produces into a cassette, then letting you replay it
through a fake adapter that is bit-for-bit identical — same tokens,
same order, same tool calls, zero network.
Install
npm install -D @agentskit/evalRecord once
Wrap any real adapter. Every streamed chunk is captured into a
Cassette object you can persist to disk.
import { createRecordingAdapter, saveCassette } from '@agentskit/eval/replay'
import { createRuntime } from '@agentskit/runtime'
import { anthropic } from '@agentskit/adapters'
const base = anthropic({ apiKey: process.env.ANTHROPIC_API_KEY!, model: 'claude-sonnet-4-6' })
const { factory, cassette } = createRecordingAdapter(base, { seed: 'bug-repro-#427' })
const runtime = createRuntime({ adapter: factory })
await runtime.run('Summarize the quarterly report')
await saveCassette('./fixtures/bug-427.cassette.json', cassette)Replay forever
In tests — or when you're iterating on a fix — swap the real adapter for the cassette. No API keys, no latency, no flake.
import { createReplayAdapter, loadCassette } from '@agentskit/eval/replay'
import { createRuntime } from '@agentskit/runtime'
const cassette = await loadCassette('./fixtures/bug-427.cassette.json')
const runtime = createRuntime({ adapter: createReplayAdapter(cassette) })
const result = await runtime.run('Summarize the quarterly report')
expect(result.content).toContain('Q3 revenue')Matching modes
The replay adapter accepts a mode option that controls how incoming
requests map to recorded entries.
| Mode | Behavior | Use when |
|---|---|---|
strict (default) | Request fingerprint (messages + context) must match exactly | Your test sends the same input as the recording |
sequential | Returns next unused entry regardless of request | Requests include timestamps or volatile metadata |
loose | Matches by last user message content only | You care about the prompt, not the surrounding context |
createReplayAdapter(cassette, { mode: 'sequential' })What goes in a cassette
Cassettes are plain JSON. Commit them to git, diff them in PRs.
{
"version": 1,
"seed": "bug-repro-#427",
"entries": [
{
"request": { "messages": [{ "role": "user", "content": "..." }] },
"chunks": [
{ "type": "text", "content": "Hello" },
{ "type": "tool_call", "toolCall": { "id": "t1", "name": "search", "args": "{...}" } },
{ "type": "done" }
]
}
]
}See also
- Prompt snapshot testing — assert prompts stay stable
- Prompt diff — attribute output changes to prompt changes
- Eval suite — score agents in CI