agentskit.js
Recipes

Prompt snapshot testing

Jest-style snapshot tests for prompts, with semantic tolerance so small wording drift doesn't break CI.

Prompts are code. They should be reviewed in PRs and tested like code. @agentskit/eval/snapshot gives you snapshot testing — the same "write once, assert next time" workflow as Jest — with one twist that matters for LLM outputs: semantic tolerance.

Exact-match snapshots are too brittle for model outputs. Normalized and similarity-based modes let you assert intent without pinning every comma.

Install

npm install -D @agentskit/eval

Quick start

prompts.test.ts
import { matchPromptSnapshot } from '@agentskit/eval/snapshot'
import { expect, it } from 'vitest'

it('reviewer skill system prompt stays stable', async () => {
  const actual = buildReviewerSystemPrompt({ language: 'typescript' })
  const result = await matchPromptSnapshot(actual, './__snapshots__/reviewer.snap.md')
  expect(result.matched).toBe(true)
})

First run creates the snapshot file. Next run compares. Update snapshots on purpose with UPDATE_SNAPSHOTS=1 vitest or { update: true }.

Matching modes

ModeWhat matchesUse for
{ kind: 'exact' } (default)Byte-for-byteSource-of-truth prompt templates
{ kind: 'normalized' }Case + punctuation + whitespace ignoredPrompts with cosmetic drift
{ kind: 'similarity', threshold }Jaccard token similarity ≥ thresholdLLM-generated prompts or summaries
{ kind: 'similarity', threshold, embed }Cosine of embeddings ≥ thresholdFull semantic assertions
await matchPromptSnapshot(actual, path, {
  mode: { kind: 'similarity', threshold: 0.85 },
})

Embedding-based snapshots

Plug in any embedding function — OpenAI, local, whatever — to compare snapshots by meaning instead of tokens.

import { OpenAI } from 'openai'
const openai = new OpenAI()

async function embed(text: string) {
  const r = await openai.embeddings.create({ model: 'text-embedding-3-small', input: text })
  return r.data[0].embedding
}

await matchPromptSnapshot(output, './__snapshots__/answer.snap.txt', {
  mode: { kind: 'similarity', threshold: 0.9, embed },
})

Low-level primitives

If you're building your own harness, the comparison logic is exposed:

import { comparePrompt, jaccard, cosine, normalize } from '@agentskit/eval/snapshot'

await comparePrompt('hello world', 'hello,  world!', { kind: 'normalized' })
// => { matched: true, reason: 'normalized match', ... }

See also

✎ Edit this page on GitHub·Found a problem? Open an issue →·How to contribute →

On this page