Recipes
Prompt diff
Git blame for prompts — find which prompt change is responsible for an output change.
You shipped a prompt tweak. A week later the outputs look different,
and you have no idea which line did it. @agentskit/eval/diff solves
that: line-level diff + heuristic attribution that points at the
prompt lines most likely responsible for an output shift.
Install
npm install -D @agentskit/evalDiff two prompt versions
import { promptDiff, formatDiff } from '@agentskit/eval/diff'
const diff = promptDiff(oldPrompt, newPrompt)
console.log(formatDiff(diff))
// You are a helpful assistant.
// - Answer briefly.
// + Answer with pirate slang.Each entry in diff.lines is { op: 'equal' | 'add' | 'remove', lineNo, content }.
Totals (added, removed, changed) are on the result.
Attribute an output change
Given the old/new prompt and the old/new output, attribute which changed prompt lines probably caused the output shift. Simple token overlap — good enough to rank suspects.
import { attributePromptChange } from '@agentskit/eval/diff'
const report = attributePromptChange({
oldPrompt: 'You are a helpful assistant.\nAnswer briefly.',
newPrompt: 'You are a helpful assistant.\nAnswer with pirate slang.',
oldOutput: 'Hello, how can I help?',
newOutput: 'Ahoy matey, what be yer query?',
})
console.log(report.suspectLines)
// [{ op: 'add', lineNo: 2, content: 'Answer with pirate slang.' }]
console.log(report.score) // 1.0 — every changed line overlaps the output deltaPair with replay + snapshots
The workflow:
- Record the old session with
createRecordingAdapter. - Tweak the prompt, generate a new output.
- Snapshot the new output with
matchPromptSnapshot. If it matches — you're done. - If it doesn't, attribute with
attributePromptChangeto see which tweak is load-bearing.
You now have the LLM-equivalent of git bisect for prompts.