Prompt diff

Git blame for prompts — find which prompt change is responsible for an output change.

You shipped a prompt tweak. A week later the outputs look different, and you have no idea which line did it. @agentskit/eval/diff solves that: line-level diff + heuristic attribution that points at the prompt lines most likely responsible for an output shift.

Install

npm install -D @agentskit/eval

Diff two prompt versions

compare-prompts.ts

import { promptDiff, formatDiff } from '@agentskit/eval/diff'

const diff = promptDiff(oldPrompt, newPrompt)
console.log(formatDiff(diff))
//   You are a helpful assistant.
// - Answer briefly.
// + Answer with pirate slang.

Each entry in diff.lines is { op: 'equal' | 'add' | 'remove', lineNo, content }. Totals (added, removed, changed) are on the result.

Attribute an output change

Given the old/new prompt and the old/new output, attribute which changed prompt lines probably caused the output shift. Simple token overlap — good enough to rank suspects.

import { attributePromptChange } from '@agentskit/eval/diff'

const report = attributePromptChange({
  oldPrompt: 'You are a helpful assistant.\nAnswer briefly.',
  newPrompt: 'You are a helpful assistant.\nAnswer with pirate slang.',
  oldOutput: 'Hello, how can I help?',
  newOutput: 'Ahoy matey, what be yer query?',
})

console.log(report.suspectLines)
// [{ op: 'add', lineNo: 2, content: 'Answer with pirate slang.' }]
console.log(report.score) // 1.0 — every changed line overlaps the output delta

Pair with replay + snapshots

The workflow:

Record the old session with createRecordingAdapter.
Tweak the prompt, generate a new output.
Snapshot the new output with matchPromptSnapshot. If it matches — you're done.
If it doesn't, attribute with attributePromptChange to see which tweak is load-bearing.

You now have the LLM-equivalent of git bisect for prompts.

Prompt diff

Install

Diff two prompt versions

Attribute an output change

Pair with replay + snapshots

See also

On this page