Recipes
Prompt injection detector
Score incoming text for injection attempts — heuristics + optional model classifier (Llama Guard, Rebuff).
Prompt injection is user input that tries to rewrite the agent's
instructions. createInjectionDetector gives you a two-layer defense:
cheap regex heuristics for the common patterns, and a pluggable
model classifier for the subtle ones. The verdict is the max of both
signals.
Install
import { createInjectionDetector } from '@agentskit/core/security'Heuristic-only (fast, free)
const detector = createInjectionDetector()
const verdict = await detector.check(userMessage)
if (verdict.blocked) {
audit.append({ actor: userId, action: 'injection_blocked', payload: verdict })
return 'Sorry, that request was blocked.'
}Default heuristics catch the usual suspects: "ignore previous instructions", "you are now a...", system-prompt leakage, developer mode, policy bypass phrasing, tool-call smuggling, role confusion.
Layer a model classifier (Llama Guard, Prompt Guard, Rebuff)
const detector = createInjectionDetector({
threshold: 0.7,
classifier: async input => {
const res = await fetch('https://api.example.com/llama-guard', {
method: 'POST',
body: JSON.stringify({ text: input }),
headers: { 'content-type': 'application/json', authorization: `Bearer ${process.env.LG_KEY}` },
})
const { unsafe_score } = (await res.json()) as { unsafe_score: number }
return unsafe_score
},
})Classifier errors are swallowed — you degrade to heuristic-only instead of rejecting all traffic when the upstream flakes.
Verdict shape
{
score: number, // max(heuristic, classifier)
blocked: boolean, // score >= threshold
hits: [{ name, weight }], // heuristic hits
source: 'heuristic' | 'hybrid',
}Add your own heuristics
import { DEFAULT_INJECTION_HEURISTICS, createInjectionDetector } from '@agentskit/core/security'
createInjectionDetector({
heuristics: [
...DEFAULT_INJECTION_HEURISTICS,
{ name: 'off-topic-divert', pattern: /let['’]s talk about something else/i, weight: 0.5 },
],
})