Prompt injection detector

Score incoming text for injection attempts — heuristics + optional model classifier (Llama Guard, Rebuff).

Prompt injection is user input that tries to rewrite the agent's instructions. createInjectionDetector gives you a two-layer defense: cheap regex heuristics for the common patterns, and a pluggable model classifier for the subtle ones. The verdict is the max of both signals.

Install

import { createInjectionDetector } from '@agentskit/core/security'

Heuristic-only (fast, free)

const detector = createInjectionDetector()

const verdict = await detector.check(userMessage)
if (verdict.blocked) {
  audit.append({ actor: userId, action: 'injection_blocked', payload: verdict })
  return 'Sorry, that request was blocked.'
}

Default heuristics catch the usual suspects: "ignore previous instructions", "you are now a...", system-prompt leakage, developer mode, policy bypass phrasing, tool-call smuggling, role confusion.

Layer a model classifier (Llama Guard, Prompt Guard, Rebuff)

const detector = createInjectionDetector({
  threshold: 0.7,
  classifier: async input => {
    const res = await fetch('https://api.example.com/llama-guard', {
      method: 'POST',
      body: JSON.stringify({ text: input }),
      headers: { 'content-type': 'application/json', authorization: `Bearer ${process.env.LG_KEY}` },
    })
    const { unsafe_score } = (await res.json()) as { unsafe_score: number }
    return unsafe_score
  },
})

Classifier errors are swallowed — you degrade to heuristic-only instead of rejecting all traffic when the upstream flakes.

Verdict shape

{
  score: number,              // max(heuristic, classifier)
  blocked: boolean,           // score >= threshold
  hits: [{ name, weight }],   // heuristic hits
  source: 'heuristic' | 'hybrid',
}

Add your own heuristics

import { DEFAULT_INJECTION_HEURISTICS, createInjectionDetector } from '@agentskit/core/security'

createInjectionDetector({
  heuristics: [
    ...DEFAULT_INJECTION_HEURISTICS,
    { name: 'off-topic-divert', pattern: /let['’]s talk about something else/i, weight: 0.5 },
  ],
})