Prompt injection

Detect instruction-hijacking patterns in user input, tool results, and RAG chunks before they reach the model.

Prompt injection is the main attack surface for agents: a user — or content the agent retrieves — attempts to override the system prompt or change the agent's behavior. createInjectionDetector catches common patterns with zero-cost heuristics and optionally escalates high-risk inputs to an LLM classifier.

import { createInjectionDetector } from '@agentskit/core/security'

const detector = createInjectionDetector({
  classifier: async (text) => {
    // Optional LLM-based classifier
    return adapter.complete({ ... })
  },
})

const verdict = await detector.check(userInput)
if (verdict.blocked) throw new Error(verdict.reason)

#Heuristic layer

The heuristic layer runs synchronously at zero cost. It catches "ignore previous instructions", role-swap attempts, system-prompt leak probes, and fenced payloads like <|system|>.

#Model classifier layer

Pluggable. Use any adapter to score high-risk inputs that pass the heuristics but still look suspicious.

#Where to run it

User input → chat.send preprocess.
Tool results → before feeding back into the loop (tool-output can be attacker-controlled).
RAG retrievals → classify each chunk before context-injection.

Explore nearby

✎ Edit this page on GitHub·Found a problem? Open an issue →·How to contribute →

Prompt injection

#Heuristic layer

#Model classifier layer

#Where to run it

#Related

Explore nearby

On this page