How do I stop indirect prompt injection hidden inside retrieved documents?

Question

Accepted Answer

To prevent indirect prompt injection from retrieved documents, implement layered mitigations that include input transformations, trust tagging, and robust context management. This addresses the OWASP LLM01 Prompt Injection risk. Input Transformations and Trust Tagging: Apply input transformations to untrusted content and tag each context segment with its provenance and trust level. This allows the model to be conditioned to respect these tags, treating instructions from low-trust segments as data rather than directives. Structural Separation and Hierarchical Context: Employ structural delimiters and role-based channels to segregate context elements. Implement a hierarchical context architecture with a sealed top layer for system prompts and policies, a sticky middle layer for session-critical facts, and a rolling tail for compactable content to prevent policy instructions from being dropped during summarization. Provenance Tracking and Context Integrity: Maintain a provenance graph for every context element, enabling traceability of any token to its source. This helps ensure context window integrity, which is crucial as anything in the context window shapes agent behavior. Runtime Validation and Intent Re-verification: Implement runtime-layer tool call validation against the agent's current intent. Before any consequential action, re-derive whether the action falls within the originally attested intent, rather than relying on potentially corrupted agent reasoning. Output Filtering and Content Classification: For outgoing data, implement output filtering and content classification to prevent agents from including sensitive context content in tool calls or external responses. This addresses the "Context-as-exfiltration-channel" threat. Adversarial Testing: Conduct adversarial testing to identify and mitigate vulnerabilities related to prompt injection. This is a general control for OWASP LLM01 Prompt Injection.

How do I stop indirect prompt injection hidden inside retrieved documents?

How does your AI agent score?

Related questions