What is system prompt leakage and how do I prevent it?

Question

Accepted Answer

System prompt leakage (OWASP LLM07) occurs when secrets, access rules, or sensitive logic embedded within the system prompt are exposed, mistakenly assuming prompt confidentiality for security. This can lead to an adversary gaining knowledge of internal workings or sensitive information.

To prevent system prompt leakage, implement the following controls:

Keep no secrets/credentials/authorization logic in prompts. Instead, enforce controls in code or infrastructure, not within the prompt text itself.
Design plugins and tools to be safe even if the prompt is fully known. This means that even if the system prompt is exposed, the tools should not allow for damaging actions.
Utilize a hierarchical context architecture where a sealed top layer contains system prompts and policies that are never compacted or summarized. This prevents policy instructions from being dropped or pushed out by adversarial inputs.
Maintain a provenance graph for every context element to trace any token to its source and enforce segregation through structural delimiters and role-based channels. This helps in understanding the trust level of each context segment.
Apply zero trust to the runtime window by conditioning the model on segment provenance, ensuring that instructions in low-trust segments are treated as data, not directives. This prevents content from untrusted sources from being interpreted as instructions without explicit authorization.
Implement output filtering and content classification on outgoing data to prevent agents from including context content in tool calls or external responses, which could leak internal information.

What is system prompt leakage and how do I prevent it?

How does your AI agent score?

Related questions