What is OWASP LLM10 unbounded consumption and how do I cap tokens, cost, and denial-of-wallet on an LLM API?

Question

Accepted Answer

OWASP LLM10 Unbounded Consumption refers to resource exhaustion or denial-of-wallet attacks, as well as model extraction and theft through unbounded querying. To cap tokens, cost, and prevent denial-of-wallet on an LLM API, implement the following controls: Rate Limits and Quotas: Implement rate limiting (e.g., sliding window, configurable RPM) and quotas on API requests to prevent excessive usage. This aligns with the OWASP LLM10 Unbounded Consumption risk. Token and Spend Caps: Define explicit token limits for API calls, such as max_tokens, and enforce them by capping the maximum allowed tokens. Implement a cost accounting system that tracks token usage and multiplies it by per-million pricing tables to attribute spend. This system should accumulate per-session totals and warn or gate the user when spending crosses a predefined threshold. This directly addresses the OWASP LLM10 Unbounded Consumption risk and the NIST AI RMF GOVERN function by managing financial impact. Context Management with Thresholds: Utilize context management strategies that detect approaching context window limits early and apply progressively more expensive remedies. Define token thresholds that trigger proactive actions (e.g., auto-compacting conversation history at 70% of the effective window) and block new requests entirely at higher thresholds (e.g., 98%). This helps prevent prompt_too_long errors and manages token consumption. Circuit Breakers for Compaction Failures: Implement circuit breakers that stop attempts to compact context after a certain number of consecutive failures (e.g., 3 failures) to avoid burning API budget on persistently failing operations. Session-Scoped Cost Tracking: Ensure that costs are tracked per session, and that resuming a different session does not bleed costs over from a previous one. This prevents accidental inflation of budgets across sessions. Budget Gates and Atomic Refusal: Implement a hard budget gate that raises an exception (e.g., BudgetExceeded) before an LLM call would push daily spend past a defined budget cap. This ensures atomic refusal of calls that would exceed the budget. Audit Logging: Every sampling call should be audit-logged with details such as the model used, token count, and server name. This provides visibility into usage and helps in identifying potential abuse.

What is OWASP LLM10 unbounded consumption and how do I cap tokens, cost, and denial-of-wallet on an LLM API?

How does your AI agent score?

Related questions