How do I threat-model an autonomous AI agent end to end?

Question

Accepted Answer

Threat modeling an autonomous AI agent end-to-end involves a structured approach that considers the entire system's lifecycle and potential attack paths across multiple layers. This process should be continuous and integrated into the engineering discipline, rather than a one-time checklist.

A formal MAESTRO threat modeling pass should be conducted, with findings remediated, as part of the validation and integration phase. The MAESTRO framework organizes threats across seven architectural layers: Foundation Models (L1), Data Operations (L2), Agent Frameworks (L3), Deployment and Infrastructure (L4), Evaluation and Observability (L5), Security and Compliance (L6), and Agent Ecosystem (L7). Threats rarely exist in isolation within a single layer; therefore, the threat model must trace plausible attack chains from any layer through to consequential effects.

Key controls and considerations for threat modeling include:

Human Oversight (NIST AI RMF Function: Govern, OWASP LLM Top 10: LLM07-Insecure Output Handling): Design for human intervention points, override mechanisms, and deadman switches, ensuring that humans can always intervene and observe agent actions.
Autonomy (NIST AI RMF Function: Govern, OWASP LLM Top 10: LLM04-Insecure Plugin Design): Establish explicit, machine-readable autonomy levels and policies, ensuring that an agent cannot perform actions outside its defined bounds. Mitigate "autonomy creep" through periodic re-attestation and "autonomy shopping" by expressing policy at the effect level.
Data Governance (NIST AI RMF Function: Govern, OWASP LLM Top 10: LLM02-Sensitive Information Disclosure): Implement data classification baselines and ensure that derived data inherits the classification of its inputs, preventing trust from leaking. Mitigate memory poisoning and sensitive data retention through read-only stores, provenance, and redaction workflows.
Identity & Intent (NIST AI RMF Function: Govern, OWASP LLM Top 10: LLM04-Insecure Plugin Design): Stand up workload identity for the agent platform and integrate with enterprise identity providers. Ensure that authority cannot expand and that intent binds to action, meaning authorization is re-derived from the original intent, making prompt injection less effective.
Runtime (NIST AI RMF Function: Map, OWASP LLM Top 10: LLM04-Insecure Plugin Design): Deploy LLM gateways and tool brokers as runtime chokepoints to ensure all model and tool calls are mediated and cannot bypass checks. Implement basic policy enforcement at these chokepoints.
Evaluation and Observability (NIST AI RMF Function: Measure, OWASP LLM Top 10: LLM09-Excessive Agency): Instrument comprehensive telemetry at every chokepoint and build an audit pipeline. Implement immutable event logs, separate operational dashboards for grader results, and human sampling of passed outcomes to mitigate evaluation attack surfaces like grader manipulation and log injection.
Deployment and Infrastructure (NIST AI RMF Function: Map, OWASP LLM Top 10: LLM04-Insecure Plugin Design): Harden sandboxes, implement network restrictions, scope credentials, and require explicit tool approvals for sensitive actions to mitigate threats like container compromise and lateral movement.
Ecosystem (NIST AI RMF Function: Map, OWASP LLM Top 10: LLM04-Insecure Plugin Design): Address supply chain risks in fallback chains by treating them as production paths for security review. Ensure that new components conform to existing architectural commitments for identity, data, context, and runtime contracts.

How do I threat-model an autonomous AI agent end to end?

How does your AI agent score?

Related questions