What are vector and embedding weaknesses in LLM and RAG applications?

Question

Accepted Answer

Vector and embedding weaknesses in LLM and RAG applications primarily involve data leakage, unauthorized reconstruction of sensitive information, and integrity issues within the context window. These weaknesses are categorized under OWASP LLM08: Vector and Embedding Weaknesses.

Embedding Inversion An attacker can reconstruct original text, sometimes substantially, from embeddings in a vector database. This means vector stores can contain "reconstruction-grade representations of sensitive data" and should be treated as primary data stores for governance.
PII Leakage through Derived Data Embeddings, summaries, and logs derived from Personally Identifiable Information (PII) can still allow for reconstruction or correlation attacks to extract personal information, even if the derived data itself isn't classified as PII in source systems.
Cross-Context Leakage This includes memory contamination across sessions or tenants, where one user's data leaks into another's context through memory persistence, or shared caches and embedding stores mix context between tenants.
Retrieval Poisoning Untrusted content from RAG pipelines, often the lowest-trust segment, can be authored by anyone whose documents end up in the corpus, leading to potential manipulation of the agent's context.
Access-Control Bypass Treating vector databases as "just" a search index with relaxed access controls can lead to bypasses, as they contain sensitive data that requires robust governance.
Right-to-Erasure Failures When a user requests data deletion, copies may persist in memory stores, embeddings, summaries, fine-tuning data, and logs, leading to failures in fulfilling deletion requests.
Data Residency Violations Agents might retrieve data from one region and process it through a model API in a non-compliant region, violating data residency requirements.

To mitigate these weaknesses, controls include treating vector databases as containing original text for access control, encrypting embeddings at rest, using differentially-private embedding techniques, implementing strict per-tenant memory scoping, and employing separate physical or logical vector indexes for confidential data. Additionally, access-controlled retrieval, per-tenant/source partitioning, sanitizing ingested content, and validating retrieval relevance are crucial. A data classification service and a continuously updated data inventory are also recommended.

What are vector and embedding weaknesses in LLM and RAG applications?

How does your AI agent score?

Related questions