Should I redact or tokenize PII before generating embeddings for a RAG store?

Question

Accepted Answer

Yes, you should redact or tokenize PII before generating embeddings for a RAG store to mitigate risks like embedding inversion attacks and PII leakage through derived data. Treat vector databases as primary data stores for governance purposes because embedding inversion attacks can reconstruct original text from embeddings, and derived data like embeddings can still lead to PII leakage. This addresses the OWASP LLM Top 10 risk L2 (Sensitive Information Disclosure). Implement classification inheritance where any data derived from classified inputs, such as embeddings, inherits at least the classification of its inputs. This ensures that the security properties of derived data are traceable from the source, preventing classification from being lost. This aligns with the NIST AI RMF function of Govern. Maintain a per-user data inventory across all stores and implement deletion workflows that propagate to derived data to address right-to-erasure failures. This helps manage the proliferation of personal data copies and supports the NIST AI RMF function of Govern. Consider differentially-private embedding techniques for highly sensitive data to further protect against embedding inversion attacks. Encrypt embeddings at rest where warranted to add another layer of protection for sensitive data.

Should I redact or tokenize PII before generating embeddings for a RAG store?

How does your AI agent score?

Related questions