Home · AI Security Answers · RAG & data security
Should I redact or tokenize PII before generating embeddings for a RAG store?
Yes, you should redact or tokenize PII before generating embeddings for a RAG store to mitigate risks like embedding inversion attacks and PII leakage through derived data.
- Treat vector databases as primary data stores for governance purposes because embedding inversion attacks can reconstruct original text from embeddings, and derived data like embeddings can still lead to PII leakage. This addresses the OWASP LLM Top 10 risk L2 (Sensitive Information Disclosure).
- Implement classification inheritance where any data derived from classified inputs, such as embeddings, inherits at least the classification of its inputs. This ensures that the security properties of derived data are traceable from the source, preventing classification from being lost. This aligns with the NIST AI RMF function of Govern.
- Maintain a per-user data inventory across all stores and implement deletion workflows that propagate to derived data to address right-to-erasure failures. This helps manage the proliferation of personal data copies and supports the NIST AI RMF function of Govern.
- Consider differentially-private embedding techniques for highly sensitive data to further protect against embedding inversion attacks.
- Encrypt embeddings at rest where warranted to add another layer of protection for sensitive data.
Grounded in
- Designing Agentic AI Systems with the ORCHIDEAS Framework
How does your AI agent score?
Get a free, instant AI agent security readiness snapshot — mapped to NIST, OWASP & ISO — then unlock the full report with a prioritized, cited fix-list.
This AI-generated answer is for guidance only — not a certification, audit, or penetration test. Grounded in the NIST AI RMF, OWASP LLM Top 10, and ISO/IEC 42001 control text; verify applicability to your environment.