ANALYSIS

LLMs with retrieval-augmented generation: Good or bad for privacy compliance?

Published17 Dec. 2025

Subscribe to IAPP Newsletters

Contributors:

Julia Kaufmann

AIGP, CIPP/E

Partner

Osborne Clarke

Florian Eisenmenger

CIPP/E, CIPM, FIP

Osborne Clarke GmbH & Co. KG

Editor's note: The IAPP is policy neutral. We publish contributed opinion and analysis pieces to enable our members to hear a broad spectrum of views in our domains.

Retrieval-augmented generation is a technique that enables AI systems with large language models to leverage additional information and documents to optimize output.

Unlike training and fine-tuning, these additional information and documents were not used to train or improve the LLM itself. Put simply, documents or information retrieved from an external source are fed into the AI system to enhance the accuracy and reliability of the output generated by the LLM. While the underlying LLM itself remains unchanged, it receives additional information and context to inform its output.

A typical example of a RAG-based AI system is a chatbot that can connect to external databases through application programming interfaces to incorporate a document search functionality or process a large number of uploaded documents to generate more tailored output.

The additional information and documents used by RAG can, of course, contain personal data. Privacy concerns arise because more personal data can have a negative impact on key privacy principles — such as data minimization, transparency and lawfulness — and RAG may result in dataflows to external parties.

The March 2025 paper "AI Privacy Risks & Mitigations — Large Language Models" published by the European Data Protection Board’s Support Pool of Experts discusses, among other things, privacy risks resulting from the use of RAG. Nevertheless, in its October 2025 guidelines on data protection law issues related to generative AI systems using RAG, the Datenschutzkonferenz — the Conference of the Independent Data Protection Authorities of Germany — concluded that RAG can also have a significant positive impact on privacy compliance.

Data privacy compliance risks

According to the EDPB's Support Pool of Experts, when using RAG, the external database may contain personal data, including sensitive data, that is subsequently processed by the AI system to generate the output without proper safeguards and privacy compliance measures in place. This risk is particularly high if neither the external databases nor the content of such external databases used as a source have been properly identified and assessed before deployment.

Furthermore, if the RAG-based AI system uses APIs to connect with external databases, prompts containing personal data may be transmitted to third parties without knowledge of any subsequent retention and processing by the third party. Poorly configured retrieval logic could result in irrelevant, misleading or misunderstood context that is then fed into the AI system and processed to generate incorrect and hallucinated output.

Insufficient security measures for personal data transmitted via APIs to external databases, as well as data processed and stored in the AI system using RAG, pose a risk of unauthorized access or data leaks. Moreover, privacy risks may also arise from the inadvertent storage of personal data in log files of the RAG-based AI system.

These privacy risks must be addressed through appropriate mitigation measures. These vary on a case-by-case basis, but may include: proper due diligence of the external sources, their content — including sensitive data — and the APIs used by RAG; ideally restricting retrieval sources to internal sources with appropriate access right restrictions; due diligence on the data retention concepts of any external databases used by RAG; anonymizing or pseudonymizing prompts shared with external databases used by RAG; and minimizing logging and establishing data transfer agreements with third parties. Additional measures include regularly evaluating the output of the RAG-based AI system for accuracy and non-hallucination; implementing transparency measures, retrieval filters and configurations to reduce the risk of processing and leaking sensitive data; weighting or flagging retrieved content with source information for appropriate context; and maintaining a robust data rights management system.

Data privacy compliance benefits

On the other side, the overarching benefit of RAG compared to generative AI systems without it is the generation of more accurate, dynamic, updated and real-time output by leveraging information and documents in addition to the knowledge base of the underlying LLM. Large language models are trained on data up to a specific point in time and may lack awareness of recent developments or specific knowledge.

The DSK's RAG guidelines examine not only the privacy risks, but also the positive impact of such benefits on privacy compliance, particularly regarding the principles of accuracy, transparency and lawfulness.

Accuracy. The core benefit of RAG serves the accuracy principle. By leveraging information and data from additional sources beyond the knowledge base of the LLM itself, the output should be more up-to-date and context-specific, thereby reducing the risk of inaccurate or hallucinated output.

According to the DSK, the level of risk reduction depends significantly upon the quality of the additional information and documents as well as various technical aspects, such as prioritization and subsequent processing by the AI system. A mix of languages in prompts, additional information or documents should be avoided to maintain the level of accuracy.

Transparency. Retrieval-augmented generation does not inherently increase the overall transparency of an AI system, and transparency obligations regarding the processing of personal data by the RAG-based AI system must be addressed through typical transparency methods.

However, if RAG enables identification of the sources for the generated output, it enhances clarity and explainability, which in turn positively impacts the balancing of interest test.

Principle of lawfulness. The processing of any personal data by a RAG-based AI system requires a legal basis. For this analysis, the processing by RAG components must be considered. According to the DSK, RAG can support a positive balancing of interest test — particularly through reducing the risk of inaccurate or outdated output through the use of retrieved information, enhancing clarity and explainability, and enabling the identification of the sources used for the generated output.

Beyond the privacy benefits of RAG, the RAG guidelines also discuss privacy risks, particularly those relating to data minimization, purpose limitation and integrity and confidentiality.

Data minimization. Given the increased volume of data processed by an AI system through RAG, steps must be considered to minimize the processing of personal data. This could be achieved by restricting the retrieval of information from external sources, implementing retention concepts for retrieved information and logs within the AI system, and removing personal data from retrieved information — through data transformation steps, for example — prior to further processing by the AI system.

Purpose limitation. To comply with the purpose limitation principle when retrieving personal data from external sources for further processing by the RAG-based AI system, the initial purposes for which such personal data was collected must be adhered to in the context of the AI system, particularly through appropriate access right concepts and data separation measures.

The potential linkage of retrieved personal data with personal data generated by the LLM as output may infringe upon the purpose limitation principle, which represents a common risk for AI systems.

Integrity and confidentiality. To comply with the principle of integrity and confidentiality, appropriate access right concepts must be applied to RAG, particularly regarding the content of external sources. These measures may also enable the lawful processing of sensitive data by the RAG-based AI system, provided such sensitive data is not used for further LLM training and remains stored within the external source.

Data subject rights. While the DSK acknowledges that ensuring compliance with data subject rights for the LLM itself remains an unresolved issue, the guidelines emphasize that data subject rights can — and must — be upheld for the AI system's inputs and outputs, as well as for the retrieved information and documents used by RAG.

Key takeaways

On the one hand, RAG, like any other technology used in the context of AI systems, bears privacy risks. The paper by the EDPB's Support Pool of Experts and the DSK's RAG guidelines provide helpful guidance for identifying these vulnerabilities. Nevertheless, a case-by-case risk assessment for each RAG-based AI system remains a compliance necessity.

On the other hand, as illustrated by the DSK's guidelines, RAG can also help achieve compliance with certain privacy principles — most notably the principle of lawfulness. In particular, reducing the risk of inaccurate or outdated output using retrieved information, increasing clarity and explainability, and identifying the sources used for the generated output can help tip the scale toward an overriding legitimate interest. To leverage these benefits in an organization's privacy compliance assessments, it is essential to understand and configure how RAG is implemented for a specific AI system.

Julia Kaufmann, AIGP, CIPP/E, is a partner and Florian Eisenmenger, CIPP/E, CIPM, FIP, is counsel at Osborne Clarke.

This content is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Contributors:

Julia Kaufmann

AIGP, CIPP/E

Partner

Osborne Clarke

Florian Eisenmenger

CIPP/E, CIPM, FIP

Osborne Clarke GmbH & Co. KG

Tags:

AI and machine learning Risk management Data security

Contributors:

Data privacy compliance risks

Data privacy compliance benefits

Key takeaways

Contributors:

Related Stories

Hallucinations in LLMs: Technical challenges, systemic risks and AI governance implications

Perspective: Why data subjects' rights to LLM training data are not relevant

Using special categories of data for training LLMs: never allowed?

The impact of credentials on compensation