A view from Brussels: The stakes of data memorization in AI models

Regurgitation, memorization, hallucination ... Sometimes it feels like artificial intelligence taxonomy is taken straight out of a thriller/horror movie with zombies chewing up our memories and spitting out a different version to mess with humanity. Now that I write this, I am willing to bet that movie has been made — probably at a low point of cinematic creativity.

The results of an ongoing public consultation by the German Federal Commissioner for Data Protection and Freedom of Information could be more entertaining. With this survey, the BfDI wants to "contribute to the development of data protection-compliant approaches for dealing with memorized data" in large language models. The BfDI, in its own admission, broadly defines data memorization as a term that covers, for example, deduplication, use of anonymous or anonymized training data, fine-tuning without personal data and differential privacy.

Data reconstruction attacks have emerged as a concerning type of threat to privacy. As explained in this IAPP article, "the ability to reconstruct training samples is partly explained by the tendency of neural networks to memorize their training data. Generative AI systems are fed carefully crafted instructions and receive context-sensitive information to summarize or complete other tasks. The input is also used for training, which means a lot of data is memorized. Taking advantage of this, attackers can simply ask for the repetition of confidential information that exists in the context of a conversation or an instruction.”

Radarfirst- AI at the tipping point: balancing fiduciary duty, risk and innovation

Against that backdrop, the consultation builds on the European Data Protection Board's Opinion 28/2024, adopted in December 2024, that addresses certain data protection aspects related to the processing of personal data in the context of AI models. Highly anticipated, this opinion established, among others, the conditions for when and how an AI model can be considered "anonymous."

The EDPB established that by default, AI models are very likely to require a "thorough evaluation of the risks of identification" against the board's expectations regarding anonymization. The opinion also provided a non-prescriptive and non-exhaustive list of possible elements data protection authorities can consider when assessing a controller's claim of anonymity. It pointed to mitigation measures to facilitate data subject rights — unlearning techniques when it comes to memorization, for example.

The BfDI considers that "complete anonymization of AI models is generally not reliably possible" given the volume of data used for training, even if the original intent is to train the model using anonymous data.

The ongoing consultation, open until 31 Aug., lists eight questions. It is a short questionnaire, but with very complex legal interpretations and technical questions.

The first set of questions aims to collect feedback on circumstances under which an LLM could be considered anonymous, technical measures organizations use or plan to use to prevent "data memorization," and risk assessment approaches pertaining to personal data being extracted from an LLM.

The questionnaire also looks at the fundamental definition of processing. The BfDI asks whether the calculation that results from a prompt constitutes processing as defined by the EU General Data Protection Regulation, even if the output of the AI model is not personal.

The "intensity of data processing" and potential related assessment obligations are another focus of the consultation. The BfDI hopes to gather practical feedback on methods that estimate the amount and type of personal data memorized, determine whether the AI model used contains the personal data of a specific individual, and assess their informative value and possible limitations.

The BfDI also seeks input on how data subjects can exercise their rights in AI models — particularly access, rectification and erasure — given their "black box architecture."

Several European DPAs have widely relayed the consultation, indicating at least a clear common interest.

Isabelle Roccia, CIPP/E, is the managing director, Europe, for the IAPP.

This article originally appeared in the Europe Data Protection Digest, a free weekly IAPP newsletter. Subscriptions to this and other IAPP newsletters can be found here.

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

A view from Brussels: The stakes of data memorization in AI models

Related stories