EDPB weighs in on key questions on personal data in AI models

This is a big question, the Hamburg Commissioner for Data Protection and Freedom of Information notably attempted to answer in its discussion paper in July. The Hamburg DPA determined certain AI models, specifically large language models, do not store personal data so certain EU General Data Protection Regulation requirements — the focus on data subject rights — do not apply to the model itself.

The EDPB seems to disagree, "AI models trained with personal data cannot, in all cases, be considered anonymous.” This approach requires a case-by-case assessment of whether the model is anonymous, by reference to the definition of personal data under the GDPR.

An AI model will be considered anonymous when the controller can demonstrate with evidence that personal data from the training cannot be extracted from the model. For example, through model inversion and other attacks and any output produced when querying the model does not relate to the individuals whose personal data was used to train the model – such that the likelihood of this happening is "insignificant."

The opinion refers to existing anonymization principles and approaches – including Working Party 29 Opinion 05/2014 on anonymization techniques, which the EDPB may be due to update according to its Work Programme 2024-2025, and that the controller needs to consider "all the means reasonably likely to be used" by the controller or another person to identify individuals, referring to the Breyer and OC v European Commission cases. In the context of an AI model, the EDPB allows this will require a thorough assessment, which may involve consideration of elements such as model design, model analysis, model testing and documentation.

While the EDPB leaves this open for a case-by-case analysis based on existing anonymization principles, it sets a high bar for all developers, small, medium or large, to meet (for example, how does an attack on a model constitute "legal means" in accordance with the Breyer case) and causes problems for entire models that return information about individuals when queried e.g. about public figures.

There could be diverging approaches between different DPAs, who do not always agree on general anonymization principles. The key for AI developers aiming for anonymity is to make sure they can produce evidence supporting an anonymization assessment in a data protection impact or separately, and to consider what safeguards can be built in to reduce the risk of reidentification either through model queries or other methods of extraction – such as filtering.

Can legitimate interests be a legal basis for AI processing?

The EDPB confirmed the legitimate interest can generally serve as a legal basis for the development and deployment of AI models. Building on its recent guidelines on processing personal data based on Article 6(1)(f) GDPR, the EDPB outlines a three-step test: identifying a legitimate interest, ensuring the necessity of the processing, and conducting a balancing exercise to confirm that the controller’s legitimate interest is not overridden by the data subjects' interests or fundamental rights and freedoms.

Notably, the EDPB presents a set of criteria that controllers and DPAs can apply under the three-step test specifically for AI development and deployment. This level of detail is to some extent helpful, offering some concrete measures to justify processing activities related to AI. Only "some" as for example France's data protection authority, the Commission nationale de l'informatique et des libertés guidance on legitimate interest is more comprehensive. Again, it is possible the EDPB intentionally leaves strategic ambiguity here. While the first step of the test — legitimate interest — is quite straightforward, controllers will see the challenge in the necessity part and in the balancing exercise.

The EDPB sets a high bar for necessity in relation to the volume of personal data involved in the AI model. This needs to be assessed in light "of less intrusive alternatives that may reasonably be available to achieve just as effectively the purpose of the legitimate interest pursued." If achieving the goal can be done using an AI system that doesn't involve handling personal data, then it should be deemed unnecessary to process personal data. The EDPB outlines potential risks, followed by mitigations that could tip the balance in favor of the controller:

Technical measures such as deidentification, pseudonymization, or data masking.
Facilitating data subjects' rights, such as offering an unconditional opt-out or addressing data regurgitation or memorization through unlearning techniques.
Enhanced transparency, including media campaigns, emails, graphics, FAQs, labels, model cards, and voluntary annual transparency reports.

For AI deployment, measures could include technical safeguards to prevent the storage, regurgitation, or generation of personal data, especially in generative AI models such as output filters. It could also offer rights beyond the GDPR, such as post-training techniques to remove or suppress personal data from trained AI models despite the technical obstacles emphasized by AI model developers in the public debate preceding this opinion.

The EDPB highlights measures on web scraping; while indiscriminate large-scale processing is unreasonable, selective data collection can be justified. Controllers must exclude information containing personal data that might harm vulnerable individuals, such as those facing potential abuse or prejudice. Data collection should also avoid sensitive or intrusive sources, particularly those involving sensitive data, location data, or financial data.

Importantly, the EDPB no longer implies that social media data is an impermissible source, as it did in the ChatGPT interim report, which is encouraging for ongoing disputes with social media platforms. Further safeguards include respecting "robots.txt" or "ai.txt" signals, as informed by the EU AI Act and copyright laws. Enhanced transparency through public campaigns, detailed disclosures, FAQs, or transparency labels can finally help bridge the information gap between controllers and individuals, mitigating individual expectations.

What are the consequences of unlawful processing in AI development?

The final question considers the impact of unlawful processing in a legal basis sense in the development of an AI model on the subsequent use of that model.

Again, this is expressed as general observations with the EDPB emphasizing the need for case-by-case assessment, and room for DPAs discretion.

The EDPB identifies three scenarios.

Scenario 1: The model is developed unlawfully, the model retains personal data and is subsequently processed by the same controller

An assessment will need to be conducted to determine if the initial and subsequent processing constitute different purposes and the extent to which a lack of legal basis for the former impacts the latter, e.g. if the legal basis for the subsequent processing is legitimate interests, that the initial processing was unlawful should be taken into account in the legitimate interest assessment. Corrective measures imposed by a DPA in response to the initial processing can have downstream impact too – if it requires deletion of the personal data, for example.

Scenario 2: The model is developed unlawfully, the model retains personal data and is deployed by another controller

DPAs will need to assess the responsibilities of each controller separately but should consider if the second controller has assessed if the model was developed unlawfully, e.g., if the data originates from a breach or the initial processing was subject to a finding of infringement from a DPA or a court. The opinion acknowledges that the degree of this "due diligence" type assessment may vary depending on the risks – i.e. it acknowledges proportionality.

Scenario 3: The model is developed unlawfully, but is anonymized before the same or another controller processes personal data in its deployment

If it can be demonstrated the subsequent operation of the model does not involve personal data, such as the model is anonymous, then the GDPR would not apply, and the unlawfulness of any initial processing should not impact subsequent operation of the model.

This provides a roadmap for potential enforcement and introduces a concept of due diligence that could safeguard subsequent processing/deployment of an AI model against initial unlawful processing, provided a DPA hasn’t disrupted that by, for example, ordering deletion earlier on in the process. Effectively, there are two ways to protect that subsequent processing: demonstrate the model is anonymous, or if that is not possible, satisfy yourself that the model was not obviously developed unlawfully.

The opinion leaves open the possibility that a DPA could take measures that would defeat any subsequent processing or use of an AI model entirely. Hints at possible worst-case scenarios here include the possibility of "erasing part of the dataset that was processed unlawfully or, where this is not possible … the erasure of the whole dataset used to develop the AI model and/or the AI model itself." The EDPB qualifies that such a step would need to be proportionate, and that when assessing proportionality in this context, DPAs may take into account remediation measures that the controller can take such as retraining. Third party deployers of AI models will also not be able to entirely distance themselves from the development phase – making further diligence likely.

What is missing?

Some challenging topics are explicitly excluded: automated decision making, compatibility of purposes, DPIAs and data protection by design.

The opinion sidesteps how sensitive data can be used to train AI models, even unintentionally, although the nature of the data impacts the AI developer’s ability to rely on the legitimate interest. Notably, the opinion still makes challenging general comments, including on a recent Court of Justice of the European Union finding that when even one sensitive data item is included in a "bloc," and that Article 9 condition is required for the entire set and another on the narrow nature of the "manifestly made public" condition.

The opinion also remains silent on the relevance of the CJEU case law that recognized the possibility for search engines to rely on the legitimate interest in the context of the development of AI models. Finally, it misses an opportunity to clarify how Article 11 GDPR on processing which do not require identification of individuals apply to the development of LLMs.

Though the opinion omits key clarifications, this could be a deliberate move to maintain some strategic ambiguity as technology and regulations are developing at a fast pace. The real game-changer will lie in how DPAs choose to interpret and enforce this opinion, ultimately steering the future of AI innovation in the EU.

Alex Jameson is senior associate, Izabela Kowalczuk-Pakula is partner, Willy Mikalef is partner and Nils Lölfing is counsel at Bird & Bird.

EDPB weighs in on key questions on personal data in AI models

Related stories

Does an AI model contain personal data?

Can legitimate interests be a legal basis for AI processing?

What are the consequences of unlawful processing in AI development?

What is missing?