On 17 Dec., the European Data Protection Board adopted Opinion 28/2024 "on certain data protection aspects related to the processing of personal data in the context of AI models."
Although organizations appreciate explicit confirmation that legitimate interest is a legal basis for the development and deployment of artificial intelligence models, they hoped to move from theoretical discussions to practical solutions. We are partway there.
The wording "case by case" appears 16 times with "may" or "might" 161 times in the opinion. This will likely lead to a diverging approach between data protection authorities, but it is possible this strategic ambiguity is deliberate by the EDPB.
Does an AI model contain personal data?
This is a big question, which the Hamburg Commissioner for Data Protection and Freedom of Information notably attempted to answer in its discussion paper in July. The HmbBfDI determined certain AI models, specifically large language models, do not store personal data, so certain EU General Data Protection Regulation requirements and their focus on data subject rights do not apply to the models themselves.
The EDPB seems to disagree, "AI models trained with personal data cannot, in all cases, be considered anonymous. This approach requires a case-by-case assessment of whether the model is anonymous, by reference to the definition of personal data under the GDPR.
An AI model will be considered anonymous when the controller can demonstrate with evidence that personal data from the training cannot be extracted from the model. For example, through model inversion and other attacks and any output produced when querying the model does not relate to the individuals whose personal data was used to train the model – such that the likelihood of this happening is "insignificant."
The opinion refers to existing anonymization principles and approaches, including the Working Party 29 Opinion 05/2014 on anonymization techniques, which the EDPB may be due to update according to its Work Programme 2024-2025. It notes controllers needs to consider "all the means reasonably likely to be used" by the controller or another person to identify individuals, referring to the European Commission v. Breyer and OC v. European Commission cases. In the context of an AI model, the EDPB notes this will require a thorough assessment, which may involve consideration of elements such as model design, model analysis, model testing and documentation.
While the EDPB leaves this open for a case-by-case analysis based on existing anonymization principles, it sets a high bar for all developers, small, medium or large, to meet. For example, how does an attack on a model constitute "legal means" in accordance with the Breyer case and cause problems for entire models that return information about individuals when queried, for example, about public figures.
There could be diverging approaches between different DPAs do not always agree on general anonymization principles. The key for AI developers aiming for anonymity is to make sure they can produce evidence supporting an anonymization assessment in a data protection impact assessment or separately, as well as to consider what safeguards can be built in to reduce the risk of reidentification either through model queries or other methods of extraction such as filtering.
Can legitimate interests be a legal basis for AI processing?
The EDPB confirmed legitimate interest can generally serve as a legal basis for the development and deployment of AI models. Building on its recent guidelines on processing personal data based on Article 6(1)(f) of the GDPR, the EDPB outlined a three-step test: identifying a legitimate interest, ensuring the necessity of the processing, and conducting a balancing exercise to confirm the controller's legitimate interest is not overridden by the data subjects' interests or fundamental rights and freedoms.
Notably, the EDPB presents a set of criteria that controllers and DPAs can apply under the three-step test specifically for AI development and deployment. This level of detail is helpful to some extent, offering some concrete measures to justify processing activities related to AI. Only "some" as, for example, guidance on legitimate interest from France's DPA, the Commission nationale de l'informatique et des libertés, is more comprehensive. Again, it is possible the EDPB intentionally leaves strategic ambiguity here. While the first step of the test — legitimate interest — is quite straightforward, controllers will see the challenge in the necessity part and in the balancing exercise.
The EDPB sets a high bar for necessity in relation to the volume of personal data involved in the AI model. This needs to be assessed in light "of less intrusive alternatives that may reasonably be available to achieve just as effectively the purpose of the legitimate interest pursued." If achieving the goal can be done using an AI system that doesn't involve handling personal data, then it should be deemed unnecessary to process personal data. The EDPB outlines potential risks, followed by mitigations that could tip the balance in favor of the controller:
- Technical measures such as deidentification, pseudonymization or data masking.
- Facilitating data subjects' rights, such as offering an unconditional opt out or addressing data regurgitation or memorization through unlearning techniques.
- Enhanced transparency, including media campaigns, emails, graphics, FAQs, labels, model cards and voluntary annual transparency reports.
For AI deployment, measures could include technical safeguards to prevent the storage, regurgitation or generation of personal data, especially in generative AI models such as output filters. It could also offer rights beyond the GDPR, such as post-training techniques to remove or suppress personal data from trained AI models despite the technical obstacles emphasized by AI model developers in the public debate preceding this opinion.
The EDPB highlights measures on web scraping; while indiscriminate large-scale processing is unreasonable, selective data collection can be justified. Controllers must exclude information containing personal data that might harm vulnerable individuals, such as those facing potential abuse or prejudice. Data collection should also avoid sensitive or intrusive sources, particularly those involving sensitive, location or financial data.
Importantly, the EDPB no longer implies social media data is an impermissible source, as it did in the ChatGPT interim report, which is encouraging for ongoing disputes with social media platforms. Further safeguards include respecting "robots.txt" or "ai.txt" signals, as informed by the EU AI Act and copyright laws. Enhanced transparency through public campaigns, detailed disclosures, FAQs or transparency labels can finally help bridge the information gap between controllers and individuals, mitigating individual expectations.
What are the consequences of unlawful processing in AI development?
The final question considers the impact of unlawful processing in a legal basis sense in the development of an AI model on the subsequent use of that model.
Again, this is expressed as general observations with the EDPB emphasizing the need for case-by-case assessment and room for DPAs' discretion.
The EDPB identifies three scenarios.
Scenario 1: The model is developed unlawfully, retains personal data and is subsequently processed by the same controller
An assessment will need to be conducted to determine if the initial and subsequent processing constitute different purposes and the extent to which a lack of legal basis for the former impacts the latter. For example, if the legal basis for subsequent processing is legitimate interests, the unlawful initial processing should be taken into account in the legitimate interest assessment. Corrective measures imposed by a DPA in response to the initial processing can have downstream impact too, such as if it requires deletion of the personal data.
Scenario 2: The model is developed unlawfully, retains personal data and is deployed by another controller
DPAs will need to assess the responsibilities of each controller separately but should consider if the second controller has assessed if the model was developed unlawfully, for example if the data originates from a breach or the initial processing was subject to a finding of infringement from a DPA or a court. The opinion acknowledges the degree of this due diligence type assessment may vary depending on the risks, meaning it acknowledges proportionality.
Scenario 3: The model is developed unlawfully but is anonymized before the same or another controller processes personal data in its deployment
If it can be demonstrated the subsequent operation of the model does not involve personal data, such as if the model is anonymous, then the GDPR does not apply and the unlawfulness of any initial processing should not impact subsequent operation of the model.
This provides a roadmap for potential enforcement and introduces a concept of due diligence that could safeguard subsequent processing/deployment of an AI model against initial unlawful processing, provided a DPA has not disrupted that by, for example, ordering deletion earlier on in the process. Effectively, there are two ways to protect that subsequent processing: demonstrate the model is anonymous or, if that is not possible, confirm the model was not obviously developed unlawfully.
The opinion leaves the possibility that a DPA could take measures that would defeat any subsequent processing or use of an AI model entirely. Hints at possible worst-case scenarios here include the possibility of "erasing part of the dataset that was processed unlawfully or, where this is not possible … the erasure of the whole dataset used to develop the AI model and/or the AI model itself." The EDPB qualifies that such a step would need to be proportionate and that, when assessing proportionality in this context, DPAs may take into account controller remediation measures such as retraining. Third-party deployers of AI models will also not be able to entirely distance themselves from the development phase, making further diligence likely.
What is missing?
Some challenging topics, such as automated decision-making, compatibility of purposes, DPIAsand data protection by design, are explicitly excluded.
The opinion sidesteps how sensitive data can be used to train AI models, even unintentionally, although the nature of the data impacts the AI developer's ability to rely on the legitimate interest. Notably, the opinion still makes challenging general comments, including on a recent Court of Justice of the European Union finding that, when even one sensitive data item is included in a "bloc," the Article 9 condition is required for the entire set, as well as another on the narrow nature of the "manifestly made public" condition.
The opinion also remains silent on the relevance of the CJEU case law that recognized the possibility for search engines to rely on the legitimate interest in the context of the development of AI models. Finally, it misses an opportunity to clarify how GDPR Article 11 on processing that does not require identification of individuals applies to the development of LLMs.
Though the opinion omits key clarifications, this could be a deliberate move to maintain some strategic ambiguity as technology and regulations are developing at a fast pace. The real game changer will lie in how DPAs choose to interpret and enforce this opinion, ultimately steering the future of AI innovation in the EU.
Alex Jameson is a senior associate, Izabela Kowalczuk-Pakula is a partner, Willy Mikalef is a partner and Nils Lölfing is a counsel at Bird & Bird.