Editor's note: The IAPP is policy neutral. We publish contributed opinion and analysis pieces to enable our members to hear a broad spectrum of views in our domains.
Aligning artificial intelligence development with the EU General Data Protection Regulation is no easy feat, especially when dealing with special categories of data. There are many reasons for this, but one stands out: developers can't process special categories of data — for example, health data for health care applications — without applying one of the exceptions under Article 9(2).
And there aren't any good possibilities here. Or are there?
Article 9(1) of the GDPR states that the processing of special categories of data is subject to a strict prohibition, unless at least one of the exceptions in Article 9(2) applies. This doesn't leave many possibilities for AI developers, but two alternatives are generally pointed out.
First is explicit consent. However, this approach requires an unequivocal and informed affirmative action from the data subject, which is impractical — if not impossible — to do at scale. The second approach applies an exception where special categories of data have been "manifestly made public by the data subject." Again, while theoretically plausible, AI developers are met with the hurdle of demonstrating that the data subject had the clear intent to make this information public.
As the Higher Regional Court of Cologne recently stated, this is not easy to do because this exception will most likely only apply in cases where the data subject uploads special categories of data to their own public platforms, such as social media profiles and posts.
But what about Article 9(2)(j) of the GDPR, which includes an exception enabling the processing of special categories of data for the purposes of scientific research?
To understand how we can fit the processing of special categories of data for AI development into the concept of scientific research, we need to go back to September 2024 to the Kneschke v. LAION decision from the Hamburg Regional Court. Granted, this decision concerns EU copyright rules and the directive on copyright and related rights in the Digital Single Market. It does, however, focus on the concept of scientific research as it relates to dataset collection for AI training purposes.
Here, the court was met with Robert Kneschke, a photographer who licensed a photo to a photo agency, and LAION, the nonprofit association that compiles and releases large-scale, image-to-text datasets for AI research. To that end, LAION downloaded publicly accessible images, analyzed their content and metadata to then extract and record matching URLs and descriptions. Kneschke's photo was included within these datasets, leading the plaintiff to seek an injunction against LAION to prevent further acts of reproduction.
One of the main points of the case was whether LAION's use of the image could be considered as text and data mining for scientific research under Article 3 of the copyright directive. The court concluded that it could, since scientific research extends beyond the direct act of discovering new knowledge to the preparatory work that allows that discovery.
This means that ancillary steps — that is, curating and publishing datasets for AI training — aimed at enabling the acquisition of new knowledge fall under the concept of scientific research. The datasets don't need to be used by the curator itself — in this case, LAION — since their availability to researchers is enough to consider it scientific research. Therefore, the key factor in considering dataset creation as scientific research is the purpose of the actions taken and whether the reproduction of content was done with the purpose of deriving new information.
If we bring this to the data protection and GDPR space, it is possible to use similar arguments to justify the processing of special categories of data for AI training, either under Article 9(2)(j) or Article 6(4), which permit the processing of personal data for a purpose other than it was collected for if the new processing is compatible with the original purpose.
First, while the LAION case focused only on dataset creation, it is possible to extend its arguments to the broader AI development stage. As argued by the Hamburg Regional Court and the European Data Protection Supervisor, scientific research is generally defined as a systematic and methodological activity aimed at generating new knowledge through the testing of hypotheses and transparent scrutiny and validation.
This activity should also contribute to a public interest or to the collective body of knowledge and, according to Recital 159 of the GDPR, can encompass technological development, either publicly or privately endorsed. AI development arguably fits in this definition, as a series of steps intended to increase the collective body of knowledge — especially when it has the public interest in mind, such as the development of AI applications in the health care sector.
This qualification is also implicit from the court's decision: If ancillary steps to the creation of knowledge — such as dataset creation, as done by LAION — are considered scientific research, then it is only logical for the end result permitted by those steps — AI development — should likewise be considered scientific research.
However, for the scientific research exception to apply, developers will still need to guarantee a proportional processing accompanied by the appropriate safeguards outlined in Article 89 of the GDPR. This means that AI developers, for example, will need to assess their data sources to exclude unnecessary personal data, especially special categories of data.
If the collection of special categories of data cannot be avoided, then developers should adopt anonymization and/or pseudonymization procedures to the extent possible. Ultimately, if the model needs to be trained on special categories of data, developers should assess whether the data could be memorized or potentially reproduced in outputs. If these risks exist, they should also adopt measures to mitigate them, such as adopting privacy-enhancing technologies to reduce inference risks and testing models against adversarial attacks.
This suggests that looking at AI development as scientific research under the GDPR may offer AI developers a way out of the prohibition on processing special categories of data under Article 9(1), provided developers follow scientific methodologies aimed at generating new knowledge and advancing technology.
This does not mean, however, that the processing of this data is unconstrained by the requirements of the GDPR. It's quite the opposite. AI developers will still need to comply with data minimization and anonymization standards as well as the appropriate safeguards to ensure security and respect for the rights, interests and freedoms of data subjects.
Francisco Arga e Lima, CIPP/E, is a data protection and AI consultant in Portugal and a Ph.D. candidate and guest lecturer at NOVA School of Law.
