Editor's note: The IAPP is policy neutral. We publish contributed opinion pieces to enable our members to hear a broad spectrum of views in our domains. 

There are two proposed provisions in the EU Digital Omnibus to amend the EU General Data Protection Regulation, both of which are intended to simplify and facilitate artificial intelligence innovation. The amendments relate to a contentious topic: What is the legal basis for a large language model provider processing personal data for training LLMs, both regular and special-category. 

The Commission is proposing to simplify the GDPR by providing that AI providers:

  1. May rely on the legitimate interest basis for AI development and operation, provided they apply enhanced safeguards and provide data subjects with an unconditional right to opt-out (Recitals 30-31; new Article 88c GDPR).
  2. May rely on a new exemption allowing the processing of "residual" special-category data for development and operation of AI (new Article 9(2) sub k GDPR), provided specific conditions are met (Recital 33; new Article 9(5) GDPR). 

These specific conditions require controllers to implement suitable safeguards to avoid the collection and processing of special-category data; where despite these measures residual special-category data remains, to remove such information without undue delay, or, where removal would require disproportionate effort, implement effective technical and organizational safeguards to ensure that the information cannot be used to generate outputs, be inferred, or otherwise be disclosed to third parties. 

ADVERTISEMENT

Radarfirst- Looking for clarity and confidence in every decision? You found it.

Both proposed amendments miss the mark and create more issues than they simplify.

Legitimate interest basis: Unconditional opt-out

The opt-out requirement makes sense where an LLM provider leverages its user content to train its LLM, i.e., first-party data. The opt-out requirement does not make sense in case the provider scrapes the internet, i.e., third-party data. Providing an opt-out to all individuals whose personal data may be available on the internet requires a global opt-out database. This is not only a practical impossibility, but also not desirable for a data protection point. We need to provide individuals with material protection, not procedural measures. If my 80-year-old grandmother forgets to opt-out, then all would be allowed? 

Conflicting standpoints for  using first-party data for LLM training 

It is likely that the proposal in the Digital Omnibus is triggered by the conflicting standpoints of the EU data protection authorities as to the legal basis for using first-party data for training LLMs. 

In 2019, the U.K. Information Commissioner's Office and the Hamburg Commissioner for Data Protection and Freedom of Information ruled that further processing of personal data generated by a service provider was not compatible with the original purpose of processing and required an opt-in. 

On 21 May 2025, Ireland's Data Protection Commission issued a statement concerning Meta training its LLMs using public content shared by adults on Facebook and Instagram across the EU/EEA (i.e. first-party data). 

After two years of consultation with leading technology companies and a request for the opinion of the European Data Protection Board, which resulted in the EDPB Opinion on AI models, the DPC agreed to several amendments to Meta's policies. 

The amendments include an updated transparency notice, an easier to use objection form, opt-out, and a longer notice period for users. Meta further has to ensure the objection forms work properly within its apps and that access to the objection form is provided for more than a year. Other safeguards to be implemented are deidentification measures, filtering of training data, output filters, as well as updated risks assessments and other required documentation under the GDPR and EU laws, such as an legitimate interest assessment, data protection impact assessment and a compatibility assessment. 

The DPC announced it will continuously monitor Meta's roll-out of these measures to ensure data subjects indeed have a proper opportunity to opt-out. 

On 23 May 2025, just two days later, the Higher Regional Court of Cologne, agreed with the DPC that an opt out sufficed and dismissed the application for an injunction against Meta for training its LLM. After expressing initial objections against the decision of the Cologne court and issuing a so-called urgency procedure under Article 66 GDPR against the DPC as the lead DPA for Meta, the Hamburg DPA decided to drop the urgency procedure to ensure a consistent approach for the EU. 

On 23 Sept. 2025, the ICO followed the position of the DPC in respect of LinkedIn, which received the go-ahead from the ICO to train its LLM after it agreed to implement similar safeguards adopted by Meta, including offering a clear and simple route for users to object to the processing. 

However, Netherlands' DPA, Autoriteit Persoonsgegevens, raised serious concerns about LinkedIn's plans to leverage user data for training AI systems based on opt-out, indicating that it is not yet decided whether this is in compliance with the GDPR. This statement has not been followed up on in any manner. 

In October 2025, France's DPA, the Commission nationale de l'informatique et des libertés, issued a similar statement, urging individuals to use the opt-out possibilities while reserving its judgment on the validity of the legal basis relied on by LLM providers. However, this statement can no longer be found on the CNIL's website. 

All in all, European DPAs are more or less aligned on the position that digital service providers can rely on legitimate interest to use their user content for training LLMs, provided appropriate safeguards are in place and they offer an opt-out to their users. 

To this extent the proposal in the Digital Omnibus for first-party data is aligned with practice as it is developing, and in that sense seems unnecessary. The proposal for opt-out, however, is not limited to the situation where an LLM provider uses its first-party data but also extends to the situation where the provider scrapes the internet. 

Why the opt-out should not apply to scraped data

In an earlier article, I discussed that the EDPB, the CNIL and the European Data Protection Supervisor all indicated LLM providers can potentially rely on the legitimate interest basis when scraping personal data from public websites to train their LLMs, provided they apply suitable safeguards, including anonymization and pseudonymization of personal data immediately after being collected. 

The EDPB and CNIL initially floated the idea that an opt-out should be provided in these cases. Their idea seemed inspired by the copyright holder's right to opt out from text and data mining for LLM training purposes. In this context, not opting out means the copyright holder's content may be used by LLM providers for training purposes. Providing an opt-out to all individuals whose personal data may be publicly available on the internet is not only a practical impossibility, but also not desirable from a protection point of view.

Where many individuals will not bother to register relevant content on the push back list, this may well result in undermining their data protection rather than increasing it. Not registering on the opt-out list will result in their data being used for training purposes. This is a fundamental point. In many previous publications, including the IAPP, I advocated for data protection laws that move toward providing individuals with material data protection, rather than rights they do not understand and will not exercise in the first place. 

It is noteworthy that the CNIL's suggestion to implement an opt-out database for internet scraping has been dropped from their updated guidance.  

Publicly available is not the same as belonging to the public domain

A topic not covered by DPAs, is that it seems LLM providers generally assume that when LLMs are trained on publicly available data on the internet, privacy problems are reduced for the obvious reason the data is already public. However, publicly accessible does not mean that those personal data also belong to the public domain. 

For example, information shared publicly by individuals on specific social media is usually shared in a certain context. Information may further be shared by others specifically to violate someone's privacy, such as doxing. Public availability of data should therefore not be mistaken for data intended to belong to the public domain. 

Humans naturally understand when it is appropriate to share sensitive information based on context. However, LLMs are not currently designed to do this. Training an LLM for public use on data that was not intended for that level of publicness violates the original privacy expectations. In other words, Nissenbaum's contextual integrity is not respected. 

Ideally, we want LLMs to be trained solely on data that belongs to the public domain. The first Dutch language LLM: GPT-NL, is the first to overcome this issue by training the model on data of public persons that belongs to the public domain only. It does so by applying a narrow definition of public persons, those with a Wikipedia page only, and training GPT-NL only on information relating to the public person's public role. 

GPT-NL is not trained on scraped data, but on opt-in datasets of high-quality data only, such as content of the Dutch government, national libraries, archives and the combined Dutch publishers, whereby any dubious content, for example, gossip, is not included. This reflects the latest research insights, where scientists recommend the development of LLMs that are trained exclusively on data that is explicitly intended for public use. Further innovation here is required.

New limited exemption for processing 'residual' special category data

For an LLM to function, it needs to be trained on personal data for public persons insofar as these belong to the public domain. These data may contain special category data — the fact that Pope is a Catholic, King Charles has cancer, etc. These elements cannot be qualified as residual processing of special category data, as these elements are left in on purpose and no safeguards are implemented to stop this information from being generated by the LLM in its outputs. The CNIL has already acknowledged this, and I recommend the Commission rethink this new proposed exemption. 

As it stands, it does not simplify, it further confuses. 

Lokke Moerel is a professor of Global ICT Law at Tilburg University. This article reflects the personal opinion of the author only.