Editor's note: This article is part four of a series on key data protection issues posed by large language models. The first article discussed the interim report issued by the European Data Protection Board's ChatGPT Taskforce, stating LLMs may not be trained with special categories of data. The second article discussed whether individuals can be considered controllers for their own inputs and outputs when using LLM chatbots. The third article discussed the Hamburg and Danish data protection authorities' position that LLMs do not contain personal data and, therefore, data subjects' rights cannot relate to the model itself and that models trained in violation of the EU General Data Protection Regulation do not affect the lawfulness of the use of such models within an artificial intelligence system.
The issue of whether data subjects' rights to access, deletion and correction of their personal information should apply to training data presents a conundrum. EU supervisory authorities take the position that DSRs need to be upheld throughout the process of training large language models, while at the same time requiring LLM providers to anonymize the training data to be able to rely on the legitimate interest basis for using this data for LLM training purposes.
This begs the question: If an LLM provider complies with this guidance and removes personal data from training data to the maximum extent possible, what is the relevance of DSRs in relation to the remaining training data?
Legitimate interest basis
The European Data Protection Board, France's data protection authority, the Commission nationale de l'informatique et des libertés, and the European Data Protection Supervisor have all indicated LLM providers can potentially rely on the legitimate interest basis when scraping personal data from public websites to train LLMs, provided they apply suitable safeguards.
These include an extensive set of data minimization measures: one, to prevent certain sources from being collected, such as social media profiles, pornographic sites or health forums; two, to ensure certain data categories are filtered out, such as banking transactions or geolocation data; and, three, to apply anonymization and pseudonymization of personal data immediately after being collected.
The DPAs are most strict on special categories of data. These need to be filtered out, either at the collection stage by avoiding the data collection in the first place, or immediately thereafter by deleting it prior to training LLMs.
Relevance of DSRs to training data
Where an LLM provider complies with the guidance and removes personal data from training data to the maximum extent possible, the relevance of DSRs in relation to such training data becomes a moot point. Yet the EDPB, the CNIL, the Hamburg Commissioner for Data Protection and Freedom of Information and EDPS all explicitly state in guidance that DSRs need to be upheld throughout the process of training LLMs.
The CNIL acknowledges anonymization may make it impossible to respond to DSRs and that Article 11 of the EU General Data Protection Regulation does not require the LLM provider to retain additional personal data to be able to identify individuals and respond to a DSR. The CNIL, however, then finds new ways in which DSRs could be serviced. For example, it suggests data subjects can provide an exact copy of the content to which their request pertains based on which the LLM provider can do a search for the document in its training data.
The EDPB and CNIL even require LLM providers to maintain an opt-out list — in the terminology of the CNIL a "push back" list — where individuals can list the websites from which they do not want personal information to be scraped for LLM training purposes. This basically amounts to LLM providers having to maintain an opt-out database for the population at large. This requirement goes beyond the right to object of individuals under Article 21 whenever legitimate interest is relied upon by a controller as a legal basis for processing. Article 21 does not provide the individual with a discretionary right to object, but only based on grounds relating to his or her particular situation, which further may be overridden if the controller demonstrates compelling legitimate grounds for the processing.
In a similar vein, the EDPS indicates that a traceable record of the processing of personal data should be kept, and datasets should be managed in a traceable manner, which may support the exercise of DSRs.
We see this guidance as forced efforts to keep DSRs relevant in the LLM training phase, while the real protection of nonpublic persons is achieved by anonymization and pseudonymization of the training data and implementing safeguards to prevent LLM outputs from including their personal data. DSRs relating to training data will not deliver material data protection.
Even if personal data would be deleted from the training data in response to a DSR request, there is no guarantee that the output of newly trained LLMs will be impacted. Deletion of personal data from training data also has no impact on LLMs that are already trained. In the current state of the art, it is not yet possible to train an already trained LLM to forget something, this concept is called "machine unlearning."
The proposed new ways to facilitate responses to DSRs proposed by the EDPB, CNIL and EDPS may further prove counterproductive. Requiring traceable records of processed personal data, as suggested by the EDPS, may assist the LLM provider in re-identifying individuals but will at the same time forfeit the anonymization efforts. Requiring individuals to subscribe to opt-out lists as the EDPB and CNIL suggest, forces individuals to act, while the compliance efforts to protect the individuals should be the responsibility of LLM providers.
The EDPB and CNIL's idea seems inspired by the copyright holders' right to opt out from text and data mining for LLM training purposes. In this context, not opting out means the copyright holder's content may be used by LLM providers for training purposes. Where many individuals will not bother to register relevant content on the push back list, this may well result in undermining their data protection rather than increasing it.
Not registering on the opt-out list will result in their data being used for training purposes. This is a fundamental point. In many previous publications, including the IAPP, this author advocated for data protection laws that move toward providing individuals with material data protection, rather than rights they do not understand and will not exercise in the first place. In a world of too many choices, autonomy of the individual is reduced rather than increased.
As Cass Sunstein stated in "The Ethics of Influence," "autonomy does not require choices everywhere; it does not justify an insistence on active choosing in all contexts. ... People should be allowed to devote their attention to the questions that, in their view, deserve attention. If people have to make choices everywhere, their autonomy is reduced, if only because they cannot focus on those activities that seem to them most worthy of their time."
Personal data of public persons does not always have to be anonymized in training data
The first article in this series concluded DPAs are too absolute where they discuss the legal basis for LLM training based on the distinction between "regular" and "special categories of data." For LLMs, the more relevant distinction is between personal data of "public persons" and "non-public persons."
Currently available LLMs are generally trained on data that includes information about public figures, including their publicly available special category data. Without those data, the LLMs would not be able to respond to prompts such as "Who is the King of the Netherlands?" "Who is the most popular rock star?" or "What is Obama's most famous quote?" but also questions reporting on sensitive topics like "Was Michael Jackson convicted of a crime" or "Does King Charles have cancer?"
The first article further concluded that the absolute position of the EDPB and CNIL that all special categories of data need to be filtered out is contrary to case law of the Court of Justice of the European Union. In line with GC and Others v CNIL, the prohibitions of GDPR's Articles 9 and 10 should apply to LLM providers, taking into account their roles, powers and capabilities.
The prohibitions should not apply as if the LLM provider has itself caused the special category of data to appear on the internet, as the activities of the LLM provider occur only after such data have already been published online. Accordingly, the prohibitions can apply to an LLM provider only because of the inclusion of such data in the output of an LLM, and therefore by means of an ex-post verification based on a DSR request of an individual. This includes any objection to processing filed by the individual based on personal grounds under Article 21.
For LLM providers, this means they need to facilitate requests to delete or block personal data in their LLM outputs, not already at the training phase. Upon such a request, the LLM provider should either block the relevant output or invoke an exception under Articles 9, 10 or 21 that allows for the output to be retained.
Like search engines, the applicable exemption could be found in the substantial public interest in the fundamental rights of any person, both natural and legal, to the freedom of information under the EU Charter of Fundamental Rights. This seems like the right solution. Publicly available information on public persons should not be restricted by excluding new forms of distribution.
Data protection rights should still apply, but only according to the impacts of such new distribution. Though this may seem counterintuitive, this means LLM providers do not need to service DSRs of public persons in respect of training data. This data was already in the public domain, and DSRs can be effectuated at that level. Where LLM providers increase the impact by using these data as training data, the additional impact must be mitigated by servicing DSRs at the output level.
Personal data of nonpublic persons should be anonymized in training data
As set out in the first article in this series, DPAs are correct in requiring LLM providers to apply anonymization techniques to protect the personal data of nonpublic persons. Where there are many John Smiths and limited personal data available on nonpublic persons, chances of LLMs generating correct outputs diminish drastically.
Again, this already seems to be the practical conclusion of providers of the current generation of LLM chatbots. Most LLM chatbot providers have by now implemented safeguards to prevent outputs from including personal data of nonpublic individuals and to prevent their LLM chatbots from answering private or sensitive questions. Where personal data of nonpublic persons would inadvertently be included in outputs, they should be able to make DSR requests.
Again, this is not an obstacle in practice. Many LLM providers have dedicated channels in place to respond to DSRs by blocking outputs if they inadvertently contain personal data of nonpublic persons.
Note the anonymization techniques applied to training data take a different form than what the DPAs have considered appropriate up until now. LLMs must ultimately be trained on personal data to be able to understand the concepts of persons and objects in sentences. For example, in the sentence "James's inspection of the car showed that it was time for an oil change" it is clear to a human reader that it is the car's oil that needs changing and not James'.
An LLM, however, requires extensive training data to be able to interpret sentences that address multiple persons and objects and accurately link parts of the sentence to such persons or objects. This means deleting all personal data from training sets makes it impossible to train an LLM to interpret natural language in a contextually relevant manner.
The techniques to perform contextual anonymization are developing, but the long and short of it is that simply filtering out names does not result in keeping the required context. Also, this is not absolute. Certain specific data elements can be filtered out based on their formatting, for example, credit card numbers, email addresses, geolocation data and telephone numbers. Such elements can be removed from the training dataset, without affecting the context required to train a functioning LLM.
Again: Why DSRs are not relevant to training data
LLM training datasets generally consist of large and unstructured sets of text. The data originally scraped have undergone many preparation and processing activities, including applying anonymization techniques, to make the set suitable for LLM training. These activities mitigate undue privacy impacts on individuals, but also make it difficult to respond to DSRs.
The CNIL's suggestion that individuals can provide a copy of specific documents for the LLM provider to verify whether these are included in the training data, will not work in most cases. It is unlikely a PDF or other document is still included in its original form in the training set, while the content of the document may very well be part of the unstructured data. A response that the document is not included in the training data is, therefore, not a conclusive answer to whether the content is used for training.
This may be different for AI tools trained with images or videos, in which case the LLM provider is more likely to be able to compare the provided image or video to the training dataset. This is also what certain providers of AI models trained with images already do, for example, OpenAI enables artists and creatives to provide a copy of the content they do not want to be used for the next training run.
As previously stated, removing certain data elements from the training dataset does not necessarily change the LLM's outputs. As set out in the third article in this series, LLMs are trained to generate probabilistic responses. They do not rely on a database of pre-defined answers which, if an answer is removed from it, results in the LLM no longer providing it as an output. As a result, deleting or correcting information in the LLM training dataset does not take away a potential impact on an individual due to the outputs of the LLM.
The reason that updates to the LLM training dataset do not automatically result in changed outputs, is that — in the current state of the art — it is not possible to train an LLM to forget or unlearn something. Google issued a challenge in 2023, enabling AI scientists and experts to compete against each other to create a model that can successfully learn and unlearn information. This is, however, not a practical possibility now.
This means the only option to transpose changes in the training data also to the LLM is to retrain the entire LLM based on the updated training dataset. For sophisticated LLMs, such retraining generally takes several months — for example, the training of the BigScience LLM is expected to take three to four months.
The time and resources required for such retraining limit how often LLMs are retrained in practice, and thus how long it takes for any change in the LLM training dataset to have an impact on the actual outputs of the LLM in practice.
Why opt-out registers will not prove to be productive, at best
Providing nonpublic persons with DSRs for training data or requiring LLM providers to set up opt-out registers is not a practical way to provide material protections to individuals' personal information. As stated, removing specific data elements from the LLM training dataset has a limited impact. A far more relevant solution is to implement a specific safeguard at the output level, including that certain prompts do not get a substantive response or certain output is blocked.
DPA guidance stressing that DSRs should apply to LLM training data, therefore, basically puts individuals on the wrong footing as to relevance of their rights. This leads to senseless DSR requests, which will systematically be responded to by LLM providers as not being serviceable, to the detriment of requests that are serviceable in respect of outputs that provide material data protection.
The digital revolution is transforming society as we know it. The advancement of generative AI requires us to holistically rethink how to provide meaningful data protection to individuals whose personal data are used for LLM training. In line with GC and Others v CNIL,this requires taking into account LLM providers' roles, powers, and capabilities, and imposing data protection obligations that correspond to them.
Protection of both individuals' privacy rights and society's interest in reaping the benefits of the new technology can be achieved by applying the right anonymization techniques that prevent material privacy impacts but at same time preserve context in training content to allow for useful LLMs. AI also plays an important role here, as it allows for more sophisticated anonymization techniques.
In this early phase of LLM development, DPAs are recommended to collaborate with researchers to take inventory of the tools currently being used, assess their filtering and anonymization techniques, and share best practices. Focusing on nominal rights of individuals that do not provide for protection in practice is not the way forward. Stopping this practice does not create a gap in protection. The DPAs have extensive powers to request information from LLM providers on their data anonymization practices. In addition, the EU AI Office has the power to request information on LLM providers' "curation methodologies (e.g. cleaning, filtering, etc.)" under the AI Act.
Lokke Moerel is a professor of Global ICT Law at Tilburg University. This article reflects the personal opinion of the author only.