28 Aug. 2024

Using special categories of data for training LLMs: never allowed?

Editor's note: This article is part one of a series on key data protection issues posed by large language models. The second article discussed whether individuals can be considered controllers for their own inputs and outputs when using LLM chatbots. The third article discussed the Hamburg and Danish data protection authorities' position that LLMs do not contain personal data and, therefore, data subjects' rights cannot relate to the model itself and that models trained in violation of the EU General Data Protection Regulation do not affect the lawfulness of the use of such models within an artificial intelligence system. The fourth article explores data subjects' rights related to LLM training data.

In April 2023, the European Data Protection Board established its ChatGPT Taskforce to coordinate the various national enforcement actions taken by EU data protection authorities against OpenAI, the provider of chatbot ChatGPT, which is powered by OpenAI's large language model.

The main issue of contention is whether OpenAI has a legal basis under the EU General Data Protection Regulation to scrape personal data from public websites to train its LLM.

Developing LLMs requires vast amounts of training data sourced by providers from publicly available websites through web scraping.

Syrenis- A privacy professional's AI checklist: 10 points to master compliance ahead of the curve

Because it is practically impossible to obtain consent from every individual whose personal data might be scraped, the only legal basis potentially available under the GDPR is legitimate interest,on which the U.K. Information Commissioner's Office issued a consultation in March 2024.

The scraped personal data may contain special categories of data — data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, health, and sexual orientation, for example — and criminal data, which require the provider to rely on one of the exceptions under GDPR's Articles 9 and 10.

First EU guidance on the 'lawfulness' of web scraping

The first EU guidance on the "lawfulness" of web scraping for purposes of training an LLM is included in the ChatGPT Taskforce's 24 May report. It shows an unusual pro-innovation approach, acknowledging that LLM providers can potentially rely on the legitimate interest basis when scraping personal data and cautioning that this will require suitable safeguards to prevent privacy impacts for individuals.

Examples of such safeguards include "defining precise collection criteria and ensuring that certain data categories are not collected or that certain sources (such as public social media profiles) are excluded from data collection."

The EDPB is most strict on the processing of special category and criminal data. The report states these data categories should be filtered out either at the collection stage — by avoiding collecting the data in the first place — or immediately thereafter — by deleting the data prior to training LLMs.

The report reflects only the common denominators agreed upon by the EU DPAs. Individual DPAs may hold different opinions and are not bound by the positions in the report. This is already evidenced by recent guidance issued by the Netherlands' DPA, Autoriteit Persoonsgegevens, which took the position that web scraping is almost always unlawful.

Web scraping of special category data is considered permissible only when the individual has given explicit consent or the information has been manifestly made public by the individual. This basically prohibits the collection of special category data for LLM training purposes. Obtaining consent from every individual whose data might be scraped is practically impossible. In many cases the relevant special categories of data will further not have been manifestly made public by the individual themselves.

Is the EDPB's restrictive position on special category data justified?

The question is whether the restrictive positions of the EDPB and the AP on processing special category data are correct.

Training LLMs is impossible without including personal data about public figures. Without that data, the LLMs would not be able to respond to prompts such as "Who is the king of the Netherlands?," "Who is the most popular rockstar?," or "What is Obama's most famous quote?"

LLMs would also be unable to provide accurate responses about significant political or religious speeches, even though this information is publicly available through sources like Wikipedia and public news sites.

Publicly available information about public figures often includes special categories of data, such as their political opinions and religious beliefs, health status, sexual preferences, and racial or ethnic origin. It may also include information about criminal investigations, including allegations and acquittals.

Importantly, these types of data may not initially have been disclosed by the public figures themselves. Instead, such personal data is typically processed and published under the journalistic exemption provided by Article 85 of the GDPR.

A strict application of the GDPR's prohibition under Articles 9 and 10, as provided by the EDPB and the AP, would require LLM providers to filter out all names, including those of public figures, and ensure they are unrecognizable in the LLMs output. This would severely limit the utility of LLMs. Not surprisingly, currently available LLMs are generally trained on data that includes information about public figures, including their publicly available special category data. A strict interpretation that prohibits such data collection and processing essentially amounts to a ban on LLMs.

Strict interpretation is contrary to CJEU case law

The question of whether, and under what conditions, a party may process special categories of data that were previously published by a third party has already been answered by the Court of Justice of the European Union.

In GC and Others v. CNIL, the CJEU was presented with a similar conundrum in the context of search engine operators. Search results include references and links to press articles on public people, which in many cases include special categories of data. These press articles are initially published by news websites under the journalistic exemption. In the earlier Google Spain case, the CJEU ruled search engines cannot also rely on the journalistic exemption because they do not act solely for journalistic purposes.

In GC and Others v. CNIL, the CJEU was asked whether GDPR Articles 9 and 10 apply in full to search engines. If so, search engines would be de facto prohibited, as none of the exemptions apply. Personal data included in the referenced press articles are not always manifestly made public by individuals, and obtaining consent from all individuals included in search results is impossible in practice.

The CJEU was faced with a binary option. The individuals who requested the search engine operator to dereference search results with their special categories of data argued that the prohibitions of Articles 9 and 10 apply in full to search engine operators. However, the search engine operator diametrically argued the opposite because the prohibitions merely reference articles already published by third parties that do not require a new exception.

In its judgment, the CJEU basically refers to the opinion of its advocate general, which is more extensive in its reasoning and therefore worth discussing. The advocate general, confronted with the binary option provided by the parties, considered that a full application of the prohibitions of Articles 9 and 10 would be prohibitive for search engines and concluded that this, therefore, cannot be the conclusion. It would result in the search engine operator having to vet all published articles before displaying a link, which the advocate general considered "neither possible nor desirable."

The opinion states that a literal application of the prohibition on processing special category data "would require a search engine to ascertain that a list of results displayed following a search carried out on the basis of the name of a natural person does not contain any link to internet pages comprising data covered by that provision, and to do so ex ante and systematically, that is to say, even in the absence of a request for de-referencing from a data subject. To my mind, an ex-ante systematic control is neither possible nor desirable.”

The advocate general considered that the prohibitions should not apply as if the search engine had itself caused the special categories of data to appear in the referenced internet pages, because search engines intervene only after such data has been published online.

Accordingly, the prohibitions do apply to search engine operators, but only because of their referencing. Therefore, a search engine does not have to systematically verify whether it can rely on an exemption before displaying a link but can do so ex post based on a dereferencing request by the data subject.

The CJEU, explicitly referring to its advocate general, subsequently rejected the all-or-nothing approach of the parties and ruled that the prohibitions can only apply to the operator in regard to its specific responsibilities, powers and capabilities.

The decision states that the "operator of a search engine is responsible not because personal data referred to in those provisions appear on a web page published by a third party but because of the referencing of that page and in particular the display of the link to that web page in the list of results presented to internet users following a search on the basis of an individual’s name, since such a display of the link in such a list is liable significantly to affect the data subject’s fundamental rights to privacy and to the protection of the personal data relating to him."

The CJEU then concluded that, in those circumstances, the prohibitions "can apply to that operator only by reason of that referencing and thus via a verification, under the supervision of the competent national authorities, on the basis of a request by the data subject.”

In other words, if a search engine receives a dereferencing request, it needs to then verify whether it had a valid exemption to process such special category data in the first place. According to the CJEU, an applicable exemption can be found in Article 9(2)(g) of the GDPR, which enables processing necessary for substantial public interest based on EU or member state law. According to the CJEU, a relevant basis in EU law can be found in Article 11 of the EU Charter of Fundamental Rights, which states the freedoms of expression and information.

In sum, the search engine operator cannot rely on the journalistic exemption to publish special category data, but it can rely on the fundamental right of anyperson — both natural and legal — to the freedom of information. This seems the best way forward. Publicly available information on public people should not be restricted by excluding new forms of distribution. Data protection rights should still apply but only according to the impacts of such new distribution.

Applying the CJEU case law to LLM providers

Like search engine operators, LLM providers do not publish special categories of data of public people on websites. In a similar vein, LLM providers are responsible not because special category data appears on a webpage but because such data is scraped and used for training an LLM, which may cause the data to appear in the LLM's output when an individual includes a prompt requesting such information.

In this sense, LLM outputs following a prompt are comparable to search engine results following a search query. This is also how LLM chatbots are often used, as a search engine "on steroids," whereby the output summarizes the response to a prompt, rather than providing links to relevant web pages in response to a search query. This is also evidenced by the implementation of LLM chatbots as co-pilot functionality by search engines like Bing.

In line with the CJEU's decision, the prohibitions of GDPR Articles 9 and 10 should apply to LLM providers, taking into account their roles, powers and capabilities. The prohibitions should not apply as if the LLM provider has itself caused the special category data to appear on the internet, as the activities of the LLM provider occur only after such data has already been published online.

Accordingly, the prohibitions can apply to an LLM provider only because of the inclusion of such data in the output of an LLM, and therefore by means of an ex post verification based on the request of an individual.

For LLM providers, this means they need to facilitate requests to delete or block personal data in their LLM outputs. The LLM provider should either block the relevant output or invoke an exception under GDPR Articles 9 and 10 that allows for the output to be retained. Like search engines, the applicable exemption could be found in the substantial public interest in the freedom of information under the EU Charter of Fundamental Rights.

Concluding thoughts

DPAs are also absolute when they discus the legal basis of processing based on the distinction between processing "regular" personal data and special categories of data. For LLMs and search engines, the more relevant distinction is between personal data, whether regular or special category, of public and nonpublic people.

In the public interest, public individuals must accept that many aspects of their lives can be reported on, including sensitive topics like their health or criminal investigations. This does not automatically apply to nonpublic people. In fact, it can be questioned whether there is a legitimate interest to include any personal data of nonpublic people in LLM outputs.

When there are many John Smiths and limited personal data available on nonpublic people chances of LLMs generating correct outputs diminish drastically. This seems to be the practical conclusion of providers of the current generation of LLM chatbots, which seem to no longer include personal data of nonpublic people in outputs, and rightly so.

The above should not be taken in an absolute manner. As also required by he EDPB taskforce, LLM providers are responsible for implementing suitable safeguards in their tools and algorithms.

Also, for public people, LLM providers should consider which data categories and sources are necessary for a properly functioning LLM. For example, sensitive data such as Social Security numbers, passport details, credit card information and precise location data are not relevant to the public roles of individuals and should be filtered out of the LLM training dataset when possible during collection and otherwise thereafter.

To ensure accuracy, LLM providers should further ensure that they scrape data from reputable sources rather than indiscriminately gathering information from social media and gossip websites.

With this information, a first version of the LLM can be trained. It should then be updated over time to remove outputs that individuals objected to and other outputs that were flagged as inappropriate.

Lokke Moerel is senior of counsel at Morrison Foerster and Professor Global ICT Law at Tilburg University. Marijn Storm is of counsel at Morrison Foerster.

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page