23 Oct. 2024

OPINION

Do LLMs 'store' personal data? This is asking the wrong question

Editor's note: This article is the third in a series on key data protection issues posed by large language models. The first article discussed the interim report issued by the European Data Protection Board's ChatGPT Taskforce, stating LLMs may not be trained with special categories of data. The second article discussed whether individuals can be considered controllers for their own inputs and outputs when using LLM chatbots. The fourth article explores data subjects' rights related to LLM training data.

In May 2024, the Hamburg Commissioner for Data Protection and Freedom of Information published a Discussion Paper on large language models and personal data that derived three main principles.

First, the DPA determined LLMs do not "store" personal data and therefore data subjects' rights cannot relate to the model itself. Second, insofar as personal data is processed in an "LLM-supported AI system," the processing must comply with the requirements of the EU General Data Protection Regulation, and data subjects' rights can apply to the input and output of the artificial intelligence system. These data subjects' requests should be serviced by the deployers of the LLM-supported AI system — rather than the provider of the LLM. Third, the DPA determined that if LLMs are trained in violation of the GDPR, the lawfulness of using the models within an LLM-supported AI system is not affected.

The Hamburg DPA referred to October 2023 guidance from Denmark's data protection authority, Datatilsynet, on the use of AI by public authorities, which also states that AI models do not constitute personal data in themselves, because they are only the resultof the processing of personal data. Datatilsynet compared the models to statistical reports, which are the result of processing personal data but do not constitute personal data by themselves.

Although the DPAs do not make a clear determination, the result of their position is that the users of an LLM-supported AI system are the sole controllers for inputs and outputs and that they should respond to related data subjects' rights requests.

For purposes of this article, we state upfront that it is not under dispute that LLMs consist of a collection of tokens which are mathematical representations only and from a technical perspective do not "store" personal data. LLMs are not databases from which outputs are pulled. Outputs of LLMs are probabilistic in nature — and not reproductions of training data — and can therefore be inaccurate.

Further, the outputs of the current generation of LLM chatbots will only contain personal data in very limited cases. Most LLM chatbot providers have by now implemented safeguards to prevent outputs from including personal data of nonpublic individuals. We further see providers increasingly implement safeguards to prevent their LLM chatbots from answering private or sensitive questions. Outputs of LLMs, however, may still contain personal data on public persons, for example when responding to prompts such as "Who is the King of The Netherlands?," "Who is the most popular rock star?," or "provide a biography of Obama.”

Do LLMs store personal data?

The question of whether LLMs "store" personal data is not discussed in the May 2024 Report of the European Data Protection Board's ChatGPT Taskforce. And rightly so. The GDPR is technology neutral and its definitions are functional rather than technical. The Hamburg and Danish DPAs apply a narrow technical definition and conclude that the content of an LLM — that is, the tokens — does not constitute personal data.

This definition is contrary to longstanding case law and guidance, which provides that information, such as software code or a collection of tokens, can qualify as personal data by virtue not only of its content, but also if it has as a purpose or a result that it may have impact on individuals. Where new tools cause additional impacts on individuals, the providers of those tools qualify as controllers and are responsible for compliance with the GDPR, including data subjects rights' requests.

Our first article drew the parallel between LLMs and search engines, which is also compelling here. In GC and Others v. CNIL, the Court of Justice of the European Union did not consider whether search engines themselves store personal data. This is not a relevant question. The CJEU considered search engine operators to be controllers for the processing of personal data "by users" of the search engine "since the display of links following a search on the basis of an individual's name is liable to significantly affect the data subjects' rights to privacy and data protection," this in addition to the original publication of the personal data on the websites of third parties.

According to the CJEU, it would be contrary "to the objective of the GDPR — which is to ensure, through a broad definition of the concept of 'controller,' effective and complete protection of data subjects — to exclude the operator of a search engine from that definition on the ground that it does not exercise control over the personal data published on the web pages of third parties." The CJEU concludes that search engine providers, therefore, are controllers for the processing of personal data by the users and are responsible for data subjects' requests.

In a similar vein, LLM providers are responsible for the processing of personal data by users when their LLM is used as chatbot. This use may significantly affect the rights of data subjects, since personal data may appear in the output following a prompt, for example based on an individual's name. Also, this impact is in addition to the original publication of the personal data on websites of third parties, which is scraped by the LLM providers and included in the training dataset.

In this sense, outputs of LLM chatbots following a prompt are comparable to search results following a search engine query. This is why the Danish DPA's comparison of an LLM and a statistical report is not on point. You cannot query a statistical report based on an individual's name to produce personal data. An LLM chatbot can be asked questions relating to public individuals, which may result in output containing personal data about the individual. In other words, an LLM can qualify as personal data because it is used as an LLM chatbot which may produce such personal data upon request.

A discussion of whether LLMs "store" personal data without considering the purpose of their use and resulting impact on individuals, is therefore not relevant. If the Hamburg and Danish DPAs' guidance is followed, LLM providers would not be responsible for any inaccurate outputs relating to public persons. This would lead to a gap in data protection in cases where the LLM providers make their LLMs available via an API to third-party deployers ("closed-source").

In these cases, only the LLM providers can implement safeguards to prevent such inaccurate outputs, for example, by training the LLM not to respond to certain prompts or by blocking certain output. The conclusion would be different if the LLM providers make their LLM available on an open-source basis, as then deployers can implement the required safeguards themselves.

In Google Spain, the CJEU ruled that the objective of the GDPR is to ensure, through a broad definition of "controller," effective and complete protection of data subjects. The EDPB also adopted this position in its Guidelines on the concepts of controller and processor in the GDPR. Such effective and complete protection can only be achieved if LLM providers are considered joint-controllers for the use of their LLMs and are responsible for data subjects' requests. Interestingly, this is not something that causes resistance or proves to be difficult in practice. To the contrary, we are seeing that the main LLM providers already have in place dedicated channels to respond to data subjects' requests by blocking outputs that are incorrect, and rightly so.

Summary of the Hamburg DPA's arguments

Tokens are mathematical representations only. According to the Hamburg DPA, LLMs do not contain personal data because the data used to train the LLMs is transformed into abstract mathematical representations and probability weights. The Hamburg DPA considers this to be a different type of information than what is considered to be personal data by the CJEU, such as dynamic IP addresses, vehicle identification numbers, or other coded character strings like the transparency and consent strings used to capture digital advertising consents. Unlike the identifiers addressed in CJEU case law, LLMs include individual tokens as language fragments — such as "M," "ia," " Mü" and "ller," instead of the name Mia Müller. Such tokens lack individual information content and therefore do not serve as placeholders for an identity.

Output is probabilistic not reproduction. LLMs do not learn text or personal data, they learn correlations between tokens based on probability weights. Because tokens are only correlated based on probability, they do not constitute personal data under the GDPR.

Risk of personal data extraction not relevant. The Hamburg and Danish DPAs acknowledge the risk of personal data being retrieved from an LLM via extraction attacks, model inversion attacks, or membership inference attacks. The Danish DPA does not consider these risks to mean that the LLM thus qualifies as personal data. The Hamburg DPA further acknowledges that LLMs can reproduce training content but considers that the mere presence of personal data in LLM outputs is not conclusive evidence that this data has been memorized by the LLM. According to the DPA, an instance of output matching personal data could also be a coincidence. Further, in relation to the above methods used to extract personal data from LLMs, the Hamburg DPA considers that such extraction methods are "not allowed by law" and thus are not considered by the CJEU as a method of identification.

All three arguments are contrary to CJEU case law and EDPB guidance

Tokens are mathematical representations only. The DPAs' interpretation of what constitutes personal data is contrary to the guidance of the EDPB's predecessor, the Article 29 Working Party, in its 2007 Opinion on the concept of personal data. The WP29 discusses the definition in Article 4(1) of the GDPR, which breaks down into three elements: "Any information; "Relating to"; and "An identified or identifiable natural person."

"Any information." The opinion explains that the term "any information" calls for a wide interpretation of the concept of personal data, which includes any information "regardless of the nature or content of the information, and the technical format in which it is presented," which also includes "information stored in a computer memory by means of binary code." This means the tokens included in an LLM qualify as "information."

"Relating to." The opinion provides that information can "relate to" a data subject by virtue of: 'content', which applies to information that is "about" a data subject; 'purpose', which applies to information that is used or likely to be used with a purpose to evaluate, treat in a certain way, or influence the status or behavior of a data subject; or 'result', which applies to information if the use thereof is likely to have an impact on an individual's rights and interests.

The opinion makes it clear that these are alternative rather than cumulative criteria, stating "in order to consider that the data 'relate' to an individual, a 'content' element or a 'purpose' element or a 'result' element should be present."

Where the Hamburg DPA considers that tokens "lack individual information content and therefore do not serve as placeholders for an identity," only one element — "content" — is considered. Also applying the purpose and result criteria leads to a different conclusion. The information in the LLM — the "tokens" — can relate to an individual due to the purpose and result, because personal data may appear in the LLM chatbot's output following a prompt on the basis of an individual's name or a query such as "Who is the King of the Netherlands?"

"An identified or identifiable person." The opinion specifically clarifies that it is not the "information" that needs to identify a person, identifiability can also be caused by other means, such as by using the LLM chatbot.

When assessing LLMs against these three elements, the conclusion is that the tokens qualify as information that may relate to a data subject due to purpose and result because personal data may appear in the output of the chatbot following a prompt relating to a public person. The tokens themselves do not need to identify the person, the identification can also be caused by other means, that is, the LLM chatbot.

Output is probabilistic only. It is not under dispute that the outputs of LLMs are probabilistic in nature and often inaccurate and that LLMs are not databases from which outputs are pulled. The Hamburg DPA's conclusion, however, that LLMs therefore do not constitute personal data, is not correct.

The GDPR is about ensuring that companies providing tools that can create outputs containing information about individuals are accurate. If inaccurate personal information would not constitute personal data, the GDPR's accuracy requirements would be moot. The report of the EDPB's ChatGPT Taskforce rightfully states that the principle of accuracy also applies to information of a probabilistic nature:

"As a matter of fact, due to the probabilistic nature of the system, the current training approach leads to a model which may also produce biased or made up outputs. In addition, the outputs provided by ChatGPT are likely to be taken as factually accurate by end users, including information relating to individuals, regardless of their actual accuracy. In any case, the principle of data accuracy must be complied with."

See in a similar vein the June 2024 guidance issued by France's data protection authority, the Commission nationale de l'informatique et des libertés.

Risk of personal data extraction not relevant. The reasoning that security measures to prevent leakage of personal data from LLMs result in the conclusion that such LLMs do not contain personal data is turning the GDPR upside down.

The GDPR applies if personal data can be leaked and requires security measures to prevent that from happening. The fact that security measures are successful, does not mean that the GDPR no longer applies.

Similarly, pseudonymizing data does not bring personal data outside the purview of the GDPR. It is the other way around, pseudonymization techniques are privacy-by-design requirements necessary to minimize privacy impacts.

Concluding thoughts

The DPAs' positions that LLMs do not "store" personal data and data subjects' rights therefore cannot relate to an LLM, and that the deployers of an LLM chatbot — rather than the provider of the LLM — should service a data subject request, do not hold.

If this guidance were followed, LLM chatbot providers would not be responsible for any inaccurate outputs. This would lead to a gap in data protection, as — in most cases — only the providers of the LLM chatbots can respond to data subjects' requests by either training the LLM not to respond to certain prompts or by blocking outputs of their LLM chatbots that are incorrect.

This is supported by the recent opinion of the joint body of the Conference of the Independent Data Protection Authorities of Germany, the Datenschutzkonferenz, which, in a 2 Sept. press release, stated that LLMs trained with personal data form the core of any AI system and that "(a)nyone who analyzes their functional conditions in more detail from a legal and technical perspective will find that even here, in the vast majority of cases, the processing of personal data cannot be ruled out." (Author translation.)

We also question the third position of the Hamburg DPA, that LLMs trained in violation of the GDPR do not affect the lawfulness of using these models within an LLM-supported AI system. Where new tools cause additional impacts on individuals, the providers of those tools are responsible for compliance with the GDPR, including preventing bias in the outputs of their models, ensuring adequate security measures to prevent leakage of personal data, implementing safeguards to prevent their LLM chatbots from providing information about non-public persons, etc.

Again, this does not seem controversial in practice. The main LLM chatbot providers are already undertaking significant efforts to increase accuracy, prevent hallucinations and bias and ensure jailbreak robustness, and rightly so.

Where it is established that a specific LLM is designed and trained in violation of the GDPR, it is difficult to see how this would not affect the lawfulness of the further use by a third party controller of such an LLM within an LLM-supported AI system, as is the position of the Hamburg DPA. Any further use of such LLM would then likely not comply with the GDPR's fairness principle and accountability requirements.

Lokke Moerel is senior of counsel at Morrison Foerster and Professor Global ICT Law at Tilburg University.
Marijn Storm is of counsel at Morrison Foerster.

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page

Do LLMs 'store' personal data? This is asking the wrong question

Related stories

Do LLMs store personal data?

Summary of the Hamburg DPA's arguments

All three arguments are contrary to CJEU case law and EDPB guidance

Concluding thoughts