Large language models offer benefits like better efficiency, productivity, research and innovation. This has also been recognized by the European Data Protection Board taskforce in its report published 23 May. All LLMs are developed and trained using datasets.
There is currently strong public debate in Norway on the development of LLMs, raising important privacy questions, especially considering the EDPB taskforce report and its investigation of OpenAI for processing in the context of its chatbot ChatGPT. Some argue consent, not legitimate interest, is the best legal basis.
Five stages of development
The use and development of any artificial intelligence technology involves processing of data, and sometimes personal data.
However, the LLM will typically not keep information it has learned from or linked to an individual in a retrievable format, so one may discuss to what extent the EU General Data Protection Regulation applies.
The LLM is not a database with retrievable information but does contain language patterns. Nevertheless, if personal data is actually processed, a legal basis set out in the GDPR is required.
The taskforce distinguished the stages of developing an LLM into five categories, of which collection of training data, pre-processing — including filtering — and training is relevant to discussions around legal basis.
Can legitimate interest be used to develop AI models?
Companies developing LLMs typically use legitimate interest as a legal basis. This is understandable, as obtaining consent from every individual whose data might be used in training the model is logistically challenging, if not impossible.
Although the EDPB taskforce does not give a definitive answer as to what legal basis could be appropriate, it leaves legitimate interest open as a possibility. Its use would require a careful balancing test by the controller. Depending on the use of the actual LLM and how it may be used for the broader public good, the controller may have such a legitimate interest.
However, there are numerous ways of developing LLMs for different purposes, meaning the balancing test could land differently based on the specific facts in each case.
While the taskforce identified that legitimate interest could be used as legal basis in the three first stages, there are other interesting takeaways, too.
The taskforce does not distinguish between third-party, first-party, public and private data
The taskforce report focuses on ChatGPT and its developer OpenAI, and as a result, on the collection, pre-processing and training based on data obtained through "web scraping," that is, third-party data.
Web scraping involves automatic collection and extraction of data from publicly available sources on the internet. The collecting entity obtains data from individuals it does not have a direct relationship with.
The taskforce does not address whether their assessment and considerations would be different in cases where the data collected for use in the LLM is collected from the entity's own customers, that is, first-party data. But when training a LLM, does distinguishing between using first-party data and third-party data matter?
To the extent first-party data is public, the difference between first-party data and third-party data seems irrelevant because the third-party data of a controller is by definition always the first-party data of another controller. And, conceptually, the first-party data of a company that is made public by a user would be the third-party data of another company collecting public data.
The public or private nature of the personal data does, however, play an important role. The privacy expectations would be different if the first-party data were private, but it also matters in which context these private first-party datasets have been generated — in the context of using a specific AI system, the prompt of a chatbot, for example, that requires this data to deliver a service or improve the AI system at hand or to continue training the LLM, or in another context unrelated to the use of a specific AI system in order to train an LLM.
While the fundamental GDPR principles apply to both first-party and third-party data, private and public personal data, the implementation differs primarily in terms of transparency requirements when the controller has a relationship with the user.
The balancing test involves the assessment of various elements where the controller should consider, among other things, the relationship with the individuals, if they could expect their data is used for the relevant processing activity, the actual impact of the data processing activity and whether an opt-out mechanism is implemented.
From a privacy perspective, the lack of a direct relationship with the relevant data subjects entails a lower degree of control as far as transparency offered to the data subjects is concerned. A company that uses its first-party data in LLM training or AI product development would be better equipped to readily inform its customers or users of the intended processing activity, to impose adequate safeguards, such as opt-out solutions and tools for managing data subject requests.
The taskforce assessment of adequate safeguards
Further, the taskforce notes adequate safeguards play a special role in reducing undue impact on a data subject and that such safeguards may change the balancing test in favor of the controller. It states that excluding certain "sensitive" sources from the data sources could be a safeguard, such as public social media profiles.
Interestingly, the taskforce does not provide any reasoning based on the GDPR for why public social media profiles, where people express themselves and publicly share their opinions, should be excluded for identifying language patterns in LLMs. This is a weakness with the report, maybe because the taskforce focuses on third-party data obtained by OpenAI.
Generally, the GDPR does not prohibit processing of personal data from social media platforms. Processing of such data could impact the balancing test, in particular, if this data is kept private by the individuals.
At a first glance, it could seem there might be good reasons from a privacy perspective for OpenAI to exclude social media data from being collected.
However, the fact there is no direct interaction between OpenAI and the individuals applies to any third-party data — this is not particular to social media data. Moreover, if the individuals using the social media platforms might not have expected OpenAI collected their data through "web scraping," the same would apply to any other digital resource publicly available — such as comments rating hotels, restaurants, newspaper articles or a personal blog.
In any event, third-party use, originated through social media or otherwise, raises the question of how to provide sufficient information to such individuals regarding the envisaged training and opt-out procedure. These considerations would be relevant in the balancing test. Limiting the kind of sources that are more likely to be personal data intensive — for example, an identifiable public profile or private information, rather than public comments — web scraping of third-party data could, therefore, be a safeguard for organizations to change the balancing test in favor of the controller.
In other words, there are good reasons for assessing and excluding specific sources when using third-party data. But does the same apply where a controller is intent to use its first-party data for the training of its own LLMs? And, again, does it matter if these first-party data are private or public?
Of course, the answer would depend on various factors necessary to be considered in the balancing test. However, there are good reasons for not interpreting the taskforce wording too exactly. After all, the report focuses on the processing of third-party data and on ChatGPT specifically.
From a privacy perspective, the processing of first-party data would not necessitate the use of the exemption pursuant to Article 14(5)(b) of the GDPR. The controller would be able to provide users with information and to opt-out of the processing in advance. Irrespective of whether the data at hand is first-party or third-party data, the opt-out procedure must be fair.
The combination of providing information and an opt-out possibility would enable data subjects to exercise their rights more easily, with greater control and increased transparency.
Also, for first-party data a contractual relationship between the controller and the user would typically exist. In contrast to data processing where such relationship is absent, a controller using data with whom a contractual relationship exists could uphold a higher level of transparency.
Based on the inherent difference between the processing of first-party and third-party data, and between the private and public nature of this data, there is no good reason to categorically exclude the processing of first-party data for the development of LLMs, irrespective of whether they originated via social media or otherwise.
We would argue such use may, in many cases, be based on legitimate interest.
Norway's data protection authority, Datatilsynet, is addressing the question of what legal basis is correct. It seems the DPA will coordinate the outcome of this discussion with data protection authorities in the European Economic Area and maybe with the EDPB. It is an important discussion. The consequences of the DPAs' views and the opinions of the EDPB have a large impact on how Europe may develop AI going forward.
In this case, making consent the only possible legal basis is hardly strengthening the EU's possibility of developing sound AI tools. We hope legitimate interest based upon a sound balancing test and a fair opt-out solution is acknowledged.
Anna Olberg Eide is a senior associate at Schjødt lawfirm.