How to train AI lawfully?

European Meta users were notified in April that the company would begin training its artificial intelligence models on public content shared on its platforms. As Meta considered this relied on personal data processing based on legitimate interests rather than consent, the company offered a one-month period for users to object to their public data being used for training purposes.

The AI training involved public information, such as posts and comments of Instagram and Facebook users in the EU — some published years ago, when AI existed mostly in science fiction.

Meta's actions did not go without response. The privacy advocacy group NOYB sent a "cease and desist" letter over Meta's AI training, and considered the possibility of a class-action lawsuit in the European Union, with procedures still ongoing.

Radarfirst- AI at the tipping point: balancing fiduciary duty, risk and innovation

While this may grow to another saga like the one involving EU-U.S. data transfers, the question remains: what is an appropriate legal basis for training an AI model? If legitimate interest is not appropriate, then what is? Will the EU AI Act add more complexity to this?

First-party data, third party data vs. web scraping

According to the European Data Protection Board, there are two main approaches to obtain data for training an AI model. The first is locally obtaining the data, that is using the organization's own data acquired from its customers, users, etc., either directly or through a defined third party. The second is web scraping, obtaining data from public online sources such as social media.

The first approach is easier for compliance with the EU General Data Protection Regulation, but leads to a more limited dataset and potentially a less accurate AI model. This is where consent can be still meaningful and feasible.

On the contrary, web scraping allows for much larger datasets and a more accurate AI model, but is a more invasive approach for privacy. As it is considered more effective, it seems to be the norm than the exception for the development phase of AI models. This was the approach of Clearview AI, which developed its facial recognition based on images available on the web, without requesting consent or informing data subjects concerned.

This did not go unnoticed by European data protection authorities. Clearview AI received hefty fines in France, Greece, Italy, the Netherlands and Sweden, between 250,000 and 30.5 million euros. Several other European DPAs declared the web scraping unlawful and a breach of the GDPR's transparency principle. The regulator's decisions ordered Clearview AI to substantially change its practices and to delete parts of its datasets — not to mention the reputational impacts.

Is web scraping lawful at all under the GDPR?

Consent is practically impossible with web scraping: there are too many unidentified data holders during the process, and most likely the opt-in rate would be low. Relying on contractual obligations can be another option, but would apply to few data subjects who would actually use the service for entering into a contract. This would not be a valid lawful ground for most data subjects.

Legitimate interest is therefore becoming the common approach. What are the legitimate interests that could be identified in this situation? The EDPB identified, for example, the development of a service for conversational agents to assist users, like ChatGPT, fraud detection and improving threat detection for an information system.

Sensitive data in web scraping

The problem with web scraping is that the types of personal data that will be processed cannot be assessed in advance. With Meta's nearly 3.5 billion daily product users it is almost impossible to accurately see which part of Meta's dataset will qualify as special categories of data, and which not. This is particularly difficult for data revealing political opinions, religious or philosophical beliefs, for example, where it is more difficult to draw the line with "simple" personal data. As for Clearview AI, the risk is even higher, as the concerned dataset is also biometric data, where all images clearly fall under the prohibition of Article 9(1) of the GDPR.

Therefore, measures should be implemented to exclude special categories of data before web scraping. If that is not possible, a relevant exception under Article 9(2) of the GDPR should apply. While explicit consent is not practically feasible in most cases, there is another exception where the personal data concerned are manifestly made public by the data subjects.

The question of transparency remains, however. How should data subjects be informed? How can they object to processing? They are not aware their data is used for training an AI model. Certainly not if that data was uploaded more than 10 years ago. What about photos that were not manifestly published, like photos uploaded by another person about someone else?

As for substantial public interest, it must be laid down in EU or national law. Some argue Article 10(5) of the EU AI Act lifts the prohibition of Article 9(1) of the GDPR to process special categories of data. Such processing must be strictly necessary for bias detection and correction, and this applies only to high-risk AI systems. However, this topic remains largely debated, and most likely will be assessed by the European Commission or EDPB.

What about automated decision-making?

Article 22 of the GDPR should not apply to generative AI systems, such as ChatGPT, as there is no legal or similar decision taken — at least for personal use cases.

This might change where facial recognition is used as part of an application process for service, which is an increasingly common practice in the financial tech sector. In such cases, the most appropriate option seems to be explicit consent based on first-party data or clearly defined third-party data.

As a general rule, processing of special categories of data is prohibited for automated decision-making with two exceptions: explicit consent or a substantial public interest, and suitable measures and legitimate interests, under Article 22(4) of the GDPR. As for consent, the challenge is how to make it freely given if an organization wants to make it a regular part of its process. While the AI Act may open the door for substantial public interests, it seems to be an opaque option at this stage.

How to explain this to the public?

The lawful grounds and exceptions are extremely complex when it comes to training an AI model, causing debates even among privacy professionals. It boils down to transparency: how is this all best explained to customers, users and stakeholders? Even in the narrow cases where consent is allowed — or required — is it possible to ask for meaningful consent? Can we assume the data subjects are "reasonably well-informed, observant and circumspect" persons, as the AI Act puts it?

Professor Daniel Solove's suggestion might resolve this, at least for consent. According to Solove, allowing or abandoning consent are both not satisfying. Instead, he proposed a murky consent: to accept that "privacy consent" is legal fiction rather than something perfectly meaningful, which seems to be a fitting idea when working with consent and AI.

Marcell Szikszai, CIPM, is data protection officer at Advanzia Bank.

This article is eligible for Continuing Professional Education credits. Please self-submit according to CPE policy guidelines.

Submit for CPEs

Interested in writing for us? Visit our Contributor Guidelines Page