The field of artificial intelligence, particularly the rise of large language models, presents a dual challenge of unprecedented opportunities intertwined with complex data protection concerns.
European regulators at the forefront of data protection are keen to ensure AI innovation is in line with the principles enshrined in the EU General Data Protection Regulation.
In the Baden-Württemberg Commissioner for Data Protection and Freedom of Information's recent publication "Legal bases in data protection in the use of artificial intelligence," the complexities of processing personal data within AI systems are broken down, raising critical questions about data collection, training methods and potential impact on the rights of data subjects.
Studies in this area have already drawn on the perspectives of other prominent regulators, including France's data protection authority, the Commission nationale de l'informatique et des libertés, the U.K. Information Commissioner's Office, Germany's Hamburg Commissioner for Data Protection and Freedom of Information and the European Data Protection Board.
Analyzing their investigations, guidelines and studies can shed light on key data protection considerations shaping the responsible development and use of AI systems in the European landscape.
Data protection compliance throughout the AI life cycle
Given data's extensive use in AI — particularly the training of AI systems on massive datasets — the GDPR comes into play when it relates to an identified or identifiable natural person.
While LLMs may not be explicitly designed to process personal data, they can inadvertently store or indirectly reveal such information through outputs or inferences. The abstract nature of language processing in LLMs doesn't negate potential privacy risks. Therefore, data storage, model outputs, the potential for re-identification — through model attacks, for example — and the evolving nature of technology need to be carefully considered from a data protection and privacy perspective.
Regulators share the understanding that relying solely on the abstract nature of AI, or the complexity of extraction attacks, does not absolve data controllers of their data protection responsibilities. To ensure compliance throughout the life cycle of an AI system, the LfDI points out that different stages of processing relevant to data protection law must be considered: collection of training data, processing of data for AI training, deployment of AI applications, use of AI applications, and use of AI results.
For each phase, controllers must assess data processing from the perspective of all stakeholders — the provider, the user and the data subject — recognizing that each phase may involve different data processing activities and may require separate legal bases and the implementation of technical and organizational measures for compliance.
Navigating legal bases for AI data processing
The LfDI, echoing the approach of the EDPB, the ICO, and the CNIL, meticulously examines the legal bases under the GDPR for processing personal data in AI systems. Key highlights include:
- Consent (Art. 6(1)(a) GDPR). While consent is a possible legal basis, its suitability for AI systems depends on the processing phase. It is generally unsuitable for large-scale data collection and training datasets where direct contact with data subjects is impractical. Challenges include ensuring informed consent for complex AI systems, addressing the right to withdraw consent and its impact on AI functionality, and managing potential transparency and traceability issues.
- Contract performance (Art. 6(1)(b) GDPR). This basis may cover AI-driven tasks within existing contracts, such as AI speech generation or medical diagnostics). However, it doesn't legitimize processing personal data of third parties or data use beyond the scope of the contract.
- Legal obligation (Art. 6(1)(c) GDPR). Its application in AI systems is restricted as it mandates strict necessity and a specific legal obligation.
- Vital interests (Art. 6(1)(d) GDPR). Generally unsuitable for AI system training due to its short-term nature, it might justify using AI in life-saving situations under specific circumstances.
- Legitimate interests (Art. 6(1)(f) GDPR). Applicable to nonpublic bodies, this basis offers flexibility due to its open wording but requires careful balancing of interests. The ICO, CNIL, and LfDI highlight this as a commonly used basis for AI development, emphasizing factors such as data sensitivity, processing scope, impact on data subjects, AI system characteristics, data subject expectations, data minimization, and the necessity of processing.
Transparency
AI systems, particularly deep learning models, often struggle with transparency due to their complexity and opaque decision-making processes. Tracing the origin and transformation of data throughout the AI life cycle is another obstacle.
To address this, the CNIL and LfDI encourage clear diagrams, practical examples and plain language to improve accessibility of information. The ICO recommends its guidance "Explaining Decisions Made with AI" for detailed, practical recommendations on how to achieve transparency throughout the AI life cycle.
Data storage in LLMs and data minimization
The ICO and LfDI address the tension between data-intensive AI systems and data minimization. While LLMs primarily store information as abstract mathematical representations (embeddings) — with no direct link to individuals, as highlighted by the HmbBfDI, referencing CJEU's jurisprudence — the authorities address concerns about the storage of personal data, particularly through inference.
The ICO stresses the need for ongoing evaluation and retraining of AI models to potentially reduce reliance on large historical datasets. The regulators emphasize clear data retention policies, ensuring that training data is deleted when no longer needed to retrain models.
Data accuracy
The EDPB and LfDI underline the importance of data accuracy, particularly for AI systems generating outputs that produce results perceived as factual. They highlight the potential for bias and inaccuracy in LLMs and call for measures to ensure and communicate the reliability of AI-generated content.
The EDPB's investigation focuses on web scraping practices and the need to scrutinize sources for training data collection. The ICO emphasizes that data accuracy goes beyond statistical accuracy, stressing the importance of considering different types of error — like false positives and negatives — and using nuanced statistical measures like precision and recall.
Data subjects' rights
The EDPB questionnaire, as well as the LfDI and CNIL studies, suggest mechanisms for individuals to provide additional information to facilitate identification and the exercise of their rights, including access, rectification, erasure and objection. They also propose model retraining and output filtering as potential alternatives to meet data subjects' requests, considering the challenges of identifying data subjects in large datasets and complex models.
Using data for new purposes
Processing data for purposes beyond the original collection, such as training AI systems with existing data, presents challenges. The CNIL and LfDI emphasize the need for a "compatibility test" to assess the link between purposes, collection context, reasonable expectations, data sensitivity and safeguards.
In addition, they point to the need to establish a legal basis for further processing and to enter into contractual agreements with third-party data providers to ensure data lawfulness and compatibility.
Accountability
Both the LfDI and the ICO acknowledge the difficulties in demonstrating compliance and maintaining auditable records due to the complexities of AI systems. They emphasize the need for clarity in controller/processor relationships, particularly when outsourcing AI development or using third-party AI systems.
The ICO recommends documentation throughout the AI development process, including risk assessments, data minimization strategies, and steps taken to ensure data protection by design and default.
The LfDI also provides guidance on determining controller/processor responsibilities, highlighting the importance of contractual clarity and due diligence when procuring AI systems from external providers.
Harnessing the power of AI, upholding data protection rights
The authorities' positions represent a starting point for addressing the complex issue of data protection in AI systems. Their risk-based and balanced approach aims to foster innovation while safeguarding privacy, and from this we can see the importance of implementing and monitoring the effectiveness of these recommendations in robust programs that combine AI governance and data protection.
The continuous evolution of this field, with ongoing studies and publications, will undoubtedly shape how the world navigates the intersection of AI and data protection. By adopting a proactive and adaptable approach, we can harness the power of AI while upholding the fundamental right to data protection.
Henrique Fabretti Moraes, CIPP/E, CIPM, CIPT, CDPO/BR, FIP, is a partner and Helena Dominguez Bianchi, CIPP/E, is an attorney at Opice Blum.