Editor's note: The IAPP is policy neutral. We publish contributed opinion and analysis pieces to enable our members to hear a broad spectrum of views in our domains.
In the fast-paced world of artificial intelligence, the balance between innovation and regulation is constantly tested.
Organizations are eager to push the boundaries of what AI can achieve, but they must do so while navigating stringent data privacy laws like the EU General Data Protection Regulation and the EU AI Act, along with other critical requirements like robustness, accountability and transparency.
The key question remains: how can we train AI models effectively while ensuring data privacy compliance, in particular? This has led to a growing conversation about anonymization and deidentification, and the murky space between them.
Before diving into all the possibilities, it's helpful to clarify the differences between three key techniques: deidentification, pseudonymization and anonymization.
First, "deidentification" involves removing or altering some or all of the personal identifiers in a dataset. This could include techniques such as masking, redacting or generalizing data. The key to this, though, is understanding that deidentification does not render the dataset completely anonymous, as it is always possible to reconstitute or reidentify the data by adding other data elements or information.
Second, "pseudonymization," which is a specific form of deidentification, where the data elements are replaced with artificial identifiers. In the vast majority of cases, the data set can be restored using a key to unlock the artificial identifiers. As an aside, under the GDPR's Article 32, pseudonymization is considered a technique that supports the security of the data processed.
Because both deidentification and pseudonymization leave room for reidentification, data processed using these methods still falls under data protection laws like the GDPR.
The only way to truly take the data processing outside the scope of the applicable data protection laws is to use the third technique — "anonymization." This means identifiers are permanently and irreversibly removed or modified so the data can never be linked to an individual.
Traditionally, data privacy has relied on clear distinctions: pseudonymization and anonymization. The difference between these concepts is critical, as advancements in technology are making true, irreversible anonymization increasingly difficult — if not impossible — to achieve, not just today but also in the future.
A more flexible way to look at this is through "subjective anonymization," a flexible approach recognizing that anonymization depends not only on removing identifiers but also on the context in which the data is used and the actors involved.
We need to keep in mind that any similarity in nature implies a similarity in legal treatment, while any difference in nature requires a different legal approach. Legal scholars have long debated distinctions in obligations and enforcement, dating back to French legal traditions. This principle applies to anonymization as well — should we treat it as an absolute concept, or should its definition depend on context?
Based on a substantially similar approach, the traditional view of anonymization as a binary process — where data is either anonymous or it isn't — is now being questioned. Legal and academic discussions suggest a more nuanced approach: one that recognizes that whether data is truly anonymous depends on context, and if the capacity of the recipient party to reidentify the data is different from the capacity of the divulging party, that distinction should be taken into account and should require a different approach.
This is where we need to draw a line between "objective" anonymization, the one that meets all the GDPR standards for everyone, and "subjective anonymization," the one that meets the standards of the GDPR for one party, but not for another.
This perspective was underscored by a recent legal opinion of the CJEU Advocate General, who pointed out how supposedly anonymous data can still be reidentified when combined with external information. In essence, anonymization isn't just about removing identifiers from a dataset, it's also about considering who has access to the data, what other information is available, and how likely it is someone could reidentify individuals.
Imagine an AI model trained on a dataset stripped of direct identifiers like names and addresses. At first glance, the data appears anonymous. But what happens if another organization has a separate dataset with overlapping individuals, perhaps containing behavioral patterns, purchase histories, or social media activity? With the right analytical tools, these datasets could be cross-referenced, and individuals could be reidentified. Suddenly, the line between anonymous and identifiable data becomes blurred, exposing companies to legal and ethical risks.
Subjective anonymization acknowledges this reality. It argues that data anonymization should not be evaluated in isolation but in the context of the data environment in which it exists. This shifts the focus from purely technical methods to a broader assessment of risks and technical and organizational measures, considering factors like data linkage potential, the motivation and resources of potential attackers, and the regulatory landscape.
In addition, it embraces one of the core underlying principles of Article 32 of the GDPR — a risk-based approach through accountability of the controller and processor. It takes into account not just technical protection — that is, objective anonymization — but also the ability to ensure the ongoing confidentiality of the data while promoting AI robustness, which is yet a new requirement under the EU AI Act.
For AI model training, this presents both opportunities and challenges. On one hand, subjective anonymization could offer a more flexible approach to data privacy, allowing organizations to leverage data while still managing risks effectively and coping with the concept of robustness.
On the other hand, it introduces regulatory uncertainty. If authorities accept the idea that anonymization is context-dependent, organizations may have more room to maneuver. But if regulators continue to insist on an absolute standard of anonymization, businesses will have to adhere to stricter data processing rules, potentially stifling innovation, hindering robustness, transparency and accountability, and perhaps more importantly, preventing that market from leading in the space.
The tension between regulation and AI development is particularly evident in different global approaches to anonymization. European regulators, including the European Data Protection Board, have historically taken a conservative stance, emphasizing the need for permanent irreversibility in anonymization.
However, recent court rulings have introduced a more pragmatic perspective, suggesting anonymization should be assessed based on the means reasonably available to those who handle the data. This shift aligns with the principle of subjective anonymization and could pave the way for more practical applications.
Practical implementation of subjective anonymization requires careful risk assessments and ongoing monitoring. Organizations need to consider not just how data is anonymized from a technical perspective, but also the broader ecosystem in which it is used.
This means evaluating factors like the likelihood of data linkage with other datasets, the technical feasibility of reidentification, the incentives for malicious actors to attempt reidentification, and who has and will likely have access to the data.
These considerations are particularly relevant as companies explore new approaches to data privacy, such as synthetic data and Edge AI. Synthetic data — artificially generated datasets that mimic real data — has been hailed as a potential solution to privacy concerns. While it offers a promising alternative, its risks reinforcing the argument for subjective anonymization: data privacy must be assessed in context rather than relying on a one-size-fits-all approach.
However, even synthetic data is not immune to risks. If the original data used to generate synthetic datasets is not properly anonymized, there remains a possibility that individuals could still be reidentified. This is especially a concern when we consider that prompt-based attacks on large language models to give up their prized data are a real and present consideration being discussed among AI governance professionals. Moreover, biases present in the source data could be transferred to synthetic datasets, raising ethical concerns about fairness and representativeness.
Edge AI, which processes data on devices rather than in centralized cloud systems, offers another avenue for privacy protection. By keeping data localized to the edge devices, Edge AI reduces the need for large-scale data transfers, potentially minimizing exposure to reidentification risks. However, it does not eliminate the challenge of ensuring that AI models are trained on diverse, unbiased datasets.
The debate over subjective anonymization is emblematic of a larger struggle: how to balance the need for innovation with the imperative of data protection in a world that is facing the rise of autocracies through a fight over who is going to win the AI battle. AI thrives on data, yet regulations are becoming more stringent.
The future of AI governance may depend on finding a middle ground — one that acknowledges the limitations of traditional anonymization while upholding strong ethical safeguards.
As AI regulations evolve, particularly with the EU AI Act, organizations will face increasing pressure to demonstrate the quality and traceability of the data they use. This raises critical questions: Will regulators formally recognize subjective anonymization as a viable approach? Can businesses develop AI solutions that are both effective and privacy-compliant without being hamstrung by overly inappropriate rules?
The answers to these questions will shape the trajectory of AI development for years to come. The debate over anonymization is only just beginning. In an era where AI development and privacy concerns collide, organizations must embrace a new, context-driven approach — or risk being left behind.
Roy Kamp, AIGP, CIPP/E, CIPP/US, CIPM, FIP, is legal director and Noémie Weinbaum, AIGP, CIPP/E, CIPP/US, CIPM, CDPO/FR, FIP, is senior managing counsel, privacy compliance at UKG.