The EU General Data Protection Regulation is among the most influential data privacy laws in the world — setting the standard, in many ways, for how global organizations implement their data privacy programs. The GDPR and more general EU data protection laws suffer from one central problem: One of their most important provisions is unclear.

Specifically, the GDPR defines anonymous data as data that “does not relate to an identified or identifiable natural person or to personal data rendered anonymous” so “the data subject is not or no longer identifiable.” Data that meets this criteria is therefore not subject to the GDPR, making anonymous data the holy grail of data protection. If you can anonymize data, regulations like the GDPR simply no longer apply — not their onerous requirements on handling data and not even their very high fines. From a compliance standpoint, anonymous data makes your life easier.

The problem is even though the GDPR specifically calls out anonymous data, and even though European data protection authorities have publicly talked about anonymization for decades, it’s unclear anyone really knows what “anonymization” means in practice. This is something even the regulators themselves acknowledge, with Spain’s DPA, the Agencia Española de Protección de Datos, and the European Data Protection Supervisor releasing a joint document titled “10 misunderstandings related to anonymization” to clarify these exact issues.

But in the big picture significant uncertainty remains, which can leave organizations attempting to anonymize their data in a deep bind even when they don’t equate anonymization to a free license. Before we can get to what organizations can do about uncertain anonymization standards, we must first examine why we are here in the first place. 

Conflicting regulatory guidance in the EU

Let’s start with the uncertainty around what “anonymization” truly means under EU data protection standards. Even though the GDPR discusses anonymization in Recital 26, we must go back to opinions issued by the Working Party 29 , which predates the European Data Protection Board. In 2007, the WP issued an opinion that clearly articulated the difference between “anonymization” and “pseudonymization.” The main difference between the two defined pseudonymization as privacy protective and technically reversible. On the other hand, anonymization was defined as such: “Disguising identities can also be done in a way that no reidentification is possible, e.g. by one-way cryptography, which creates in general anonymized data.”

In 2007, the WP erred on the side of flexibility, writing that as long as “appropriate technical measures” were put in place to prevent reidentification of data, that data can be considered anonymous. The 2007 opinion was therefore reasonable and in line with other anonymization standards.

Then came the WP’s 2014 opinion on anonymization techniques, which turned this analysis on its head and set the path for significant confusion about EU anonymization standards that exists to this day. In particular, the WP revisited the difference between anonymization and pseudonymization, and declared a “specific pitfall is to consider pseudonymised data to be equivalent to anonymized data.” Pseudonymity allows for identifiability, the WP wrote, and “therefore stays inside the scope of the legal regime of data protection.”

In their new analysis, the difference between anonymization and pseudonymization lays in the likelihood of reidentifiability — whether it’s possible to derive personal information from deidentified data. As study after study has demonstrated, however, it’s pretty much impossible to perfectly anonymize data, meaning some possibility of reidentification often remains. So how should organizations determine what is likely?

Here the WP enumerated three specific reidentification risks:

  • Singling out — the ability to locate an individual's record within a data set.
  • Linkability — the ability to link two records pertaining to the same individual or group of individuals.
  • Inference — the ability to confidently guess or estimate values using other information.

The WP stated an anonymization solution that protected against each of these risks “would be robust against re-identification performed by the most likely and reasonable means the data controller and any third party may employ.” In other words, anonymization that protects against each of these three risks would be satisfactory and stand up to regulatory scrutiny.

So far, so good. The problem emerged, however, when the WP went on to suggest both aggregation and destruction of the raw data were also needed to achieve anonymization:

“It is critical to understand that when a data controller does not delete the original (identifiable) data at event-level, and the data controller hands over part of this dataset (for example after removal or masking of identifiable data), the resulting dataset is still personal data. Only if the data controller would aggregate the data to a level where the individual events are no longer identifiable, the resulting dataset can be qualified as anonymous.”

In other words, only by aggregating data into group statistics and permanently deleting the original data could organizations have full confidence their data was anonymized and therefore outside the scope of data protection regulations in the EU.

Due to this U-turn, EU regulators still vacillate between the 2007 and 2014 standards. Some regulators stated a residual risk of reidentification is acceptable as long as the right precautions are in place. Regulators like the U.K. Information Commissioner's Office and Ireland’s Data Protection Commission took this track. But other DPAs, like France’s Commission Nationale de l'Informatique et des Libertés, used a more absolutist language in their guidance.

Organizations attempting to comply with these standards and aiming to meet EU anonymization requirements get stuck between a rock and a hard place. What can they do?

Give up and embrace pseudonymization

The first option is to give up on the project of anonymizing data entirely and simply consider all deidentified data as pseudonymized. While pseudonymized data does not fall outside the scope of EU data protections because reidentification is still possible, the compliance burden on pseudonymous data can be significantly lighter — assuming the processing purpose is legitimate, a legal basis is established (or the secondary purpose is considered to be compatible with the initial purpose) and the data controller is not in a position to identify individuals (making most individual rights virtually nonexistent, except the rights to information and to object).

Standards for how to implement pseudonymization techniques vary, but many overlap with anonymization practices under other legal frameworks outside the EU. Here, for example, is how the European Union Agency for Cybersecurity described pseudonymization techniques:

“The choice of a pseudonymization technique and policy depends on different parameters, primarily the data protection level and the utility of the pseudonymized dataset (that the pseudonymization entity wishes to achieve)  . . .  Still, utility requirements might lead the pseudonymization entity towards a combination of different approaches or variations of a selected approach. Similarly, with regard to pseudonymization policies, fully randomized pseudonymization offers the best protection level but prevents any comparison between databases.”

If this seems a lot like anonymization to you, you’re not alone. These similarities at a technical level combined with the 2014 guidance from WP led some organizations to give up on anonymization entirely, at least until EU DPAs provide further clarity.

That said, applying EU data protection standards to all types of pseudonymous data, irrespective of the strength of the pseudonymization process, can be problematic when data must be accessed quickly and shared among different types of stakeholders. Those who are using pseudonymous data for their own purposes are “controllers” under EU data protection laws and must make sure they tick all the right boxes before processing the data. It’s also unclear how pseudonymization can really help justify data transfers to third countries with no adequacy decisions, as the standard adopted seems to be anonymization rather than pseudonymization. Researchers especially in the medical space have been quite vocal about the problems this causes.

So if organizations aren’t willing to give up on anonymization entirely, what else could they do? They have a few options.

Argue the risks of reidentification is sufficiently remote    

The next option lies in arguing the means of reidentification are not reasonably likely to be used, and relying more heavily on the WP 2007 opinion than on its 2014 opinion or, at least, ignoring the most problematic paragraphs of the 2014 opinion. The question becomes how can organizations argue, even though risk remains for reidentification, the risk is sufficiently remote and therefore their data is anonymous?

The 2014 WP guidance itself referred to the importance of context and confirmed that to meet the legal test, “account must be taken of ‘all’ the means ‘likely reasonably’ to be used for identification by the controller and third parties, paying special attention to what has lately become, in the current state of technology, ‘likely reasonably’ (given the increase in computational power and tools available).” The WP also stated when it is not possible to pursue an approach based on the mitigation of the three risks mentioned above, a risk-based approach remains an option.

A risk-based approach implies adopting an attacker-centric definition of anonymization, which appears compatible with the legal test. Indeed, the legal test will focus upon an assessment of the reidentification means reasonably likely to be used by the controller or another person, i.e, an attacker. In order to anticipate attackers’ behavior, deidentification experts rely upon risk models to guide them in their selection of data and context controls.

Trusted third parties

The next option is to rely on what are called “trusted third parties,” or TTPs, which can help serve as intermediaries between organizations possessing the raw data and those who seek to use anonymous data. Specifically, when one party wants to share anonymous data with a secondary organization, a TTP can “broker” the exchange, implementing deidentification techniques on the raw data, which remains under the control of the original party while sharing the deidentified data with the secondary organization.

In 2013, the WP addressed this arrangement in an opinion on purpose limitation. It described TTPs as operating “in situations where a number of organizations each want to anonymize the personal data they hold for use in a collaborative project,” which can be used “to link datasets from separate organizations, and then create anonymized records for researchers.”

Introducing a third party to perform the deidentification and keep the raw data separate, the WP suggested, seemed to be another useful method to achieve anonymization. The 2013 opinion even described a state referred to as “complete anonymization” that could be achieved in this setting.

One year earlier, the U.K. ICO described this type of arrangement as “particularly effective where a number of organizations each want to anonymize the personal data they hold for use in a collaborative project.” The ICO went to great lengths to describe how this approach enables anonymization.

Inserting a third party into the deidentification process is one central way to bolster the claim to anonymity and can facilitate the creation of anonymized data when data from different sources is linked together. The inclusion of a TTP can serve as one component of a broader risk-based approach.

Emerging technological approaches 

All the anonymization options above are heavy on processes — assessing all the reasonably likely ways data could be reidentified, as in the risk-based approach, or inserting third parties to manage the data, as in the use of TTPs. There is one way to streamline and accelerate these processes that relies on a group of emerging tech to help automate the deidentification process. It is worth noting that because these technologies are still emerging and, in many senses, still being proven, it’s unclear to what extent they support anonymization under the EU frameworks. That said, there are clear signs these technologies might stand up to regulatory scrutiny.

Take, for example, synthetic data, which consists of creating new data from a sample set of data that preserves the correlations within that sample set but does not recreate any direct identifiers. The use of synthetic data is growing in the health care space in particular, offering a promising way to extract trends from health data without using patient identifiers directly. Indeed, one such solution was designated as anonymous data under GDPR standards by the CNIL. The tech remains in its infancy and does not necessarily eliminate all reidentification risks, so it remains to be seen how useful it can be in real-world settings, but EU DPAs do seem open to labeling such data anonymous under data protection standards.

Differential privacy is a mathematical privacy framework that also holds promise for anonymization. The framework is a method of inserting controlled randomization into a data analysis process, resulting in limits on the amount of personal information inferable by any attacker. The U.S. Census Bureau uses the technique to protect the privacy of respondent’s data. While EU DPAs have yet to formally opine on differential privacy, we believe they’re likely to look favorably on the technique.

Synthetic data and differential privacy are not the only techniques with a promise for anonymization: Some tout the benefits of federated learning, which can, if implemented with an eye for compliance, serve a similar function as TTPs, although in practice tends to be used as a data minimization rather than anonymization technique. While there are often significant technical barriers in practice and deployment, another technique known as secure multiparty computation can be used to design multiparty data processing protocols that simulate a TTP without actually having one. In that sense, the future holds new opportunities for meeting the requirements of EU data protection laws, even if such mandates remain unclear.

What to do?

Absent further clarification from EU regulatory authorities themselves, there is no one-size-fits-all approach to anonymization for organizations seeking to comply with EU data protection standards. There are a host of concrete options — and clear arguments — these organizations can use to get value out of their data while ensuring it remains protected.

Photo by Charlie Egan on Unsplash