The concept of anonymization — a concept that is, despite its ambiguities, critical for data science programs around the world — can be confusing across jurisdictions. Data that meets the standards for "anonymization," for example, are generally not subject to privacy or data protection laws. With the rapid adoption of artificial intelligence, which is typically trained on vast amounts of data, the need for clarification has only become stronger, as has the push toward standardization.

Nowhere is this standard more important than in the EU, which set the bar for regulating the use of data, and whose regulations companies are rightly attuned to when building global compliance programs for their data. In other words, getting anonymization right for EU data forms a core component of any global strategy focused on responsibly collecting and utilizing data.

However, EU legal anonymization standards are among the trickiest to implement in practice — and have long been criticized for their ambiguities. This article provides an overview of the confusion around this standard and how it is progressively evolving.

Conflicting guidance in the EU

As we wrote in a previous article, written, the main source of confusion lies in existing regulatory guidance in the EU focused on what anonymization is and is not.

The original guidance issued by the Article 29 Working Party, for example, suggested a risk-based approach to anonymization is possible. At a high level, a risk-based approach to anonymization allows for the residual risk that the data still could theoretically be identified in the future — the lower the risk, the stronger claims to anonymization can be. This risk-based approach is commonly applied in several jurisdictions and has been a central tenet of anonymization standards in the U.S. The Federal Trade Commission, for example, promoted this standard in 2012, shaping state-level privacy laws around the U.S. ever since. A risk-based approach typically entails maintaining tight control over the way the data is reused, which is why closed data environments, with monitoring and auditing capabilities, are so important.

Risk-based approaches to anonymization have also long been adopted in the health care space, which has relied on anonymized data for decades and arguably uses some of the most sensitive data available. Because study after study has demonstrated some residual risk of reidentification always remains with data, this risk-based approach is more or less the only practical method of applying anonymization in practice.

At the same time, however, the Article 29 Working Party's guidance also appeared to endorse a restrictive view of anonymization that directly undercut the risk-based approach, requiring the destruction of all raw data, irrespective of the technique applied on the data and the implementation of context controls. Context controls limit the processing environment of the data with no direct impact upon what the data looks like. In practice, this made the circumstance under which data could be usefully anonymized extremely rare, if not impossible. If no risks of reidentification were allowed to remain, however remote those risks were, data had to be practically useless to be considered anonymized.

Meanwhile, the European Data Protection Supervisor focused on the criterion of irreversibility — meaning techniques used to anonymize can never be reversed — in a way that also appears to be at odds with the Article 29 Working Party's initial risk-based approach.

To make matters even more confusing, those in the technical community who have tried to formalize the EU concept of anonymity have adopted the most restrictive approach to anonymization, often oversimplifying legal tests for anonymization and driving the adoption of the most conservative methods possible. This paper, for example, which builds upon a prior paper on the same subject, argues legal standards for anonymization are violated if only a single record within a data set can be singled out.

However, it appears issues related to anonymization in the EU could soon be addressed, if not resolved entirely. While the European Data Protection Board has yet to release its forthcoming guidance on anonymization, three recent initiatives — two legislative proposals and a General Court's decision — show EU institutions are increasingly sympathetic to a more pragmatic stance.

The Data Act

Let's start with the EU Data Act, now in the final trilogue negotiation stage, which encourages data reuse by allowing users of connected devices to gain access to usage data, by reducing contractual imbalances in data sharing contracts to the benefit of small and medium-sized enterprises, or by enabling public sector bodies to access and use private-sector data that is necessary for exceptional circumstances.

One core objective of the Data Act is to set harmonized rules for:

  1. The design of connected products to allow access to data generated by a connected product or generated during the provision of related services to the user of that product.
  2. Data holders making data they accessed from a connected product or generated during the provision of a service related to data subjects, users or data recipients at their request, available.

The Data Act applies to both observed personal and nonpersonal data, often referred to as "usage data" in practice, and excludes derived and inferred data. Recital 6 of the act, as amended by the European Parliament, makes it clear the drafters expect a significant proportion of data will amount to nonpersonal, and therefore anonymized, data:

Many connected products, for example, in the civil infrastructure, energy generation or transport sectors, are recording data about their environment or interaction with other elements of that infrastructure without any actions by the user or any third party. Such data may often be non-personal in nature and valuable for the user or third parties, which may use it to improve their operations, the overall functioning of a network or system or by making it available to others.

Recital 24a of the European Parliament's version then notes:

A substantial hurdle to non-personal data sharing by businesses thus results from the lack of predictability of economic returns from investing in the curation and making available of data sets or data products. In order to allow for the emergence of liquid, efficient and fair markets for non-personal data in the Union, it must be clarified which party has the right to offer such data on a marketplace. Users should therefore have the right to share non-personal data with data recipients for commercial and non-commercial purposes.

Why are these provisions so important? Previous regulatory guidance, such as the Article 29 Working Party's opinion on the concept of personal data, clarifies that all data can become personal data when the purpose of the processing is to learn things about individuals. This is what the Article 29 Working Party used to call "personal data by purpose."

The only sensible way to make the Data Act work, then, is to assert that data is not personal data once a risk-based approach has been performed. More specifically, because all data can technically be used to infer personal information, the only way to apply the Data Act is to assume a risk-based approach to anonymization is a valid option and a reasonably low residual risk of identification is sufficient.

The European Health Data Space Regulation 

Next, let's turn to the European Health Data Space regulation, which is at an earlier stage of the legislative process. This proposed regulation, among other things,sets forth "rules and mechanisms supporting the secondary use of electronic health data." As electronic health data is special category data, it is crucial to put in place appropriate safeguards to enable its reuse. Among the list of safeguards are provisions relating to anonymization and pseudonymization. 

Recital 43 states in particular that: 

In addition to the tasks necessary to ensure effective secondary use of health data, the health data access body should strive to expand the availability of additional health datasets, support the development of AI in health and promote the development of common standards. They should apply tested techniques that ensure electronic health data is processed in a manner that preserves the privacy of the information contained in the data for which secondary use is allowed, including techniques for pseudonymisation, anonymization, generalization, suppression and randomisation of personal data.

Recital 43 adds "Health data access bodies can prepare datasets to the data user requirement linked to the issued data permit. This includes rules for anonymization of microdata sets." 

On the other hand, Recital 50 provides: 

Where the applicant needs anonymised statistical data, it should submit a data request application, requiring the health data access body to provide directly the result.

Recital 43 therefore seems to suggest it is possible to anonymize individual-level data.

Recitals 43 and Recital 50, read together, confirm there are two types of anonymization processes, as in this explanation where we distinguished between local and global anonymization: the anonymization of microdata sets, or "units of data that aggregate statistics are compiled from" to use the words of the European Commission in its Eurostat documentation, and the anonymization of ​​statistical data.

Because the reuse of individual-level data will always contain a statistical probability of reidentification, as we noted above, we are left with a risk-based approach to anonymization. Again, it is simply not possible to eliminate reidentification risks. As a nod to this reading, the proposed regulation also states clearly that, just as with the Data Governance Act, certain "categories of electronic health data can remain particularly sensitive even when they are in anonymized format and thus non-personal."

This assertion is true even when the output data is aggregated, as clarified in Recital 64, which states "for these types of health data, there remains a risk for reidentification [even] after the anonymization or aggregation."

In practice, this means the European Health Data Space regulation proposal appears to clearly endorse a risk-based approach to anonymization, aligning itself with the Data Act and other approaches to minimizing the risk of reidentification. 

The SRB v EDPS case

The third notable event is the decision of the EU General Court in Single Resolution Board v. EDPS, decided 26 April.

That case involves the processing of personal comments produced by the shareholders and creditors of Banco Popular, who were affected by the resolution decision adopted by the SRB. The SRB serves as the central resolution authority in the Banking Union. Some of these comments were shared with a third party, Deloitte, which did not have access to identifying data used for registration purposes, such as proof of participants' identity and ownership of Banco Popular's capital instruments. The comments shared with third parties, including Deloitte, were filtered, categorized and aggregated, when identical, and then associated with an alphanumeric code, a 33-digit globally unique identifier randomly generated at the time the responses to the form were received.

The EDPS asserted that the SRB had breached the data protection obligation in Article 15 of Regulation 2018/1725, by failing to inform data subjects that their personal data would then be shared with third parties. When assessing the SRB's claim that the court should annul the EDPS's revised decision, the General Court had to determine whether the comments transmitted to Deloitte amounted to personal data within the meaning of Article 3(1) of Regulation 2018/1725 and, in particular, whether the data subjects remained identifiable. In other words, the General Court had to determine whether that data had been anonymized. Here, the General Court disagreed with the EDPS, arguing that a risk-based approach does indeed meet the legal standards for anonymization in the EU.

The General Court relied upon a previous Court of Justice of the European Union case, commonly referred to as the Patrick Breyer case, and considered two factors in determining whether reasonable standards for anonymization were met:

  1. The controlled environment, meaning the context controls placed on the data environment such as access controls, asserting that"Deloitte did not have access to the identification data received during the registration phase that would have allowed the participants to be linked to their comments by virtue of the alphanumeric code."
  2. The data itself, meaning the data controls, such as masking, placed on the data to transform its appearance, asserting that "the alphanumeric code appearing on the information transmitted to Deloitte did not in itself allow the authors of the comments to be identified" and although "it cannot be ruled out that personal views or opinions may constitute personal data … such a conclusion cannot be based on a presumption … but must be based on the examination of whether, by its content, purpose or effect, a view is linked to a particular person." 

The General Court then held it was "necessary to put oneself in Deloitte's position in order to determine whether the information transmitted to it relates to 'identifiable persons.'" As the EDPS did not put itself in Deloitte's position, the General Court considered that the EDPS was "incorrect to maintain that it was not necessary to ascertain whether the authors of the information transmitted to Deloitte were re-identifiable by Deloitte or whether such reidentification was reasonably possible." In other words, the EDPS failed to consider "that it was for the EDPS to determine whether the possibility of combining the information that had been transmitted to Deloitte with the additional information held by the SRB constituted a means likely reasonably to be used by Deloitte to identify the authors of the comments." 

In practice, this means an absolutist approach to anonymization, in which no chance of reidentifying data is tolerated, which is, for the most part, statistically impossible, does not reign supreme in the EU. Instead, the General Court's decision adopts a risk-based approach by examining controls in data and context, allowing for a low probability of identification to remain as long as that probability is not likely or reasonable.  

It's important to note that some objections to the General Court's opinion have been raised. Recital 26 of the EU General Data Protection Regulation, for example, requires taking into consideration the intended data recipient to determine whether an individual is identifiable, as well the means likely to be used by all situationally relevant attackers. Under this type of analysis, the data controller itself may also be viewed as a potential attacker, which seems to be the default position of some regulators, such as the Irish Supervisory Authority for example. However, data controllers can also become trustworthy when appropriate context controls are in place, meaning the data controller should not always be within the list of potential attackers, as the U.K. Information Commissioner's Office itself has stated. 

It is also worth noting that it's unclear whether the purpose for which the data was shared was a motivating factor in the General Court's analysis, despite the fact that the purpose of data usage weighs heavily when performing a risk-based approach to anonymization and usually excludes individual decision making. The EDPS itself made this clear, as has the European Data Space regulation in confirming the importance of purpose restrictions as well, see Articles 34 and 35. 

The impact of changing anonymization standards

The EDPB is still in the process of formalizing anonymization guidance, so it remains to be seen whether the move towards a more pragmatic approach to anonymization in the EU will be officially endorsed by the bloc's most important overseer of personal data regulators. In the meantime, however, it appears clear a risk-based approach is gaining steam in the EU.